Thanks. I am using option #3. I know this is not a Tika forum, but just wondering if you have input: Initially I was trying to use DcXMLParser. I tried with
1. If you don't want things stripped out while parsing then you need to call the parser with a ParseContext set up (similar to the question that Michele had
... The SimpleLinkExtractor should already set up the result Outlink records with the anchor text ("coolest new thing" in your example above. -- Ken ... Ken
... It will, but Tika winds up stripping out some elements & attributes, which you'd likely need to extract the pubDate. You can configure Tika to not strip