I am attempting to parse a page of mime-type application/xml using Tika. The debug log shows that it is being parsed by org.apache.tika.parser.xml.DcXMLParser.
However, if the document is structured as follows: <urlset> <url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z< /lastmod><changefreq>monthly</changefreq></url> </urlset> I receive a single outlink which incorrectly concatenates the loc and lastmod elements so that the outlink reads as: http://www.example.com/index.html2014-02-15T01:30Z If I reformat with carriage returns and line feeds, but in no other way change the xml document so that it is now: <urlset> <url> <loc> <http://www.example.com/index.html%3c/loc> http://www.example.com/index.html</loc> <lastmod>2014-02-15T01:30Z</lastmod> <changefreq>monthly</changefreq> </url> </urlset> I then receive two outlinks, the first being correct and the second being an erroneous extraction from the lastmod element: http://www.example.com/index.html and T01:30Z If I then remove the colon from the time/datestamp in the lastmod element I receive the single outlink http://www.example.com/index.html that I would originally have expected. Any ideas as to what might be going on and how I can correctly parse the original document? If Tika cannot parse this correctly shouldn't Nutch at least perform a format validation on the returned outlinks and discard those that are invalid? Thanks!

