I am attempting to parse a page of mime-type application/xml using Tika.
The debug log shows that it is being parsed by
org.apache.tika.parser.xml.DcXMLParser.

 

However, if the document is structured as follows:

 

<urlset>

 
<url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z<
/lastmod><changefreq>monthly</changefreq></url>

</urlset>

 

I receive a single outlink which incorrectly concatenates the loc and
lastmod elements so that the outlink reads as:
http://www.example.com/index.html2014-02-15T01:30Z

 

If I reformat with carriage returns and line feeds, but in no other way
change the xml document so that it is now:

 

<urlset>

        <url>

<loc> <http://www.example.com/index.html%3c/loc>
http://www.example.com/index.html</loc>

<lastmod>2014-02-15T01:30Z</lastmod>

<changefreq>monthly</changefreq>

        </url>

</urlset>

 

I then receive two outlinks, the first being correct and the second being an
erroneous extraction from the lastmod element:
http://www.example.com/index.html and T01:30Z

 

If I then remove the colon from the time/datestamp in the lastmod element I
receive the single outlink http://www.example.com/index.html that I would
originally have expected.

 

Any ideas as to what might be going on and how I can correctly parse the
original document?  If Tika cannot parse this correctly shouldn't Nutch at
least perform a format validation on the returned outlinks and discard those
that are invalid?

 

Thanks!

 

Reply via email to