Sebastian,

Thank you.  It is indeed a sitemap.

I had reviewed NUTCH-1465.

Interestingly however, I have successfully parsed literally dozens of other 
sitemaps using Tika.

Perhaps I had just been lucky so far?

Thanks

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: Wednesday, March 05, 2014 8:57 AM
To: [email protected]
Subject: Re: Tika Parsing XML Incorrect Outlink Extraction

Hi Iain,

the document looks like a sitemap (sitemaps.org).
Support for sitemaps is ongoing work, see NUTCH-1465.
The point is: you cannot expect that any XML-based format is properly parsed by 
Tika. In case of sitemaps, it's not only the outlinks but also re-fetch 
intervals and last-modified times which have to be transfered into Nutch' data 
structures.

Sebastian


On 03/05/2014 01:39 PM, Iain Lopata wrote:
> My apologies, but I realized that the formatting of my XML was not 
> preserved in the email.  Hopefully it is clear enough that in the 
> first case the <loc> <lastmod> and <changefreq> elements are all on 
> the same line and in the second case they have been moved to separate lines.
> 
> -----Original Message-----
> From: Iain Lopata [mailto:[email protected]]
> Sent: Wednesday, March 05, 2014 6:19 AM
> To: [email protected]
> Subject: Tika Parsing XML Incorrect Outlink Extraction
> 
> I am attempting to parse a page of mime-type application/xml using Tika.
> The debug log shows that it is being parsed by 
> org.apache.tika.parser.xml.DcXMLParser.
> 
>  
> 
> However, if the document is structured as follows:
> 
>  
> 
> <urlset>
> 
>  
> <url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T0
> 1:30Z< /lastmod><changefreq>monthly</changefreq></url>
> 
> </urlset>
> 
>  
> 
> I receive a single outlink which incorrectly concatenates the loc and 
> lastmod elements so that the outlink reads as:
> http://www.example.com/index.html2014-02-15T01:30Z
> 
>  
> 
> If I reformat with carriage returns and line feeds, but in no other 
> way change the xml document so that it is now:
> 
>  
> 
> <urlset>
> 
>         <url>
> 
> <loc> <http://www.example.com/index.html%3c/loc>
> http://www.example.com/index.html</loc>
> 
> <lastmod>2014-02-15T01:30Z</lastmod>
> 
> <changefreq>monthly</changefreq>
> 
>         </url>
> 
> </urlset>
> 
>  
> 
> I then receive two outlinks, the first being correct and the second 
> being an erroneous extraction from the lastmod element:
> http://www.example.com/index.html and T01:30Z
> 
>  
> 
> If I then remove the colon from the time/datestamp in the lastmod 
> element I receive the single outlink http://www.example.com/index.html 
> that I would originally have expected.
> 
>  
> 
> Any ideas as to what might be going on and how I can correctly parse 
> the original document?  If Tika cannot parse this correctly shouldn't 
> Nutch at least perform a format validation on the returned outlinks 
> and discard those that are invalid?
> 
>  
> 
> Thanks!
> 
>  
> 
> 


Reply via email to