Hi Iain,

the document looks like a sitemap (sitemaps.org).
Support for sitemaps is ongoing work, see NUTCH-1465.
The point is: you cannot expect that any XML-based
format is properly parsed by Tika. In case of sitemaps,
it's not only the outlinks but also re-fetch intervals
and last-modified times which have to be transfered
into Nutch' data structures.

Sebastian


On 03/05/2014 01:39 PM, Iain Lopata wrote:
> My apologies, but I realized that the formatting of my XML was not preserved
> in the email.  Hopefully it is clear enough that in the first case the <loc>
> <lastmod> and <changefreq> elements are all on the same line and in the
> second case they have been moved to separate lines.
> 
> -----Original Message-----
> From: Iain Lopata [mailto:[email protected]] 
> Sent: Wednesday, March 05, 2014 6:19 AM
> To: [email protected]
> Subject: Tika Parsing XML Incorrect Outlink Extraction
> 
> I am attempting to parse a page of mime-type application/xml using Tika.
> The debug log shows that it is being parsed by
> org.apache.tika.parser.xml.DcXMLParser.
> 
>  
> 
> However, if the document is structured as follows:
> 
>  
> 
> <urlset>
> 
>  
> <url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T01:30Z<
> /lastmod><changefreq>monthly</changefreq></url>
> 
> </urlset>
> 
>  
> 
> I receive a single outlink which incorrectly concatenates the loc and
> lastmod elements so that the outlink reads as:
> http://www.example.com/index.html2014-02-15T01:30Z
> 
>  
> 
> If I reformat with carriage returns and line feeds, but in no other way
> change the xml document so that it is now:
> 
>  
> 
> <urlset>
> 
>         <url>
> 
> <loc> <http://www.example.com/index.html%3c/loc>
> http://www.example.com/index.html</loc>
> 
> <lastmod>2014-02-15T01:30Z</lastmod>
> 
> <changefreq>monthly</changefreq>
> 
>         </url>
> 
> </urlset>
> 
>  
> 
> I then receive two outlinks, the first being correct and the second being an
> erroneous extraction from the lastmod element:
> http://www.example.com/index.html and T01:30Z
> 
>  
> 
> If I then remove the colon from the time/datestamp in the lastmod element I
> receive the single outlink http://www.example.com/index.html that I would
> originally have expected.
> 
>  
> 
> Any ideas as to what might be going on and how I can correctly parse the
> original document?  If Tika cannot parse this correctly shouldn't Nutch at
> least perform a format validation on the returned outlinks and discard those
> that are invalid?
> 
>  
> 
> Thanks!
> 
>  
> 
> 

Reply via email to