Sebastian, Thank you. It is indeed a sitemap.
I had reviewed NUTCH-1465. Interestingly however, I have successfully parsed literally dozens of other sitemaps using Tika. Perhaps I had just been lucky so far? Thanks -----Original Message----- From: Sebastian Nagel [mailto:[email protected]] Sent: Wednesday, March 05, 2014 8:57 AM To: [email protected] Subject: Re: Tika Parsing XML Incorrect Outlink Extraction Hi Iain, the document looks like a sitemap (sitemaps.org). Support for sitemaps is ongoing work, see NUTCH-1465. The point is: you cannot expect that any XML-based format is properly parsed by Tika. In case of sitemaps, it's not only the outlinks but also re-fetch intervals and last-modified times which have to be transfered into Nutch' data structures. Sebastian On 03/05/2014 01:39 PM, Iain Lopata wrote: > My apologies, but I realized that the formatting of my XML was not > preserved in the email. Hopefully it is clear enough that in the > first case the <loc> <lastmod> and <changefreq> elements are all on > the same line and in the second case they have been moved to separate lines. > > -----Original Message----- > From: Iain Lopata [mailto:[email protected]] > Sent: Wednesday, March 05, 2014 6:19 AM > To: [email protected] > Subject: Tika Parsing XML Incorrect Outlink Extraction > > I am attempting to parse a page of mime-type application/xml using Tika. > The debug log shows that it is being parsed by > org.apache.tika.parser.xml.DcXMLParser. > > > > However, if the document is structured as follows: > > > > <urlset> > > > <url><loc>http://www.example.com/index.html</loc><lastmod>2014-02-15T0 > 1:30Z< /lastmod><changefreq>monthly</changefreq></url> > > </urlset> > > > > I receive a single outlink which incorrectly concatenates the loc and > lastmod elements so that the outlink reads as: > http://www.example.com/index.html2014-02-15T01:30Z > > > > If I reformat with carriage returns and line feeds, but in no other > way change the xml document so that it is now: > > > > <urlset> > > <url> > > <loc> <http://www.example.com/index.html%3c/loc> > http://www.example.com/index.html</loc> > > <lastmod>2014-02-15T01:30Z</lastmod> > > <changefreq>monthly</changefreq> > > </url> > > </urlset> > > > > I then receive two outlinks, the first being correct and the second > being an erroneous extraction from the lastmod element: > http://www.example.com/index.html and T01:30Z > > > > If I then remove the colon from the time/datestamp in the lastmod > element I receive the single outlink http://www.example.com/index.html > that I would originally have expected. > > > > Any ideas as to what might be going on and how I can correctly parse > the original document? If Tika cannot parse this correctly shouldn't > Nutch at least perform a format validation on the returned outlinks > and discard those that are invalid? > > > > Thanks! > > > >

