Re: Sitemap URL's concatenated, causing status 14 not found
Hi Markus, ok, no problem. Done: https://github.com/crawler-commons/crawler-commons/issues/213 Sebastian On 06/07/2018 12:21 AM, Markus Jelsma wrote: > Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine > Github account. If you would do the honours of opening a ticket, please do so. > > Entschuldiging, > Markus > > > > -Original message- >> From:Sebastian Nagel >> Sent: Tuesday 29th May 2018 11:33 >> To: user@nutch.apache.org >> Subject: Re: Sitemap URL's concatenated, causing status 14 not found >> >>> I agree that the this is not the ideal error behaviour, but I guess the >>> code was written from the >> assumption that the document is valid and conformant. >> >> Over time the crawler-commons sitemap parser has been extended to get as >> much as possible from >> non-conforming sitemaps as well. Of course, it's hard to foresee and handle >> all possible mistakes... >> The equivalent syntax error for sitemaps (missing closing/next in >> is handled. >> >> @Markus: Please open an issue for crawler-commons >> https://github.com/crawler-commons/crawler-commons/issues/ >> >> Thanks, >> Sebastian >> >> >> On 05/26/2018 02:57 AM, Yossi Tamari wrote: >>> Hi Markus, >>> >>> I don’t believe this is a valid sitemapindex. Each should include >>> exactly one . >>> See also https://www.sitemaps.org/protocol.html#index and >>> https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. >>> I agree that the this is not the ideal error behaviour, but I guess the >>> code was written from the assumption that the document is valid and >>> conformant. >>> >>> Yossi. >>> >>>> -Original Message- >>>> From: Markus Jelsma >>>> Sent: 25 May 2018 23:45 >>>> To: User >>>> Subject: Sitemap URL's concatenated, causing status 14 not found >>>> >>>> Hello, >>>> >>>> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but >>>> Nutch things those two sitemap URL's are actually one consisting of both >>>> concatenated. >>>> >>>> Here is https://www.saxion.nl/sitemap.xml >>>> >>>> >>>> >>> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";> >>>> >>>> https://www.saxion.nl/opleidingen-sitemap.xml >>>> https://www.saxion.nl/content-sitemap.xml >>>> >>>> >>>> >>>> This seems fine, but Nutch attempts, and obviously fails to load: >>>> >>>> 2018-05-25 16:27:50,515 ERROR [Thread-30] >>>> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. >>>> Status code: 14 for https://www.saxion.nl/opleidingen- >>>> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml >>>> >>>> What is going on here? Why does Nutch, or CC's sitemap util behave like >>>> this? >>>> >>>> Thanks, >>>> Markus >>> >> >>
RE: Sitemap URL's concatenated, causing status 14 not found
Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine Github account. If you would do the honours of opening a ticket, please do so. Entschuldiging, Markus -Original message- > From:Sebastian Nagel > Sent: Tuesday 29th May 2018 11:33 > To: user@nutch.apache.org > Subject: Re: Sitemap URL's concatenated, causing status 14 not found > > > I agree that the this is not the ideal error behaviour, but I guess the > > code was written from the > assumption that the document is valid and conformant. > > Over time the crawler-commons sitemap parser has been extended to get as much > as possible from > non-conforming sitemaps as well. Of course, it's hard to foresee and handle > all possible mistakes... > The equivalent syntax error for sitemaps (missing closing/next in > is handled. > > @Markus: Please open an issue for crawler-commons > https://github.com/crawler-commons/crawler-commons/issues/ > > Thanks, > Sebastian > > > On 05/26/2018 02:57 AM, Yossi Tamari wrote: > > Hi Markus, > > > > I don’t believe this is a valid sitemapindex. Each should include > > exactly one . > > See also https://www.sitemaps.org/protocol.html#index and > > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. > > I agree that the this is not the ideal error behaviour, but I guess the > > code was written from the assumption that the document is valid and > > conformant. > > > > Yossi. > > > >> -Original Message- > >> From: Markus Jelsma > >> Sent: 25 May 2018 23:45 > >> To: User > >> Subject: Sitemap URL's concatenated, causing status 14 not found > >> > >> Hello, > >> > >> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but > >> Nutch things those two sitemap URL's are actually one consisting of both > >> concatenated. > >> > >> Here is https://www.saxion.nl/sitemap.xml > >> > >> > >> >> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";> > >> > >> https://www.saxion.nl/opleidingen-sitemap.xml > >> https://www.saxion.nl/content-sitemap.xml > >> > >> > >> > >> This seems fine, but Nutch attempts, and obviously fails to load: > >> > >> 2018-05-25 16:27:50,515 ERROR [Thread-30] > >> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. > >> Status code: 14 for https://www.saxion.nl/opleidingen- > >> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml > >> > >> What is going on here? Why does Nutch, or CC's sitemap util behave like > >> this? > >> > >> Thanks, > >> Markus > > > >
Re: Sitemap URL's concatenated, causing status 14 not found
> I agree that the this is not the ideal error behaviour, but I guess the code > was written from the assumption that the document is valid and conformant. Over time the crawler-commons sitemap parser has been extended to get as much as possible from non-conforming sitemaps as well. Of course, it's hard to foresee and handle all possible mistakes... The equivalent syntax error for sitemaps (missing closing/next in is handled. @Markus: Please open an issue for crawler-commons https://github.com/crawler-commons/crawler-commons/issues/ Thanks, Sebastian On 05/26/2018 02:57 AM, Yossi Tamari wrote: > Hi Markus, > > I don’t believe this is a valid sitemapindex. Each should include > exactly one . > See also https://www.sitemaps.org/protocol.html#index and > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. > I agree that the this is not the ideal error behaviour, but I guess the code > was written from the assumption that the document is valid and conformant. > > Yossi. > >> -Original Message- >> From: Markus Jelsma >> Sent: 25 May 2018 23:45 >> To: User >> Subject: Sitemap URL's concatenated, causing status 14 not found >> >> Hello, >> >> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but >> Nutch things those two sitemap URL's are actually one consisting of both >> concatenated. >> >> Here is https://www.saxion.nl/sitemap.xml >> >> >> > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";> >> >> https://www.saxion.nl/opleidingen-sitemap.xml >> https://www.saxion.nl/content-sitemap.xml >> >> >> >> This seems fine, but Nutch attempts, and obviously fails to load: >> >> 2018-05-25 16:27:50,515 ERROR [Thread-30] >> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. >> Status code: 14 for https://www.saxion.nl/opleidingen- >> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml >> >> What is going on here? Why does Nutch, or CC's sitemap util behave like this? >> >> Thanks, >> Markus >
RE: Sitemap URL's concatenated, causing status 14 not found
Ah, of course, i missed that! Thanks, Markus -Original message- > From:Yossi Tamari > Sent: Saturday 26th May 2018 2:57 > To: user@nutch.apache.org > Subject: RE: Sitemap URL's concatenated, causing status 14 not found > > Hi Markus, > > I don’t believe this is a valid sitemapindex. Each should include > exactly one . > See also https://www.sitemaps.org/protocol.html#index and > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. > I agree that the this is not the ideal error behaviour, but I guess the code > was written from the assumption that the document is valid and conformant. > > Yossi. > > > -Original Message- > > From: Markus Jelsma > > Sent: 25 May 2018 23:45 > > To: User > > Subject: Sitemap URL's concatenated, causing status 14 not found > > > > Hello, > > > > We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but > > Nutch things those two sitemap URL's are actually one consisting of both > > concatenated. > > > > Here is https://www.saxion.nl/sitemap.xml > > > > > > > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";> > > > > https://www.saxion.nl/opleidingen-sitemap.xml > > https://www.saxion.nl/content-sitemap.xml > > > > > > > > This seems fine, but Nutch attempts, and obviously fails to load: > > > > 2018-05-25 16:27:50,515 ERROR [Thread-30] > > org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. > > Status code: 14 for https://www.saxion.nl/opleidingen- > > sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml > > > > What is going on here? Why does Nutch, or CC's sitemap util behave like > > this? > > > > Thanks, > > Markus > >
RE: Sitemap URL's concatenated, causing status 14 not found
Hi Markus, I don’t believe this is a valid sitemapindex. Each should include exactly one . See also https://www.sitemaps.org/protocol.html#index and https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. I agree that the this is not the ideal error behaviour, but I guess the code was written from the assumption that the document is valid and conformant. Yossi. > -Original Message- > From: Markus Jelsma > Sent: 25 May 2018 23:45 > To: User > Subject: Sitemap URL's concatenated, causing status 14 not found > > Hello, > > We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but > Nutch things those two sitemap URL's are actually one consisting of both > concatenated. > > Here is https://www.saxion.nl/sitemap.xml > > > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";> > > https://www.saxion.nl/opleidingen-sitemap.xml > https://www.saxion.nl/content-sitemap.xml > > > > This seems fine, but Nutch attempts, and obviously fails to load: > > 2018-05-25 16:27:50,515 ERROR [Thread-30] > org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. > Status code: 14 for https://www.saxion.nl/opleidingen- > sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml > > What is going on here? Why does Nutch, or CC's sitemap util behave like this? > > Thanks, > Markus
Sitemap URL's concatenated, causing status 14 not found
Hello, We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but Nutch things those two sitemap URL's are actually one consisting of both concatenated. Here is https://www.saxion.nl/sitemap.xml http://www.sitemaps.org/schemas/sitemap/0.9";> https://www.saxion.nl/opleidingen-sitemap.xml https://www.saxion.nl/content-sitemap.xml This seems fine, but Nutch attempts, and obviously fails to load: 2018-05-25 16:27:50,515 ERROR [Thread-30] org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. Status code: 14 for https://www.saxion.nl/opleidingen-sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml What is going on here? Why does Nutch, or CC's sitemap util behave like this? Thanks, Markus