Re: Sitemap URL's concatenated, causing status 14 not found

2018-06-07 Thread Sebastian Nagel
Hi Markus,

ok, no problem. Done:
  https://github.com/crawler-commons/crawler-commons/issues/213

Sebastian

On 06/07/2018 12:21 AM, Markus Jelsma wrote:
> Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine 
> Github account. If you would do the honours of opening a ticket, please do so.
> 
> Entschuldiging,
> Markus
> 
>  
>  
> -Original message-
>> From:Sebastian Nagel 
>> Sent: Tuesday 29th May 2018 11:33
>> To: user@nutch.apache.org
>> Subject: Re: Sitemap URL's concatenated, causing status 14 not found
>>
>>> I agree that the this is not the ideal error behaviour, but I guess the 
>>> code was written from the
>> assumption that the document is valid and conformant.
>>
>> Over time the crawler-commons sitemap parser has been extended to get as 
>> much as possible from
>> non-conforming sitemaps as well. Of course, it's hard to foresee and handle 
>> all possible mistakes...
>> The equivalent syntax error for sitemaps (missing closing/next  in 
>>  is handled.
>>
>> @Markus: Please open an issue for crawler-commons
>>   https://github.com/crawler-commons/crawler-commons/issues/
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 05/26/2018 02:57 AM, Yossi Tamari wrote:
>>> Hi Markus,
>>>
>>> I don’t believe this is a valid sitemapindex. Each  should include 
>>> exactly one .
>>> See also https://www.sitemaps.org/protocol.html#index and 
>>> https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
>>> I agree that the this is not the ideal error behaviour, but I guess the 
>>> code was written from the assumption that the document is valid and 
>>> conformant.
>>>
>>> Yossi.
>>>
>>>> -Original Message-
>>>> From: Markus Jelsma 
>>>> Sent: 25 May 2018 23:45
>>>> To: User 
>>>> Subject: Sitemap URL's concatenated, causing status 14 not found
>>>>
>>>> Hello,
>>>>
>>>> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
>>>> Nutch things those two sitemap URL's are actually one consisting of both
>>>> concatenated.
>>>>
>>>> Here is https://www.saxion.nl/sitemap.xml
>>>>
>>>> 
>>>> >>> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
>>>> 
>>>> https://www.saxion.nl/opleidingen-sitemap.xml
>>>> https://www.saxion.nl/content-sitemap.xml
>>>> 
>>>> 
>>>>
>>>> This seems fine, but Nutch attempts, and obviously fails to load:
>>>>
>>>> 2018-05-25 16:27:50,515 ERROR [Thread-30]
>>>> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
>>>> Status code: 14 for https://www.saxion.nl/opleidingen-
>>>> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
>>>>
>>>> What is going on here? Why does Nutch, or CC's sitemap util behave like 
>>>> this?
>>>>
>>>> Thanks,
>>>> Markus
>>>
>>
>>



RE: Sitemap URL's concatenated, causing status 14 not found

2018-06-06 Thread Markus Jelsma
Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine 
Github account. If you would do the honours of opening a ticket, please do so.

Entschuldiging,
Markus

 
 
-Original message-
> From:Sebastian Nagel 
> Sent: Tuesday 29th May 2018 11:33
> To: user@nutch.apache.org
> Subject: Re: Sitemap URL's concatenated, causing status 14 not found
> 
> > I agree that the this is not the ideal error behaviour, but I guess the 
> > code was written from the
> assumption that the document is valid and conformant.
> 
> Over time the crawler-commons sitemap parser has been extended to get as much 
> as possible from
> non-conforming sitemaps as well. Of course, it's hard to foresee and handle 
> all possible mistakes...
> The equivalent syntax error for sitemaps (missing closing/next  in 
>  is handled.
> 
> @Markus: Please open an issue for crawler-commons
>   https://github.com/crawler-commons/crawler-commons/issues/
> 
> Thanks,
> Sebastian
> 
> 
> On 05/26/2018 02:57 AM, Yossi Tamari wrote:
> > Hi Markus,
> > 
> > I don’t believe this is a valid sitemapindex. Each  should include 
> > exactly one .
> > See also https://www.sitemaps.org/protocol.html#index and 
> > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> > I agree that the this is not the ideal error behaviour, but I guess the 
> > code was written from the assumption that the document is valid and 
> > conformant.
> > 
> > Yossi.
> > 
> >> -Original Message-
> >> From: Markus Jelsma 
> >> Sent: 25 May 2018 23:45
> >> To: User 
> >> Subject: Sitemap URL's concatenated, causing status 14 not found
> >>
> >> Hello,
> >>
> >> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> >> Nutch things those two sitemap URL's are actually one consisting of both
> >> concatenated.
> >>
> >> Here is https://www.saxion.nl/sitemap.xml
> >>
> >> 
> >>  >> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
> >> 
> >> https://www.saxion.nl/opleidingen-sitemap.xml
> >> https://www.saxion.nl/content-sitemap.xml
> >> 
> >> 
> >>
> >> This seems fine, but Nutch attempts, and obviously fails to load:
> >>
> >> 2018-05-25 16:27:50,515 ERROR [Thread-30]
> >> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> >> Status code: 14 for https://www.saxion.nl/opleidingen-
> >> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> >>
> >> What is going on here? Why does Nutch, or CC's sitemap util behave like 
> >> this?
> >>
> >> Thanks,
> >> Markus
> > 
> 
> 


Re: Sitemap URL's concatenated, causing status 14 not found

2018-05-29 Thread Sebastian Nagel
> I agree that the this is not the ideal error behaviour, but I guess the code 
> was written from the
assumption that the document is valid and conformant.

Over time the crawler-commons sitemap parser has been extended to get as much 
as possible from
non-conforming sitemaps as well. Of course, it's hard to foresee and handle all 
possible mistakes...
The equivalent syntax error for sitemaps (missing closing/next  in 
 is handled.

@Markus: Please open an issue for crawler-commons
  https://github.com/crawler-commons/crawler-commons/issues/

Thanks,
Sebastian


On 05/26/2018 02:57 AM, Yossi Tamari wrote:
> Hi Markus,
> 
> I don’t believe this is a valid sitemapindex. Each  should include 
> exactly one .
> See also https://www.sitemaps.org/protocol.html#index and 
> https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> I agree that the this is not the ideal error behaviour, but I guess the code 
> was written from the assumption that the document is valid and conformant.
> 
>   Yossi.
> 
>> -Original Message-
>> From: Markus Jelsma 
>> Sent: 25 May 2018 23:45
>> To: User 
>> Subject: Sitemap URL's concatenated, causing status 14 not found
>>
>> Hello,
>>
>> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
>> Nutch things those two sitemap URL's are actually one consisting of both
>> concatenated.
>>
>> Here is https://www.saxion.nl/sitemap.xml
>>
>> 
>> > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
>> 
>> https://www.saxion.nl/opleidingen-sitemap.xml
>> https://www.saxion.nl/content-sitemap.xml
>> 
>> 
>>
>> This seems fine, but Nutch attempts, and obviously fails to load:
>>
>> 2018-05-25 16:27:50,515 ERROR [Thread-30]
>> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
>> Status code: 14 for https://www.saxion.nl/opleidingen-
>> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
>>
>> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
>>
>> Thanks,
>> Markus
> 



RE: Sitemap URL's concatenated, causing status 14 not found

2018-05-29 Thread Markus Jelsma
Ah, of course, i missed that!

Thanks,
Markus
 
-Original message-
> From:Yossi Tamari 
> Sent: Saturday 26th May 2018 2:57
> To: user@nutch.apache.org
> Subject: RE: Sitemap URL's concatenated, causing status 14 not found
> 
> Hi Markus,
> 
> I don’t believe this is a valid sitemapindex. Each  should include 
> exactly one .
> See also https://www.sitemaps.org/protocol.html#index and 
> https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> I agree that the this is not the ideal error behaviour, but I guess the code 
> was written from the assumption that the document is valid and conformant.
> 
>   Yossi.
> 
> > -Original Message-
> > From: Markus Jelsma 
> > Sent: 25 May 2018 23:45
> > To: User 
> > Subject: Sitemap URL's concatenated, causing status 14 not found
> > 
> > Hello,
> > 
> > We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> > Nutch things those two sitemap URL's are actually one consisting of both
> > concatenated.
> > 
> > Here is https://www.saxion.nl/sitemap.xml
> > 
> > 
> >  > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
> > 
> > https://www.saxion.nl/opleidingen-sitemap.xml
> > https://www.saxion.nl/content-sitemap.xml
> > 
> > 
> > 
> > This seems fine, but Nutch attempts, and obviously fails to load:
> > 
> > 2018-05-25 16:27:50,515 ERROR [Thread-30]
> > org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> > Status code: 14 for https://www.saxion.nl/opleidingen-
> > sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> > 
> > What is going on here? Why does Nutch, or CC's sitemap util behave like 
> > this?
> > 
> > Thanks,
> > Markus
> 
> 


RE: Sitemap URL's concatenated, causing status 14 not found

2018-05-25 Thread Yossi Tamari
Hi Markus,

I don’t believe this is a valid sitemapindex. Each  should include 
exactly one .
See also https://www.sitemaps.org/protocol.html#index and 
https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
I agree that the this is not the ideal error behaviour, but I guess the code 
was written from the assumption that the document is valid and conformant.

Yossi.

> -Original Message-
> From: Markus Jelsma 
> Sent: 25 May 2018 23:45
> To: User 
> Subject: Sitemap URL's concatenated, causing status 14 not found
> 
> Hello,
> 
> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> Nutch things those two sitemap URL's are actually one consisting of both
> concatenated.
> 
> Here is https://www.saxion.nl/sitemap.xml
> 
> 
>  xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
> 
> https://www.saxion.nl/opleidingen-sitemap.xml
> https://www.saxion.nl/content-sitemap.xml
> 
> 
> 
> This seems fine, but Nutch attempts, and obviously fails to load:
> 
> 2018-05-25 16:27:50,515 ERROR [Thread-30]
> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> Status code: 14 for https://www.saxion.nl/opleidingen-
> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> 
> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
> 
> Thanks,
> Markus