Re: Sitemap detection bug?

Michael Chen Fri, 18 Aug 2017 14:09:29 -0700

Hi Sebastian,

Sorry forgot to reply to list.


I remember enabling debug logging once before and found that the parsing of 
robots.txt stops after it finds the entry relevant to the crawler ID. Is the 
site map information displayed there too? 

It would also be great if someone could test it on 2.x, which should be very 
quick. I'm positive that there is something wrong with MSCDirect in specific 
that's blocking the site map extraction. Other sites work.

Thanks!
Michael

> On Aug 18, 2017, at 12:41, Sebastian Nagel <wastl.na...@googlemail.com> wrote:
> 
> Hi Michael,
> 
> yes, I tried the mentioned sitemap with crawler-commons. The sitemap URL was 
> detected in the
> robots.txt file. It needs some more debugging. The problem for me: I know 2.x 
> not from running
> any production crawler, so it will take longer for me to get into it.
> 
> But would you mind to move all discussions to user@nutch. It's important
> to keep them public, as some sort of documentation.
> 
> Thanks,
> Sebastian
> 
> 
>> On 08/18/2017 08:10 PM, Michael Chen wrote:
>> Could you check it for mscdirect.com? Some documentation on sitemaps suggest 
>> that there should be a
>> blank line before sitemaps, which MSCDirect doesn't have. Also might have 
>> something to do with the
>> crawler ID?
>> 
>> Please let me know if I can provide you with any additional information.
>> 
>> Thank you!
>> 
>> Michael
>> 
>> 
>>> On 08/18/2017 06:16 AM, Sebastian Nagel wrote:
>>> Hi Michael,
>>> 
>>> I've checked crawler-commons which is used for robots.txt parsing (recent 
>>> version and also 0.5 used
>>> by Nutch 2.x).  Seems to work. But it needs a closer look where the problem 
>>> is.
>>> 
>>> Best,
>>> Sebastian
>>> 
>>>> On 08/18/2017 03:40 AM, Michael Chen wrote:
>>>> Hi,
>>>> 
>>>> I've been unable to detect the sitemap for 
>>>> https://www.mscdirect.com/robots.txt, I did some
>>>> searching and I think it might be due to their robots.txt line spacing 
>>>> format. I tried
>>>> user-agent=Googlebot but that didn't help either. Could someone reproduce 
>>>> the problem?
>>>> 
>>>> Thanks!
>>>> 
>>>> Michael
>>>> 
>> 
>

Re: Sitemap detection bug?

Reply via email to