Re: Problem with crawling macys robots.txt

Sebastian Nagel Sun, 01 Jun 2014 12:54:12 -0700

Hi Luke, hi Nima,

>     The/Robot Exclusion Standard/does not mention anything about the "*" 
> character in
> the|Disallow:|statement.
Indeed the RFC draft [1] does not. However, since Google [2] does wild card 
patterns are
frequently used in robots.txt. With crawler-commons 0.4 [3] these rules are 
also followed
by Nutch (to be in versions 1.9 resp. 2.3).


But the error message is about the noindex lines:
 noindex: *natuzzi*
These lines are redundant (and also invalid, I suppose):
if a page/URL is disallowed, it's not fetched at all,
and will hardly slip into the index.
I think you can ignore the warning.

> One might also question the craw-delay setting of 120 seconds, but that's 
> another issue...
Yeah, it will take very long to crawl the site.
With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:

<property>
 <name>fetcher.max.crawl.delay</name>
 <value>30</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property>

Cheers,
Sebastian

[1] http://www.robotstxt.org/norobots-rfc.txt
[2] https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
[3] 
http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt

On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> From wikipedia:
>     The/Robot Exclusion Standard/does not mention anything about the "*" 
> character in
> the|Disallow:|statement. Some crawlers like Googlebot recognize strings 
> containing "*", while MSNbot
> and Teoma interpret it in different ways
> 
> So the 'problem' is with Macy's. Really, there is no problem for you: 
> presumably that line is just
> ignored from robots.txt.
> 
> One might also question the craw-delay setting of 120 seconds, but that's 
> another issue...
> 
> 
> 
> On 31/05/2014 12:16 AM, Nima Falaki wrote:
>> Hello Everyone:
>>
>> Just have a question about an issue I discovered while trying to crawl the
>> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
>> crawler-commons 0.4. This is the robots.txt file from macys
>>
>> User-agent: *
>> Crawl-delay: 120
>> Disallow: /compare
>> Disallow: /registry/wedding/compare
>> Disallow: /catalog/product/zoom.jsp
>> Disallow: /search
>> Disallow: /shop/search
>> Disallow: /shop/registry/wedding/search
>> Disallow: *natuzzi*
>> noindex: *natuzzi*
>> Disallow: *Natuzzi*
>> noindex: *Natuzzi*
>> Disallow:  /bag/add*
>>
>>
>> When I run this robots.txt through the RobotsRulesParser with this url
>> (http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=)
>>
>> I get the following exceptions
>>
>> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> robots.txt file (size 672): noindex: *natuzzi*
>>
>> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> robots.txt file (size 672): noindex: *Natuzzi*
>>
>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> robots.txt file (size 672): noindex: *natuzzi*
>>
>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> robots.txt file (size 672): noindex: *Natuzzi*
>>
>> Is there anything I can do to solve this problem? Is this a problem
>> with nutch or does macys.com have a really bad robots.txt file?
>>
>>
>>
>>
>>   <http://www.popsugar.com>
>> Nima Falaki
>> Software Engineer
>> [email protected]
>>
> 
>

Re: Problem with crawling macys robots.txt

Reply via email to