I had the same issue with macys robots.txt I was never able ti crawl it .I think its robots.txt file is broken.
Sent from my HTC ----- Reply message ----- From: "Sebastian Nagel" <[email protected]> To: <[email protected]> Subject: Problem with crawling macys robots.txt Date: Sun, Jun 1, 2014 3:53 PM Hi Luke, hi Nima, > The/Robot Exclusion Standard/does not mention anything about the "*" > character in > the|Disallow:|statement. Indeed the RFC draft [1] does not. However, since Google [2] does wild card patterns are frequently used in robots.txt. With crawler-commons 0.4 [3] these rules are also followed by Nutch (to be in versions 1.9 resp. 2.3). But the error message is about the noindex lines: noindex: *natuzzi* These lines are redundant (and also invalid, I suppose): if a page/URL is disallowed, it's not fetched at all, and will hardly slip into the index. I think you can ignore the warning. > One might also question the craw-delay setting of 120 seconds, but that's > another issue... Yeah, it will take very long to crawl the site. With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted: <property> <name>fetcher.max.crawl.delay</name> <value>30</value> <description> If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. </description> </property> Cheers, Sebastian [1] http://www.robotstxt.org/norobots-rfc.txt [2] https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [3] http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt On 05/31/2014 04:27 PM, Luke Mawbey wrote: > From wikipedia: > The/Robot Exclusion Standard/does not mention anything about the "*" > character in > the|Disallow:|statement. Some crawlers like Googlebot recognize strings > containing "*", while MSNbot > and Teoma interpret it in different ways > > So the 'problem' is with Macy's. Really, there is no problem for you: > presumably that line is just > ignored from robots.txt. > > One might also question the craw-delay setting of 120 seconds, but that's > another issue... > > > > On 31/05/2014 12:16 AM, Nima Falaki wrote: >> Hello Everyone: >> >> Just have a question about an issue I discovered while trying to crawl the >> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and >> crawler-commons 0.4. This is the robots.txt file from macys >> >> User-agent: * >> Crawl-delay: 120 >> Disallow: /compare >> Disallow: /registry/wedding/compare >> Disallow: /catalog/product/zoom.jsp >> Disallow: /search >> Disallow: /shop/search >> Disallow: /shop/registry/wedding/search >> Disallow: *natuzzi* >> noindex: *natuzzi* >> Disallow: *Natuzzi* >> noindex: *Natuzzi* >> Disallow: /bag/add* >> >> >> When I run this robots.txt through the RobotsRulesParser with this url >> (http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=) >> >> I get the following exceptions >> >> 2014-05-30 17:02:20,570 WARN robots.SimpleRobotRulesParser >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in >> robots.txt file (size 672): noindex: *natuzzi* >> >> 2014-05-30 17:02:20,571 WARN robots.SimpleRobotRulesParser >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in >> robots.txt file (size 672): noindex: *Natuzzi* >> >> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in >> robots.txt file (size 672): noindex: *natuzzi* >> >> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in >> robots.txt file (size 672): noindex: *Natuzzi* >> >> Is there anything I can do to solve this problem? Is this a problem >> with nutch or does macys.com have a really bad robots.txt file? >> >> >> >> >> <http://www.popsugar.com> >> Nima Falaki >> Software Engineer >> [email protected] >> > >

