Hello Everyone:

Just have a question about an issue I discovered while trying to crawl the
macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
crawler-commons 0.4. This is the robots.txt file from macys

User-agent: *
Crawl-delay: 120
Disallow: /compare
Disallow: /registry/wedding/compare
Disallow: /catalog/product/zoom.jsp
Disallow: /search
Disallow: /shop/search
Disallow: /shop/registry/wedding/search
Disallow: *natuzzi*
noindex: *natuzzi*
Disallow: *Natuzzi*
noindex: *Natuzzi*
Disallow:  /bag/add*


When I run this robots.txt through the RobotsRulesParser with this url
(http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=)
I get the following exceptions

2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) -      Unknown line in
robots.txt file (size 672): noindex: *natuzzi*

2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) -      Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*

2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) -      Unknown line in
robots.txt file (size 672): noindex: *natuzzi*

2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
(SimpleRobotRulesParser.java:reportWarning(456)) -      Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*

Is there anything I can do to solve this problem? Is this a problem
with nutch or does macys.com have a really bad robots.txt file?




 <http://www.popsugar.com>
Nima Falaki
Software Engineer
[email protected]

Reply via email to