Hi Sebastian: One thing I noticed is that when I tested the robots.txt with RobotsRulesParser, which is in org.apache.nutch.protocol, with the following URL http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
It gave me this message 2014-06-02 18:27:16,949 WARN robots.SimpleRobotRulesParser ( SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing robots.txt for /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt 2014-06-02 18:27:16,952 WARN robots.SimpleRobotRulesParser ( SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in robots.txt file (size 672): noindex: *natuzzi* 2014-06-02 18:27:16,952 WARN robots.SimpleRobotRulesParser ( SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in robots.txt file (size 672): noindex: *Natuzzi* 2014-06-02 18:27:16,954 WARN robots.SimpleRobotRulesParser ( SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in robots.txt file (size 672): noindex: *natuzzi* 2014-06-02 18:27:16,955 WARN robots.SimpleRobotRulesParser ( SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in robots.txt file (size 672): noindex: *Natuzzi* *allowed: http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= <http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>* This is in direct contrary to what happened when I ran the crawl script with http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= as my SeedURL I got this in my crawlDB *http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= <http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=> Version: 7* *Status: 3 (db_gone)* *Fetch time: Thu Jul 17 18:05:47 PDT 2014* *Modified time: Wed Dec 31 16:00:00 PST 1969* *Retries since fetch: 0* *Retry interval: 3888000 seconds (45 days)* *Score: 1.0* *Signature: null* *Metadata:* * _pst_=robots_denied(18), lastModified=0* Is this a bug in the crawler-commons 0.3? Where when you test the macys robots.txt file with RobotRulesParser it allows it, but when you run the macys url as a seed url in the crawl script then it denies the url. On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <[email protected] > wrote: > Hi Luke, hi Nima, > > > The/Robot Exclusion Standard/does not mention anything about the "*" > character in > > the|Disallow:|statement. > Indeed the RFC draft [1] does not. However, since Google [2] does wild > card patterns are > frequently used in robots.txt. With crawler-commons 0.4 [3] these rules > are also followed > by Nutch (to be in versions 1.9 resp. 2.3). > > But the error message is about the noindex lines: > noindex: *natuzzi* > These lines are redundant (and also invalid, I suppose): > if a page/URL is disallowed, it's not fetched at all, > and will hardly slip into the index. > I think you can ignore the warning. > > > One might also question the craw-delay setting of 120 seconds, but > that's another issue... > Yeah, it will take very long to crawl the site. > With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted: > > <property> > <name>fetcher.max.crawl.delay</name> > <value>30</value> > <description> > If the Crawl-Delay in robots.txt is set to greater than this value (in > seconds) then the fetcher will skip this page, generating an error report. > If set to -1 the fetcher will never skip such pages and will wait the > amount of time retrieved from robots.txt Crawl-Delay, however long that > might be. > </description> > </property> > > Cheers, > Sebastian > > [1] http://www.robotstxt.org/norobots-rfc.txt > [2] > https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt > [3] > http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt > > On 05/31/2014 04:27 PM, Luke Mawbey wrote: > > From wikipedia: > > The/Robot Exclusion Standard/does not mention anything about the "*" > character in > > the|Disallow:|statement. Some crawlers like Googlebot recognize strings > containing "*", while MSNbot > > and Teoma interpret it in different ways > > > > So the 'problem' is with Macy's. Really, there is no problem for you: > presumably that line is just > > ignored from robots.txt. > > > > One might also question the craw-delay setting of 120 seconds, but > that's another issue... > > > > > > > > On 31/05/2014 12:16 AM, Nima Falaki wrote: > >> Hello Everyone: > >> > >> Just have a question about an issue I discovered while trying to crawl > the > >> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and > >> crawler-commons 0.4. This is the robots.txt file from macys > >> > >> User-agent: * > >> Crawl-delay: 120 > >> Disallow: /compare > >> Disallow: /registry/wedding/compare > >> Disallow: /catalog/product/zoom.jsp > >> Disallow: /search > >> Disallow: /shop/search > >> Disallow: /shop/registry/wedding/search > >> Disallow: *natuzzi* > >> noindex: *natuzzi* > >> Disallow: *Natuzzi* > >> noindex: *Natuzzi* > >> Disallow: /bag/add* > >> > >> > >> When I run this robots.txt through the RobotsRulesParser with this url > >> ( > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > ) > >> > >> I get the following exceptions > >> > >> 2014-05-30 17:02:20,570 WARN robots.SimpleRobotRulesParser > >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > >> robots.txt file (size 672): noindex: *natuzzi* > >> > >> 2014-05-30 17:02:20,571 WARN robots.SimpleRobotRulesParser > >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > >> robots.txt file (size 672): noindex: *Natuzzi* > >> > >> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser > >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > >> robots.txt file (size 672): noindex: *natuzzi* > >> > >> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser > >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > >> robots.txt file (size 672): noindex: *Natuzzi* > >> > >> Is there anything I can do to solve this problem? Is this a problem > >> with nutch or does macys.com have a really bad robots.txt file? > >> > >> > >> > >> > >> <http://www.popsugar.com> > >> Nima Falaki > >> Software Engineer > >> [email protected] > >> > > > > > > -- <http://www.popsugar.com> Nima Falaki Software Engineer [email protected]

