> though , I wonder if anyone uses Nutch in production and how they overcome > this limitation being imposed by sites like macys.com where they have a > Crawl-Delay specified?
If you follow rules of politeness, there will be now way to overcome the crawl-delay from robots.txt: crawling will be horribly slow. So slow, that completeness and freshness seem unreachable targets. But maybe that's exactly the intention of site owner. On 06/03/2014 04:29 PM, S.L wrote: > Thats good piece of Info Nima , it means you wont be able to crawl more > than 720 pages in 24 hrs , this sounds like a pretty serious limitation > though , I wonder if anyone uses Nutch in production and how they overcome > this limitation being imposed by sites like macys.com where they have a > Crawl-Delay specified? > > > > > On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <[email protected]> wrote: > >> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay >> accordingly and it solved the issue. Macys.com has a crawl-delay of 120, >> nutch by default has a crawl delay of 30, so I had to change that and it >> worked. You guys must either make the crawl delay to -1 (something I dont >> recommend, but I did for example purposes), or to over 120 (for macys.com) >> in order to crawl macys.com >> >> <property> >> >> <name>fetcher.max.crawl.delay</name> >> >> <value>-1</value> >> >> <description> >> >> If the Crawl-Delay in robots.txt is set to greater than this value (in >> >> seconds) then the fetcher will skip this page, generating an error report. >> >> If set to -1 the fetcher will never skip such pages and will wait the >> >> amount of time retrieved from robots.txt Crawl-Delay, however long that >> >> might be. >> >> </description> >> >> </property> >> >> >> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <[email protected]> wrote: >> >>> Hi Sebastian: >>> >>> One thing I noticed is that when I tested the robots.txt with >>> RobotsRulesParser, which is in org.apache.nutch.protocol, with the >>> following URL >>> >> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= >>> >>> It gave me this message >>> >>> 2014-06-02 18:27:16,949 WARN robots.SimpleRobotRulesParser ( >>> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing >>> robots.txt for >>> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt >>> >>> 2014-06-02 18:27:16,952 WARN robots.SimpleRobotRulesParser ( >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in >>> robots.txt file (size 672): noindex: *natuzzi* >>> >>> 2014-06-02 18:27:16,952 WARN robots.SimpleRobotRulesParser ( >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in >>> robots.txt file (size 672): noindex: *Natuzzi* >>> >>> 2014-06-02 18:27:16,954 WARN robots.SimpleRobotRulesParser ( >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in >>> robots.txt file (size 672): noindex: *natuzzi* >>> >>> 2014-06-02 18:27:16,955 WARN robots.SimpleRobotRulesParser ( >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in >>> robots.txt file (size 672): noindex: *Natuzzi* >>> >>> *allowed: >>> >> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= >>> < >> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= >>> * >>> >>> >>> This is in direct contrary to what happened when I ran the crawl script >>> with >>> >> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= >>> as my SeedURL >>> >>> I got this in my crawlDB >>> >>> * >> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= >>> < >> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= >>> >>> Version: 7* >>> >>> *Status: 3 (db_gone)* >>> >>> *Fetch time: Thu Jul 17 18:05:47 PDT 2014* >>> >>> *Modified time: Wed Dec 31 16:00:00 PST 1969* >>> >>> *Retries since fetch: 0* >>> >>> *Retry interval: 3888000 seconds (45 days)* >>> >>> *Score: 1.0* >>> >>> *Signature: null* >>> >>> *Metadata:* >>> >>> * _pst_=robots_denied(18), lastModified=0* >>> >>> >>> Is this a bug in the crawler-commons 0.3? Where when you test the macys >>> robots.txt file with RobotRulesParser it allows it, but when you run the >>> macys url as a seed url in the crawl script then it denies the url. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel < >>> [email protected]> wrote: >>> >>>> Hi Luke, hi Nima, >>>> >>>>> The/Robot Exclusion Standard/does not mention anything about the >>>> "*" character in >>>>> the|Disallow:|statement. >>>> Indeed the RFC draft [1] does not. However, since Google [2] does wild >>>> card patterns are >>>> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules >>>> are also followed >>>> by Nutch (to be in versions 1.9 resp. 2.3). >>>> >>>> But the error message is about the noindex lines: >>>> noindex: *natuzzi* >>>> These lines are redundant (and also invalid, I suppose): >>>> if a page/URL is disallowed, it's not fetched at all, >>>> and will hardly slip into the index. >>>> I think you can ignore the warning. >>>> >>>>> One might also question the craw-delay setting of 120 seconds, but >>>> that's another issue... >>>> Yeah, it will take very long to crawl the site. >>>> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted: >>>> >>>> <property> >>>> <name>fetcher.max.crawl.delay</name> >>>> <value>30</value> >>>> <description> >>>> If the Crawl-Delay in robots.txt is set to greater than this value (in >>>> seconds) then the fetcher will skip this page, generating an error >>>> report. >>>> If set to -1 the fetcher will never skip such pages and will wait the >>>> amount of time retrieved from robots.txt Crawl-Delay, however long that >>>> might be. >>>> </description> >>>> </property> >>>> >>>> Cheers, >>>> Sebastian >>>> >>>> [1] http://www.robotstxt.org/norobots-rfc.txt >>>> [2] >>>> >> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt >>>> [3] >>>> >> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt >>>> >>>> On 05/31/2014 04:27 PM, Luke Mawbey wrote: >>>>> From wikipedia: >>>>> The/Robot Exclusion Standard/does not mention anything about the >>>> "*" character in >>>>> the|Disallow:|statement. Some crawlers like Googlebot recognize >> strings >>>> containing "*", while MSNbot >>>>> and Teoma interpret it in different ways >>>>> >>>>> So the 'problem' is with Macy's. Really, there is no problem for you: >>>> presumably that line is just >>>>> ignored from robots.txt. >>>>> >>>>> One might also question the craw-delay setting of 120 seconds, but >>>> that's another issue... >>>>> >>>>> >>>>> >>>>> On 31/05/2014 12:16 AM, Nima Falaki wrote: >>>>>> Hello Everyone: >>>>>> >>>>>> Just have a question about an issue I discovered while trying to >> crawl >>>> the >>>>>> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 >> and >>>>>> crawler-commons 0.4. This is the robots.txt file from macys >>>>>> >>>>>> User-agent: * >>>>>> Crawl-delay: 120 >>>>>> Disallow: /compare >>>>>> Disallow: /registry/wedding/compare >>>>>> Disallow: /catalog/product/zoom.jsp >>>>>> Disallow: /search >>>>>> Disallow: /shop/search >>>>>> Disallow: /shop/registry/wedding/search >>>>>> Disallow: *natuzzi* >>>>>> noindex: *natuzzi* >>>>>> Disallow: *Natuzzi* >>>>>> noindex: *Natuzzi* >>>>>> Disallow: /bag/add* >>>>>> >>>>>> >>>>>> When I run this robots.txt through the RobotsRulesParser with this >> url >>>>>> ( >>>> >> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= >>>> ) >>>>>> >>>>>> I get the following exceptions >>>>>> >>>>>> 2014-05-30 17:02:20,570 WARN robots.SimpleRobotRulesParser >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line >> in >>>>>> robots.txt file (size 672): noindex: *natuzzi* >>>>>> >>>>>> 2014-05-30 17:02:20,571 WARN robots.SimpleRobotRulesParser >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line >> in >>>>>> robots.txt file (size 672): noindex: *Natuzzi* >>>>>> >>>>>> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line >> in >>>>>> robots.txt file (size 672): noindex: *natuzzi* >>>>>> >>>>>> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line >> in >>>>>> robots.txt file (size 672): noindex: *Natuzzi* >>>>>> >>>>>> Is there anything I can do to solve this problem? Is this a problem >>>>>> with nutch or does macys.com have a really bad robots.txt file? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> <http://www.popsugar.com> >>>>>> Nima Falaki >>>>>> Software Engineer >>>>>> [email protected] >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> -- >>> >>> >>> <http://www.popsugar.com> >>> >>> Nima Falaki >>> Software Engineer >>> [email protected] >>> >>> >> >> >> -- >> >> >> >> Nima Falaki >> Software Engineer >> [email protected] >> >

