Out of curiosity , what if one needs to set the rules of politeness that are more realistic , i.e if I want to set the crawl-delay to be a certain max value regardless of what a particular site has , which java class should I be looking to change , assuming that this cannot be achieved using the config parameters. Thanks.
On Tue, Jun 3, 2014 at 5:52 PM, Sebastian Nagel <[email protected]> wrote: > > though , I wonder if anyone uses Nutch in production and how they > overcome > > this limitation being imposed by sites like macys.com where they have a > > Crawl-Delay specified? > > If you follow rules of politeness, there will be now way to overcome the > crawl-delay from robots.txt: crawling will be horribly slow. So slow, that > completeness and freshness seem unreachable targets. But maybe that's > exactly the intention of site owner. > > On 06/03/2014 04:29 PM, S.L wrote: > > Thats good piece of Info Nima , it means you wont be able to crawl more > > than 720 pages in 24 hrs , this sounds like a pretty serious limitation > > though , I wonder if anyone uses Nutch in production and how they > overcome > > this limitation being imposed by sites like macys.com where they have a > > Crawl-Delay specified? > > > > > > > > > > On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <[email protected]> > wrote: > > > >> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay > >> accordingly and it solved the issue. Macys.com has a crawl-delay of 120, > >> nutch by default has a crawl delay of 30, so I had to change that and it > >> worked. You guys must either make the crawl delay to -1 (something I > dont > >> recommend, but I did for example purposes), or to over 120 (for > macys.com) > >> in order to crawl macys.com > >> > >> <property> > >> > >> <name>fetcher.max.crawl.delay</name> > >> > >> <value>-1</value> > >> > >> <description> > >> > >> If the Crawl-Delay in robots.txt is set to greater than this value (in > >> > >> seconds) then the fetcher will skip this page, generating an error > report. > >> > >> If set to -1 the fetcher will never skip such pages and will wait the > >> > >> amount of time retrieved from robots.txt Crawl-Delay, however long that > >> > >> might be. > >> > >> </description> > >> > >> </property> > >> > >> > >> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <[email protected]> > wrote: > >> > >>> Hi Sebastian: > >>> > >>> One thing I noticed is that when I tested the robots.txt with > >>> RobotsRulesParser, which is in org.apache.nutch.protocol, with the > >>> following URL > >>> > >> > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >>> > >>> It gave me this message > >>> > >>> 2014-06-02 18:27:16,949 WARN robots.SimpleRobotRulesParser ( > >>> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing > >>> robots.txt for > >>> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt > >>> > >>> 2014-06-02 18:27:16,952 WARN robots.SimpleRobotRulesParser ( > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > >>> robots.txt file (size 672): noindex: *natuzzi* > >>> > >>> 2014-06-02 18:27:16,952 WARN robots.SimpleRobotRulesParser ( > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > >>> robots.txt file (size 672): noindex: *Natuzzi* > >>> > >>> 2014-06-02 18:27:16,954 WARN robots.SimpleRobotRulesParser ( > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > >>> robots.txt file (size 672): noindex: *natuzzi* > >>> > >>> 2014-06-02 18:27:16,955 WARN robots.SimpleRobotRulesParser ( > >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > >>> robots.txt file (size 672): noindex: *Natuzzi* > >>> > >>> *allowed: > >>> > >> > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >>> < > >> > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >>> * > >>> > >>> > >>> This is in direct contrary to what happened when I ran the crawl script > >>> with > >>> > >> > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >>> as my SeedURL > >>> > >>> I got this in my crawlDB > >>> > >>> * > >> > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >>> < > >> > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >>> > >>> Version: 7* > >>> > >>> *Status: 3 (db_gone)* > >>> > >>> *Fetch time: Thu Jul 17 18:05:47 PDT 2014* > >>> > >>> *Modified time: Wed Dec 31 16:00:00 PST 1969* > >>> > >>> *Retries since fetch: 0* > >>> > >>> *Retry interval: 3888000 seconds (45 days)* > >>> > >>> *Score: 1.0* > >>> > >>> *Signature: null* > >>> > >>> *Metadata:* > >>> > >>> * _pst_=robots_denied(18), lastModified=0* > >>> > >>> > >>> Is this a bug in the crawler-commons 0.3? Where when you test the macys > >>> robots.txt file with RobotRulesParser it allows it, but when you run > the > >>> macys url as a seed url in the crawl script then it denies the url. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel < > >>> [email protected]> wrote: > >>> > >>>> Hi Luke, hi Nima, > >>>> > >>>>> The/Robot Exclusion Standard/does not mention anything about the > >>>> "*" character in > >>>>> the|Disallow:|statement. > >>>> Indeed the RFC draft [1] does not. However, since Google [2] does wild > >>>> card patterns are > >>>> frequently used in robots.txt. With crawler-commons 0.4 [3] these > rules > >>>> are also followed > >>>> by Nutch (to be in versions 1.9 resp. 2.3). > >>>> > >>>> But the error message is about the noindex lines: > >>>> noindex: *natuzzi* > >>>> These lines are redundant (and also invalid, I suppose): > >>>> if a page/URL is disallowed, it's not fetched at all, > >>>> and will hardly slip into the index. > >>>> I think you can ignore the warning. > >>>> > >>>>> One might also question the craw-delay setting of 120 seconds, but > >>>> that's another issue... > >>>> Yeah, it will take very long to crawl the site. > >>>> With Nutch the property "fetcher.max.crawl.delay" needs to be > adjusted: > >>>> > >>>> <property> > >>>> <name>fetcher.max.crawl.delay</name> > >>>> <value>30</value> > >>>> <description> > >>>> If the Crawl-Delay in robots.txt is set to greater than this value > (in > >>>> seconds) then the fetcher will skip this page, generating an error > >>>> report. > >>>> If set to -1 the fetcher will never skip such pages and will wait the > >>>> amount of time retrieved from robots.txt Crawl-Delay, however long > that > >>>> might be. > >>>> </description> > >>>> </property> > >>>> > >>>> Cheers, > >>>> Sebastian > >>>> > >>>> [1] http://www.robotstxt.org/norobots-rfc.txt > >>>> [2] > >>>> > >> > https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt > >>>> [3] > >>>> > >> > http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt > >>>> > >>>> On 05/31/2014 04:27 PM, Luke Mawbey wrote: > >>>>> From wikipedia: > >>>>> The/Robot Exclusion Standard/does not mention anything about the > >>>> "*" character in > >>>>> the|Disallow:|statement. Some crawlers like Googlebot recognize > >> strings > >>>> containing "*", while MSNbot > >>>>> and Teoma interpret it in different ways > >>>>> > >>>>> So the 'problem' is with Macy's. Really, there is no problem for you: > >>>> presumably that line is just > >>>>> ignored from robots.txt. > >>>>> > >>>>> One might also question the craw-delay setting of 120 seconds, but > >>>> that's another issue... > >>>>> > >>>>> > >>>>> > >>>>> On 31/05/2014 12:16 AM, Nima Falaki wrote: > >>>>>> Hello Everyone: > >>>>>> > >>>>>> Just have a question about an issue I discovered while trying to > >> crawl > >>>> the > >>>>>> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 > >> and > >>>>>> crawler-commons 0.4. This is the robots.txt file from macys > >>>>>> > >>>>>> User-agent: * > >>>>>> Crawl-delay: 120 > >>>>>> Disallow: /compare > >>>>>> Disallow: /registry/wedding/compare > >>>>>> Disallow: /catalog/product/zoom.jsp > >>>>>> Disallow: /search > >>>>>> Disallow: /shop/search > >>>>>> Disallow: /shop/registry/wedding/search > >>>>>> Disallow: *natuzzi* > >>>>>> noindex: *natuzzi* > >>>>>> Disallow: *Natuzzi* > >>>>>> noindex: *Natuzzi* > >>>>>> Disallow: /bag/add* > >>>>>> > >>>>>> > >>>>>> When I run this robots.txt through the RobotsRulesParser with this > >> url > >>>>>> ( > >>>> > >> > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >>>> ) > >>>>>> > >>>>>> I get the following exceptions > >>>>>> > >>>>>> 2014-05-30 17:02:20,570 WARN robots.SimpleRobotRulesParser > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line > >> in > >>>>>> robots.txt file (size 672): noindex: *natuzzi* > >>>>>> > >>>>>> 2014-05-30 17:02:20,571 WARN robots.SimpleRobotRulesParser > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line > >> in > >>>>>> robots.txt file (size 672): noindex: *Natuzzi* > >>>>>> > >>>>>> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line > >> in > >>>>>> robots.txt file (size 672): noindex: *natuzzi* > >>>>>> > >>>>>> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser > >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line > >> in > >>>>>> robots.txt file (size 672): noindex: *Natuzzi* > >>>>>> > >>>>>> Is there anything I can do to solve this problem? Is this a problem > >>>>>> with nutch or does macys.com have a really bad robots.txt file? > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> <http://www.popsugar.com> > >>>>>> Nima Falaki > >>>>>> Software Engineer > >>>>>> [email protected] > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >>> -- > >>> > >>> > >>> <http://www.popsugar.com> > >>> > >>> Nima Falaki > >>> Software Engineer > >>> [email protected] > >>> > >>> > >> > >> > >> -- > >> > >> > >> > >> Nima Falaki > >> Software Engineer > >> [email protected] > >> > > > >

