Thats good piece of Info Nima , it means you wont be able to crawl more than 720 pages in 24 hrs , this sounds like a pretty serious limitation though , I wonder if anyone uses Nutch in production and how they overcome this limitation being imposed by sites like macys.com where they have a Crawl-Delay specified?
On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <[email protected]> wrote: > Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay > accordingly and it solved the issue. Macys.com has a crawl-delay of 120, > nutch by default has a crawl delay of 30, so I had to change that and it > worked. You guys must either make the crawl delay to -1 (something I dont > recommend, but I did for example purposes), or to over 120 (for macys.com) > in order to crawl macys.com > > <property> > > <name>fetcher.max.crawl.delay</name> > > <value>-1</value> > > <description> > > If the Crawl-Delay in robots.txt is set to greater than this value (in > > seconds) then the fetcher will skip this page, generating an error report. > > If set to -1 the fetcher will never skip such pages and will wait the > > amount of time retrieved from robots.txt Crawl-Delay, however long that > > might be. > > </description> > > </property> > > > On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <[email protected]> wrote: > > > Hi Sebastian: > > > > One thing I noticed is that when I tested the robots.txt with > > RobotsRulesParser, which is in org.apache.nutch.protocol, with the > > following URL > > > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > > > > It gave me this message > > > > 2014-06-02 18:27:16,949 WARN robots.SimpleRobotRulesParser ( > > SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing > > robots.txt for > > /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt > > > > 2014-06-02 18:27:16,952 WARN robots.SimpleRobotRulesParser ( > > SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > > robots.txt file (size 672): noindex: *natuzzi* > > > > 2014-06-02 18:27:16,952 WARN robots.SimpleRobotRulesParser ( > > SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > > robots.txt file (size 672): noindex: *Natuzzi* > > > > 2014-06-02 18:27:16,954 WARN robots.SimpleRobotRulesParser ( > > SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > > robots.txt file (size 672): noindex: *natuzzi* > > > > 2014-06-02 18:27:16,955 WARN robots.SimpleRobotRulesParser ( > > SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in > > robots.txt file (size 672): noindex: *Natuzzi* > > > > *allowed: > > > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > > < > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >* > > > > > > This is in direct contrary to what happened when I ran the crawl script > > with > > > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > > as my SeedURL > > > > I got this in my crawlDB > > > > * > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > > < > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > > > > Version: 7* > > > > *Status: 3 (db_gone)* > > > > *Fetch time: Thu Jul 17 18:05:47 PDT 2014* > > > > *Modified time: Wed Dec 31 16:00:00 PST 1969* > > > > *Retries since fetch: 0* > > > > *Retry interval: 3888000 seconds (45 days)* > > > > *Score: 1.0* > > > > *Signature: null* > > > > *Metadata:* > > > > * _pst_=robots_denied(18), lastModified=0* > > > > > > Is this a bug in the crawler-commons 0.3? Where when you test the macys > > robots.txt file with RobotRulesParser it allows it, but when you run the > > macys url as a seed url in the crawl script then it denies the url. > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel < > > [email protected]> wrote: > > > >> Hi Luke, hi Nima, > >> > >> > The/Robot Exclusion Standard/does not mention anything about the > >> "*" character in > >> > the|Disallow:|statement. > >> Indeed the RFC draft [1] does not. However, since Google [2] does wild > >> card patterns are > >> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules > >> are also followed > >> by Nutch (to be in versions 1.9 resp. 2.3). > >> > >> But the error message is about the noindex lines: > >> noindex: *natuzzi* > >> These lines are redundant (and also invalid, I suppose): > >> if a page/URL is disallowed, it's not fetched at all, > >> and will hardly slip into the index. > >> I think you can ignore the warning. > >> > >> > One might also question the craw-delay setting of 120 seconds, but > >> that's another issue... > >> Yeah, it will take very long to crawl the site. > >> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted: > >> > >> <property> > >> <name>fetcher.max.crawl.delay</name> > >> <value>30</value> > >> <description> > >> If the Crawl-Delay in robots.txt is set to greater than this value (in > >> seconds) then the fetcher will skip this page, generating an error > >> report. > >> If set to -1 the fetcher will never skip such pages and will wait the > >> amount of time retrieved from robots.txt Crawl-Delay, however long that > >> might be. > >> </description> > >> </property> > >> > >> Cheers, > >> Sebastian > >> > >> [1] http://www.robotstxt.org/norobots-rfc.txt > >> [2] > >> > https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt > >> [3] > >> > http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt > >> > >> On 05/31/2014 04:27 PM, Luke Mawbey wrote: > >> > From wikipedia: > >> > The/Robot Exclusion Standard/does not mention anything about the > >> "*" character in > >> > the|Disallow:|statement. Some crawlers like Googlebot recognize > strings > >> containing "*", while MSNbot > >> > and Teoma interpret it in different ways > >> > > >> > So the 'problem' is with Macy's. Really, there is no problem for you: > >> presumably that line is just > >> > ignored from robots.txt. > >> > > >> > One might also question the craw-delay setting of 120 seconds, but > >> that's another issue... > >> > > >> > > >> > > >> > On 31/05/2014 12:16 AM, Nima Falaki wrote: > >> >> Hello Everyone: > >> >> > >> >> Just have a question about an issue I discovered while trying to > crawl > >> the > >> >> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 > and > >> >> crawler-commons 0.4. This is the robots.txt file from macys > >> >> > >> >> User-agent: * > >> >> Crawl-delay: 120 > >> >> Disallow: /compare > >> >> Disallow: /registry/wedding/compare > >> >> Disallow: /catalog/product/zoom.jsp > >> >> Disallow: /search > >> >> Disallow: /shop/search > >> >> Disallow: /shop/registry/wedding/search > >> >> Disallow: *natuzzi* > >> >> noindex: *natuzzi* > >> >> Disallow: *Natuzzi* > >> >> noindex: *Natuzzi* > >> >> Disallow: /bag/add* > >> >> > >> >> > >> >> When I run this robots.txt through the RobotsRulesParser with this > url > >> >> ( > >> > http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType= > >> ) > >> >> > >> >> I get the following exceptions > >> >> > >> >> 2014-05-30 17:02:20,570 WARN robots.SimpleRobotRulesParser > >> >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line > in > >> >> robots.txt file (size 672): noindex: *natuzzi* > >> >> > >> >> 2014-05-30 17:02:20,571 WARN robots.SimpleRobotRulesParser > >> >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line > in > >> >> robots.txt file (size 672): noindex: *Natuzzi* > >> >> > >> >> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser > >> >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line > in > >> >> robots.txt file (size 672): noindex: *natuzzi* > >> >> > >> >> 2014-05-30 17:02:20,574 WARN robots.SimpleRobotRulesParser > >> >> (SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line > in > >> >> robots.txt file (size 672): noindex: *Natuzzi* > >> >> > >> >> Is there anything I can do to solve this problem? Is this a problem > >> >> with nutch or does macys.com have a really bad robots.txt file? > >> >> > >> >> > >> >> > >> >> > >> >> <http://www.popsugar.com> > >> >> Nima Falaki > >> >> Software Engineer > >> >> [email protected] > >> >> > >> > > >> > > >> > >> > > > > > > -- > > > > > > <http://www.popsugar.com> > > > > Nima Falaki > > Software Engineer > > [email protected] > > > > > > > -- > > > > Nima Falaki > Software Engineer > [email protected] >

