Re: Problem with crawling macys robots.txt

S.L Tue, 03 Jun 2014 21:12:06 -0700

Out of curiosity , what if one needs to set the rules of politeness that
are more realistic , i.e if I want to set the crawl-delay to be a certain
max value regardless of what a particular site has , which java class
should I be looking to change , assuming that this cannot be achieved using
the config parameters. Thanks.



On Tue, Jun 3, 2014 at 5:52 PM, Sebastian Nagel <[email protected]>
wrote:

> > though , I wonder if anyone uses Nutch in production and how they
> overcome
> > this limitation being imposed by sites like macys.com where they have a
> > Crawl-Delay specified?
>
> If you follow rules of politeness, there will be now way to overcome the
> crawl-delay from robots.txt: crawling will be horribly slow. So slow, that
> completeness and freshness seem unreachable targets. But maybe that's
> exactly the intention of site owner.
>
> On 06/03/2014 04:29 PM, S.L wrote:
> > Thats good piece of Info Nima , it means you wont be able to crawl more
> > than 720 pages in 24 hrs , this sounds like a pretty serious limitation
> > though , I wonder if anyone uses Nutch in production and how they
> overcome
> > this limitation being imposed by sites like macys.com where they have a
> > Crawl-Delay specified?
> >
> >
> >
> >
> > On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <[email protected]>
> wrote:
> >
> >> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
> >> accordingly and it solved the issue. Macys.com has a crawl-delay of 120,
> >> nutch by default has a crawl delay of 30, so I had to change that and it
> >> worked. You guys must either make the crawl delay to -1 (something I
> dont
> >> recommend, but I did for example purposes), or to over 120 (for
> macys.com)
> >> in order to crawl macys.com
> >>
> >> <property>
> >>
> >>  <name>fetcher.max.crawl.delay</name>
> >>
> >>  <value>-1</value>
> >>
> >>  <description>
> >>
> >>  If the Crawl-Delay in robots.txt is set to greater than this value (in
> >>
> >>  seconds) then the fetcher will skip this page, generating an error
> report.
> >>
> >>  If set to -1 the fetcher will never skip such pages and will wait the
> >>
> >>  amount of time retrieved from robots.txt Crawl-Delay, however long that
> >>
> >>  might be.
> >>
> >>  </description>
> >>
> >> </property>
> >>
> >>
> >> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <[email protected]>
> wrote:
> >>
> >>> Hi Sebastian:
> >>>
> >>> One thing I noticed is that when I tested the robots.txt with
> >>> RobotsRulesParser, which is in org.apache.nutch.protocol, with the
> >>> following URL
> >>>
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>>
> >>> It gave me this message
> >>>
> >>> 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
> >>> robots.txt for
> >>> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
> >>>
> >>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> >>> robots.txt file (size 672): noindex: *natuzzi*
> >>>
> >>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> >>> robots.txt file (size 672): noindex: *Natuzzi*
> >>>
> >>> 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> >>> robots.txt file (size 672): noindex: *natuzzi*
> >>>
> >>> 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
> >>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> >>> robots.txt file (size 672): noindex: *Natuzzi*
> >>>
> >>> *allowed:
> >>>
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>> <
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>> *
> >>>
> >>>
> >>> This is in direct contrary to what happened when I ran the crawl script
> >>> with
> >>>
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>> as my SeedURL
> >>>
> >>> I got this in my crawlDB
> >>>
> >>> *
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>> <
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>>
> >>>       Version: 7*
> >>>
> >>> *Status: 3 (db_gone)*
> >>>
> >>> *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
> >>>
> >>> *Modified time: Wed Dec 31 16:00:00 PST 1969*
> >>>
> >>> *Retries since fetch: 0*
> >>>
> >>> *Retry interval: 3888000 seconds (45 days)*
> >>>
> >>> *Score: 1.0*
> >>>
> >>> *Signature: null*
> >>>
> >>> *Metadata:*
> >>>
> >>> *        _pst_=robots_denied(18), lastModified=0*
> >>>
> >>>
> >>> Is this a bug in the crawler-commons 0.3? Where when you test the macys
> >>> robots.txt file with RobotRulesParser it allows it, but when you run
> the
> >>> macys url as a seed url in the crawl script then it denies the url.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
> >>> [email protected]> wrote:
> >>>
> >>>> Hi Luke, hi Nima,
> >>>>
> >>>>>     The/Robot Exclusion Standard/does not mention anything about the
> >>>> "*" character in
> >>>>> the|Disallow:|statement.
> >>>> Indeed the RFC draft [1] does not. However, since Google [2] does wild
> >>>> card patterns are
> >>>> frequently used in robots.txt. With crawler-commons 0.4 [3] these
> rules
> >>>> are also followed
> >>>> by Nutch (to be in versions 1.9 resp. 2.3).
> >>>>
> >>>> But the error message is about the noindex lines:
> >>>>  noindex: *natuzzi*
> >>>> These lines are redundant (and also invalid, I suppose):
> >>>> if a page/URL is disallowed, it's not fetched at all,
> >>>> and will hardly slip into the index.
> >>>> I think you can ignore the warning.
> >>>>
> >>>>> One might also question the craw-delay setting of 120 seconds, but
> >>>> that's another issue...
> >>>> Yeah, it will take very long to crawl the site.
> >>>> With Nutch the property "fetcher.max.crawl.delay" needs to be
> adjusted:
> >>>>
> >>>> <property>
> >>>>  <name>fetcher.max.crawl.delay</name>
> >>>>  <value>30</value>
> >>>>  <description>
> >>>>  If the Crawl-Delay in robots.txt is set to greater than this value
> (in
> >>>>  seconds) then the fetcher will skip this page, generating an error
> >>>> report.
> >>>>  If set to -1 the fetcher will never skip such pages and will wait the
> >>>>  amount of time retrieved from robots.txt Crawl-Delay, however long
> that
> >>>>  might be.
> >>>>  </description>
> >>>> </property>
> >>>>
> >>>> Cheers,
> >>>> Sebastian
> >>>>
> >>>> [1] http://www.robotstxt.org/norobots-rfc.txt
> >>>> [2]
> >>>>
> >>
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> >>>> [3]
> >>>>
> >>
> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
> >>>>
> >>>> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> >>>>> From wikipedia:
> >>>>>     The/Robot Exclusion Standard/does not mention anything about the
> >>>> "*" character in
> >>>>> the|Disallow:|statement. Some crawlers like Googlebot recognize
> >> strings
> >>>> containing "*", while MSNbot
> >>>>> and Teoma interpret it in different ways
> >>>>>
> >>>>> So the 'problem' is with Macy's. Really, there is no problem for you:
> >>>> presumably that line is just
> >>>>> ignored from robots.txt.
> >>>>>
> >>>>> One might also question the craw-delay setting of 120 seconds, but
> >>>> that's another issue...
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 31/05/2014 12:16 AM, Nima Falaki wrote:
> >>>>>> Hello Everyone:
> >>>>>>
> >>>>>> Just have a question about an issue I discovered while trying to
> >> crawl
> >>>> the
> >>>>>> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3
> >> and
> >>>>>> crawler-commons 0.4. This is the robots.txt file from macys
> >>>>>>
> >>>>>> User-agent: *
> >>>>>> Crawl-delay: 120
> >>>>>> Disallow: /compare
> >>>>>> Disallow: /registry/wedding/compare
> >>>>>> Disallow: /catalog/product/zoom.jsp
> >>>>>> Disallow: /search
> >>>>>> Disallow: /shop/search
> >>>>>> Disallow: /shop/registry/wedding/search
> >>>>>> Disallow: *natuzzi*
> >>>>>> noindex: *natuzzi*
> >>>>>> Disallow: *Natuzzi*
> >>>>>> noindex: *Natuzzi*
> >>>>>> Disallow:  /bag/add*
> >>>>>>
> >>>>>>
> >>>>>> When I run this robots.txt through the RobotsRulesParser with this
> >> url
> >>>>>> (
> >>>>
> >>
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> >>>> )
> >>>>>>
> >>>>>> I get the following exceptions
> >>>>>>
> >>>>>> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
> >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> >> in
> >>>>>> robots.txt file (size 672): noindex: *natuzzi*
> >>>>>>
> >>>>>> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
> >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> >> in
> >>>>>> robots.txt file (size 672): noindex: *Natuzzi*
> >>>>>>
> >>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> >> in
> >>>>>> robots.txt file (size 672): noindex: *natuzzi*
> >>>>>>
> >>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
> >> in
> >>>>>> robots.txt file (size 672): noindex: *Natuzzi*
> >>>>>>
> >>>>>> Is there anything I can do to solve this problem? Is this a problem
> >>>>>> with nutch or does macys.com have a really bad robots.txt file?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>   <http://www.popsugar.com>
> >>>>>> Nima Falaki
> >>>>>> Software Engineer
> >>>>>> [email protected]
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>>  <http://www.popsugar.com>
> >>>
> >>> Nima Falaki
> >>> Software Engineer
> >>> [email protected]
> >>>
> >>>
> >>
> >>
> >> --
> >>
> >>
> >>
> >> Nima Falaki
> >> Software Engineer
> >> [email protected]
> >>
> >
>
>

Re: Problem with crawling macys robots.txt

Reply via email to