Re: Problem with crawling macys robots.txt

Sebastian Nagel Tue, 03 Jun 2014 14:53:33 -0700

> though , I wonder if anyone uses Nutch in production and how they overcome
> this limitation being imposed by sites like macys.com where they have a
> Crawl-Delay specified?


If you follow rules of politeness, there will be now way to overcome the
crawl-delay from robots.txt: crawling will be horribly slow. So slow, that
completeness and freshness seem unreachable targets. But maybe that's
exactly the intention of site owner.

On 06/03/2014 04:29 PM, S.L wrote:
> Thats good piece of Info Nima , it means you wont be able to crawl more
> than 720 pages in 24 hrs , this sounds like a pretty serious limitation
> though , I wonder if anyone uses Nutch in production and how they overcome
> this limitation being imposed by sites like macys.com where they have a
> Crawl-Delay specified?
> 
> 
> 
> 
> On Tue, Jun 3, 2014 at 3:24 AM, Nima Falaki <[email protected]> wrote:
> 
>> Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
>> accordingly and it solved the issue. Macys.com has a crawl-delay of 120,
>> nutch by default has a crawl delay of 30, so I had to change that and it
>> worked. You guys must either make the crawl delay to -1 (something I dont
>> recommend, but I did for example purposes), or to over 120 (for macys.com)
>> in order to crawl macys.com
>>
>> <property>
>>
>>  <name>fetcher.max.crawl.delay</name>
>>
>>  <value>-1</value>
>>
>>  <description>
>>
>>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>>
>>  seconds) then the fetcher will skip this page, generating an error report.
>>
>>  If set to -1 the fetcher will never skip such pages and will wait the
>>
>>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>>
>>  might be.
>>
>>  </description>
>>
>> </property>
>>
>>
>> On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <[email protected]> wrote:
>>
>>> Hi Sebastian:
>>>
>>> One thing I noticed is that when I tested the robots.txt with
>>> RobotsRulesParser, which is in org.apache.nutch.protocol, with the
>>> following URL
>>>
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>>
>>> It gave me this message
>>>
>>> 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
>>> robots.txt for
>>> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
>>>
>>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
>>> robots.txt file (size 672): noindex: *natuzzi*
>>>
>>> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
>>> robots.txt file (size 672): noindex: *Natuzzi*
>>>
>>> 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
>>> robots.txt file (size 672): noindex: *natuzzi*
>>>
>>> 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
>>> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
>>> robots.txt file (size 672): noindex: *Natuzzi*
>>>
>>> *allowed:
>>>
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>> <
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>> *
>>>
>>>
>>> This is in direct contrary to what happened when I ran the crawl script
>>> with
>>>
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>> as my SeedURL
>>>
>>> I got this in my crawlDB
>>>
>>> *
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>> <
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>>
>>>       Version: 7*
>>>
>>> *Status: 3 (db_gone)*
>>>
>>> *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
>>>
>>> *Modified time: Wed Dec 31 16:00:00 PST 1969*
>>>
>>> *Retries since fetch: 0*
>>>
>>> *Retry interval: 3888000 seconds (45 days)*
>>>
>>> *Score: 1.0*
>>>
>>> *Signature: null*
>>>
>>> *Metadata:*
>>>
>>> *        _pst_=robots_denied(18), lastModified=0*
>>>
>>>
>>> Is this a bug in the crawler-commons 0.3? Where when you test the macys
>>> robots.txt file with RobotRulesParser it allows it, but when you run the
>>> macys url as a seed url in the crawl script then it denies the url.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
>>> [email protected]> wrote:
>>>
>>>> Hi Luke, hi Nima,
>>>>
>>>>>     The/Robot Exclusion Standard/does not mention anything about the
>>>> "*" character in
>>>>> the|Disallow:|statement.
>>>> Indeed the RFC draft [1] does not. However, since Google [2] does wild
>>>> card patterns are
>>>> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules
>>>> are also followed
>>>> by Nutch (to be in versions 1.9 resp. 2.3).
>>>>
>>>> But the error message is about the noindex lines:
>>>>  noindex: *natuzzi*
>>>> These lines are redundant (and also invalid, I suppose):
>>>> if a page/URL is disallowed, it's not fetched at all,
>>>> and will hardly slip into the index.
>>>> I think you can ignore the warning.
>>>>
>>>>> One might also question the craw-delay setting of 120 seconds, but
>>>> that's another issue...
>>>> Yeah, it will take very long to crawl the site.
>>>> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:
>>>>
>>>> <property>
>>>>  <name>fetcher.max.crawl.delay</name>
>>>>  <value>30</value>
>>>>  <description>
>>>>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>>>>  seconds) then the fetcher will skip this page, generating an error
>>>> report.
>>>>  If set to -1 the fetcher will never skip such pages and will wait the
>>>>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>>>>  might be.
>>>>  </description>
>>>> </property>
>>>>
>>>> Cheers,
>>>> Sebastian
>>>>
>>>> [1] http://www.robotstxt.org/norobots-rfc.txt
>>>> [2]
>>>>
>> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
>>>> [3]
>>>>
>> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
>>>>
>>>> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
>>>>> From wikipedia:
>>>>>     The/Robot Exclusion Standard/does not mention anything about the
>>>> "*" character in
>>>>> the|Disallow:|statement. Some crawlers like Googlebot recognize
>> strings
>>>> containing "*", while MSNbot
>>>>> and Teoma interpret it in different ways
>>>>>
>>>>> So the 'problem' is with Macy's. Really, there is no problem for you:
>>>> presumably that line is just
>>>>> ignored from robots.txt.
>>>>>
>>>>> One might also question the craw-delay setting of 120 seconds, but
>>>> that's another issue...
>>>>>
>>>>>
>>>>>
>>>>> On 31/05/2014 12:16 AM, Nima Falaki wrote:
>>>>>> Hello Everyone:
>>>>>>
>>>>>> Just have a question about an issue I discovered while trying to
>> crawl
>>>> the
>>>>>> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3
>> and
>>>>>> crawler-commons 0.4. This is the robots.txt file from macys
>>>>>>
>>>>>> User-agent: *
>>>>>> Crawl-delay: 120
>>>>>> Disallow: /compare
>>>>>> Disallow: /registry/wedding/compare
>>>>>> Disallow: /catalog/product/zoom.jsp
>>>>>> Disallow: /search
>>>>>> Disallow: /shop/search
>>>>>> Disallow: /shop/registry/wedding/search
>>>>>> Disallow: *natuzzi*
>>>>>> noindex: *natuzzi*
>>>>>> Disallow: *Natuzzi*
>>>>>> noindex: *Natuzzi*
>>>>>> Disallow:  /bag/add*
>>>>>>
>>>>>>
>>>>>> When I run this robots.txt through the RobotsRulesParser with this
>> url
>>>>>> (
>>>>
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>>>> )
>>>>>>
>>>>>> I get the following exceptions
>>>>>>
>>>>>> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
>>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
>> in
>>>>>> robots.txt file (size 672): noindex: *natuzzi*
>>>>>>
>>>>>> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
>>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
>> in
>>>>>> robots.txt file (size 672): noindex: *Natuzzi*
>>>>>>
>>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
>> in
>>>>>> robots.txt file (size 672): noindex: *natuzzi*
>>>>>>
>>>>>> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>>>>>> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line
>> in
>>>>>> robots.txt file (size 672): noindex: *Natuzzi*
>>>>>>
>>>>>> Is there anything I can do to solve this problem? Is this a problem
>>>>>> with nutch or does macys.com have a really bad robots.txt file?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   <http://www.popsugar.com>
>>>>>> Nima Falaki
>>>>>> Software Engineer
>>>>>> [email protected]
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>>  <http://www.popsugar.com>
>>>
>>> Nima Falaki
>>> Software Engineer
>>> [email protected]
>>>
>>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> [email protected]
>>
>

Re: Problem with crawling macys robots.txt

Reply via email to