Re: Problem with crawling macys robots.txt

Nima Falaki Tue, 03 Jun 2014 00:25:23 -0700

Nevermind, I figured it out, I adjusted my fetcher.max.crawl.delay
accordingly and it solved the issue. Macys.com has a crawl-delay of 120,
nutch by default has a crawl delay of 30, so I had to change that and it
worked. You guys must either make the crawl delay to -1 (something I dont
recommend, but I did for example purposes), or to over 120 (for macys.com)
in order to crawl macys.com


<property>

 <name>fetcher.max.crawl.delay</name>

 <value>-1</value>

 <description>

 If the Crawl-Delay in robots.txt is set to greater than this value (in

 seconds) then the fetcher will skip this page, generating an error report.

 If set to -1 the fetcher will never skip such pages and will wait the

 amount of time retrieved from robots.txt Crawl-Delay, however long that

 might be.

 </description>

</property>


On Mon, Jun 2, 2014 at 6:31 PM, Nima Falaki <[email protected]> wrote:

> Hi Sebastian:
>
> One thing I noticed is that when I tested the robots.txt with
> RobotsRulesParser, which is in org.apache.nutch.protocol, with the
> following URL
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>
> It gave me this message
>
> 2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
> robots.txt for
> /Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt
>
> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> robots.txt file (size 672): noindex: *natuzzi*
>
> 2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> robots.txt file (size 672): noindex: *Natuzzi*
>
> 2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> robots.txt file (size 672): noindex: *natuzzi*
>
> 2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
> SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
> robots.txt file (size 672): noindex: *Natuzzi*
>
> *allowed:
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> <http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>*
>
>
> This is in direct contrary to what happened when I ran the crawl script
> with
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> as my SeedURL
>
> I got this in my crawlDB
>
> *http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> <http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>
>       Version: 7*
>
> *Status: 3 (db_gone)*
>
> *Fetch time: Thu Jul 17 18:05:47 PDT 2014*
>
> *Modified time: Wed Dec 31 16:00:00 PST 1969*
>
> *Retries since fetch: 0*
>
> *Retry interval: 3888000 seconds (45 days)*
>
> *Score: 1.0*
>
> *Signature: null*
>
> *Metadata:*
>
> *        _pst_=robots_denied(18), lastModified=0*
>
>
> Is this a bug in the crawler-commons 0.3? Where when you test the macys
> robots.txt file with RobotRulesParser it allows it, but when you run the
> macys url as a seed url in the crawl script then it denies the url.
>
>
>
>
>
>
>
>
>
>
>
> On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <
> [email protected]> wrote:
>
>> Hi Luke, hi Nima,
>>
>> >     The/Robot Exclusion Standard/does not mention anything about the
>> "*" character in
>> > the|Disallow:|statement.
>> Indeed the RFC draft [1] does not. However, since Google [2] does wild
>> card patterns are
>> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules
>> are also followed
>> by Nutch (to be in versions 1.9 resp. 2.3).
>>
>> But the error message is about the noindex lines:
>>  noindex: *natuzzi*
>> These lines are redundant (and also invalid, I suppose):
>> if a page/URL is disallowed, it's not fetched at all,
>> and will hardly slip into the index.
>> I think you can ignore the warning.
>>
>> > One might also question the craw-delay setting of 120 seconds, but
>> that's another issue...
>> Yeah, it will take very long to crawl the site.
>> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:
>>
>> <property>
>>  <name>fetcher.max.crawl.delay</name>
>>  <value>30</value>
>>  <description>
>>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>>  seconds) then the fetcher will skip this page, generating an error
>> report.
>>  If set to -1 the fetcher will never skip such pages and will wait the
>>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>>  might be.
>>  </description>
>> </property>
>>
>> Cheers,
>> Sebastian
>>
>> [1] http://www.robotstxt.org/norobots-rfc.txt
>> [2]
>> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
>> [3]
>> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
>>
>> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
>> > From wikipedia:
>> >     The/Robot Exclusion Standard/does not mention anything about the
>> "*" character in
>> > the|Disallow:|statement. Some crawlers like Googlebot recognize strings
>> containing "*", while MSNbot
>> > and Teoma interpret it in different ways
>> >
>> > So the 'problem' is with Macy's. Really, there is no problem for you:
>> presumably that line is just
>> > ignored from robots.txt.
>> >
>> > One might also question the craw-delay setting of 120 seconds, but
>> that's another issue...
>> >
>> >
>> >
>> > On 31/05/2014 12:16 AM, Nima Falaki wrote:
>> >> Hello Everyone:
>> >>
>> >> Just have a question about an issue I discovered while trying to crawl
>> the
>> >> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
>> >> crawler-commons 0.4. This is the robots.txt file from macys
>> >>
>> >> User-agent: *
>> >> Crawl-delay: 120
>> >> Disallow: /compare
>> >> Disallow: /registry/wedding/compare
>> >> Disallow: /catalog/product/zoom.jsp
>> >> Disallow: /search
>> >> Disallow: /shop/search
>> >> Disallow: /shop/registry/wedding/search
>> >> Disallow: *natuzzi*
>> >> noindex: *natuzzi*
>> >> Disallow: *Natuzzi*
>> >> noindex: *Natuzzi*
>> >> Disallow:  /bag/add*
>> >>
>> >>
>> >> When I run this robots.txt through the RobotsRulesParser with this url
>> >> (
>> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
>> )
>> >>
>> >> I get the following exceptions
>> >>
>> >> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
>> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> >> robots.txt file (size 672): noindex: *natuzzi*
>> >>
>> >> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
>> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> >> robots.txt file (size 672): noindex: *Natuzzi*
>> >>
>> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> >> robots.txt file (size 672): noindex: *natuzzi*
>> >>
>> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
>> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
>> >> robots.txt file (size 672): noindex: *Natuzzi*
>> >>
>> >> Is there anything I can do to solve this problem? Is this a problem
>> >> with nutch or does macys.com have a really bad robots.txt file?
>> >>
>> >>
>> >>
>> >>
>> >>   <http://www.popsugar.com>
>> >> Nima Falaki
>> >> Software Engineer
>> >> [email protected]
>> >>
>> >
>> >
>>
>>
>
>
> --
>
>
>  <http://www.popsugar.com>
>
> Nima Falaki
> Software Engineer
> [email protected]
>
>


-- 



Nima Falaki
Software Engineer
[email protected]

Re: Problem with crawling macys robots.txt

Reply via email to