Re: Problem with crawling macys robots.txt

Nima Falaki Mon, 02 Jun 2014 18:32:27 -0700

Hi Sebastian:

One thing I noticed is that when I tested the robots.txt with
RobotsRulesParser, which is in org.apache.nutch.protocol, with the
following URL
http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=


It gave me this message

2014-06-02 18:27:16,949 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(452)) - Problem processing
robots.txt for
/Users/nfalaki/shopstyle/apache-nutch-1.8/runtime/local/robots4.txt

2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *natuzzi*

2014-06-02 18:27:16,952 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*

2014-06-02 18:27:16,954 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *natuzzi*

2014-06-02 18:27:16,955 WARN  robots.SimpleRobotRulesParser (
SimpleRobotRulesParser.java:reportWarning(456)) - Unknown line in
robots.txt file (size 672): noindex: *Natuzzi*

*allowed:
http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
<http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>*


This is in direct contrary to what happened when I ran the crawl script
with
http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
as my SeedURL

I got this in my crawlDB

*http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
<http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=>
      Version: 7*

*Status: 3 (db_gone)*

*Fetch time: Thu Jul 17 18:05:47 PDT 2014*

*Modified time: Wed Dec 31 16:00:00 PST 1969*

*Retries since fetch: 0*

*Retry interval: 3888000 seconds (45 days)*

*Score: 1.0*

*Signature: null*

*Metadata:*

*        _pst_=robots_denied(18), lastModified=0*


Is this a bug in the crawler-commons 0.3? Where when you test the macys
robots.txt file with RobotRulesParser it allows it, but when you run the
macys url as a seed url in the crawl script then it denies the url.











On Sun, Jun 1, 2014 at 12:53 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Luke, hi Nima,
>
> >     The/Robot Exclusion Standard/does not mention anything about the "*"
> character in
> > the|Disallow:|statement.
> Indeed the RFC draft [1] does not. However, since Google [2] does wild
> card patterns are
> frequently used in robots.txt. With crawler-commons 0.4 [3] these rules
> are also followed
> by Nutch (to be in versions 1.9 resp. 2.3).
>
> But the error message is about the noindex lines:
>  noindex: *natuzzi*
> These lines are redundant (and also invalid, I suppose):
> if a page/URL is disallowed, it's not fetched at all,
> and will hardly slip into the index.
> I think you can ignore the warning.
>
> > One might also question the craw-delay setting of 120 seconds, but
> that's another issue...
> Yeah, it will take very long to crawl the site.
> With Nutch the property "fetcher.max.crawl.delay" needs to be adjusted:
>
> <property>
>  <name>fetcher.max.crawl.delay</name>
>  <value>30</value>
>  <description>
>  If the Crawl-Delay in robots.txt is set to greater than this value (in
>  seconds) then the fetcher will skip this page, generating an error report.
>  If set to -1 the fetcher will never skip such pages and will wait the
>  amount of time retrieved from robots.txt Crawl-Delay, however long that
>  might be.
>  </description>
> </property>
>
> Cheers,
> Sebastian
>
> [1] http://www.robotstxt.org/norobots-rfc.txt
> [2]
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> [3]
> http://crawler-commons.googlecode.com/svn/tags/crawler-commons-0.4/CHANGES.txt
>
> On 05/31/2014 04:27 PM, Luke Mawbey wrote:
> > From wikipedia:
> >     The/Robot Exclusion Standard/does not mention anything about the "*"
> character in
> > the|Disallow:|statement. Some crawlers like Googlebot recognize strings
> containing "*", while MSNbot
> > and Teoma interpret it in different ways
> >
> > So the 'problem' is with Macy's. Really, there is no problem for you:
> presumably that line is just
> > ignored from robots.txt.
> >
> > One might also question the craw-delay setting of 120 seconds, but
> that's another issue...
> >
> >
> >
> > On 31/05/2014 12:16 AM, Nima Falaki wrote:
> >> Hello Everyone:
> >>
> >> Just have a question about an issue I discovered while trying to crawl
> the
> >> macys robots.txt, I am using nutch 1.8 and used crawler-commons 0.3 and
> >> crawler-commons 0.4. This is the robots.txt file from macys
> >>
> >> User-agent: *
> >> Crawl-delay: 120
> >> Disallow: /compare
> >> Disallow: /registry/wedding/compare
> >> Disallow: /catalog/product/zoom.jsp
> >> Disallow: /search
> >> Disallow: /shop/search
> >> Disallow: /shop/registry/wedding/search
> >> Disallow: *natuzzi*
> >> noindex: *natuzzi*
> >> Disallow: *Natuzzi*
> >> noindex: *Natuzzi*
> >> Disallow:  /bag/add*
> >>
> >>
> >> When I run this robots.txt through the RobotsRulesParser with this url
> >> (
> http://www1.macys.com/shop/product/inc-international-concepts-dont-forget-me-split-t-shirt?ID=1430219&CategoryID=30423&LinkType=
> )
> >>
> >> I get the following exceptions
> >>
> >> 2014-05-30 17:02:20,570 WARN  robots.SimpleRobotRulesParser
> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
> >> robots.txt file (size 672): noindex: *natuzzi*
> >>
> >> 2014-05-30 17:02:20,571 WARN  robots.SimpleRobotRulesParser
> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
> >> robots.txt file (size 672): noindex: *Natuzzi*
> >>
> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
> >> robots.txt file (size 672): noindex: *natuzzi*
> >>
> >> 2014-05-30 17:02:20,574 WARN  robots.SimpleRobotRulesParser
> >> (SimpleRobotRulesParser.java:reportWarning(456)) -     Unknown line in
> >> robots.txt file (size 672): noindex: *Natuzzi*
> >>
> >> Is there anything I can do to solve this problem? Is this a problem
> >> with nutch or does macys.com have a really bad robots.txt file?
> >>
> >>
> >>
> >>
> >>   <http://www.popsugar.com>
> >> Nima Falaki
> >> Software Engineer
> >> [email protected]
> >>
> >
> >
>
>


-- 


 <http://www.popsugar.com>
Nima Falaki
Software Engineer
[email protected]

Re: Problem with crawling macys robots.txt

Reply via email to