RE: What could be blocking me, if not robots.txt?

Chip Calhoun Fri, 30 Sep 2011 13:16:19 -0700

I've been able to run the ParserChecker now, but I'm not sure how to understand 
the results. Here's what I got:


# bin/nutch org.apache.nutch.parse.ParserChecker 
http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml
---------
Url
---------------
http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 1
  outlink: toUrl: GR:32:A:128 anchor:
Content Metadata: ETag="1fa962a-56f20-485df79c50980" Date=Fri, 30 Sep 2011 
19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010 21:26:14 GMT 
Content-Type=text/xml Connection=close Accept-Ranges=bytes Server=Apache/2.2.3 
(Red Hat)
Parse Metadata: Content-Type=application/xml

Curl also retrieves this file, and yet I can't get my crawl to pick it up.

Could it be an issue with robots.txt? The robots file for this site reads as 
follows:
User-agent: PHFAWS/Nutch-1.3
Disallow:

User-agent: archive.org_bot
Disallow:

User-agent: *
Disallow: /

That first user-agent is, as near as I can tell, what I'm sending. My log shows 
the following:
2011-09-30 15:54:17,712 INFO  http.Http - http.agent = PHFAWS/Nutch-1.3 
(American Institute of Physics: Physics History Finding Aids Web Site; 
http://www.aip.org/history/nbl/findingaids.html; [email protected])

Can anyone tell what I'm missing? Thanks.

Chip


-----Original Message-----
From: Chip Calhoun [mailto:[email protected]] 
Sent: Thursday, September 29, 2011 4:12 PM
To: [email protected]
Subject: RE: What could be blocking me, if not robots.txt?

Ah, sorry. I had already deleted the local copy from my server (aip.org) to 
avoid clutter. So yeah, that will definitely 404 now.

Curl retrieves the whole file with no problems. I can't try the ParserChecker 
today as I'm stuck away from my own machine, but I will try it tomorrow. The 
fact that I can curl it at least tells me this is a problem I need to fix in 
Nutch.

Chip

________________________________________
From: Markus Jelsma [[email protected]]
Sent: Thursday, September 29, 2011 1:01 PM
To: [email protected]
Cc: Chip Calhoun
Subject: Re: What could be blocking me, if not robots.txt?

Oh, it's a 404. That makes sense.

> Hi everyone,
>
> I'm using Nutch to crawl a few friendly sites, and am having trouble 
> with some of them. One site in particular has created an exception for 
> me in its robots.txt, and yet I can't crawl any of its pages. I've 
> tried copying the files I want to index (3 XML documents) to my own 
> server and crawling that, and it works fine that way; so something is 
> keeping me from indexing any files on this other site.
>
> I compared the logs of my attempt to crawl the friendly site with my 
> attempt to crawl my own site, and I've found few differences. Most 
> differences come from the fact that my own site requires a crawlDelay, 
> so there are many log sections along the lines of:
>
> 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10, 
> spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO 
> fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO
>  fetcher.Fetcher -   maxThreads    = 1 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher -   inProgress    = 0 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher -   crawlDelay    = 5000 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher -   minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO
> fetcher.Fetcher -   nextFetchTime = 1317308262122 2011-09-29 10:57:37,529
> INFO  fetcher.Fetcher -   now           = 1317308257529 2011-09-29
> 10:57:37,529 INFO  fetcher.Fetcher -   0.
> http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29
> 10:57:37,529 INFO  fetcher.Fetcher -   1.
> http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml
>
> That strikes me as probably irrelevant, but I figured I should mention it.
> The main difference I see in the logs is that the crawl of my own site 
> (the crawl that worked) has the following two lines which do not 
> appear in the log of my failed crawl:
>
> 2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
> plugin.includes system property, and all claim to support the content 
> type application/xml, but they are not mapped to it  in the 
> parse-plugins.xml file 2011-09-29 10:58:23,559 INFO  
> crawl.SignatureFactory - Using Signature impl: 
> org.apache.nutch.crawl.MD5Signature
>
> Also, while my successful crawl has three lines like the following, my 
> failed one only has two:
>
> 2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't find 
> rules for scope 'crawldb', using default
>
> Can anyone think of something I might have missed?
>
> Chip


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Thursday, September 29, 2011 1:00 PM
To: [email protected]
Cc: Chip Calhoun
Subject: Re: What could be blocking me, if not robots.txt?


> Hi everyone,
> 
> I'm using Nutch to crawl a few friendly sites, and am having trouble 
> with some of them. One site in particular has created an exception for 
> me in its robots.txt, and yet I can't crawl any of its pages. I've 
> tried copying the files I want to index (3 XML documents) to my own 
> server and crawling that, and it works fine that way; so something is 
> keeping me from indexing any files on this other site.
> 
> I compared the logs of my attempt to crawl the friendly site with my 
> attempt to crawl my own site, and I've found few differences. Most 
> differences come from the fact that my own site requires a crawlDelay, 
> so there are many log sections along the lines of:
> 
> 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10, 
> spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO 
> fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO
>  fetcher.Fetcher -   maxThreads    = 1 2011-09-29 10:57:37,529 INFO 
> fetcher.Fetcher -   inProgress    = 0 2011-09-29 10:57:37,529 INFO 
> fetcher.Fetcher -   crawlDelay    = 5000 2011-09-29 10:57:37,529 INFO 
> fetcher.Fetcher -   minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO 
> fetcher.Fetcher -   nextFetchTime = 1317308262122 2011-09-29 10:57:37,529
> INFO  fetcher.Fetcher -   now           = 1317308257529 2011-09-29
> 10:57:37,529 INFO  fetcher.Fetcher -   0.
> http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29
> 10:57:37,529 INFO  fetcher.Fetcher -   1.
> http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml
> 
> That strikes me as probably irrelevant, but I figured I should mention it.
> The main difference I see in the logs is that the crawl of my own site 
> (the crawl that worked) has the following two lines which do not 
> appear in the log of my failed crawl:
> 
> 2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
> plugin.includes system property, and all claim to support the content 
> type application/xml, but they are not mapped to it  in the 
> parse-plugins.xml file 2011-09-29 10:58:23,559 INFO  
> crawl.SignatureFactory - Using Signature impl: 
> org.apache.nutch.crawl.MD5Signature

If this doesn't popup when crawling the site that means its not fetched 
(properly). Can you try using the parser checker to download it? Can you you 
curl? The fetcher should throw an exception if there's trouble but it may also 
be stopped by the http.content.limit setting.

> 
> Also, while my successful crawl has three lines like the following, my
> failed one only has two:
> 
> 2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
> 
> Can anyone think of something I might have missed?
> 
> Chip

RE: What could be blocking me, if not robots.txt?

Reply via email to