Aha! That's done it. Thanks!

Incidentally, I only asked them to add the /Nutch-1.3 because originally I had 
a user-agent of "PHFAWS Spider" and had them add "PHFAWS Spider" to their 
user-agent, and it didn't work. It seems that at least some sites have trouble 
with a user-agent that's more than one word. And I only went with multiple 
words because the tutorial gives " <value>My Nutch Spider</value>" as an 
example. This might be something to warn people about in the documentation.

Chip

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Monday, October 03, 2011 9:42 AM
To: [email protected]
Subject: Re: What could be blocking me, if not robots.txt?

Oh i misread, your user agent is PHFAWS/Nutch-1.3? Are you sure that that's 
what is configured as your user agent name? If your name is PHFAWS then the 
robots.txt must list your name without /Nutch-1.3.

Or maybe change the robots.txt to 
> User-agent: PHFAWS/Nutch-1.3
> Allow: /


On Monday 03 October 2011 15:31:46 Chip Calhoun wrote:
> I apologize, but I haven't found much Nutch documentation that deals 
> with the user-agent and robots.txt. Why am I being blocked when the 
> user-agent I'm sending matches the user-agent in that robots.txt?
> 
> Chip
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]]
> Sent: Friday, September 30, 2011 6:28 PM
> To: [email protected]
> Cc: Chip Calhoun
> Subject: Re: What could be blocking me, if not robots.txt?
> 
> > I've been able to run the ParserChecker now, but I'm not sure how to 
> > understand the results. Here's what I got:
> > 
> > # bin/nutch org.apache.nutch.parse.ParserChecker
> > http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml --------- 
> > Url
> > ---------------
> > http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml---------
> > ParseData
> > ---------
> > Version: 5
> > Status: success(1,0)
> > Title:
> > Outlinks: 1
> > 
> >   outlink: toUrl: GR:32:A:128 anchor:
> > Content Metadata: ETag="1fa962a-56f20-485df79c50980" Date=Fri, 30 
> > Sep
> > 2011
> > 19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010
> > 21:26:14 GMT Content-Type=text/xml Connection=close 
> > Accept-Ranges=bytes
> > Server=Apache/2.2.3 (Red Hat) Parse Metadata:
> > Content-Type=application/xml
> 
> This means almost everything is good to go but...
> 
> > Curl also retrieves this file, and yet I can't get my crawl to pick 
> > it up.
> > 
> > Could it be an issue with robots.txt? The robots file for this site 
> > reads as follows: User-agent: PHFAWS/Nutch-1.3
> > Disallow:
> > 
> > User-agent: archive.org_bot
> > Disallow:
> > 
> > User-agent: *
> > Disallow: /
> 
> This is the problem.
> 
> > That first user-agent is, as near as I can tell, what I'm sending. 
> > My log shows the following: 2011-09-30 15:54:17,712 INFO  http.Http 
> > - http.agent = PHFAWS/Nutch-1.3 (American Institute of Physics: 
> > Physics History Finding Aids Web Site; 
> > http://www.aip.org/history/nbl/findingaids.html;
> > [email protected])
> > 
> > Can anyone tell what I'm missing? Thanks.
> > 
> > Chip
> > 
> > 
> > -----Original Message-----
> > From: Chip Calhoun [mailto:[email protected]]
> > Sent: Thursday, September 29, 2011 4:12 PM
> > To: [email protected]
> > Subject: RE: What could be blocking me, if not robots.txt?
> > 
> > Ah, sorry. I had already deleted the local copy from my server
> > (aip.org) to avoid clutter. So yeah, that will definitely 404 now.
> > 
> > Curl retrieves the whole file with no problems. I can't try the 
> > ParserChecker today as I'm stuck away from my own machine, but I 
> > will try it tomorrow. The fact that I can curl it at least tells me 
> > this is a problem I need to fix in Nutch.
> > 
> > Chip
> > 
> > ________________________________________
> > From: Markus Jelsma [[email protected]]
> > Sent: Thursday, September 29, 2011 1:01 PM
> > To: [email protected]
> > Cc: Chip Calhoun
> > Subject: Re: What could be blocking me, if not robots.txt?
> > 
> > Oh, it's a 404. That makes sense.
> > 
> > > Hi everyone,
> > > 
> > > I'm using Nutch to crawl a few friendly sites, and am having 
> > > trouble with some of them. One site in particular has created an 
> > > exception for me in its robots.txt, and yet I can't crawl any of its 
> > > pages.
> > > I've tried copying the files I want to index (3 XML documents) to 
> > > my own server and crawling that, and it works fine that way; so 
> > > something is keeping me from indexing any files on this other site.
> > > 
> > > I compared the logs of my attempt to crawl the friendly site with 
> > > my attempt to crawl my own site, and I've found few differences. 
> > > Most differences come from the fact that my own site requires a 
> > > crawlDelay, so there are many log sections along the lines of:
> > > 
> > > 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10, 
> > > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 
> > > INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29
> > > 10:57:37,529 INFO
> > > 
> > >  fetcher.Fetcher -   maxThreads    = 1 2011-09-29 10:57:37,529 INFO
> > > 
> > > fetcher.Fetcher -   inProgress    = 0 2011-09-29 10:57:37,529 INFO
> > > fetcher.Fetcher -   crawlDelay    = 5000 2011-09-29 10:57:37,529 INFO
> > > fetcher.Fetcher -   minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO
> > > fetcher.Fetcher -   nextFetchTime = 1317308262122 2011-09-29
> > > 10:57:37,529 INFO  fetcher.Fetcher -   now           = 1317308257529
> > > 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   0.
> > > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29
> > > 10:57:37,529 INFO  fetcher.Fetcher -   1.
> > > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml
> > > 
> > > That strikes me as probably irrelevant, but I figured I should 
> > > mention it. The main difference I see in the logs is that the 
> > > crawl of my own site (the crawl that worked) has the following two 
> > > lines which do not appear in the log of my failed crawl:
> > > 
> > > 2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing
> > > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via 
> > > the plugin.includes system property, and all claim to support the 
> > > content type application/xml, but they are not mapped to it  in 
> > > the parse-plugins.xml file 2011-09-29 10:58:23,559 INFO 
> > > crawl.SignatureFactory - Using Signature impl:
> > > org.apache.nutch.crawl.MD5Signature
> > > 
> > > Also, while my successful crawl has three lines like the 
> > > following, my failed one only has two:
> > > 
> > > 2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't 
> > > find rules for scope 'crawldb', using default
> > > 
> > > Can anyone think of something I might have missed?
> > > 
> > > Chip
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: Thursday, September 29, 2011 1:00 PM
> > To: [email protected]
> > Cc: Chip Calhoun
> > Subject: Re: What could be blocking me, if not robots.txt?
> > 
> > > Hi everyone,
> > > 
> > > I'm using Nutch to crawl a few friendly sites, and am having 
> > > trouble with some of them. One site in particular has created an 
> > > exception for me in its robots.txt, and yet I can't crawl any of its 
> > > pages.
> > > I've tried copying the files I want to index (3 XML documents) to 
> > > my own server and crawling that, and it works fine that way; so 
> > > something is keeping me from indexing any files on this other site.
> > > 
> > > I compared the logs of my attempt to crawl the friendly site with 
> > > my attempt to crawl my own site, and I've found few differences. 
> > > Most differences come from the fact that my own site requires a 
> > > crawlDelay, so there are many log sections along the lines of:
> > > 
> > > 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10, 
> > > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 
> > > INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29
> > > 10:57:37,529 INFO
> > > 
> > >  fetcher.Fetcher -   maxThreads    = 1 2011-09-29 10:57:37,529 INFO
> > > 
> > > fetcher.Fetcher -   inProgress    = 0 2011-09-29 10:57:37,529 INFO
> > > fetcher.Fetcher -   crawlDelay    = 5000 2011-09-29 10:57:37,529 INFO
> > > fetcher.Fetcher -   minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO
> > > fetcher.Fetcher -   nextFetchTime = 1317308262122 2011-09-29
> > > 10:57:37,529 INFO  fetcher.Fetcher -   now           = 1317308257529
> > > 2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   0.
> > > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29
> > > 10:57:37,529 INFO  fetcher.Fetcher -   1.
> > > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml
> > > 
> > > That strikes me as probably irrelevant, but I figured I should 
> > > mention it. The main difference I see in the logs is that the 
> > > crawl of my own site (the crawl that worked) has the following two 
> > > lines which do not appear in the log of my failed crawl:
> > > 
> > > 2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing
> > > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via 
> > > the plugin.includes system property, and all claim to support the 
> > > content type application/xml, but they are not mapped to it  in 
> > > the parse-plugins.xml file 2011-09-29 10:58:23,559 INFO 
> > > crawl.SignatureFactory - Using Signature impl:
> > > org.apache.nutch.crawl.MD5Signature
> > 
> > If this doesn't popup when crawling the site that means its not 
> > fetched (properly). Can you try using the parser checker to download 
> > it? Can you you curl? The fetcher should throw an exception if 
> > there's trouble but it may also be stopped by the http.content.limit 
> > setting.
> > 
> > > Also, while my successful crawl has three lines like the 
> > > following, my failed one only has two:
> > > 
> > > 2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't 
> > > find rules for scope 'crawldb', using default
> > > 
> > > Can anyone think of something I might have missed?
> > > 
> > > Chip

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to