Aha! That's done it. Thanks! Incidentally, I only asked them to add the /Nutch-1.3 because originally I had a user-agent of "PHFAWS Spider" and had them add "PHFAWS Spider" to their user-agent, and it didn't work. It seems that at least some sites have trouble with a user-agent that's more than one word. And I only went with multiple words because the tutorial gives " <value>My Nutch Spider</value>" as an example. This might be something to warn people about in the documentation.
Chip -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Monday, October 03, 2011 9:42 AM To: [email protected] Subject: Re: What could be blocking me, if not robots.txt? Oh i misread, your user agent is PHFAWS/Nutch-1.3? Are you sure that that's what is configured as your user agent name? If your name is PHFAWS then the robots.txt must list your name without /Nutch-1.3. Or maybe change the robots.txt to > User-agent: PHFAWS/Nutch-1.3 > Allow: / On Monday 03 October 2011 15:31:46 Chip Calhoun wrote: > I apologize, but I haven't found much Nutch documentation that deals > with the user-agent and robots.txt. Why am I being blocked when the > user-agent I'm sending matches the user-agent in that robots.txt? > > Chip > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Friday, September 30, 2011 6:28 PM > To: [email protected] > Cc: Chip Calhoun > Subject: Re: What could be blocking me, if not robots.txt? > > > I've been able to run the ParserChecker now, but I'm not sure how to > > understand the results. Here's what I got: > > > > # bin/nutch org.apache.nutch.parse.ParserChecker > > http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml --------- > > Url > > --------------- > > http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml--------- > > ParseData > > --------- > > Version: 5 > > Status: success(1,0) > > Title: > > Outlinks: 1 > > > > outlink: toUrl: GR:32:A:128 anchor: > > Content Metadata: ETag="1fa962a-56f20-485df79c50980" Date=Fri, 30 > > Sep > > 2011 > > 19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010 > > 21:26:14 GMT Content-Type=text/xml Connection=close > > Accept-Ranges=bytes > > Server=Apache/2.2.3 (Red Hat) Parse Metadata: > > Content-Type=application/xml > > This means almost everything is good to go but... > > > Curl also retrieves this file, and yet I can't get my crawl to pick > > it up. > > > > Could it be an issue with robots.txt? The robots file for this site > > reads as follows: User-agent: PHFAWS/Nutch-1.3 > > Disallow: > > > > User-agent: archive.org_bot > > Disallow: > > > > User-agent: * > > Disallow: / > > This is the problem. > > > That first user-agent is, as near as I can tell, what I'm sending. > > My log shows the following: 2011-09-30 15:54:17,712 INFO http.Http > > - http.agent = PHFAWS/Nutch-1.3 (American Institute of Physics: > > Physics History Finding Aids Web Site; > > http://www.aip.org/history/nbl/findingaids.html; > > [email protected]) > > > > Can anyone tell what I'm missing? Thanks. > > > > Chip > > > > > > -----Original Message----- > > From: Chip Calhoun [mailto:[email protected]] > > Sent: Thursday, September 29, 2011 4:12 PM > > To: [email protected] > > Subject: RE: What could be blocking me, if not robots.txt? > > > > Ah, sorry. I had already deleted the local copy from my server > > (aip.org) to avoid clutter. So yeah, that will definitely 404 now. > > > > Curl retrieves the whole file with no problems. I can't try the > > ParserChecker today as I'm stuck away from my own machine, but I > > will try it tomorrow. The fact that I can curl it at least tells me > > this is a problem I need to fix in Nutch. > > > > Chip > > > > ________________________________________ > > From: Markus Jelsma [[email protected]] > > Sent: Thursday, September 29, 2011 1:01 PM > > To: [email protected] > > Cc: Chip Calhoun > > Subject: Re: What could be blocking me, if not robots.txt? > > > > Oh, it's a 404. That makes sense. > > > > > Hi everyone, > > > > > > I'm using Nutch to crawl a few friendly sites, and am having > > > trouble with some of them. One site in particular has created an > > > exception for me in its robots.txt, and yet I can't crawl any of its > > > pages. > > > I've tried copying the files I want to index (3 XML documents) to > > > my own server and crawling that, and it works fine that way; so > > > something is keeping me from indexing any files on this other site. > > > > > > I compared the logs of my attempt to crawl the friendly site with > > > my attempt to crawl my own site, and I've found few differences. > > > Most differences come from the fact that my own site requires a > > > crawlDelay, so there are many log sections along the lines of: > > > > > > 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, > > > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 > > > INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 > > > 10:57:37,529 INFO > > > > > > fetcher.Fetcher - maxThreads = 1 2011-09-29 10:57:37,529 INFO > > > > > > fetcher.Fetcher - inProgress = 0 2011-09-29 10:57:37,529 INFO > > > fetcher.Fetcher - crawlDelay = 5000 2011-09-29 10:57:37,529 INFO > > > fetcher.Fetcher - minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO > > > fetcher.Fetcher - nextFetchTime = 1317308262122 2011-09-29 > > > 10:57:37,529 INFO fetcher.Fetcher - now = 1317308257529 > > > 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - 0. > > > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29 > > > 10:57:37,529 INFO fetcher.Fetcher - 1. > > > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml > > > > > > That strikes me as probably irrelevant, but I figured I should > > > mention it. The main difference I see in the logs is that the > > > crawl of my own site (the crawl that worked) has the following two > > > lines which do not appear in the log of my failed crawl: > > > > > > 2011-09-29 10:57:50,497 INFO parse.ParserFactory - The parsing > > > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via > > > the plugin.includes system property, and all claim to support the > > > content type application/xml, but they are not mapped to it in > > > the parse-plugins.xml file 2011-09-29 10:58:23,559 INFO > > > crawl.SignatureFactory - Using Signature impl: > > > org.apache.nutch.crawl.MD5Signature > > > > > > Also, while my successful crawl has three lines like the > > > following, my failed one only has two: > > > > > > 2011-09-29 10:58:44,824 WARN regex.RegexURLNormalizer - can't > > > find rules for scope 'crawldb', using default > > > > > > Can anyone think of something I might have missed? > > > > > > Chip > > > > -----Original Message----- > > From: Markus Jelsma [mailto:[email protected]] > > Sent: Thursday, September 29, 2011 1:00 PM > > To: [email protected] > > Cc: Chip Calhoun > > Subject: Re: What could be blocking me, if not robots.txt? > > > > > Hi everyone, > > > > > > I'm using Nutch to crawl a few friendly sites, and am having > > > trouble with some of them. One site in particular has created an > > > exception for me in its robots.txt, and yet I can't crawl any of its > > > pages. > > > I've tried copying the files I want to index (3 XML documents) to > > > my own server and crawling that, and it works fine that way; so > > > something is keeping me from indexing any files on this other site. > > > > > > I compared the logs of my attempt to crawl the friendly site with > > > my attempt to crawl my own site, and I've found few differences. > > > Most differences come from the fact that my own site requires a > > > crawlDelay, so there are many log sections along the lines of: > > > > > > 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, > > > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 > > > INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 > > > 10:57:37,529 INFO > > > > > > fetcher.Fetcher - maxThreads = 1 2011-09-29 10:57:37,529 INFO > > > > > > fetcher.Fetcher - inProgress = 0 2011-09-29 10:57:37,529 INFO > > > fetcher.Fetcher - crawlDelay = 5000 2011-09-29 10:57:37,529 INFO > > > fetcher.Fetcher - minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO > > > fetcher.Fetcher - nextFetchTime = 1317308262122 2011-09-29 > > > 10:57:37,529 INFO fetcher.Fetcher - now = 1317308257529 > > > 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - 0. > > > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29 > > > 10:57:37,529 INFO fetcher.Fetcher - 1. > > > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml > > > > > > That strikes me as probably irrelevant, but I figured I should > > > mention it. The main difference I see in the logs is that the > > > crawl of my own site (the crawl that worked) has the following two > > > lines which do not appear in the log of my failed crawl: > > > > > > 2011-09-29 10:57:50,497 INFO parse.ParserFactory - The parsing > > > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via > > > the plugin.includes system property, and all claim to support the > > > content type application/xml, but they are not mapped to it in > > > the parse-plugins.xml file 2011-09-29 10:58:23,559 INFO > > > crawl.SignatureFactory - Using Signature impl: > > > org.apache.nutch.crawl.MD5Signature > > > > If this doesn't popup when crawling the site that means its not > > fetched (properly). Can you try using the parser checker to download > > it? Can you you curl? The fetcher should throw an exception if > > there's trouble but it may also be stopped by the http.content.limit > > setting. > > > > > Also, while my successful crawl has three lines like the > > > following, my failed one only has two: > > > > > > 2011-09-29 10:58:44,824 WARN regex.RegexURLNormalizer - can't > > > find rules for scope 'crawldb', using default > > > > > > Can anyone think of something I might have missed? > > > > > > Chip -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

