I've been able to run the ParserChecker now, but I'm not sure how to understand the results. Here's what I got:
# bin/nutch org.apache.nutch.parse.ParserChecker http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml --------- Url --------------- http://digital.lib.umd.edu/oclc/MdU.ead.histms.0094.xml--------- ParseData --------- Version: 5 Status: success(1,0) Title: Outlinks: 1 outlink: toUrl: GR:32:A:128 anchor: Content Metadata: ETag="1fa962a-56f20-485df79c50980" Date=Fri, 30 Sep 2011 19:54:14 GMT Content-Length=356128 Last-Modified=Wed, 05 May 2010 21:26:14 GMT Content-Type=text/xml Connection=close Accept-Ranges=bytes Server=Apache/2.2.3 (Red Hat) Parse Metadata: Content-Type=application/xml Curl also retrieves this file, and yet I can't get my crawl to pick it up. Could it be an issue with robots.txt? The robots file for this site reads as follows: User-agent: PHFAWS/Nutch-1.3 Disallow: User-agent: archive.org_bot Disallow: User-agent: * Disallow: / That first user-agent is, as near as I can tell, what I'm sending. My log shows the following: 2011-09-30 15:54:17,712 INFO http.Http - http.agent = PHFAWS/Nutch-1.3 (American Institute of Physics: Physics History Finding Aids Web Site; http://www.aip.org/history/nbl/findingaids.html; [email protected]) Can anyone tell what I'm missing? Thanks. Chip -----Original Message----- From: Chip Calhoun [mailto:[email protected]] Sent: Thursday, September 29, 2011 4:12 PM To: [email protected] Subject: RE: What could be blocking me, if not robots.txt? Ah, sorry. I had already deleted the local copy from my server (aip.org) to avoid clutter. So yeah, that will definitely 404 now. Curl retrieves the whole file with no problems. I can't try the ParserChecker today as I'm stuck away from my own machine, but I will try it tomorrow. The fact that I can curl it at least tells me this is a problem I need to fix in Nutch. Chip ________________________________________ From: Markus Jelsma [[email protected]] Sent: Thursday, September 29, 2011 1:01 PM To: [email protected] Cc: Chip Calhoun Subject: Re: What could be blocking me, if not robots.txt? Oh, it's a 404. That makes sense. > Hi everyone, > > I'm using Nutch to crawl a few friendly sites, and am having trouble > with some of them. One site in particular has created an exception for > me in its robots.txt, and yet I can't crawl any of its pages. I've > tried copying the files I want to index (3 XML documents) to my own > server and crawling that, and it works fine that way; so something is > keeping me from indexing any files on this other site. > > I compared the logs of my attempt to crawl the friendly site with my > attempt to crawl my own site, and I've found few differences. Most > differences come from the fact that my own site requires a crawlDelay, > so there are many log sections along the lines of: > > 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - maxThreads = 1 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - inProgress = 0 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - crawlDelay = 5000 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - nextFetchTime = 1317308262122 2011-09-29 10:57:37,529 > INFO fetcher.Fetcher - now = 1317308257529 2011-09-29 > 10:57:37,529 INFO fetcher.Fetcher - 0. > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29 > 10:57:37,529 INFO fetcher.Fetcher - 1. > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml > > That strikes me as probably irrelevant, but I figured I should mention it. > The main difference I see in the logs is that the crawl of my own site > (the crawl that worked) has the following two lines which do not > appear in the log of my failed crawl: > > 2011-09-29 10:57:50,497 INFO parse.ParserFactory - The parsing plugins: > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > plugin.includes system property, and all claim to support the content > type application/xml, but they are not mapped to it in the > parse-plugins.xml file 2011-09-29 10:58:23,559 INFO > crawl.SignatureFactory - Using Signature impl: > org.apache.nutch.crawl.MD5Signature > > Also, while my successful crawl has three lines like the following, my > failed one only has two: > > 2011-09-29 10:58:44,824 WARN regex.RegexURLNormalizer - can't find > rules for scope 'crawldb', using default > > Can anyone think of something I might have missed? > > Chip -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Thursday, September 29, 2011 1:00 PM To: [email protected] Cc: Chip Calhoun Subject: Re: What could be blocking me, if not robots.txt? > Hi everyone, > > I'm using Nutch to crawl a few friendly sites, and am having trouble > with some of them. One site in particular has created an exception for > me in its robots.txt, and yet I can't crawl any of its pages. I've > tried copying the files I want to index (3 XML documents) to my own > server and crawling that, and it works fine that way; so something is > keeping me from indexing any files on this other site. > > I compared the logs of my attempt to crawl the friendly site with my > attempt to crawl my own site, and I've found few differences. Most > differences come from the fact that my own site requires a crawlDelay, > so there are many log sections along the lines of: > > 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - maxThreads = 1 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - inProgress = 0 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - crawlDelay = 5000 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - nextFetchTime = 1317308262122 2011-09-29 10:57:37,529 > INFO fetcher.Fetcher - now = 1317308257529 2011-09-29 > 10:57:37,529 INFO fetcher.Fetcher - 0. > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29 > 10:57:37,529 INFO fetcher.Fetcher - 1. > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml > > That strikes me as probably irrelevant, but I figured I should mention it. > The main difference I see in the logs is that the crawl of my own site > (the crawl that worked) has the following two lines which do not > appear in the log of my failed crawl: > > 2011-09-29 10:57:50,497 INFO parse.ParserFactory - The parsing plugins: > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > plugin.includes system property, and all claim to support the content > type application/xml, but they are not mapped to it in the > parse-plugins.xml file 2011-09-29 10:58:23,559 INFO > crawl.SignatureFactory - Using Signature impl: > org.apache.nutch.crawl.MD5Signature If this doesn't popup when crawling the site that means its not fetched (properly). Can you try using the parser checker to download it? Can you you curl? The fetcher should throw an exception if there's trouble but it may also be stopped by the http.content.limit setting. > > Also, while my successful crawl has three lines like the following, my > failed one only has two: > > 2011-09-29 10:58:44,824 WARN regex.RegexURLNormalizer - can't find rules > for scope 'crawldb', using default > > Can anyone think of something I might have missed? > > Chip

