Ah, sorry. I had already deleted the local copy from my server (aip.org) to avoid clutter. So yeah, that will definitely 404 now.
Curl retrieves the whole file with no problems. I can't try the ParserChecker today as I'm stuck away from my own machine, but I will try it tomorrow. The fact that I can curl it at least tells me this is a problem I need to fix in Nutch. Chip ________________________________________ From: Markus Jelsma [markus.jel...@openindex.io] Sent: Thursday, September 29, 2011 1:01 PM To: user@nutch.apache.org Cc: Chip Calhoun Subject: Re: What could be blocking me, if not robots.txt? Oh, it's a 404. That makes sense. > Hi everyone, > > I'm using Nutch to crawl a few friendly sites, and am having trouble with > some of them. One site in particular has created an exception for me in > its robots.txt, and yet I can't crawl any of its pages. I've tried copying > the files I want to index (3 XML documents) to my own server and crawling > that, and it works fine that way; so something is keeping me from indexing > any files on this other site. > > I compared the logs of my attempt to crawl the friendly site with my > attempt to crawl my own site, and I've found few differences. Most > differences come from the fact that my own site requires a crawlDelay, so > there are many log sections along the lines of: > > 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, > spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - maxThreads = 1 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - inProgress = 0 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - crawlDelay = 5000 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO > fetcher.Fetcher - nextFetchTime = 1317308262122 2011-09-29 10:57:37,529 > INFO fetcher.Fetcher - now = 1317308257529 2011-09-29 > 10:57:37,529 INFO fetcher.Fetcher - 0. > http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29 > 10:57:37,529 INFO fetcher.Fetcher - 1. > http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml > > That strikes me as probably irrelevant, but I figured I should mention it. > The main difference I see in the logs is that the crawl of my own site > (the crawl that worked) has the following two lines which do not appear in > the log of my failed crawl: > > 2011-09-29 10:57:50,497 INFO parse.ParserFactory - The parsing plugins: > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > plugin.includes system property, and all claim to support the content type > application/xml, but they are not mapped to it in the parse-plugins.xml > file 2011-09-29 10:58:23,559 INFO crawl.SignatureFactory - Using > Signature impl: org.apache.nutch.crawl.MD5Signature > > Also, while my successful crawl has three lines like the following, my > failed one only has two: > > 2011-09-29 10:58:44,824 WARN regex.RegexURLNormalizer - can't find rules > for scope 'crawldb', using default > > Can anyone think of something I might have missed? > > Chip