Hi everyone, I'm using Nutch to crawl a few friendly sites, and am having trouble with some of them. One site in particular has created an exception for me in its robots.txt, and yet I can't crawl any of its pages. I've tried copying the files I want to index (3 XML documents) to my own server and crawling that, and it works fine that way; so something is keeping me from indexing any files on this other site.
I compared the logs of my attempt to crawl the friendly site with my attempt to crawl my own site, and I've found few differences. Most differences come from the fact that my own site requires a crawlDelay, so there are many log sections along the lines of: 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=2 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - * queue: http://www.aip.org 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - maxThreads = 1 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - inProgress = 0 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - crawlDelay = 5000 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - minCrawlDelay = 0 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - nextFetchTime = 1317308262122 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - now = 1317308257529 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - 0. http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml 2011-09-29 10:57:37,529 INFO fetcher.Fetcher - 1. http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml That strikes me as probably irrelevant, but I figured I should mention it. The main difference I see in the logs is that the crawl of my own site (the crawl that worked) has the following two lines which do not appear in the log of my failed crawl: 2011-09-29 10:57:50,497 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/xml, but they are not mapped to it in the parse-plugins.xml file 2011-09-29 10:58:23,559 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature Also, while my successful crawl has three lines like the following, my failed one only has two: 2011-09-29 10:58:44,824 WARN regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using default Can anyone think of something I might have missed? Chip

