What could be blocking me, if not robots.txt?

Chip Calhoun Thu, 29 Sep 2011 09:57:38 -0700

Hi everyone,

I'm using Nutch to crawl a few friendly sites, and am having trouble with some 
of them. One site in particular has created an exception for me in its 
robots.txt, and yet I can't crawl any of its pages. I've tried copying the 
files I want to index (3 XML documents) to my own server and crawling that, and 
it works fine that way; so something is keeping me from indexing any files on 
this other site.


I compared the logs of my attempt to crawl the friendly site with my attempt to 
crawl my own site, and I've found few differences. Most differences come from 
the fact that my own site requires a crawlDelay, so there are many log sections 
along the lines of:

2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - -activeThreads=10, 
spinWaiting=10, fetchQueues.totalSize=2
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher - * queue: http://www.aip.org
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   maxThreads    = 1
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   inProgress    = 0
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   crawlDelay    = 5000
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   minCrawlDelay = 0
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   nextFetchTime = 1317308262122
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   now           = 1317308257529
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   0. 
http://www.aip.org/history/ead/umd/MdU.ead.histms.0067.xml
2011-09-29 10:57:37,529 INFO  fetcher.Fetcher -   1. 
http://www.aip.org/history/ead/umd/MdU.ead.histms.0312.xml

That strikes me as probably irrelevant, but I figured I should mention it. The 
main difference I see in the logs is that the crawl of my own site (the crawl 
that worked) has the following two lines which do not appear in the log of my 
failed crawl:

2011-09-29 10:57:50,497 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and all claim to support the content type application/xml, but 
they are not mapped to it  in the parse-plugins.xml file
2011-09-29 10:58:23,559 INFO  crawl.SignatureFactory - Using Signature impl: 
org.apache.nutch.crawl.MD5Signature

Also, while my successful crawl has three lines like the following, my failed 
one only has two:

2011-09-29 10:58:44,824 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'crawldb', using default

Can anyone think of something I might have missed?

Chip

What could be blocking me, if not robots.txt?

Reply via email to