I switched my nutch setup to use hdfs and I was running small crawls and it was working fine. I figured I would use hadoop dfs -rmr to delete the crawl/crawldb, crawl/linkdb, and crawl/segments and start a new crawl to see how fast it was at scanning our website. I kick off my script and it starts. The log shows it injects urls then it starts the loop of generating urls, fetching them, and putting them in the crawldb. No errors, but the fetching section is finishing in less then 30 seconds. So I look at the hadoop.log and I see this:
2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=11 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=10 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=9 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=8 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=7 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=6 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=5 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=4 2010-12-08 21:37:44,744 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=3 2010-12-08 21:37:44,745 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=2 2010-12-08 21:37:44,745 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2010-12-08 21:37:44,758 INFO http.Http - http.proxy.host = null 2010-12-08 21:37:44,758 INFO http.Http - http.proxy.port = 8080 2010-12-08 21:37:44,758 INFO http.Http - http.timeout = 17000 2010-12-08 21:37:44,758 INFO http.Http - http.content.limit = 65536 2010-12-08 21:37:44,758 INFO http.Http - http.agent = Nutch-1.0 (Lucene Crawler) 2010-12-08 21:37:44,758 INFO http.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2010-12-08 21:37:44,758 INFO http.Http - protocol.plugin.check.blocking = false 2010-12-08 21:37:44,758 INFO http.Http - protocol.plugin.check.robots = false 2010-12-08 21:37:45,726 INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 2010-12-08 21:37:46,206 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2010-12-08 21:37:46,237 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2010-12-08 21:37:46,728 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2010-12-08 21:37:46,728 INFO fetcher.Fetcher - -activeThreads=0 2010-12-08 21:37:47,101 INFO fetcher.Fetcher - Fetcher: threads: 60 2010-12-08 21:37:47,105 INFO fetcher.Fetcher - QueueFeeder finished: total 0 records + hit by time limit :0 No errors but no records either. Now, when it finishes, it runs the solrindex and solrdedup and try searching, and surprisingly enough, I actually am getting expected results, even though I am not fetching anything. How do I delete all the contents so I can test a full crawl? I guess I could just reformat the hdfs filesystem, but is there another way? Thanks, Steve Cohen

