I switched my nutch setup to use hdfs and I was running small crawls and it
was working fine. I figured I would use hadoop dfs -rmr to delete the
crawl/crawldb, crawl/linkdb, and crawl/segments and start a new crawl to see
how fast it was at scanning our website. I kick off my script and it starts.
The log shows it injects urls then it starts the loop of generating urls,
fetching them, and putting them in the crawldb. No errors, but the fetching
section is finishing in less then 30 seconds. So I look at the hadoop.log
and I see this:

2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=11
2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=10
2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=9
2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=8
2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=7
2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=6
2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=5
2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=4
2010-12-08 21:37:44,744 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=3
2010-12-08 21:37:44,745 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=2
2010-12-08 21:37:44,745 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2010-12-08 21:37:44,758 INFO  http.Http - http.proxy.host = null
2010-12-08 21:37:44,758 INFO  http.Http - http.proxy.port = 8080
2010-12-08 21:37:44,758 INFO  http.Http - http.timeout = 17000
2010-12-08 21:37:44,758 INFO  http.Http - http.content.limit = 65536
2010-12-08 21:37:44,758 INFO  http.Http - http.agent = Nutch-1.0 (Lucene
Crawler)
2010-12-08 21:37:44,758 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2010-12-08 21:37:44,758 INFO  http.Http - protocol.plugin.check.blocking =
false
2010-12-08 21:37:44,758 INFO  http.Http - protocol.plugin.check.robots =
false
2010-12-08 21:37:45,726 INFO  fetcher.Fetcher - -activeThreads=1,
spinWaiting=0, fetchQueues.totalSize=0
2010-12-08 21:37:46,206 INFO  crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2010-12-08 21:37:46,237 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2010-12-08 21:37:46,728 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2010-12-08 21:37:46,728 INFO  fetcher.Fetcher - -activeThreads=0
2010-12-08 21:37:47,101 INFO  fetcher.Fetcher - Fetcher: threads: 60
2010-12-08 21:37:47,105 INFO  fetcher.Fetcher - QueueFeeder finished: total
0 records + hit by time limit :0

No errors but no records either.

Now, when it finishes, it runs the solrindex and solrdedup and try
searching, and surprisingly enough, I actually am getting expected results,
even though I am not fetching anything.

How do I delete all the contents so I can test a full crawl? I guess I could
just reformat the hdfs filesystem, but is there another way?

Thanks,
Steve Cohen

Reply via email to