Ignore this email. I figured out what was wrong. I made a mistake on the updatedb line of my crawl script so my updates weren't going to the correct crawl database and because of that, I was only fetching the top page and nothing else.
On Wed, Dec 8, 2010 at 10:14 PM, Steve Cohen <[email protected]> wrote: > I switched my nutch setup to use hdfs and I was running small crawls and it > was working fine. I figured I would use hadoop dfs -rmr to delete the > crawl/crawldb, crawl/linkdb, and crawl/segments and start a new crawl to see > how fast it was at scanning our website. I kick off my script and it starts. > The log shows it injects urls then it starts the loop of generating urls, > fetching them, and putting them in the crawldb. No errors, but the fetching > section is finishing in less then 30 seconds. So I look at the hadoop.log > and I see this: > > 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=11 > 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=10 > 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=9 > 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=8 > 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=7 > 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=6 > 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=5 > 2010-12-08 21:37:44,721 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=4 > 2010-12-08 21:37:44,744 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=3 > 2010-12-08 21:37:44,745 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=2 > 2010-12-08 21:37:44,745 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2010-12-08 21:37:44,758 INFO http.Http - http.proxy.host = null > 2010-12-08 21:37:44,758 INFO http.Http - http.proxy.port = 8080 > 2010-12-08 21:37:44,758 INFO http.Http - http.timeout = 17000 > 2010-12-08 21:37:44,758 INFO http.Http - http.content.limit = 65536 > 2010-12-08 21:37:44,758 INFO http.Http - http.agent = Nutch-1.0 (Lucene > Crawler) > 2010-12-08 21:37:44,758 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2010-12-08 21:37:44,758 INFO http.Http - protocol.plugin.check.blocking = > false > 2010-12-08 21:37:44,758 INFO http.Http - protocol.plugin.check.robots = > false > 2010-12-08 21:37:45,726 INFO fetcher.Fetcher - -activeThreads=1, > spinWaiting=0, fetchQueues.totalSize=0 > 2010-12-08 21:37:46,206 INFO crawl.SignatureFactory - Using Signature > impl: org.apache.nutch.crawl.MD5Signature > 2010-12-08 21:37:46,237 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=0 > 2010-12-08 21:37:46,728 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2010-12-08 21:37:46,728 INFO fetcher.Fetcher - -activeThreads=0 > 2010-12-08 21:37:47,101 INFO fetcher.Fetcher - Fetcher: threads: 60 > 2010-12-08 21:37:47,105 INFO fetcher.Fetcher - QueueFeeder finished: total > 0 records + hit by time limit :0 > > No errors but no records either. > > Now, when it finishes, it runs the solrindex and solrdedup and try > searching, and surprisingly enough, I actually am getting expected results, > even though I am not fetching anything. > > How do I delete all the contents so I can test a full crawl? I guess I > could just reformat the hdfs filesystem, but is there another way? > > Thanks, > Steve Cohen > > > >

