Ignore this email. I figured out what was wrong. I made a mistake on the
updatedb line of my crawl script so my updates weren't going to the correct
crawl database and because of that, I was only fetching the top page and
nothing else.

On Wed, Dec 8, 2010 at 10:14 PM, Steve Cohen <[email protected]> wrote:

> I switched my nutch setup to use hdfs and I was running small crawls and it
> was working fine. I figured I would use hadoop dfs -rmr to delete the
> crawl/crawldb, crawl/linkdb, and crawl/segments and start a new crawl to see
> how fast it was at scanning our website. I kick off my script and it starts.
> The log shows it injects urls then it starts the loop of generating urls,
> fetching them, and putting them in the crawldb. No errors, but the fetching
> section is finishing in less then 30 seconds. So I look at the hadoop.log
> and I see this:
>
> 2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=11
> 2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=10
> 2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=9
> 2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=8
> 2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=7
> 2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=6
> 2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=5
> 2010-12-08 21:37:44,721 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=4
> 2010-12-08 21:37:44,744 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=3
> 2010-12-08 21:37:44,745 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=2
> 2010-12-08 21:37:44,745 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2010-12-08 21:37:44,758 INFO  http.Http - http.proxy.host = null
> 2010-12-08 21:37:44,758 INFO  http.Http - http.proxy.port = 8080
> 2010-12-08 21:37:44,758 INFO  http.Http - http.timeout = 17000
> 2010-12-08 21:37:44,758 INFO  http.Http - http.content.limit = 65536
> 2010-12-08 21:37:44,758 INFO  http.Http - http.agent = Nutch-1.0 (Lucene
> Crawler)
> 2010-12-08 21:37:44,758 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2010-12-08 21:37:44,758 INFO  http.Http - protocol.plugin.check.blocking =
> false
> 2010-12-08 21:37:44,758 INFO  http.Http - protocol.plugin.check.robots =
> false
> 2010-12-08 21:37:45,726 INFO  fetcher.Fetcher - -activeThreads=1,
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-12-08 21:37:46,206 INFO  crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2010-12-08 21:37:46,237 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2010-12-08 21:37:46,728 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2010-12-08 21:37:46,728 INFO  fetcher.Fetcher - -activeThreads=0
> 2010-12-08 21:37:47,101 INFO  fetcher.Fetcher - Fetcher: threads: 60
> 2010-12-08 21:37:47,105 INFO  fetcher.Fetcher - QueueFeeder finished: total
> 0 records + hit by time limit :0
>
> No errors but no records either.
>
> Now, when it finishes, it runs the solrindex and solrdedup and try
> searching, and surprisingly enough, I actually am getting expected results,
> even though I am not fetching anything.
>
> How do I delete all the contents so I can test a full crawl? I guess I
> could just reformat the hdfs filesystem, but is there another way?
>
> Thanks,
> Steve Cohen
>
>
>
>

Reply via email to