On 12/17/10 2:08 AM, brad wrote:
To Generate, I use the following:
nutch generate -all -topN 100000
crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for
fetch.
crawl.GeneratorJob - GeneratorJob: starting
crawl.GeneratorJob - GeneratorJob: filtering: true
crawl.GeneratorJob - GeneratorJob: topN: 100000
crawl.GeneratorJob - GeneratorJob: done
crawl.GeneratorJob - GeneratorJob: generated batch id: 1292541893-1499060629
No other log information is provided...
Unlike the old way which include log items like:
Generator: starting at 2010-10-08 23:16:02
If in doubt you should check the logs/hadoop.log - if there were any
exceptions they should be reported there.
Same type of issue occurs with Fetch:
nutch fetch -all -threads 100 -parse
The log files show:
fetcher.FetcherJob - FetcherJob: starting
fetcher.FetcherJob - FetcherJob : timelimit set for : -1
fetcher.FetcherJob - FetcherJob: threads: 10
fetcher.FetcherJob - FetcherJob: parsing: false
fetcher.FetcherJob - FetcherJob: resuming: false
fetcher.FetcherJob - FetcherJob: fetching all
fetcher.FetcherJob - FetcherJob: done
Again, there should be some data in the log. Also, at this point you can
re-run readdb and check if the statistics is changed.
So, the question is, is Nutch 2.0 ready to beta test? or am I doing
something very wrong?
I guess it could be a config error - basic usage should just work...
So what am I missing?
I don't know, we need more information. BTW, dev@ list may be more
appropriate for this discussion.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com