Does Nutch 2.0 in good enough shape to test?

brad Thu, 16 Dec 2010 17:08:00 -0800

I have been trying to test Nutch 2.0, but my results are not good, so time
to ask for some help!


URL injection is the only piece that is working for me:
crawl.InjectorJob - InjectorJob: starting
crawl.InjectorJob - InjectorJob: urlDir: /urls
crawl.InjectorJob - InjectorJob: finished

Nothing about the success, but webpage table has data.
WebTable statistics start
Statistics for WebTable:
min score:      1.0
retry 0:        2894
max score:      1.0
TOTAL urls:     2894
status 0 (null):        2894
avg score:      1.0
WebTable statistics: done
min score:      1.0
retry 0:        2894
max score:      1.0
TOTAL urls:     2894
status 0 (null):        2894
avg score:      1.0

But, from here on out, it is down hill

To Generate, I use the following:
nutch generate -all -topN 100000

crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for
fetch.
crawl.GeneratorJob - GeneratorJob: starting
crawl.GeneratorJob - GeneratorJob: filtering: true
crawl.GeneratorJob - GeneratorJob: topN: 100000
crawl.GeneratorJob - GeneratorJob: done
crawl.GeneratorJob - GeneratorJob: generated batch id: 1292541893-1499060629

No other log information is provided... 
Unlike the old way which include log items like:
Generator: starting at 2010-10-08 23:16:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100000
Generator: jobtracker is 'local', generating exactly one partition.
Host or domain www.abc123.org has more than 50 URLs for all 1 segments -
skipping
Host or domain www.bcd123.com has more than 50 URLs for all 1 segments -
skipping
Host or domain www.cde123.com has more than 50 URLs for all 1 segments -
skipping
Host or domain www.def123.com has more than 50 URLs for all 1 segments -
skipping
...
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl_www/segments/20101008232451
Generator: finished at 2010-10-08 23:25:41, elapsed: 00:09:39


Same type of issue occurs with Fetch:
nutch fetch -all -threads 100 -parse

The log files show:
fetcher.FetcherJob - FetcherJob: starting
fetcher.FetcherJob - FetcherJob : timelimit set for : -1
fetcher.FetcherJob - FetcherJob: threads: 10
fetcher.FetcherJob - FetcherJob: parsing: false
fetcher.FetcherJob - FetcherJob: resuming: false
fetcher.FetcherJob - FetcherJob: fetching all
fetcher.FetcherJob - FetcherJob: done

Essentially nothing is generated or shows that the fetch was successful.  It
completes in about a second, so I figured it is not working.

Under the old method I would get something like this:
Fetcher: starting at 2010-10-08 23:25:42
Fetcher: segment: crawl_www/segments/20101008232451
Fetcher Timelimit set for : 1286663142207
Fetcher: threads: 360
fetching http://www.xyz.com/
fetching http://www.zyzz.com/cda/expert/

...

Fetcher: finished at 2010-10-09 05:03:51, elapsed: 05:38:09

So, the question is, is Nutch 2.0 ready to beta test? or am I doing
something very wrong?
I'm using the same seed url list as I used in 1.2.  The configuration is
virtually identical to what I used in nutch 1.2, the only changes are to
accommondate the use of hbase/gora (i.e. I used the delivered 2.0
configurations and added my changes to the confinguration files)

So what am I missing?

Thanks
Brad

Does Nutch 2.0 in good enough shape to test?

Reply via email to