I have been trying to test Nutch 2.0, but my results are not good, so time to ask for some help!
URL injection is the only piece that is working for me: crawl.InjectorJob - InjectorJob: starting crawl.InjectorJob - InjectorJob: urlDir: /urls crawl.InjectorJob - InjectorJob: finished Nothing about the success, but webpage table has data. WebTable statistics start Statistics for WebTable: min score: 1.0 retry 0: 2894 max score: 1.0 TOTAL urls: 2894 status 0 (null): 2894 avg score: 1.0 WebTable statistics: done min score: 1.0 retry 0: 2894 max score: 1.0 TOTAL urls: 2894 status 0 (null): 2894 avg score: 1.0 But, from here on out, it is down hill To Generate, I use the following: nutch generate -all -topN 100000 crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch. crawl.GeneratorJob - GeneratorJob: starting crawl.GeneratorJob - GeneratorJob: filtering: true crawl.GeneratorJob - GeneratorJob: topN: 100000 crawl.GeneratorJob - GeneratorJob: done crawl.GeneratorJob - GeneratorJob: generated batch id: 1292541893-1499060629 No other log information is provided... Unlike the old way which include log items like: Generator: starting at 2010-10-08 23:16:02 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100000 Generator: jobtracker is 'local', generating exactly one partition. Host or domain www.abc123.org has more than 50 URLs for all 1 segments - skipping Host or domain www.bcd123.com has more than 50 URLs for all 1 segments - skipping Host or domain www.cde123.com has more than 50 URLs for all 1 segments - skipping Host or domain www.def123.com has more than 50 URLs for all 1 segments - skipping ... Generator: Partitioning selected urls for politeness. Generator: segment: crawl_www/segments/20101008232451 Generator: finished at 2010-10-08 23:25:41, elapsed: 00:09:39 Same type of issue occurs with Fetch: nutch fetch -all -threads 100 -parse The log files show: fetcher.FetcherJob - FetcherJob: starting fetcher.FetcherJob - FetcherJob : timelimit set for : -1 fetcher.FetcherJob - FetcherJob: threads: 10 fetcher.FetcherJob - FetcherJob: parsing: false fetcher.FetcherJob - FetcherJob: resuming: false fetcher.FetcherJob - FetcherJob: fetching all fetcher.FetcherJob - FetcherJob: done Essentially nothing is generated or shows that the fetch was successful. It completes in about a second, so I figured it is not working. Under the old method I would get something like this: Fetcher: starting at 2010-10-08 23:25:42 Fetcher: segment: crawl_www/segments/20101008232451 Fetcher Timelimit set for : 1286663142207 Fetcher: threads: 360 fetching http://www.xyz.com/ fetching http://www.zyzz.com/cda/expert/ ... Fetcher: finished at 2010-10-09 05:03:51, elapsed: 05:38:09 So, the question is, is Nutch 2.0 ready to beta test? or am I doing something very wrong? I'm using the same seed url list as I used in 1.2. The configuration is virtually identical to what I used in nutch 1.2, the only changes are to accommondate the use of hbase/gora (i.e. I used the delivered 2.0 configurations and added my changes to the confinguration files) So what am I missing? Thanks Brad

