Hello,
I am testing nutch-2.0 with mysql storage with 1 url. I see that updatedb
command does not do anything. It does not add outlinks to the table as new urls
and I do not see any error messages in hadoop.log Here is the log entries
without plugin load info
INFO crawl.DbUpdaterJob - DbUpdaterJob: starting
2012-07-24 10:53:46,142 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-07-24 10:53:46,979 INFO mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-07-24 10:53:49,801 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-07-24 10:53:49,806 INFO crawl.FetchScheduleFactory - Using FetchSchedule
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule -
defaultInterval=25920000
2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule - maxInterval=25920000
2012-07-24 10:53:52,741 WARN mapred.FileOutputCommitter - Output path is null
in cleanup
2012-07-24 10:53:53,584 INFO crawl.DbUpdaterJob - DbUpdaterJob: done
Also, I noticed that there is crawlId option to it. Where its value comes from?
Btw, updatedb with no arguments works fine if Hbase is chosen for storage.
Thanks.
Alex.
~