I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse. My environment is windows XP + cygwin + eclipse. I thinks the top several WARN logs are not the blocker. (The plugin.folders contains an additional folder, after I remove it job still fails.) We can compare it with the logs from InjectorJob which runs successfully:
2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: starting at 2013-07-21 12:45:01 2013-07-21 12:45:01,968 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls/dev 2013-07-21 12:45:02,921 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class. 2013-07-21 12:45:02,968 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-07-21 12:45:02,984 WARN mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 2013-07-21 12:45:03,015 WARN snappy.LoadSnappy - Snappy native library not loaded 2013-07-21 12:45:03,328 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 10000 2013-07-21 12:45:03,437 WARN plugin.PluginRepository - Plugins: directory not found: ./plugins 2013-07-21 12:45:03,484 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2013-07-21 12:45:03,625 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 0 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 1 2013-07-21 12:45:04,218 INFO crawl.InjectorJob - Injector: finished at 2013-07-21 12:45:04, elapsed: 00:00:02 At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <[email protected]> wrote: >Please read the exception trace. You are running on Hadoop? You need to >ensure that your plugins.directory points to the right path. There is also >a mention of a missing job file. Please ensure that your nutch job file is >on the Hadoop jobtracker classpath. >hth > >On Saturday, July 20, 2013, Rui Gao <[email protected]> wrote: >> Hi Lewis, >> >> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob >with both hsql and mysql. But the Crawler job still fail. here's the log: >> 2013-07-21 12:23:41,156 INFO crawl.InjectorJob - InjectorJob: Using >class org.apache.gora.sql.store.SqlStore as the Gora storage class. >> 2013-07-21 12:23:41,203 WARN util.NativeCodeLoader - Unable to load >native-hadoop library for your platform... using builtin-java classes where >applicable >> 2013-07-21 12:23:41,234 WARN mapred.JobClient - No job jar file set. > User classes may not be found. See JobConf(Class) or >JobConf#setJar(String). >> 2013-07-21 12:23:41,265 WARN snappy.LoadSnappy - Snappy native library >not loaded >> 2013-07-21 12:23:41,578 INFO mapreduce.GoraRecordWriter - >gora.buffer.write.limit = 10000 >> 2013-07-21 12:23:41,718 WARN plugin.PluginRepository - Plugins: >directory not found: ./plugins >> 2013-07-21 12:23:41,765 INFO regex.RegexURLNormalizer - can't find rules >for scope 'inject', using default >> 2013-07-21 12:23:41,937 WARN mapred.FileOutputCommitter - Output path is >null in cleanup >> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total >number of urls rejected by filters: 0 >> 2013-07-21 12:23:42,468 INFO crawl.InjectorJob - InjectorJob: total >number of urls injected after normalization and filtering: 1 >> 2013-07-21 12:23:42,468 INFO crawl.FetchScheduleFactory - Using >FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule - >defaultInterval=2592000 >> 2013-07-21 12:23:42,468 INFO crawl.AbstractFetchSchedule - >maxInterval=7776000 >> 2013-07-21 12:23:42,593 WARN mapred.JobClient - No job jar file set. > User classes may not be found. See JobConf(Class) or >JobConf#setJar(String). >> 2013-07-21 12:23:42,796 INFO mapreduce.GoraRecordReader - >gora.buffer.read.limit = 10000 >> 2013-07-21 12:23:43,062 INFO crawl.FetchScheduleFactory - Using >FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule - >defaultInterval=2592000 >> 2013-07-21 12:23:43,062 INFO crawl.AbstractFetchSchedule - >maxInterval=7776000 >> 2013-07-21 12:23:43,093 INFO regex.RegexURLNormalizer - can't find rules >for scope 'generate_host_count', using default >> 2013-07-21 12:23:43,234 INFO mapreduce.GoraRecordWriter - >gora.buffer.write.limit = 10000 >> 2013-07-21 12:23:43,250 WARN mapred.FileOutputCommitter - Output path is >null in cleanup >> 2013-07-21 12:23:43,250 WARN mapred.LocalJobRunner - >job_local1378002997_0002 >> java.lang.NullPointerException >> at org.apache.avro.util.Utf8.<init>(Utf8.java:37) >> at >org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100) >> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) >> at >org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) >> at >org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) >> >> I don't know if this is the right direction I should continue with. But >any way, hopefully my experience could help others. >> >> >> Regards, >> Rui >> >> >> >> >> >> >> At 2013-07-20 23:07:41,"Rui Gao" <[email protected]> wrote: >>>Hi Lewis, >>> >>>Thanks for your answer. >>>So, what direction will Nutch go? Will it co-operate with relationship >database or will it only work on non-relationship database like hbase? >>>I remember when 2.2.1 has been released, I checked the release note, it >says some bugs related with mysql has been fixed. That's why I try to >integrate it with mysql or hsql. And also, in the wiki, there's a link >talking about how to integrate nutch with mysql: >http://nlp.solutions.asia/?p=362 >>> >>>Do you have any suggestion? >>> >>>Thanks. >>> >>>Best Regards, >>>Rui >>> >>> >>> >>> >>> >>> >>> >>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <[email protected]> >wrote: >>>>Hi Rui, >>>>This should not work. >>>>The SqlStore module and support for it is now deprecated within Apache >Gora. >>>>If you would like to downgrade to use Nutch 2.1, then you can use older >>>>Gora artifacts but this is not recommended. >>>>Thanks >>>>Lewis >>>> >>>> >>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> I have set up eclipse environment according to the WIKI. Here's some >>>>> something I did before I run the inject job: >>>>> 1. I use SqlStore as storage class >>>>> 2. I started HSql database which contains the table 'webpage'. >>>>> 3. I added 1 URL in seed.txt. >>>>> Then I run the inject job. It seems the job is finished successfully. >But >>>>> I there's no change be made to my HSql database. Any thought about >this? >>>>> Here's the log: >>>>> InjectorJob: starting at 2013-07-07 15:28:42 >>>>> InjectorJob: Injecting urlDir: urls/dev >>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora >>>>> storage class. >>>>> InjectorJob: total number of urls rejected by filters: 0 >>>>> InjectorJob: total number of urls injected after normalization and >>>>> filtering: 1 >>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02 >>>>> >>>>> Best Regards, >>>>> Rui >>>>> >>>> >>>> >>>> >>>>-- >>>>*Lewis* >> > >-- >*Lewis*

