Re:Re:Re: [2.2.1] What does inject job do?

Rui Gao Sat, 20 Jul 2013 23:54:37 -0700

Some error found when using hbase 0.90.x. Here's the log:
2013-07-21 14:51:29,500 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-07-21 14:51:29,500 WARN  mapred.LocalJobRunner - job_local196483647_0002
java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)










At 2013-07-21 13:58:33,"Rui Gao" <[email protected]> wrote:
>I checked the DB, the URL is already in DB.
>The plugin property is configured like this:
><property>
>  <name>plugin.folders</name>
>  <value>./src/plugin,./plugins</value>
>  <description>Directories where nutch plugins are located.  Each
>  element may be a relative or absolute path.  If absolute, it is used
>  as is.  If relative, it is searched for on the classpath.</description>
></property>
>
>I guess the plugin property is configured properly. Because when I change it 
>to other value, it complains plugins could not be found.
>
>
>
>
>
>
>
>At 2013-07-21 13:48:33,"Lewis John Mcgibbney" <[email protected]> 
>wrote:
>>Yes the warns which you've shown now are fine this is the old mapred API.
>>Its OK.
>>It's now stating that you've got 1 URL injected. Can you check the db?
>>either check contents or dump/read them with readdb tool?
>>Please remember that somewhere in the tutorial you reference the absolute
>>patch to plugins folder needs to be changed. This is your problem here.
>>InjectorJob doesn't require plugins to work... however when your indexing
>>plugins are called you are in trouble. You need to sort this out.
>>
>>On Saturday, July 20, 2013, Rui Gao <[email protected]> wrote:
>>> I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse.
>>My environment is windows XP + cygwin + eclipse.
>>> I thinks the top several WARN logs are not the blocker. (The
>>plugin.folders contains an additional folder, after I remove it job still
>>fails.) We can compare it with the logs from InjectorJob which runs
>>successfully:
>>>
>>> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: starting
>>at 2013-07-21 12:45:01
>>> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: Injecting
>>urlDir: urls/dev
>>> 2013-07-21 12:45:02,921 INFO  crawl.InjectorJob - InjectorJob: Using
>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>> 2013-07-21 12:45:02,968 WARN  util.NativeCodeLoader - Unable to load
>>native-hadoop library for your platform... using builtin-java classes where
>>applicable
>>> 2013-07-21 12:45:02,984 WARN  mapred.JobClient - No job jar file set.
>> User classes may not be found. See JobConf(Class) or
>>JobConf#setJar(String).
>>> 2013-07-21 12:45:03,015 WARN  snappy.LoadSnappy - Snappy native library
>>not loaded
>>> 2013-07-21 12:45:03,328 INFO  mapreduce.GoraRecordWriter -
>>gora.buffer.write.limit = 10000
>>> 2013-07-21 12:45:03,437 WARN  plugin.PluginRepository - Plugins:
>>directory not found: ./plugins
>>> 2013-07-21 12:45:03,484 INFO  regex.RegexURLNormalizer - can't find rules
>>for scope 'inject', using default
>>> 2013-07-21 12:45:03,625 WARN  mapred.FileOutputCommitter - Output path is
>>null in cleanup
>>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
>>number of urls rejected by filters: 0
>>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
>>number of urls injected after normalization and filtering: 1
>>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - Injector: finished at
>>2013-07-21 12:45:04, elapsed: 00:00:02
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <[email protected]>
>>wrote:
>>>>Please read the exception trace. You are running on Hadoop? You need to
>>>>ensure that your plugins.directory points to the right path. There is also
>>>>a mention of a missing job file. Please ensure that your nutch job file is
>>>>on the Hadoop jobtracker classpath.
>>>>hth
>>>>
>>>>On Saturday, July 20, 2013, Rui Gao <[email protected]> wrote:
>>>>> Hi Lewis,
>>>>>
>>>>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>>>>with both hsql and mysql. But the Crawler job still fail. here's the log:
>>>>> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
>>>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>>>> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
>>>>native-hadoop library for your platform... using builtin-java classes
>>where
>>>>applicable
>>>>> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
>>>> User classes may not be found. See JobConf(Class) or
>>>>JobConf#setJar(String).
>>>>> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
>>>>not loaded
>>>>> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
>>>>gora.buffer.write.limit = 10000
>>>>> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
>>>>directory not found: ./plugins
>>>>> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find
>>rules
>>>>for scope 'inject', using default
>>>>> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path
>>is
>>>>null in cleanup
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>>>number of urls rejected by filters: 0
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>>>number of urls injected after normalization and filtering: 1
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
>>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>>>defaultInterval=2592000
>>>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>>>maxInterval=7776000
>>>>> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
>>>> User classes may not be found. See JobConf(Class) or
>>>>JobConf#setJar(String).
>>>>> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
>>>>gora.buffer.read.limit = 10000
>>>>> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
>>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>>>defaultInterval=2592000
>>>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>>>maxInterval=7776000
>>>>> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find
>>rules
>>>>for scope 'generate_host_count', using default
>>>>> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
>>>>gora.buffer.write.limit = 10000
>>>>> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path
>>is
>>>>null in cleanup
>>>>> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
>>>>job_local1378002997_0002
>>>>> java.lang.NullPointerException
>>>>>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>>>>     at
>>>>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>>>>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>>>>     at
>>>>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>>>     at
>>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>>>
>>>>> I don't know if this is the right direction I should continue with. But
>>>>any way, hopefully my experience could help others.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Rui
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> At 2013-07-20 23:07:41,"Rui Gao" <>*Lewis*
>>>
>>
>>-- 
>>*Lewis*

Re:Re:Re: [2.2.1] What does inject job do?

Reply via email to