Re:Re: [2.2.1] What does inject job do?

Rui Gao Sat, 20 Jul 2013 22:59:40 -0700

I checked the DB, the URL is already in DB.
The plugin property is configured like this:
<property>
  <name>plugin.folders</name>
  <value>./src/plugin,./plugins</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>


I guess the plugin property is configured properly. Because when I change it to 
other value, it complains plugins could not be found.







At 2013-07-21 13:48:33,"Lewis John Mcgibbney" <[email protected]> wrote:
>Yes the warns which you've shown now are fine this is the old mapred API.
>Its OK.
>It's now stating that you've got 1 URL injected. Can you check the db?
>either check contents or dump/read them with readdb tool?
>Please remember that somewhere in the tutorial you reference the absolute
>patch to plugins folder needs to be changed. This is your problem here.
>InjectorJob doesn't require plugins to work... however when your indexing
>plugins are called you are in trouble. You need to sort this out.
>
>On Saturday, July 20, 2013, Rui Gao <[email protected]> wrote:
>> I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse.
>My environment is windows XP + cygwin + eclipse.
>> I thinks the top several WARN logs are not the blocker. (The
>plugin.folders contains an additional folder, after I remove it job still
>fails.) We can compare it with the logs from InjectorJob which runs
>successfully:
>>
>> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: starting
>at 2013-07-21 12:45:01
>> 2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: Injecting
>urlDir: urls/dev
>> 2013-07-21 12:45:02,921 INFO  crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>> 2013-07-21 12:45:02,968 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>> 2013-07-21 12:45:02,984 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:45:03,015 WARN  snappy.LoadSnappy - Snappy native library
>not loaded
>> 2013-07-21 12:45:03,328 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:45:03,437 WARN  plugin.PluginRepository - Plugins:
>directory not found: ./plugins
>> 2013-07-21 12:45:03,484 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>> 2013-07-21 12:45:03,625 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
>number of urls rejected by filters: 0
>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total
>number of urls injected after normalization and filtering: 1
>> 2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - Injector: finished at
>2013-07-21 12:45:04, elapsed: 00:00:02
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <[email protected]>
>wrote:
>>>Please read the exception trace. You are running on Hadoop? You need to
>>>ensure that your plugins.directory points to the right path. There is also
>>>a mention of a missing job file. Please ensure that your nutch job file is
>>>on the Hadoop jobtracker classpath.
>>>hth
>>>
>>>On Saturday, July 20, 2013, Rui Gao <[email protected]> wrote:
>>>> Hi Lewis,
>>>>
>>>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>>>with both hsql and mysql. But the Crawler job still fail. here's the log:
>>>> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
>>>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>>>> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
>>>native-hadoop library for your platform... using builtin-java classes
>where
>>>applicable
>>>> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
>>> User classes may not be found. See JobConf(Class) or
>>>JobConf#setJar(String).
>>>> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
>>>not loaded
>>>> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
>>>gora.buffer.write.limit = 10000
>>>> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
>>>directory not found: ./plugins
>>>> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find
>rules
>>>for scope 'inject', using default
>>>> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path
>is
>>>null in cleanup
>>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>>number of urls rejected by filters: 0
>>>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>>>number of urls injected after normalization and filtering: 1
>>>> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>>defaultInterval=2592000
>>>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>>>maxInterval=7776000
>>>> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
>>> User classes may not be found. See JobConf(Class) or
>>>JobConf#setJar(String).
>>>> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
>>>gora.buffer.read.limit = 10000
>>>> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
>>>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>>defaultInterval=2592000
>>>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>>>maxInterval=7776000
>>>> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find
>rules
>>>for scope 'generate_host_count', using default
>>>> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
>>>gora.buffer.write.limit = 10000
>>>> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path
>is
>>>null in cleanup
>>>> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
>>>job_local1378002997_0002
>>>> java.lang.NullPointerException
>>>>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>>>     at
>>>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>>>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>>>     at
>>>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>>>     at
>>>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>>>
>>>> I don't know if this is the right direction I should continue with. But
>>>any way, hopefully my experience could help others.
>>>>
>>>>
>>>> Regards,
>>>> Rui
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2013-07-20 23:07:41,"Rui Gao" <>*Lewis*
>>
>
>-- 
>*Lewis*

Re:Re: [2.2.1] What does inject job do?

Reply via email to