Re:Re: [2.2.1] What does inject job do?

Rui Gao Sat, 20 Jul 2013 21:58:39 -0700

I am following this article http://wiki.apache.org/nutch/RunNutchInEclipse. My 
environment is windows XP + cygwin + eclipse.
I thinks the top several WARN logs are not the blocker. (The plugin.folders 
contains an additional folder, after I remove it job still fails.) We can 
compare it with the logs from InjectorJob which runs successfully:


2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: starting at 
2013-07-21 12:45:01
2013-07-21 12:45:01,968 INFO  crawl.InjectorJob - InjectorJob: Injecting 
urlDir: urls/dev
2013-07-21 12:45:02,921 INFO  crawl.InjectorJob - InjectorJob: Using class 
org.apache.gora.sql.store.SqlStore as the Gora storage class.
2013-07-21 12:45:02,968 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2013-07-21 12:45:02,984 WARN  mapred.JobClient - No job jar file set.  User 
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:45:03,015 WARN  snappy.LoadSnappy - Snappy native library not 
loaded
2013-07-21 12:45:03,328 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 10000
2013-07-21 12:45:03,437 WARN  plugin.PluginRepository - Plugins: directory not 
found: ./plugins
2013-07-21 12:45:03,484 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2013-07-21 12:45:03,625 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total number of 
urls rejected by filters: 0
2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - InjectorJob: total number of 
urls injected after normalization and filtering: 1
2013-07-21 12:45:04,218 INFO  crawl.InjectorJob - Injector: finished at 
2013-07-21 12:45:04, elapsed: 00:00:02









At 2013-07-21 12:36:29,"Lewis John Mcgibbney" <[email protected]> wrote:
>Please read the exception trace. You are running on Hadoop? You need to
>ensure that your plugins.directory points to the right path. There is also
>a mention of a missing job file. Please ensure that your nutch job file is
>on the Hadoop jobtracker classpath.
>hth
>
>On Saturday, July 20, 2013, Rui Gao <[email protected]> wrote:
>> Hi Lewis,
>>
>> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
>with both hsql and mysql. But the Crawler job still fail. here's the log:
>> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
>class org.apache.gora.sql.store.SqlStore as the Gora storage class.
>> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
>native-hadoop library for your platform... using builtin-java classes where
>applicable
>> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
>not loaded
>> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
>directory not found: ./plugins
>> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'inject', using default
>> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>number of urls rejected by filters: 0
>> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
>number of urls injected after normalization and filtering: 1
>> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
> User classes may not be found. See JobConf(Class) or
>JobConf#setJar(String).
>> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
>gora.buffer.read.limit = 10000
>> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
>FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>defaultInterval=2592000
>> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
>maxInterval=7776000
>> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find rules
>for scope 'generate_host_count', using default
>> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
>gora.buffer.write.limit = 10000
>> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path is
>null in cleanup
>> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
>job_local1378002997_0002
>> java.lang.NullPointerException
>>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>>     at
>org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>>     at
>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>>     at
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>>
>> I don't know if this is the right direction I should continue with. But
>any way, hopefully my experience could help others.
>>
>>
>> Regards,
>> Rui
>>
>>
>>
>>
>>
>>
>> At 2013-07-20 23:07:41,"Rui Gao" <[email protected]> wrote:
>>>Hi Lewis,
>>>
>>>Thanks for your answer.
>>>So, what direction will Nutch go? Will it co-operate with relationship
>database or will it only work on non-relationship database like hbase?
>>>I remember when 2.2.1 has been released, I checked the release note, it
>says some bugs related with mysql has been fixed. That's why I try to
>integrate it with mysql or hsql. And also, in the wiki, there's a link
>talking about how to integrate nutch with mysql:
>http://nlp.solutions.asia/?p=362
>>>
>>>Do you have any suggestion?
>>>
>>>Thanks.
>>>
>>>Best Regards,
>>>Rui
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <[email protected]>
>wrote:
>>>>Hi Rui,
>>>>This should not work.
>>>>The SqlStore module and support for it is now deprecated within Apache
>Gora.
>>>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>>>Gora artifacts but this is not recommended.
>>>>Thanks
>>>>Lewis
>>>>
>>>>
>>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <[email protected]> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have set up eclipse environment according to the WIKI. Here's some
>>>>> something I did before I run the inject job:
>>>>> 1. I use SqlStore as storage class
>>>>> 2. I started HSql database which contains the table 'webpage'.
>>>>> 3. I added 1 URL in seed.txt.
>>>>> Then I run the inject job. It seems the job is finished successfully.
>But
>>>>> I there's no change be made to my HSql database. Any thought about
>this?
>>>>> Here's the log:
>>>>> InjectorJob: starting at 2013-07-07 15:28:42
>>>>> InjectorJob: Injecting urlDir: urls/dev
>>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>>>> storage class.
>>>>> InjectorJob: total number of urls rejected by filters: 0
>>>>> InjectorJob: total number of urls injected after normalization and
>>>>> filtering: 1
>>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>>>
>>>>> Best Regards,
>>>>> Rui
>>>>>
>>>>
>>>>
>>>>
>>>>--
>>>>*Lewis*
>>
>
>-- 
>*Lewis*

Re:Re: [2.2.1] What does inject job do?

Reply via email to