Re: [2.2.1] What does inject job do?

Lewis John Mcgibbney Sat, 20 Jul 2013 22:43:43 -0700

Please read the exception trace. You are running on Hadoop? You need to
ensure that your plugins.directory points to the right path. There is also
a mention of a missing job file. Please ensure that your nutch job file is
on the Hadoop jobtracker classpath.
hth


On Saturday, July 20, 2013, Rui Gao <[email protected]> wrote:
> Hi Lewis,
>
> I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob
with both hsql and mysql. But the Crawler job still fail. here's the log:
> 2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using
class org.apache.gora.sql.store.SqlStore as the Gora storage class.
> 2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
> 2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.
 User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library
not loaded
> 2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins:
directory not found: ./plugins
> 2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
> 2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
number of urls rejected by filters: 0
> 2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total
number of urls injected after normalization and filtering: 1
> 2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.
 User classes may not be found. See JobConf(Class) or
JobConf#setJar(String).
> 2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
> 2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
> 2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000
> 2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'generate_host_count', using default
> 2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
> 2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
> 2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner -
job_local1378002997_0002
> java.lang.NullPointerException
>     at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
>     at
org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>     at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>     at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
>
> I don't know if this is the right direction I should continue with. But
any way, hopefully my experience could help others.
>
>
> Regards,
> Rui
>
>
>
>
>
>
> At 2013-07-20 23:07:41,"Rui Gao" <[email protected]> wrote:
>>Hi Lewis,
>>
>>Thanks for your answer.
>>So, what direction will Nutch go? Will it co-operate with relationship
database or will it only work on non-relationship database like hbase?
>>I remember when 2.2.1 has been released, I checked the release note, it
says some bugs related with mysql has been fixed. That's why I try to
integrate it with mysql or hsql. And also, in the wiki, there's a link
talking about how to integrate nutch with mysql:
http://nlp.solutions.asia/?p=362
>>
>>Do you have any suggestion?
>>
>>Thanks.
>>
>>Best Regards,
>>Rui
>>
>>
>>
>>
>>
>>
>>
>>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <[email protected]>
wrote:
>>>Hi Rui,
>>>This should not work.
>>>The SqlStore module and support for it is now deprecated within Apache
Gora.
>>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>>Gora artifacts but this is not recommended.
>>>Thanks
>>>Lewis
>>>
>>>
>>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have set up eclipse environment according to the WIKI. Here's some
>>>> something I did before I run the inject job:
>>>> 1. I use SqlStore as storage class
>>>> 2. I started HSql database which contains the table 'webpage'.
>>>> 3. I added 1 URL in seed.txt.
>>>> Then I run the inject job. It seems the job is finished successfully.
But
>>>> I there's no change be made to my HSql database. Any thought about
this?
>>>> Here's the log:
>>>> InjectorJob: starting at 2013-07-07 15:28:42
>>>> InjectorJob: Injecting urlDir: urls/dev
>>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>>> storage class.
>>>> InjectorJob: total number of urls rejected by filters: 0
>>>> InjectorJob: total number of urls injected after normalization and
>>>> filtering: 1
>>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>>
>>>> Best Regards,
>>>> Rui
>>>>
>>>
>>>
>>>
>>>--
>>>*Lewis*
>

-- 
*Lewis*

Re: [2.2.1] What does inject job do?

Reply via email to