Re:Re:Re: [2.2.1] What does inject job do?

Rui Gao Sat, 20 Jul 2013 21:33:37 -0700

Hi Lewis,

I tried to downgrade gora-core to 0.2.1. then, I could run InjectorJob with 
both hsql and mysql. But the Crawler job still fail. here's the log:
2013-07-21 12:23:41,156 INFO  crawl.InjectorJob - InjectorJob: Using class 
org.apache.gora.sql.store.SqlStore as the Gora storage class.
2013-07-21 12:23:41,203 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2013-07-21 12:23:41,234 WARN  mapred.JobClient - No job jar file set.  User 
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:23:41,265 WARN  snappy.LoadSnappy - Snappy native library not 
loaded
2013-07-21 12:23:41,578 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 10000
2013-07-21 12:23:41,718 WARN  plugin.PluginRepository - Plugins: directory not 
found: ./plugins
2013-07-21 12:23:41,765 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2013-07-21 12:23:41,937 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total number of 
urls rejected by filters: 0
2013-07-21 12:23:42,468 INFO  crawl.InjectorJob - InjectorJob: total number of 
urls injected after normalization and filtering: 1
2013-07-21 12:23:42,468 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2013-07-21 12:23:42,468 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2013-07-21 12:23:42,593 WARN  mapred.JobClient - No job jar file set.  User 
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2013-07-21 12:23:42,796 INFO  mapreduce.GoraRecordReader - 
gora.buffer.read.limit = 10000
2013-07-21 12:23:43,062 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2013-07-21 12:23:43,062 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2013-07-21 12:23:43,093 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'generate_host_count', using default
2013-07-21 12:23:43,234 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 10000
2013-07-21 12:23:43,250 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-07-21 12:23:43,250 WARN  mapred.LocalJobRunner - job_local1378002997_0002
java.lang.NullPointerException
    at org.apache.avro.util.Utf8.<init>(Utf8.java:37)
    at org.apache.nutch.crawl.GeneratorReducer.setup(GeneratorReducer.java:100)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)


I don't know if this is the right direction I should continue with. But any 
way, hopefully my experience could help others.


Regards,
Rui






At 2013-07-20 23:07:41,"Rui Gao" <[email protected]> wrote:
>Hi Lewis,
>
>Thanks for your answer.
>So, what direction will Nutch go? Will it co-operate with relationship 
>database or will it only work on non-relationship database like hbase?
>I remember when 2.2.1 has been released, I checked the release note, it says 
>some bugs related with mysql has been fixed. That's why I try to integrate it 
>with mysql or hsql. And also, in the wiki, there's a link talking about how to 
>integrate nutch with mysql: http://nlp.solutions.asia/?p=362
>
>Do you have any suggestion?
>
>Thanks.
>
>Best Regards,
>Rui
>
>
>
>
>
>
>
>At 2013-07-11 03:53:12,"Lewis John Mcgibbney" <[email protected]> 
>wrote:
>>Hi Rui,
>>This should not work.
>>The SqlStore module and support for it is now deprecated within Apache Gora.
>>If you would like to downgrade to use Nutch 2.1, then you can use older
>>Gora artifacts but this is not recommended.
>>Thanks
>>Lewis
>>
>>
>>On Sun, Jul 7, 2013 at 12:36 AM, Rui Gao <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> I have set up eclipse environment according to the WIKI. Here's some
>>> something I did before I run the inject job:
>>> 1. I use SqlStore as storage class
>>> 2. I started HSql database which contains the table 'webpage'.
>>> 3. I added 1 URL in seed.txt.
>>> Then I run the inject job. It seems the job is finished successfully. But
>>> I there's no change be made to my HSql database. Any thought about this?
>>> Here's the log:
>>> InjectorJob: starting at 2013-07-07 15:28:42
>>> InjectorJob: Injecting urlDir: urls/dev
>>> InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora
>>> storage class.
>>> InjectorJob: total number of urls rejected by filters: 0
>>> InjectorJob: total number of urls injected after normalization and
>>> filtering: 1
>>> Injector: finished at 2013-07-07 15:28:44, elapsed: 00:00:02
>>>
>>> Best Regards,
>>> Rui
>>>
>>
>>
>>
>>-- 
>>*Lewis*

Re:Re:Re: [2.2.1] What does inject job do?

Reply via email to