Re: error using generate in 2.x

Lewis John Mcgibbney Sat, 30 Mar 2013 16:03:30 -0700

I think we need also may need to add the BATCH_ID to one Job's HashSet

private static final Collection<WebPage.Field> FIELDS = new
HashSet<WebPage.Field>();
static {
...
  FIELDS.add(WebPage.Field.BATCH_ID);
}



On Sat, Mar 30, 2013 at 3:55 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
> I've tried to sort this out locally this morning...
> I can almost replicate this behaviour with gora-cassandra and it looks
> most likely that the patch(es) applied in
> * NUTCH-1533 - NUTCH-1532 Implement getPrevModifiedTime(),
> setPrevModifiedTime(), getBatchId() and setBatchId() accessors in
> o.a.n.storage.WebPage, and
> * NUTCH-1532 - Replace 'segment' mapping field with batchId,
> respectively are not backwards compatible because some URLs within the web
> database do not contain values to the batchId.
> Of course this is a major problem.
> I opened NUTCH-1551 [0] and submitted a patch to make WebTableReader
> backwards compatible with the above patches. Please try out the patch if
> you can and comment so I can commit.
>
> We have a couple options here.
> 1) Revert both of the above until we can get a fix
> 2) Get a fix just now and commit it.
> What do you guys want to do?
>
> I have a question about whether or not we can dynamically add fields to
> existing data base entires by injecting them?
> Say for example, you inject URLs without the batchId field in your mapping
> file, then add the field and inject some more URLs... will the field be
> added to your data base? If so then why are we getting the NPE?
> There must be some other location in the Nutch code where an asserted
> attempt is being made to obtain the batchId fore some given key... it
> cannot be obtained and we receive the NPE.
>
> [0] https://issues.apache.org/jira/browse/NUTCH-1551
>
>
> On Fri, Mar 29, 2013 at 5:05 PM, kaveh minooie <[email protected]> wrote:
>
>> I use git and i fetch from github 
>> (https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>) 
>> currently I am on this commit:
>>
>> commit 4bb01d6b908dc230c8be89d398b03a**86581ec42b
>> Author: lufeng <[email protected]>
>> Date:   Thu Mar 28 13:09:09 2013 +0000
>>
>>     NUTCH-1547 BasicIndexingFilter - Problem to index full title
>>
>>     git-svn-id: https://svn.apache.org/repos/**
>> asf/nutch/branches/2.x@1462079<https://svn.apache.org/repos/asf/nutch/branches/2.x@1462079>13f79535-47bb-0310-9956-
>> **ffa450edef68
>>
>>
>> before I was on this commit :
>>
>>
>> commit f02dcf62566583551426c08bd38808**0e5b2bc93e
>>
>> >  f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml
>>
>>
>> On 03/29/2013 04:35 PM, [email protected] wrote:
>>
>>> Yes, with hbase. Here is the error
>>>
>>> 13/03/29 16:33:29 INFO zookeeper.ZooKeeper: Session: 0x13d7770d67d005f
>>> closed
>>> 13/03/29 16:33:29 ERROR crawl.WebTableReader: WebTableReader:
>>> java.lang.NullPointerException
>>>          at org.apache.gora.hbase.store.**HBaseStore.addFields(**
>>> HBaseStore.java:398)
>>>          at org.apache.gora.hbase.store.**HBaseStore.execute(HBaseStore.
>>> **java:360)
>>>          at org.apache.nutch.crawl.**WebTableReader.read(**
>>> WebTableReader.java:234)
>>>          at org.apache.nutch.crawl.**WebTableReader.run(**
>>> WebTableReader.java:476)
>>>          at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**
>>> java:65)
>>>          at org.apache.nutch.crawl.**WebTableReader.main(**
>>> WebTableReader.java:412)
>>>          at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native
>>> Method)
>>>          at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>>> NativeMethodAccessorImpl.java:**39)
>>>          at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>>> DelegatingMethodAccessorImpl.**java:25)
>>>          at java.lang.reflect.Method.**invoke(Method.java:597)
>>>          at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)
>>>
>>>
>>> If I revert to previous release it works fine.
>>>
>>> Thanks.
>>> Alex.
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Lewis John Mcgibbney <[email protected]>
>>> To: user <[email protected]>
>>> Sent: Fri, Mar 29, 2013 4:30 pm
>>> Subject: Re: error using generate in 2.x
>>>
>>>
>>> Hi Alex,
>>> With HBase also?
>>> There 'was' a bug in gora-cassandra module for this command + params
>>> however I thought it had been addressed and therefore resolved it.
>>> Lewis
>>>
>>>
>>> On Fri, Mar 29, 2013 at 4:00 PM, <[email protected]> wrote:
>>>
>>>  Hi,
>>>>
>>>> It seems that trunk has a few bugs. I found out that readdb -url urlname
>>>> also gives errors.
>>>>
>>>> Thanks.
>>>> Alex.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: kaveh minooie <[email protected]>
>>>> To: user <[email protected]>
>>>> Sent: Fri, Mar 29, 2013 1:53 pm
>>>> Subject: Re: error using generate in 2.x
>>>>
>>>>
>>>> Hi lewis
>>>>
>>>> the mapping file that I am using is the one that comes with nutch, and I
>>>> haven't touched it. this message in the log is caused by using the
>>>> -crawlId on the command line. for example this log was the result of
>>>> this command :
>>>>
>>>> bin/nutch generate -topN 1000 -crawlId t1
>>>>
>>>> which causes the nutch( or i guess technically gora ) to use a table
>>>> name 't1_webpage'. thou, I have to say that i don't understand the
>>>> rational behind the code generating a warning like this ( I mean I know
>>>> it is not actually a warning, just that the way the message has been
>>>> phrased makes it look like warning) for something that should be a
>>>> routine operation. for someone like me who is crawling ( i mean hoping
>>>> to cause it is not working right now ) thousands of websites to maintain
>>>> multiple crawldb ( or its equivalent in gora, webpage table ) for
>>>> different group of websites.
>>>>
>>>>
>>>> Now that being said, it has nothing to do with the problem that I am
>>>> having. it is the same when I ommit the -crawlId parameter ( forcing it
>>>> to use the default name webpage ), and more importantly it is new. I
>>>> haven't had this problem before, it just started to happening 2 days ago
>>>> when i pulled the latest commits to 2.x branch.
>>>>
>>>>
>>>> On 03/29/2013 09:50 AM, Lewis John Mcgibbney wrote:
>>>>
>>>>> Hi Kaveh,
>>>>> Firstly, as logged below, Gora attempts to associate your HBase table
>>>>> configuration with specified tables (from within
>>>>> gora-hbase-mapping.xml)
>>>>> however it seems that your case satisfies the condition "if
>>>>> (!tableName.equals(**tableNameFromMapping))" meaining that the table
>>>>> name
>>>>>
>>>> is
>>>>
>>>>> not equal to the value for the table name attribute or that this value
>>>>> is
>>>>> null.
>>>>> This is allowed, but I am interested to find out what the mapping file
>>>>> looks like... the entire file is not required, just the <class
>>>>>
>>>> name="value"
>>>>
>>>>> snippet if this is possible.
>>>>> I am not using the gora-hbase module and haven't ever seen anyone come
>>>>> across this problem before.
>>>>> Thanks
>>>>> Lewis
>>>>>
>>>>> On Thursday, March 28, 2013, kaveh minooie <[email protected]> wrote:
>>>>>
>>>>>  2013-03-28 11:06:25,158 INFO  store.HBaseStore - Keyclass and
>>>>>> nameclass
>>>>>>
>>>>> match but mismatching table names  mappingfile schema is 'webpage' vs
>>>>> actual schema 't1_webpage' , assuming they are the same.
>>>>>
>>>>>
>>>> --
>>>> Kaveh Minooie
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>> --
>> Kaveh Minooie
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Re: error using generate in 2.x

Reply via email to