Hi,
I've tried to sort this out locally this morning...
I can almost replicate this behaviour with gora-cassandra and it looks most
likely that the patch(es) applied in
* NUTCH-1533 - NUTCH-1532 Implement getPrevModifiedTime(),
setPrevModifiedTime(), getBatchId() and setBatchId() accessors in
o.a.n.storage.WebPage, and
* NUTCH-1532 - Replace 'segment' mapping field with batchId,
respectively are not backwards compatible because some URLs within the web
database do not contain values to the batchId.
Of course this is a major problem.
I opened NUTCH-1551 [0] and submitted a patch to make WebTableReader
backwards compatible with the above patches. Please try out the patch if
you can and comment so I can commit.

We have a couple options here.
1) Revert both of the above until we can get a fix
2) Get a fix just now and commit it.
What do you guys want to do?

I have a question about whether or not we can dynamically add fields to
existing data base entires by injecting them?
Say for example, you inject URLs without the batchId field in your mapping
file, then add the field and inject some more URLs... will the field be
added to your data base? If so then why are we getting the NPE?
There must be some other location in the Nutch code where an asserted
attempt is being made to obtain the batchId fore some given key... it
cannot be obtained and we receive the NPE.

[0] https://issues.apache.org/jira/browse/NUTCH-1551


On Fri, Mar 29, 2013 at 5:05 PM, kaveh minooie <[email protected]> wrote:

> I use git and i fetch from github 
> (https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>) 
> currently I am on this commit:
>
> commit 4bb01d6b908dc230c8be89d398b03a**86581ec42b
> Author: lufeng <[email protected]>
> Date:   Thu Mar 28 13:09:09 2013 +0000
>
>     NUTCH-1547 BasicIndexingFilter - Problem to index full title
>
>     git-svn-id: https://svn.apache.org/repos/**
> asf/nutch/branches/2.x@1462079<https://svn.apache.org/repos/asf/nutch/branches/2.x@1462079>13f79535-47bb-0310-9956-
> **ffa450edef68
>
>
> before I was on this commit :
>
>
> commit f02dcf62566583551426c08bd38808**0e5b2bc93e
>
> >  f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml
>
>
> On 03/29/2013 04:35 PM, [email protected] wrote:
>
>> Yes, with hbase. Here is the error
>>
>> 13/03/29 16:33:29 INFO zookeeper.ZooKeeper: Session: 0x13d7770d67d005f
>> closed
>> 13/03/29 16:33:29 ERROR crawl.WebTableReader: WebTableReader:
>> java.lang.NullPointerException
>>          at org.apache.gora.hbase.store.**HBaseStore.addFields(**
>> HBaseStore.java:398)
>>          at org.apache.gora.hbase.store.**HBaseStore.execute(HBaseStore.*
>> *java:360)
>>          at org.apache.nutch.crawl.**WebTableReader.read(**
>> WebTableReader.java:234)
>>          at org.apache.nutch.crawl.**WebTableReader.run(**
>> WebTableReader.java:476)
>>          at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>          at org.apache.nutch.crawl.**WebTableReader.main(**
>> WebTableReader.java:412)
>>          at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native
>> Method)
>>          at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>> NativeMethodAccessorImpl.java:**39)
>>          at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>> DelegatingMethodAccessorImpl.**java:25)
>>          at java.lang.reflect.Method.**invoke(Method.java:597)
>>          at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)
>>
>>
>> If I revert to previous release it works fine.
>>
>> Thanks.
>> Alex.
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Lewis John Mcgibbney <[email protected]>
>> To: user <[email protected]>
>> Sent: Fri, Mar 29, 2013 4:30 pm
>> Subject: Re: error using generate in 2.x
>>
>>
>> Hi Alex,
>> With HBase also?
>> There 'was' a bug in gora-cassandra module for this command + params
>> however I thought it had been addressed and therefore resolved it.
>> Lewis
>>
>>
>> On Fri, Mar 29, 2013 at 4:00 PM, <[email protected]> wrote:
>>
>>  Hi,
>>>
>>> It seems that trunk has a few bugs. I found out that readdb -url urlname
>>> also gives errors.
>>>
>>> Thanks.
>>> Alex.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: kaveh minooie <[email protected]>
>>> To: user <[email protected]>
>>> Sent: Fri, Mar 29, 2013 1:53 pm
>>> Subject: Re: error using generate in 2.x
>>>
>>>
>>> Hi lewis
>>>
>>> the mapping file that I am using is the one that comes with nutch, and I
>>> haven't touched it. this message in the log is caused by using the
>>> -crawlId on the command line. for example this log was the result of
>>> this command :
>>>
>>> bin/nutch generate -topN 1000 -crawlId t1
>>>
>>> which causes the nutch( or i guess technically gora ) to use a table
>>> name 't1_webpage'. thou, I have to say that i don't understand the
>>> rational behind the code generating a warning like this ( I mean I know
>>> it is not actually a warning, just that the way the message has been
>>> phrased makes it look like warning) for something that should be a
>>> routine operation. for someone like me who is crawling ( i mean hoping
>>> to cause it is not working right now ) thousands of websites to maintain
>>> multiple crawldb ( or its equivalent in gora, webpage table ) for
>>> different group of websites.
>>>
>>>
>>> Now that being said, it has nothing to do with the problem that I am
>>> having. it is the same when I ommit the -crawlId parameter ( forcing it
>>> to use the default name webpage ), and more importantly it is new. I
>>> haven't had this problem before, it just started to happening 2 days ago
>>> when i pulled the latest commits to 2.x branch.
>>>
>>>
>>> On 03/29/2013 09:50 AM, Lewis John Mcgibbney wrote:
>>>
>>>> Hi Kaveh,
>>>> Firstly, as logged below, Gora attempts to associate your HBase table
>>>> configuration with specified tables (from within gora-hbase-mapping.xml)
>>>> however it seems that your case satisfies the condition "if
>>>> (!tableName.equals(**tableNameFromMapping))" meaining that the table
>>>> name
>>>>
>>> is
>>>
>>>> not equal to the value for the table name attribute or that this value
>>>> is
>>>> null.
>>>> This is allowed, but I am interested to find out what the mapping file
>>>> looks like... the entire file is not required, just the <class
>>>>
>>> name="value"
>>>
>>>> snippet if this is possible.
>>>> I am not using the gora-hbase module and haven't ever seen anyone come
>>>> across this problem before.
>>>> Thanks
>>>> Lewis
>>>>
>>>> On Thursday, March 28, 2013, kaveh minooie <[email protected]> wrote:
>>>>
>>>>  2013-03-28 11:06:25,158 INFO  store.HBaseStore - Keyclass and nameclass
>>>>>
>>>> match but mismatching table names  mappingfile schema is 'webpage' vs
>>>> actual schema 't1_webpage' , assuming they are the same.
>>>>
>>>>
>>> --
>>> Kaveh Minooie
>>>
>>>
>>>
>>>
>>
>>
> --
> Kaveh Minooie
>



-- 
*Lewis*

Reply via email to