Hello,

I have gora_94 with  hbase-0.94.17 and  avro-1.7.6.




I have investigated further and it turned out that the culprit is not  
inlinkedScoreData.clear()
and found another issue in addition to the deletion of custom metadata.


For the simplicity let's consider only one seed url, let say mydomain.com that 
has two <a tags in it


http://mydomain.com has <a href="http://mydomain.com";>Home</a> and <a 
href="http://mydomain.com/page1";>Page1</a>


the same <a tags are in http://mydomain.com/page1 i.e




http://mydomain.com/page1  <a href="http://mydomain.com";>Home</a> and <a 
href="http://mydomain.com/page1";>Page1</a>


When we do


bin/nutch inject seed 
bin/nutch generate -batchId 1
bin/nutch fetch 1
bin/nutch updatedb 1


mydomain.com is fetched and after bin/nutch updatedb 1 


http://mydomain.com/page1
comes as outlink


In the second round



bin/nutch generate -batchId 2
bin/nutch fetch 2





http://mydomain.com/page1 is fetched and parsed. However, in


bin/nutch updatedb 2


http://mydomain.com comes as outlink to http://mydomain.com/page1 and it is 
considered  as a new page by DbUpdateReducer.java.


So the first issue is that custom metadata for http://mydomain.com is deleted 
after  bin/nutch updatedb 2. 
The second issue is  that  http://mydomain.com status is changed from fetched 
to unfetched.


I will investigate further and post again.


Thanks.
Alex.




http://mydomain.com/page1 
-----Original Message-----

From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Wed, Jun 18, 2014 7:30 am
Subject: Re: updatedb deletes all metadata except _csh_


Hi Alex,

On Tue, Jun 17, 2014 at 2:06 PM, <[email protected]> wrote:

>
> I am using nutch-2.x with GORA_97.


You mean GORA-94, the Avro upgrade?
With which gora- backend please?


> Further investigation shows that DbUpdateReducer
> calls
>  inlinkedScoreData.clear();
>

I see this on line ~72 of DbUpdateReducer


>
> and it calls this function
>
>  public void readFields(DataInput in) throws IOException {
>

Can you please point me to where ScoreDatum#readFields is called?


>
> And metaData.clear(); line clears all metadata.
>

Yes this should result in an empty HashMap data structure.


>
> Why metaData.clear(); line is needed in this function?
>
>
It is poorly documented and this Class has not be altered for some time so
off the top of my head I need to say that I do not know why. Based on the
Javadoc for Writable, @Override readFields should "...should attempt to
re-use storage in the existing object where possible." so I am not sure why
we clear the metadata from the HashMap structure. I would need to debug
this to understand.
If you can provide more context on where ScoreDatum#readFields is called
then I can set break point up until then.
Thanks Alex
Lewis

 

Reply via email to