Hello,
I have gora_94 with hbase-0.94.17 and avro-1.7.6. I have investigated further and it turned out that the culprit is not inlinkedScoreData.clear() and found another issue in addition to the deletion of custom metadata. For the simplicity let's consider only one seed url, let say mydomain.com that has two <a tags in it http://mydomain.com has <a href="http://mydomain.com">Home</a> and <a href="http://mydomain.com/page1">Page1</a> the same <a tags are in http://mydomain.com/page1 i.e http://mydomain.com/page1 <a href="http://mydomain.com">Home</a> and <a href="http://mydomain.com/page1">Page1</a> When we do bin/nutch inject seed bin/nutch generate -batchId 1 bin/nutch fetch 1 bin/nutch updatedb 1 mydomain.com is fetched and after bin/nutch updatedb 1 http://mydomain.com/page1 comes as outlink In the second round bin/nutch generate -batchId 2 bin/nutch fetch 2 http://mydomain.com/page1 is fetched and parsed. However, in bin/nutch updatedb 2 http://mydomain.com comes as outlink to http://mydomain.com/page1 and it is considered as a new page by DbUpdateReducer.java. So the first issue is that custom metadata for http://mydomain.com is deleted after bin/nutch updatedb 2. The second issue is that http://mydomain.com status is changed from fetched to unfetched. I will investigate further and post again. Thanks. Alex. http://mydomain.com/page1 -----Original Message----- From: Lewis John Mcgibbney <[email protected]> To: user <[email protected]> Sent: Wed, Jun 18, 2014 7:30 am Subject: Re: updatedb deletes all metadata except _csh_ Hi Alex, On Tue, Jun 17, 2014 at 2:06 PM, <[email protected]> wrote: > > I am using nutch-2.x with GORA_97. You mean GORA-94, the Avro upgrade? With which gora- backend please? > Further investigation shows that DbUpdateReducer > calls > inlinkedScoreData.clear(); > I see this on line ~72 of DbUpdateReducer > > and it calls this function > > public void readFields(DataInput in) throws IOException { > Can you please point me to where ScoreDatum#readFields is called? > > And metaData.clear(); line clears all metadata. > Yes this should result in an empty HashMap data structure. > > Why metaData.clear(); line is needed in this function? > > It is poorly documented and this Class has not be altered for some time so off the top of my head I need to say that I do not know why. Based on the Javadoc for Writable, @Override readFields should "...should attempt to re-use storage in the existing object where possible." so I am not sure why we clear the metadata from the HashMap structure. I would need to debug this to understand. If you can provide more context on where ScoreDatum#readFields is called then I can set break point up until then. Thanks Alex Lewis

