Hi Kiran, On Wed, Jan 30, 2013 at 11:10 AM, kiran chitturi <[email protected]>wrote:
> I have checked the database after the dbupdate job is ran and i could see > only markers, signature and fetch fields. > Which Gora artifacts are you using? We've recently fixed a bug in gora-cassandra [0] as the state for map values was not being correctly recorded, this prevented us from writing the values during the dbUpdaterJob. I was not aware (and no-one flagged it up during either the Gora 0.2.1 or Nutch 2.1 RC testing) that there was a problem with similar fields being written to HBase. > > The initial seed which was crawled and parsed, has only outlinks. I notice > one of the outlink is actually the inlink. > Can you reproduce? Is there any way of being more verbose here. This is starting to sound like a bug. Unfortunately, I am not 100% on the HBase module either! > > Aren't inlinks supposed to be saved during the dbUpdatedJob ? Yes, specifically in the dbUpdaterReducerJob [1] > When i tried > to debug, i could see in eclipse and in the dbUpdateReducer job that the > inlinks are being saved to the page object along with fetch fields, markers > but i did not understood where the data is going from there. > We need to narrow this down and document it fully then. I cannot look into this for a couple hours Kiran, Lewis [0] https://issues.apache.org/jira/browse/GORA-182 [1] http://wiki.apache.org/nutch/Nutch2Crawling#DbUpdate [2] http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/util/WebPageWritable.java

