Dear list,
I'd like to store additional data into the webpages rows (something like all 
distinct anchor texts for each inlink)

I'm wondering what's the best way to do this, and would appreciate any 
suggestion before digging in the wrong direction.

From what I could understand, I have at least two options, may be more :

1/ Write a custom job that would iterate over the webpage db and produce the 
desired output in a dedicated db for my own use. This option seems to lower the 
risk of messing up nutch's internals, just like the updatedb command does but 
producing a second db as a result. The only CONS of this choice from my point 
of view is the need to run yet another whole iteration over all the db to 
populate the required fields, and the duplication of information in an other db.

2/ Extend the webpage structure with my additions, and extend the DBUpdaterJob 
with pimped DBUpdateMapper and DBUpdateReducer classes doing what I want with 
inlinks's anchor texts. I see several CONS to this choice, which may result 
from my misunderstanding of how nutch works. Practically I think I can't simply 
extend the WebPage storage class, but I'd have to copy-and-modify it, because 
of the way gora persists things and so on.

3/ Better idea ?

I run nutch 2.1 and rely on hbase for storage.

Thanks in advance for your lights.

Tanguy

Reply via email to