Dear list, I'd like to store additional data into the webpages rows (something like all distinct anchor texts for each inlink)
I'm wondering what's the best way to do this, and would appreciate any suggestion before digging in the wrong direction. From what I could understand, I have at least two options, may be more : 1/ Write a custom job that would iterate over the webpage db and produce the desired output in a dedicated db for my own use. This option seems to lower the risk of messing up nutch's internals, just like the updatedb command does but producing a second db as a result. The only CONS of this choice from my point of view is the need to run yet another whole iteration over all the db to populate the required fields, and the duplication of information in an other db. 2/ Extend the webpage structure with my additions, and extend the DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes doing what I want with inlinks's anchor texts. I see several CONS to this choice, which may result from my misunderstanding of how nutch works. Practically I think I can't simply extend the WebPage storage class, but I'd have to copy-and-modify it, because of the way gora persists things and so on. 3/ Better idea ? I run nutch 2.1 and rely on hbase for storage. Thanks in advance for your lights. Tanguy

