Hi, I would certainly not extend WebPage since that will require a lot of work. (Simple Java extending won't work because it is a generated class. You'd have to modify the Avro Schema, regenerate the classes and such). Putting it in a separate table is also not great because of the duplication and separation of the data.
In my opinion the best way to add extra data is to use the Metadata field, since it is a freeform map already provided with WebPage. If you have several pieces of data, you can prefix the keys to indicate what data it is. You can write a separate Job (or extend/modify DbUpdaterJob) to work on this data. On Thu, May 30, 2013 at 11:01 AM, Tanguy Moal <[email protected]> wrote: > Dear list, > I'd like to store additional data into the webpages rows (something like > all distinct anchor texts for each inlink) > > I'm wondering what's the best way to do this, and would appreciate any > suggestion before digging in the wrong direction. > > From what I could understand, I have at least two options, may be more : > > 1/ Write a custom job that would iterate over the webpage db and produce > the desired output in a dedicated db for my own use. This option seems to > lower the risk of messing up nutch's internals, just like the updatedb > command does but producing a second db as a result. The only CONS of this > choice from my point of view is the need to run yet another whole iteration > over all the db to populate the required fields, and the duplication of > information in an other db. > > 2/ Extend the webpage structure with my additions, and extend the > DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes doing > what I want with inlinks's anchor texts. I see several CONS to this choice, > which may result from my misunderstanding of how nutch works. Practically I > think I can't simply extend the WebPage storage class, but I'd have to > copy-and-modify it, because of the way gora persists things and so on. > > 3/ Better idea ? > > I run nutch 2.1 and rely on hbase for storage. > > Thanks in advance for your lights. > > Tanguy -- *Ferdy Galema* Kalooga Development -- *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22> Kalooga Helperpark 288 9723 ZA Groningen The Netherlands +31 50 2103400 www.kalooga.com [email protected] EMEA 53 Davies Street W1K 5JH London United Kingdom +44 20 7129 1430Kalooga Spain and LatAM Maria de Sevilla Diago No 3 28022 Madrid - Madrid Spain +34 670 580 872

