Hi Ferdy, Thank you for the fast response. I'll try what you suggested and come back here if I face an other issue ;-)
-- Tanguy On May 30, 2013, at 11:14 AM, Ferdy Galema <[email protected]> wrote: > Hi, > > I would certainly not extend WebPage since that will require a lot of work. > (Simple Java extending won't work because it is a generated class. You'd > have to modify the Avro Schema, regenerate the classes and such). Putting > it in a separate table is also not great because of the duplication and > separation of the data. > > In my opinion the best way to add extra data is to use the Metadata field, > since it is a freeform map already provided with WebPage. If you have > several pieces of data, you can prefix the keys to indicate what data it > is. You can write a separate Job (or extend/modify DbUpdaterJob) to work on > this data. > > > On Thu, May 30, 2013 at 11:01 AM, Tanguy Moal <[email protected]> wrote: > >> Dear list, >> I'd like to store additional data into the webpages rows (something like >> all distinct anchor texts for each inlink) >> >> I'm wondering what's the best way to do this, and would appreciate any >> suggestion before digging in the wrong direction. >> >> From what I could understand, I have at least two options, may be more : >> >> 1/ Write a custom job that would iterate over the webpage db and produce >> the desired output in a dedicated db for my own use. This option seems to >> lower the risk of messing up nutch's internals, just like the updatedb >> command does but producing a second db as a result. The only CONS of this >> choice from my point of view is the need to run yet another whole iteration >> over all the db to populate the required fields, and the duplication of >> information in an other db. >> >> 2/ Extend the webpage structure with my additions, and extend the >> DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes doing >> what I want with inlinks's anchor texts. I see several CONS to this choice, >> which may result from my misunderstanding of how nutch works. Practically I >> think I can't simply extend the WebPage storage class, but I'd have to >> copy-and-modify it, because of the way gora persists things and so on. >> >> 3/ Better idea ? >> >> I run nutch 2.1 and rely on hbase for storage. >> >> Thanks in advance for your lights. >> >> Tanguy > > > > > -- > *Ferdy Galema* > Kalooga Development > > -- > > *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer > now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22> > Kalooga > > Helperpark 288 > 9723 ZA Groningen > The Netherlands > +31 50 2103400 > > www.kalooga.com > [email protected] EMEA > > 53 Davies Street > W1K 5JH London > United Kingdom > +44 20 7129 1430Kalooga Spain and LatAM > > Maria de Sevilla Diago No 3 > 28022 Madrid - Madrid > Spain > +34 670 580 872

