Hi Ferdy,

Thank you for the fast response. I'll try what you suggested and come back here 
if I face an other issue ;-)

--
Tanguy

On May 30, 2013, at 11:14 AM, Ferdy Galema <[email protected]> wrote:

> Hi,
> 
> I would certainly not extend WebPage since that will require a lot of work.
> (Simple Java extending won't work because it is a generated class. You'd
> have to modify the Avro Schema, regenerate the classes and such). Putting
> it in a separate table is also not great because of the duplication and
> separation of the data.
> 
> In my opinion the best way to add extra data is to use the Metadata field,
> since it is a freeform map already provided with WebPage. If you have
> several pieces of data, you can prefix the keys to indicate what data it
> is. You can write a separate Job (or extend/modify DbUpdaterJob) to work on
> this data.
> 
> 
> On Thu, May 30, 2013 at 11:01 AM, Tanguy Moal <[email protected]> wrote:
> 
>> Dear list,
>> I'd like to store additional data into the webpages rows (something like
>> all distinct anchor texts for each inlink)
>> 
>> I'm wondering what's the best way to do this, and would appreciate any
>> suggestion before digging in the wrong direction.
>> 
>> From what I could understand, I have at least two options, may be more :
>> 
>> 1/ Write a custom job that would iterate over the webpage db and produce
>> the desired output in a dedicated db for my own use. This option seems to
>> lower the risk of messing up nutch's internals, just like the updatedb
>> command does but producing a second db as a result. The only CONS of this
>> choice from my point of view is the need to run yet another whole iteration
>> over all the db to populate the required fields, and the duplication of
>> information in an other db.
>> 
>> 2/ Extend the webpage structure with my additions, and extend the
>> DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes doing
>> what I want with inlinks's anchor texts. I see several CONS to this choice,
>> which may result from my misunderstanding of how nutch works. Practically I
>> think I can't simply extend the WebPage storage class, but I'd have to
>> copy-and-modify it, because of the way gora persists things and so on.
>> 
>> 3/ Better idea ?
>> 
>> I run nutch 2.1 and rely on hbase for storage.
>> 
>> Thanks in advance for your lights.
>> 
>> Tanguy
> 
> 
> 
> 
> -- 
> *Ferdy Galema*
> Kalooga Development
> 
> -- 
> 
> *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer 
> now!<http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22>
> Kalooga
> 
> Helperpark 288
> 9723 ZA Groningen
> The Netherlands
> +31 50 2103400
> 
> www.kalooga.com
> [email protected] EMEA
> 
> 53 Davies Street
> W1K 5JH London
> United Kingdom
> +44 20 7129 1430Kalooga Spain and LatAM
> 
> Maria de Sevilla Diago No 3
> 28022 Madrid - Madrid
> Spain
> +34 670 580 872

Reply via email to