Hi,
Just heads up for the event where you do need to add to the nested Metadata
structure within the WebPage.avsc, you can merely write your changes and
utilise the ant 'generate-gora-src' target from the build script. The
GoraCompiler will then compile everything in /src/gora to the path you
specify along with whichever license header you specify (ASLv2 by default
now).

On Thursday, May 30, 2013, Tanguy Moal <[email protected]> wrote:
> Hi Ferdy,
>
> Thank you for the fast response. I'll try what you suggested and come
back here if I face an other issue ;-)
>
> --
> Tanguy
>
> On May 30, 2013, at 11:14 AM, Ferdy Galema <[email protected]>
wrote:
>
>> Hi,
>>
>> I would certainly not extend WebPage since that will require a lot of
work.
>> (Simple Java extending won't work because it is a generated class. You'd
>> have to modify the Avro Schema, regenerate the classes and such). Putting
>> it in a separate table is also not great because of the duplication and
>> separation of the data.
>>
>> In my opinion the best way to add extra data is to use the Metadata
field,
>> since it is a freeform map already provided with WebPage. If you have
>> several pieces of data, you can prefix the keys to indicate what data it
>> is. You can write a separate Job (or extend/modify DbUpdaterJob) to work
on
>> this data.
>>
>>
>> On Thu, May 30, 2013 at 11:01 AM, Tanguy Moal <[email protected]>
wrote:
>>
>>> Dear list,
>>> I'd like to store additional data into the webpages rows (something like
>>> all distinct anchor texts for each inlink)
>>>
>>> I'm wondering what's the best way to do this, and would appreciate any
>>> suggestion before digging in the wrong direction.
>>>
>>> From what I could understand, I have at least two options, may be more :
>>>
>>> 1/ Write a custom job that would iterate over the webpage db and produce
>>> the desired output in a dedicated db for my own use. This option seems
to
>>> lower the risk of messing up nutch's internals, just like the updatedb
>>> command does but producing a second db as a result. The only CONS of
this
>>> choice from my point of view is the need to run yet another whole
iteration
>>> over all the db to populate the required fields, and the duplication of
>>> information in an other db.
>>>
>>> 2/ Extend the webpage structure with my additions, and extend the
>>> DBUpdaterJob with pimped DBUpdateMapper and DBUpdateReducer classes
doing
>>> what I want with inlinks's anchor texts. I see several CONS to this
choice,
>>> which may result from my misunderstanding of how nutch works.
Practically I
>>> think I can't simply extend the WebPage storage class, but I'd have to
>>> copy-and-modify it, because of the way gora persists things and so on.
>>>
>>> 3/ Better idea ?
>>>
>>> I run nutch 2.1 and rely on hbase for storage.
>>>
>>> Thanks in advance for your lights.
>>>
>>> Tanguy
>>
>>
>>
>>
>> --
>> *Ferdy Galema*
>> Kalooga Development
>>
>> --
>>
>> *Kalooga* | Visual RelevanceCheck out our Visual Gallery Layer now!<
http://www.independent.co.uk/arts-entertainment/music/news/david-cameron-gets-teenage-kicks-starring-in-one-direction-music-video-8499282.html#!kalooga-10369/%22One%20Direction%22
>
>> Kalooga
>>
>> Helperpark 288
>> 9723 ZA Groningen
>> The Netherlands
>> +31 50 2103400
>>
>> www.kalooga.com
>> [email protected] EMEA
>>
>> 53 Davies Street
>> W1K 5JH London
>> United Kingdom
>> +44 20 7129 1430Kalooga Spain and LatAM
>>
>> Maria de Sevilla Diago No 3
>> 28022 Madrid - Madrid
>> Spain
>> +34 670 580 872
>
>

-- 
*Lewis*

Reply via email to