Hi,

I was planning to parse img tags from a url content and put it in metadata 
filed of Webpage storage class in nutch2.0 to retrieve them later  in the 
indexing step.
However, since there is no metadata data type variable in Parse class (compare 
with outlinks) this can not be done in nutch 2.0 (compare parse class with 
metadata type variable in nutch 1.X). One is restricted to use putToMetadata 
function of WebPage class which overwrites values, i.e.,if I try to put two 
metadata img_alt:alt1 img_alt:alt2  I get only the last value img_alt:alt2 in 
metadata field.

So, my question is how img tag alt values can be indexed in nutch-2.0, provided 
that there are more than one img tag in all crawled urls?
Do I need to parse them and store in one of the fields of webpage storage class 
or this step is not needed?

Thanks.
Alex.



-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Tue, Jul 3, 2012 5:08 am
Subject: Re: parse and solrindex in nutch-2.0


Hi,

On Mon, Jul 2, 2012 at 8:21 PM,  <[email protected]> wrote:

> Regarding the metadata, what would be a proper way of parsing end indexing 
multivalued tags in nutch-2.0 then?
>

Assuming you've taken a look into the schema, 'some' mutivalued fields
are permitted out of the box. Are you having problems obtaining
multiple values for some fields within the documents your trying to
parse + index?

Lewis

 

Reply via email to