Hi, I was planning to parse img tags from a url content and put it in metadata filed of Webpage storage class in nutch2.0 to retrieve them later in the indexing step. However, since there is no metadata data type variable in Parse class (compare with outlinks) this can not be done in nutch 2.0 (compare parse class with metadata type variable in nutch 1.X). One is restricted to use putToMetadata function of WebPage class which overwrites values, i.e.,if I try to put two metadata img_alt:alt1 img_alt:alt2 I get only the last value img_alt:alt2 in metadata field.
So, my question is how img tag alt values can be indexed in nutch-2.0, provided that there are more than one img tag in all crawled urls? Do I need to parse them and store in one of the fields of webpage storage class or this step is not needed? Thanks. Alex. -----Original Message----- From: Lewis John Mcgibbney <[email protected]> To: user <[email protected]> Sent: Tue, Jul 3, 2012 5:08 am Subject: Re: parse and solrindex in nutch-2.0 Hi, On Mon, Jul 2, 2012 at 8:21 PM, <[email protected]> wrote: > Regarding the metadata, what would be a proper way of parsing end indexing multivalued tags in nutch-2.0 then? > Assuming you've taken a look into the schema, 'some' mutivalued fields are permitted out of the box. Are you having problems obtaining multiple values for some fields within the documents your trying to parse + index? Lewis

