Hi Matt,

The plugin urlmeta does NOT extract the metadata from HTML pages. The 'meta'
in its name means 'crawldb metadata'

You need to use the patch in https://issues.apache.org/jira/browse/NUTCH-809

HTH

Julien


On 26 September 2011 21:18, Wilson, Matt <[email protected]>wrote:

> Also,
>
> In case this helps.  I removed the Keywords field from the solr schema to
> see if it would generate an error when the SolrIndexer runs and it does not.
>  This has lead me to believe that nutch is either not indexing the meta
> content or it is not sending the update to solr when SolrIndexer runs.
>
> Matt Wilson
>
> -----Original Message-----
> From: lewis john mcgibbney [mailto:[email protected]]
> Sent: Monday, September 26, 2011 3:04 PM
> To: [email protected]
> Subject: Re: Indexing specific metadata tags with urlmeta
>
> Hi Matt,
>
> Try changing
>
> <field name="keywords" type="string" stored="true" indexed="true"
> multiValued="true"/>
>
> to
>
> <field name="Keywords" type="string" stored="true" indexed="true"
> multiValued="true"/> as per your metadata tags.
>
> We also have a configuration option in nutch-site.xml which you could check
> out.
>
> <property>
>  <name>urlmeta.tags</name>
>  <value></value>
>  <description>
>    To be used in conjunction with features introduced in NUTCH-655, which
> allows
>    for custom metatags to be injected alongside your crawl URLs. Specifying
> those
>    custom tags here will allow for their propagation into a pages outlinks,
> as
>    well as allow for them to be included as part of an index.
>    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags
> with
>    white-space at their boundaries, if you are using anything earlier than
> Hadoop-0.21.
>  </description>
> </property>
>
> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
> <[email protected]>wrote:
>
> > I am attempting to crawl a corporate intranet site and allow it to be
> > searched in solr.  As part of the requirements I have to be able to index
> > certain metadata tags as their own field in solr (for faceted search).
>  For
> > example, the pages being crawled contain the following meta tag:
> >
> > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings,
> > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
> Plans"
> > />
> >
> > I have updated the nutch-site.xml with the following:
> >
> > <property>
> >    <name>plugin.includes</name>
> >    <value>urlmeta|protocol-httpclient|... </value>
> > </property>
> > <property>
> >    <name>urlmeta.tags</name>
> >    <value>keywords</value>
> > </property>
> >
> > I have updated the solr schema.xml with the following addition:
> >
> > <field name="keywords" type="string" stored="true" indexed="true"
> > multiValued="true"/>
> >
> > I can see that the field has been created in Solr via the admin
> interface.
> >  I also see that nutch is loading the urlmeta plugin and adding the
> > indexfilters etc in the hadroop.log.  The problem is that nutch does not
> > appear to be indexing the keywords field.  All of the pages crawled have
> the
> > tag present and I am receiving no errors in the nutch log.  I am unsure
> as
> > to what I am missing.  This seems to be pretty straightforward; however,
> I
> > must be misunderstanding either the urlmeta plugin or missing something
> in
> > the configuration.
> >
>
>
>
> --
> *Lewis*
>
>
> This E-Mail has been scanned for viruses.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to