Hi Matt, The plugin urlmeta does NOT extract the metadata from HTML pages. The 'meta' in its name means 'crawldb metadata'
You need to use the patch in https://issues.apache.org/jira/browse/NUTCH-809 HTH Julien On 26 September 2011 21:18, Wilson, Matt <[email protected]>wrote: > Also, > > In case this helps. I removed the Keywords field from the solr schema to > see if it would generate an error when the SolrIndexer runs and it does not. > This has lead me to believe that nutch is either not indexing the meta > content or it is not sending the update to solr when SolrIndexer runs. > > Matt Wilson > > -----Original Message----- > From: lewis john mcgibbney [mailto:[email protected]] > Sent: Monday, September 26, 2011 3:04 PM > To: [email protected] > Subject: Re: Indexing specific metadata tags with urlmeta > > Hi Matt, > > Try changing > > <field name="keywords" type="string" stored="true" indexed="true" > multiValued="true"/> > > to > > <field name="Keywords" type="string" stored="true" indexed="true" > multiValued="true"/> as per your metadata tags. > > We also have a configuration option in nutch-site.xml which you could check > out. > > <property> > <name>urlmeta.tags</name> > <value></value> > <description> > To be used in conjunction with features introduced in NUTCH-655, which > allows > for custom metatags to be injected alongside your crawl URLs. Specifying > those > custom tags here will allow for their propagation into a pages outlinks, > as > well as allow for them to be included as part of an index. > Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags > with > white-space at their boundaries, if you are using anything earlier than > Hadoop-0.21. > </description> > </property> > > On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt > <[email protected]>wrote: > > > I am attempting to crawl a corporate intranet site and allow it to be > > searched in solr. As part of the requirements I have to be able to index > > certain metadata tags as their own field in solr (for faceted search). > For > > example, the pages being crawled contain the following meta tag: > > > > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings, > > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 > Plans" > > /> > > > > I have updated the nutch-site.xml with the following: > > > > <property> > > <name>plugin.includes</name> > > <value>urlmeta|protocol-httpclient|... </value> > > </property> > > <property> > > <name>urlmeta.tags</name> > > <value>keywords</value> > > </property> > > > > I have updated the solr schema.xml with the following addition: > > > > <field name="keywords" type="string" stored="true" indexed="true" > > multiValued="true"/> > > > > I can see that the field has been created in Solr via the admin > interface. > > I also see that nutch is loading the urlmeta plugin and adding the > > indexfilters etc in the hadroop.log. The problem is that nutch does not > > appear to be indexing the keywords field. All of the pages crawled have > the > > tag present and I am receiving no errors in the nutch log. I am unsure > as > > to what I am missing. This seems to be pretty straightforward; however, > I > > must be misunderstanding either the urlmeta plugin or missing something > in > > the configuration. > > > > > > -- > *Lewis* > > > This E-Mail has been scanned for viruses. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

