Also, In case this helps. I removed the Keywords field from the solr schema to see if it would generate an error when the SolrIndexer runs and it does not. This has lead me to believe that nutch is either not indexing the meta content or it is not sending the update to solr when SolrIndexer runs.
Matt Wilson -----Original Message----- From: lewis john mcgibbney [mailto:[email protected]] Sent: Monday, September 26, 2011 3:04 PM To: [email protected] Subject: Re: Indexing specific metadata tags with urlmeta Hi Matt, Try changing <field name="keywords" type="string" stored="true" indexed="true" multiValued="true"/> to <field name="Keywords" type="string" stored="true" indexed="true" multiValued="true"/> as per your metadata tags. We also have a configuration option in nutch-site.xml which you could check out. <property> <name>urlmeta.tags</name> <value></value> <description> To be used in conjunction with features introduced in NUTCH-655, which allows for custom metatags to be injected alongside your crawl URLs. Specifying those custom tags here will allow for their propagation into a pages outlinks, as well as allow for them to be included as part of an index. Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with white-space at their boundaries, if you are using anything earlier than Hadoop-0.21. </description> </property> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt <[email protected]>wrote: > I am attempting to crawl a corporate intranet site and allow it to be > searched in solr. As part of the requirements I have to be able to index > certain metadata tags as their own field in solr (for faceted search). For > example, the pages being crawled contain the following meta tag: > > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings, > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans" > /> > > I have updated the nutch-site.xml with the following: > > <property> > <name>plugin.includes</name> > <value>urlmeta|protocol-httpclient|... </value> > </property> > <property> > <name>urlmeta.tags</name> > <value>keywords</value> > </property> > > I have updated the solr schema.xml with the following addition: > > <field name="keywords" type="string" stored="true" indexed="true" > multiValued="true"/> > > I can see that the field has been created in Solr via the admin interface. > I also see that nutch is loading the urlmeta plugin and adding the > indexfilters etc in the hadroop.log. The problem is that nutch does not > appear to be indexing the keywords field. All of the pages crawled have the > tag present and I am receiving no errors in the nutch log. I am unsure as > to what I am missing. This seems to be pretty straightforward; however, I > must be misunderstanding either the urlmeta plugin or missing something in > the configuration. > -- *Lewis* This E-Mail has been scanned for viruses.

