Lewis, Thanks you for your reply. I changed the capitalization in the solr/conf/schema.xml file to match that of the field in the crawled html and the other entries in nutch-site.xml. I had already added the urlmeta.tags property. Unfortunately I get the same results. After a successful crawl I execute a query in solr requesting the Keywords field be returned and it appears to have no value. Any ideas on how I can debug where the issue is?
Thanks, Matt Wilson -----Original Message----- From: lewis john mcgibbney [mailto:[email protected]] Sent: Monday, September 26, 2011 3:04 PM To: [email protected] Subject: Re: Indexing specific metadata tags with urlmeta Hi Matt, Try changing <field name="keywords" type="string" stored="true" indexed="true" multiValued="true"/> to <field name="Keywords" type="string" stored="true" indexed="true" multiValued="true"/> as per your metadata tags. We also have a configuration option in nutch-site.xml which you could check out. <property> <name>urlmeta.tags</name> <value></value> <description> To be used in conjunction with features introduced in NUTCH-655, which allows for custom metatags to be injected alongside your crawl URLs. Specifying those custom tags here will allow for their propagation into a pages outlinks, as well as allow for them to be included as part of an index. Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with white-space at their boundaries, if you are using anything earlier than Hadoop-0.21. </description> </property> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt <[email protected]>wrote: > I am attempting to crawl a corporate intranet site and allow it to be > searched in solr. As part of the requirements I have to be able to index > certain metadata tags as their own field in solr (for faceted search). For > example, the pages being crawled contain the following meta tag: > > <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings, > Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans" > /> > > I have updated the nutch-site.xml with the following: > > <property> > <name>plugin.includes</name> > <value>urlmeta|protocol-httpclient|... </value> > </property> > <property> > <name>urlmeta.tags</name> > <value>keywords</value> > </property> > > I have updated the solr schema.xml with the following addition: > > <field name="keywords" type="string" stored="true" indexed="true" > multiValued="true"/> > > I can see that the field has been created in Solr via the admin interface. > I also see that nutch is loading the urlmeta plugin and adding the > indexfilters etc in the hadroop.log. The problem is that nutch does not > appear to be indexing the keywords field. All of the pages crawled have the > tag present and I am receiving no errors in the nutch log. I am unsure as > to what I am missing. This seems to be pretty straightforward; however, I > must be misunderstanding either the urlmeta plugin or missing something in > the configuration. > -- *Lewis* This E-Mail has been scanned for viruses.

