I am attempting to crawl a corporate intranet site and allow it to be searched 
in solr.  As part of the requirements I have to be able to index certain 
metadata tags as their own field in solr (for faceted search).  For example, 
the pages being crawled contain the following meta tag:

<meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings, Student 
Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans" />

I have updated the nutch-site.xml with the following:

<property>
    <name>plugin.includes</name>
    <value>urlmeta|protocol-httpclient|... </value>
</property>
<property>
    <name>urlmeta.tags</name>
    <value>keywords</value>
</property>

I have updated the solr schema.xml with the following addition:

<field name="keywords" type="string" stored="true" indexed="true" 
multiValued="true"/>

I can see that the field has been created in Solr via the admin interface.  I 
also see that nutch is loading the urlmeta plugin and adding the indexfilters 
etc in the hadroop.log.  The problem is that nutch does not appear to be 
indexing the keywords field.  All of the pages crawled have the tag present and 
I am receiving no errors in the nutch log.  I am unsure as to what I am 
missing.  This seems to be pretty straightforward; however, I must be 
misunderstanding either the urlmeta plugin or missing something in the 
configuration.

Reply via email to