I am attempting to crawl a corporate intranet site and allow it to be searched
in solr. As part of the requirements I have to be able to index certain
metadata tags as their own field in solr (for faceted search). For example,
the pages being crawled contain the following meta tag:
<meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings, Student
Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans" />
I have updated the nutch-site.xml with the following:
<property>
<name>plugin.includes</name>
<value>urlmeta|protocol-httpclient|... </value>
</property>
<property>
<name>urlmeta.tags</name>
<value>keywords</value>
</property>
I have updated the solr schema.xml with the following addition:
<field name="keywords" type="string" stored="true" indexed="true"
multiValued="true"/>
I can see that the field has been created in Solr via the admin interface. I
also see that nutch is loading the urlmeta plugin and adding the indexfilters
etc in the hadroop.log. The problem is that nutch does not appear to be
indexing the keywords field. All of the pages crawled have the tag present and
I am receiving no errors in the nutch log. I am unsure as to what I am
missing. This seems to be pretty straightforward; however, I must be
misunderstanding either the urlmeta plugin or missing something in the
configuration.