Hi Matt,
Try changing
<field name="keywords" type="string" stored="true" indexed="true"
multiValued="true"/>
to
<field name="Keywords" type="string" stored="true" indexed="true"
multiValued="true"/> as per your metadata tags.
We also have a configuration option in nutch-site.xml which you could check
out.
<property>
<name>urlmeta.tags</name>
<value></value>
<description>
To be used in conjunction with features introduced in NUTCH-655, which
allows
for custom metatags to be injected alongside your crawl URLs. Specifying
those
custom tags here will allow for their propagation into a pages outlinks,
as
well as allow for them to be included as part of an index.
Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags
with
white-space at their boundaries, if you are using anything earlier than
Hadoop-0.21.
</description>
</property>
On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
<[email protected]>wrote:
> I am attempting to crawl a corporate intranet site and allow it to be
> searched in solr. As part of the requirements I have to be able to index
> certain metadata tags as their own field in solr (for faceted search). For
> example, the pages being crawled contain the following meta tag:
>
> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings,
> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans"
> />
>
> I have updated the nutch-site.xml with the following:
>
> <property>
> <name>plugin.includes</name>
> <value>urlmeta|protocol-httpclient|... </value>
> </property>
> <property>
> <name>urlmeta.tags</name>
> <value>keywords</value>
> </property>
>
> I have updated the solr schema.xml with the following addition:
>
> <field name="keywords" type="string" stored="true" indexed="true"
> multiValued="true"/>
>
> I can see that the field has been created in Solr via the admin interface.
> I also see that nutch is loading the urlmeta plugin and adding the
> indexfilters etc in the hadroop.log. The problem is that nutch does not
> appear to be indexing the keywords field. All of the pages crawled have the
> tag present and I am receiving no errors in the nutch log. I am unsure as
> to what I am missing. This seems to be pretty straightforward; however, I
> must be misunderstanding either the urlmeta plugin or missing something in
> the configuration.
>
--
*Lewis*