Lewis, 

Thanks you for your reply.  I changed the capitalization in the 
solr/conf/schema.xml file to match that of the field in the crawled html and 
the other entries in nutch-site.xml.  I had already added the urlmeta.tags 
property.  Unfortunately I get the same results.  After a successful crawl I 
execute a query in solr requesting the Keywords field be returned and it 
appears to have no value.  Any ideas on how I can debug where the issue is? 

Thanks, 

Matt Wilson

-----Original Message-----
From: lewis john mcgibbney [mailto:[email protected]] 
Sent: Monday, September 26, 2011 3:04 PM
To: [email protected]
Subject: Re: Indexing specific metadata tags with urlmeta

Hi Matt,

Try changing

<field name="keywords" type="string" stored="true" indexed="true"
multiValued="true"/>

to

<field name="Keywords" type="string" stored="true" indexed="true"
multiValued="true"/> as per your metadata tags.

We also have a configuration option in nutch-site.xml which you could check
out.

<property>
  <name>urlmeta.tags</name>
  <value></value>
  <description>
    To be used in conjunction with features introduced in NUTCH-655, which
allows
    for custom metatags to be injected alongside your crawl URLs. Specifying
those
    custom tags here will allow for their propagation into a pages outlinks,
as
    well as allow for them to be included as part of an index.
    Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags
with
    white-space at their boundaries, if you are using anything earlier than
Hadoop-0.21.
  </description>
</property>

On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
<[email protected]>wrote:

> I am attempting to crawl a corporate intranet site and allow it to be
> searched in solr.  As part of the requirements I have to be able to index
> certain metadata tags as their own field in solr (for faceted search).  For
> example, the pages being crawled contain the following meta tag:
>
> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, Savings,
> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 Plans"
> />
>
> I have updated the nutch-site.xml with the following:
>
> <property>
>    <name>plugin.includes</name>
>    <value>urlmeta|protocol-httpclient|... </value>
> </property>
> <property>
>    <name>urlmeta.tags</name>
>    <value>keywords</value>
> </property>
>
> I have updated the solr schema.xml with the following addition:
>
> <field name="keywords" type="string" stored="true" indexed="true"
> multiValued="true"/>
>
> I can see that the field has been created in Solr via the admin interface.
>  I also see that nutch is loading the urlmeta plugin and adding the
> indexfilters etc in the hadroop.log.  The problem is that nutch does not
> appear to be indexing the keywords field.  All of the pages crawled have the
> tag present and I am receiving no errors in the nutch log.  I am unsure as
> to what I am missing.  This seems to be pretty straightforward; however, I
> must be misunderstanding either the urlmeta plugin or missing something in
> the configuration.
>



-- 
*Lewis*


This E-Mail has been scanned for viruses.

Reply via email to