Hi Dean,
I added my documentation and bundled plugin to jira
(https://issues.apache.org/jira/browse/NUTCH-809), hope this helps.
On 11.01.2012 22:44, Dean Del Ponte wrote:
Thank-you for your response.
My goal is to get Nutch to index meta tags. It's been quite an adventure
so far!
On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney<
[email protected]> wrote:
Hi Dean,
Unfortunately nothing official. If you look you will see that this plugin
(if eventually integrated), will combine with two other issues which all
revolve roughly around the same area.
I have never used this patch or any of the others.
Anyone else?
On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<[email protected]
wrote:
Any documentation on how to use the patch at
https://issues.apache.org/jira/browse/NUTCH-809?
My apologies for the newbie question.
Thanks,
Dean Del Ponte
On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche<
[email protected]> wrote:
Hi Matt,
The plugin urlmeta does NOT extract the metadata from HTML pages. The
'meta'
in its name means 'crawldb metadata'
You need to use the patch in
https://issues.apache.org/jira/browse/NUTCH-809
HTH
Julien
On 26 September 2011 21:18, Wilson, Matt<[email protected]
wrote:
Also,
In case this helps. I removed the Keywords field from the solr
schema
to
see if it would generate an error when the SolrIndexer runs and it
does
not.
This has lead me to believe that nutch is either not indexing the
meta
content or it is not sending the update to solr when SolrIndexer
runs.
Matt Wilson
-----Original Message-----
From: lewis john mcgibbney [mailto:[email protected]]
Sent: Monday, September 26, 2011 3:04 PM
To: [email protected]
Subject: Re: Indexing specific metadata tags with urlmeta
Hi Matt,
Try changing
<field name="keywords" type="string" stored="true" indexed="true"
multiValued="true"/>
to
<field name="Keywords" type="string" stored="true" indexed="true"
multiValued="true"/> as per your metadata tags.
We also have a configuration option in nutch-site.xml which you could
check
out.
<property>
<name>urlmeta.tags</name>
<value></value>
<description>
To be used in conjunction with features introduced in NUTCH-655,
which
allows
for custom metatags to be injected alongside your crawl URLs.
Specifying
those
custom tags here will allow for their propagation into a pages
outlinks,
as
well as allow for them to be included as part of an index.
Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad
the
tags
with
white-space at their boundaries, if you are using anything earlier
than
Hadoop-0.21.
</description>
</property>
On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt
<[email protected]>wrote:
I am attempting to crawl a corporate intranet site and allow it to
be
searched in solr. As part of the requirements I have to be able to
index
certain metadata tags as their own field in solr (for faceted
search).
For
example, the pages being crawled contain the following meta tag:
<meta id="ctl00_hdrKeywords" name="Keywords" content="Banking,
Savings,
Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529
Plans"
/>
I have updated the nutch-site.xml with the following:
<property>
<name>plugin.includes</name>
<value>urlmeta|protocol-httpclient|...</value>
</property>
<property>
<name>urlmeta.tags</name>
<value>keywords</value>
</property>
I have updated the solr schema.xml with the following addition:
<field name="keywords" type="string" stored="true" indexed="true"
multiValued="true"/>
I can see that the field has been created in Solr via the admin
interface.
I also see that nutch is loading the urlmeta plugin and adding the
indexfilters etc in the hadroop.log. The problem is that nutch
does
not
appear to be indexing the keywords field. All of the pages crawled
have
the
tag present and I am receiving no errors in the nutch log. I am
unsure
as
to what I am missing. This seems to be pretty straightforward;
however,
I
must be misunderstanding either the urlmeta plugin or missing
something
in
the configuration.
--
*Lewis*
This E-Mail has been scanned for viruses.
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
--
*Lewis*