Re: Tagging records by seed list

Sebastian Nagel Mon, 06 Nov 2017 00:46:45 -0800

Hi Sol,

> 4, I edited nutch-site.xml and changed index-(basic|anchor) to be
> index-(basic|anchor|urlmeta)


The name of the plugin is "urlmeta" (not "index-urlmeta").
It implements to plugin extension point: indexing filter and
scoring filter which makes sure the metadata is transfered to
the linked pages.

Sebastian

On 11/06/2017 04:32 AM, Sol Lederman wrote:
> Hi Sebastian,
> 
> I tried using the urlmeta plugin but my indexed records don't have the
> field I expected.
> 
> Here's what I did:
> 
> 1. I dropped the nutch core in Solr.
> 2. I recursively removed the files in crawldb, linkdb, and segments
> 3. I edited seed.txt to have a tab after the url and then source=source1
> 4, I edited nutch-site.xml and changed index-(basic|anchor) to be
> index-(basic|anchor|urlmeta)
> 5. I set the value of urlmeta.tags to be this: <value>source</value>
> 6. I went through the tutorial and loaded some data into Solr.
> 7. I queried that nutch core in the Solr UI. I see records but no "source"
> field.
> 
> What am I missing?
> 
> Thanks.
> 
> Sol
> 
> 
> On Wed, Oct 25, 2017 at 3:08 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi Sol,
>>
>> yes, that's the right way to go:
>>  1. add metadata to the seed list
>>      url \t key=val
>>  2. use the urlmeta plugin (links below) to
>>   a) pass metadata forward from seeds to linked pages
>>   b) and index it
>>
>> Or did you mean another plugin?
>>
>> Best,
>> Sebastian
>>
>>
>> https://issues.apache.org/jira/browse/NUTCH-655
>>
>> https://builds.apache.org/job/nutch-trunk/javadoc/org/
>> apache/nutch/scoring/urlmeta/package-summary.html
>>
>> https://builds.apache.org/job/nutch-trunk/javadoc/org/
>> apache/nutch/indexer/urlmeta/package-summary.html
>>
>> https://builds.apache.org/job/nutch-trunk/javadoc/org/
>> apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html
>>
>> (please open a Jira issue to fix the Javadoc, formatting has been lost.
>> Thanks!)
>>
>> On 10/25/2017 08:03 PM, Sol Lederman wrote:
>>> Hi,
>>>
>>> I've got a requirement to crawl three different sets of seed lists. I'd
>>> like to put the crawl results documents into a single Solr index BUT I
>> need
>>> to tag the records with which seed list they came from. Using facets is
>> one
>>> way. Having a field that identifies the seed list is another way. I've
>> seen
>>> a little bit of documentation that mentions using the metadata plugin for
>>> this purpose. Is this a good approach for this requirement?
>>>
>>> Thanks.
>>>
>>> Sol
>>>
>>
>>
>

Re: Tagging records by seed list

Reply via email to