Re: Tagging records by seed list

Sebastian Nagel Wed, 25 Oct 2017 14:08:33 -0700

Hi Sol,

yes, that's the right way to go:
 1. add metadata to the seed list
     url \t key=val
 2. use the urlmeta plugin (links below) to
  a) pass metadata forward from seeds to linked pages
  b) and index it


Or did you mean another plugin?

Best,
Sebastian


https://issues.apache.org/jira/browse/NUTCH-655

https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/scoring/urlmeta/package-summary.html

https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/indexer/urlmeta/package-summary.html

https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html

(please open a Jira issue to fix the Javadoc, formatting has been lost. Thanks!)

On 10/25/2017 08:03 PM, Sol Lederman wrote:
> Hi,
> 
> I've got a requirement to crawl three different sets of seed lists. I'd
> like to put the crawl results documents into a single Solr index BUT I need
> to tag the records with which seed list they came from. Using facets is one
> way. Having a field that identifies the seed list is another way. I've seen
> a little bit of documentation that mentions using the metadata plugin for
> this purpose. Is this a good approach for this requirement?
> 
> Thanks.
> 
> Sol
>

Re: Tagging records by seed list

Reply via email to