Hi Sol,
yes, that's the right way to go:
1. add metadata to the seed list
url \t key=val
2. use the urlmeta plugin (links below) to
a) pass metadata forward from seeds to linked pages
b) and index it
Or did you mean another plugin?
Best,
Sebastian
https://issues.apache.org/jira/browse/NUTCH-655
https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/scoring/urlmeta/package-summary.html
https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/indexer/urlmeta/package-summary.html
https://builds.apache.org/job/nutch-trunk/javadoc/org/apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html
(please open a Jira issue to fix the Javadoc, formatting has been lost. Thanks!)
On 10/25/2017 08:03 PM, Sol Lederman wrote:
> Hi,
>
> I've got a requirement to crawl three different sets of seed lists. I'd
> like to put the crawl results documents into a single Solr index BUT I need
> to tag the records with which seed list they came from. Using facets is one
> way. Having a field that identifies the seed list is another way. I've seen
> a little bit of documentation that mentions using the metadata plugin for
> this purpose. Is this a good approach for this requirement?
>
> Thanks.
>
> Sol
>