Hi Sebastian, I tried using the urlmeta plugin but my indexed records don't have the field I expected.
Here's what I did: 1. I dropped the nutch core in Solr. 2. I recursively removed the files in crawldb, linkdb, and segments 3. I edited seed.txt to have a tab after the url and then source=source1 4, I edited nutch-site.xml and changed index-(basic|anchor) to be index-(basic|anchor|urlmeta) 5. I set the value of urlmeta.tags to be this: <value>source</value> 6. I went through the tutorial and loaded some data into Solr. 7. I queried that nutch core in the Solr UI. I see records but no "source" field. What am I missing? Thanks. Sol On Wed, Oct 25, 2017 at 3:08 PM, Sebastian Nagel <[email protected] > wrote: > Hi Sol, > > yes, that's the right way to go: > 1. add metadata to the seed list > url \t key=val > 2. use the urlmeta plugin (links below) to > a) pass metadata forward from seeds to linked pages > b) and index it > > Or did you mean another plugin? > > Best, > Sebastian > > > https://issues.apache.org/jira/browse/NUTCH-655 > > https://builds.apache.org/job/nutch-trunk/javadoc/org/ > apache/nutch/scoring/urlmeta/package-summary.html > > https://builds.apache.org/job/nutch-trunk/javadoc/org/ > apache/nutch/indexer/urlmeta/package-summary.html > > https://builds.apache.org/job/nutch-trunk/javadoc/org/ > apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html > > (please open a Jira issue to fix the Javadoc, formatting has been lost. > Thanks!) > > On 10/25/2017 08:03 PM, Sol Lederman wrote: > > Hi, > > > > I've got a requirement to crawl three different sets of seed lists. I'd > > like to put the crawl results documents into a single Solr index BUT I > need > > to tag the records with which seed list they came from. Using facets is > one > > way. Having a field that identifies the seed list is another way. I've > seen > > a little bit of documentation that mentions using the metadata plugin for > > this purpose. Is this a good approach for this requirement? > > > > Thanks. > > > > Sol > > > >

