Hi Sol, > 4, I edited nutch-site.xml and changed index-(basic|anchor) to be > index-(basic|anchor|urlmeta)
The name of the plugin is "urlmeta" (not "index-urlmeta"). It implements to plugin extension point: indexing filter and scoring filter which makes sure the metadata is transfered to the linked pages. Sebastian On 11/06/2017 04:32 AM, Sol Lederman wrote: > Hi Sebastian, > > I tried using the urlmeta plugin but my indexed records don't have the > field I expected. > > Here's what I did: > > 1. I dropped the nutch core in Solr. > 2. I recursively removed the files in crawldb, linkdb, and segments > 3. I edited seed.txt to have a tab after the url and then source=source1 > 4, I edited nutch-site.xml and changed index-(basic|anchor) to be > index-(basic|anchor|urlmeta) > 5. I set the value of urlmeta.tags to be this: <value>source</value> > 6. I went through the tutorial and loaded some data into Solr. > 7. I queried that nutch core in the Solr UI. I see records but no "source" > field. > > What am I missing? > > Thanks. > > Sol > > > On Wed, Oct 25, 2017 at 3:08 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi Sol, >> >> yes, that's the right way to go: >> 1. add metadata to the seed list >> url \t key=val >> 2. use the urlmeta plugin (links below) to >> a) pass metadata forward from seeds to linked pages >> b) and index it >> >> Or did you mean another plugin? >> >> Best, >> Sebastian >> >> >> https://issues.apache.org/jira/browse/NUTCH-655 >> >> https://builds.apache.org/job/nutch-trunk/javadoc/org/ >> apache/nutch/scoring/urlmeta/package-summary.html >> >> https://builds.apache.org/job/nutch-trunk/javadoc/org/ >> apache/nutch/indexer/urlmeta/package-summary.html >> >> https://builds.apache.org/job/nutch-trunk/javadoc/org/ >> apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html >> >> (please open a Jira issue to fix the Javadoc, formatting has been lost. >> Thanks!) >> >> On 10/25/2017 08:03 PM, Sol Lederman wrote: >>> Hi, >>> >>> I've got a requirement to crawl three different sets of seed lists. I'd >>> like to put the crawl results documents into a single Solr index BUT I >> need >>> to tag the records with which seed list they came from. Using facets is >> one >>> way. Having a field that identifies the seed list is another way. I've >> seen >>> a little bit of documentation that mentions using the metadata plugin for >>> this purpose. Is this a good approach for this requirement? >>> >>> Thanks. >>> >>> Sol >>> >> >> >

