Re: Tagging records by seed list

Sol Lederman Sun, 05 Nov 2017 19:32:15 -0800

Hi Sebastian,

I tried using the urlmeta plugin but my indexed records don't have the
field I expected.


Here's what I did:

1. I dropped the nutch core in Solr.
2. I recursively removed the files in crawldb, linkdb, and segments
3. I edited seed.txt to have a tab after the url and then source=source1
4, I edited nutch-site.xml and changed index-(basic|anchor) to be
index-(basic|anchor|urlmeta)
5. I set the value of urlmeta.tags to be this: <value>source</value>
6. I went through the tutorial and loaded some data into Solr.
7. I queried that nutch core in the Solr UI. I see records but no "source"
field.

What am I missing?

Thanks.

Sol


On Wed, Oct 25, 2017 at 3:08 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Sol,
>
> yes, that's the right way to go:
>  1. add metadata to the seed list
>      url \t key=val
>  2. use the urlmeta plugin (links below) to
>   a) pass metadata forward from seeds to linked pages
>   b) and index it
>
> Or did you mean another plugin?
>
> Best,
> Sebastian
>
>
> https://issues.apache.org/jira/browse/NUTCH-655
>
> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> apache/nutch/scoring/urlmeta/package-summary.html
>
> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> apache/nutch/indexer/urlmeta/package-summary.html
>
> https://builds.apache.org/job/nutch-trunk/javadoc/org/
> apache/nutch/indexer/urlmeta/URLMetaIndexingFilter.html
>
> (please open a Jira issue to fix the Javadoc, formatting has been lost.
> Thanks!)
>
> On 10/25/2017 08:03 PM, Sol Lederman wrote:
> > Hi,
> >
> > I've got a requirement to crawl three different sets of seed lists. I'd
> > like to put the crawl results documents into a single Solr index BUT I
> need
> > to tag the records with which seed list they came from. Using facets is
> one
> > way. Having a field that identifies the seed list is another way. I've
> seen
> > a little bit of documentation that mentions using the metadata plugin for
> > this purpose. Is this a good approach for this requirement?
> >
> > Thanks.
> >
> > Sol
> >
>
>

Re: Tagging records by seed list

Reply via email to