Hi Pablo,
This question has been raised a number of times of the user@nutch list, you
can use the archives linked to from the Nutch website.
I would suggest that the seed be populated to a new page metadata, which
could then be added via an indexing filter.
There may be other ways for achieving this, I am sure that the archives
will tell you.
Thanks
Lewis

On Mon, Oct 20, 2014 at 4:35 AM, <[email protected]> wrote:

> Hello,
>
> I'm trying to set up a Nutch+Solr to crawl a list of domains.
> I want to get 50 pages per seed in the list (no external links) and
> save the seed each page came from in the result.
>
> The goal is to be able to query for a word and get all the seeds from
> my list that lead to a page containing it.
>
> Example:
> I have a seed list with:
>
> http://domainone.com
> http://domaintwo.com
> http://domainthree.com
> http://domainfour.com
> http://domainfive.com
>
> I save 50 subpages from each of them to solr. (Total of 5*50=250 pages
> indexed in solr)
>
> Now I query for "foobar" and want to get the items back from the
> seedlist which contained the word "foobar" or have subpages
> (http://domainthree.com/somepage.html) that contained that word.
>
>
>
> How would I save the seed a page came originally from in solr?
>

Reply via email to