Hi Pablo, This question has been raised a number of times of the user@nutch list, you can use the archives linked to from the Nutch website. I would suggest that the seed be populated to a new page metadata, which could then be added via an indexing filter. There may be other ways for achieving this, I am sure that the archives will tell you. Thanks Lewis
On Mon, Oct 20, 2014 at 4:35 AM, <[email protected]> wrote: > Hello, > > I'm trying to set up a Nutch+Solr to crawl a list of domains. > I want to get 50 pages per seed in the list (no external links) and > save the seed each page came from in the result. > > The goal is to be able to query for a word and get all the seeds from > my list that lead to a page containing it. > > Example: > I have a seed list with: > > http://domainone.com > http://domaintwo.com > http://domainthree.com > http://domainfour.com > http://domainfive.com > > I save 50 subpages from each of them to solr. (Total of 5*50=250 pages > indexed in solr) > > Now I query for "foobar" and want to get the items back from the > seedlist which contained the word "foobar" or have subpages > (http://domainthree.com/somepage.html) that contained that word. > > > > How would I save the seed a page came originally from in solr? >

