I'm currently implementing quite a narrow vertical search, with a
heavily tailored seed urls, each with a number of metatags. Crawl has
follow external links turned off. I have two regex-urlfilters files, one
for crawl, one specific to indexing. The setup is fine when there is
clear delimitation between host and metatags, all pages from each site
get indexed with the metatags I assigned to the host in the seed file.
However, I have a number of hosts I'd like to crawl, where I'd like to
assign half the site one metadata tag and the other half the site
another metadata tag.
For example, an seed.txt file that looks a bit like this:
http://www.example.org/first_half/ mytag=value1
http://www.example.org/second_half/ mytag=value2
So all pages under http://www.example.org/first_half/* are indexed with
mytag = value1 and all pages under http://www.example.org/second_half/
are indexed with mytag = value2.
The problem is, pages from both halves will have links to pages on the
other half. So, presumably, Nutch will potentially follow these and
carry forth the wrong meta tag to the other half. Correct?
Any ideas on how to achieve my goal here? Where each page from the host
www.example.org has just 'mytag' meta data tag, with a value that is
correct for the subfolder the page is in. Am I going to need some
custom plugin to parse url's and assign metadata accordingly? Or can
this be achieved through config alone.
BTW I'm using Nutch 1.10.
--
Arthur Yarwood