Arthur, I think you may be able to achieve what you describe using the Subcollection plugin.
Iain -----Original Message----- From: Arthur Yarwood [mailto:[email protected]] Sent: Friday, November 6, 2015 7:45 AM To: [email protected] Subject: Assigning different meta tags to different parts of a website I'm currently implementing quite a narrow vertical search, with a heavily tailored seed urls, each with a number of metatags. Crawl has follow external links turned off. I have two regex-urlfilters files, one for crawl, one specific to indexing. The setup is fine when there is clear delimitation between host and metatags, all pages from each site get indexed with the metatags I assigned to the host in the seed file. However, I have a number of hosts I'd like to crawl, where I'd like to assign half the site one metadata tag and the other half the site another metadata tag. For example, an seed.txt file that looks a bit like this: http://www.example.org/first_half/ mytag=value1 http://www.example.org/second_half/ mytag=value2 So all pages under http://www.example.org/first_half/* are indexed with mytag = value1 and all pages under http://www.example.org/second_half/ are indexed with mytag = value2. The problem is, pages from both halves will have links to pages on the other half. So, presumably, Nutch will potentially follow these and carry forth the wrong meta tag to the other half. Correct? Any ideas on how to achieve my goal here? Where each page from the host www.example.org has just 'mytag' meta data tag, with a value that is correct for the subfolder the page is in. Am I going to need some custom plugin to parse url's and assign metadata accordingly? Or can this be achieved through config alone. BTW I'm using Nutch 1.10. -- Arthur Yarwood

