Arthur,

I think you may be able to achieve what you describe using the Subcollection
plugin.

Iain

-----Original Message-----
From: Arthur Yarwood [mailto:[email protected]] 
Sent: Friday, November 6, 2015 7:45 AM
To: [email protected]
Subject: Assigning different meta tags to different parts of a website

I'm currently implementing quite a narrow vertical search, with a heavily
tailored seed urls, each with a number of metatags. Crawl has follow
external links turned off. I have two regex-urlfilters files, one for crawl,
one specific to indexing. The setup is fine when there is clear delimitation
between host and metatags, all pages from each site get indexed with the
metatags I assigned to the host in the seed file. 
However, I have a number of hosts I'd like to crawl, where I'd like to
assign half the site one metadata tag and the other half the site another
metadata tag.

For example, an seed.txt file that looks a bit like this:

http://www.example.org/first_half/  mytag=value1
http://www.example.org/second_half/  mytag=value2

So all pages under http://www.example.org/first_half/* are indexed with
mytag = value1 and all pages under http://www.example.org/second_half/
are indexed with mytag = value2.

The problem is, pages from both halves will have links to pages on the other
half. So, presumably, Nutch will potentially follow these and carry forth
the wrong meta tag to the other half. Correct?

Any ideas on how to achieve my goal here? Where each page from the host
www.example.org has just 'mytag' meta data tag, with a value that is correct
for the subfolder the page is in.  Am I going to need some custom plugin to
parse url's and assign metadata accordingly? Or can this be achieved through
config alone.


BTW I'm using Nutch 1.10.


--
Arthur Yarwood

Reply via email to