Hi Joe, In all honesty, it might sound slightly optimistic, it may also depend upon the size and calibre of the different sites/domains but if you are attempting a depth first, domain specific crawl, then maybe separate Nutch instances will be your friend...
Lewis On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang <[email protected]> wrote: > well, these are all details. The bigger question is, how to seperate the > crawling policy of site A from that of site B? > > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak > <[email protected]>wrote: > >> You probably need to customize parse-metatags plugin. >> >> I think you go ahead and include all possible metatags. And take care of >> missing metatags in solr. >> >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang <[email protected]> wrote: >> >> > I understand conf/regex-urlfilter.txt; I can put domain names into the >> URL >> > patterns. >> > >> > But what about meta tags? What if I want to parse out different meta tags >> > for different sites? >> > >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak < >> [email protected] >> > >wrote: >> > >> > > 1) For parsing & indexing customized meta tags enable & configure >> plugin >> > > "parse-metatags" >> > > >> > > 2) There are several filters of url, like regex based. For regex, the >> > > patterns are specified via conf/regex-urlfilter.txt >> > > >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <[email protected] >> > > >wrote: >> > > >> > > > While defining url patterns, have the domain name in it so that you >> get >> > > > site/domain specific rules. I don't know about configuring meta tags. >> > > > >> > > > Thanks, >> > > > Tejas >> > > > >> > > > >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang <[email protected]> >> > > wrote: >> > > > >> > > > > How to enforce site-specific crawling policies, i.e, different URL >> > > > > patterns, meta tags, etc. for different websites to be crawled? I >> got >> > > the >> > > > > sense that multiple instances of nutch are needed? Is it correct? >> If >> > > yes, >> > > > > how? >> > > > > >> > > > >> > > >> > >> -- Lewis

