What if I want to index different metatags for different site? On Fri, Nov 16, 2012 at 11:03 AM, Markus Jelsma <[email protected]>wrote:
> you can override some URL Filter paths in nutch site or with command line > options (tools) such as bin/nutch fetch -Durlfilter.regex.file=bla. You can > also set NUTCH_HOME and keep everything separate if you're running it > locally. On Hadoop you'll need separate job files. > > -----Original message----- > > From:Joe Zhang <[email protected]> > > Sent: Fri 16-Nov-2012 18:35 > > To: [email protected] > > Subject: Re: site-specific crawling policies > > > > That's easy to do. But what about the configuration files? The same > > nutchs-site.xml, urlfiter files will be read. > > > > On Fri, Nov 16, 2012 at 3:28 AM, Sourajit Basak < > [email protected]>wrote: > > > > > Group related sites together and use separate crawldb, segment > > > directories. > > > > > > On Fri, Nov 16, 2012 at 9:40 AM, Joe Zhang <[email protected]> > wrote: > > > > > > > So how exactly do I set up different nutch instances then? > > > > > > > > On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney < > > > > [email protected]> wrote: > > > > > > > > > Hi Joe, > > > > > > > > > > In all honesty, it might sound slightly optimistic, it may also > depend > > > > > upon the size and calibre of the different sites/domains but if you > > > > > are attempting a depth first, domain specific crawl, then maybe > > > > > separate Nutch instances will be your friend... > > > > > > > > > > Lewis > > > > > > > > > > > > > > > On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang <[email protected]> > > > > wrote: > > > > > > well, these are all details. The bigger question is, how to > seperate > > > > the > > > > > > crawling policy of site A from that of site B? > > > > > > > > > > > > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak < > > > > > [email protected]>wrote: > > > > > > > > > > > >> You probably need to customize parse-metatags plugin. > > > > > >> > > > > > >> I think you go ahead and include all possible metatags. And take > > > care > > > > of > > > > > >> missing metatags in solr. > > > > > >> > > > > > >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang < > [email protected]> > > > > > wrote: > > > > > >> > > > > > >> > I understand conf/regex-urlfilter.txt; I can put domain names > into > > > > the > > > > > >> URL > > > > > >> > patterns. > > > > > >> > > > > > > >> > But what about meta tags? What if I want to parse out > different > > > meta > > > > > tags > > > > > >> > for different sites? > > > > > >> > > > > > > >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak < > > > > > >> [email protected] > > > > > >> > >wrote: > > > > > >> > > > > > > >> > > 1) For parsing & indexing customized meta tags enable & > > > configure > > > > > >> plugin > > > > > >> > > "parse-metatags" > > > > > >> > > > > > > > >> > > 2) There are several filters of url, like regex based. For > > > regex, > > > > > the > > > > > >> > > patterns are specified via conf/regex-urlfilter.txt > > > > > >> > > > > > > > >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil < > > > > > [email protected] > > > > > >> > > >wrote: > > > > > >> > > > > > > > >> > > > While defining url patterns, have the domain name in it so > > > that > > > > > you > > > > > >> get > > > > > >> > > > site/domain specific rules. I don't know about configuring > > > meta > > > > > tags. > > > > > >> > > > > > > > > >> > > > Thanks, > > > > > >> > > > Tejas > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang < > > > > [email protected] > > > > > > > > > > > >> > > wrote: > > > > > >> > > > > > > > > >> > > > > How to enforce site-specific crawling policies, i.e, > > > different > > > > > URL > > > > > >> > > > > patterns, meta tags, etc. for different websites to be > > > > crawled? > > > > > I > > > > > >> got > > > > > >> > > the > > > > > >> > > > > sense that multiple instances of nutch are needed? Is it > > > > > correct? > > > > > >> If > > > > > >> > > yes, > > > > > >> > > > > how? > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > Lewis > > > > > > > > > > > > > > >

