What if I want to index different metatags for different site?

On Fri, Nov 16, 2012 at 11:03 AM, Markus Jelsma
<[email protected]>wrote:

> you can override some URL Filter paths in nutch site or with command line
> options (tools) such as bin/nutch fetch -Durlfilter.regex.file=bla. You can
> also set NUTCH_HOME and keep everything separate if you're running it
> locally. On Hadoop you'll need separate job files.
>
> -----Original message-----
> > From:Joe Zhang <[email protected]>
> > Sent: Fri 16-Nov-2012 18:35
> > To: [email protected]
> > Subject: Re: site-specific crawling policies
> >
> > That's easy to do. But what about the configuration files? The same
> > nutchs-site.xml, urlfiter files will be read.
> >
> > On Fri, Nov 16, 2012 at 3:28 AM, Sourajit Basak <
> [email protected]>wrote:
> >
> > > Group related sites together and use separate crawldb, segment
> > > directories.
> > >
> > > On Fri, Nov 16, 2012 at 9:40 AM, Joe Zhang <[email protected]>
> wrote:
> > >
> > > > So how exactly do I set up different nutch instances then?
> > > >
> > > > On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney <
> > > > [email protected]> wrote:
> > > >
> > > > > Hi Joe,
> > > > >
> > > > > In all honesty, it might sound slightly optimistic, it may also
> depend
> > > > > upon the size and calibre of the different sites/domains but if you
> > > > > are attempting a depth first, domain specific crawl, then maybe
> > > > > separate Nutch instances will be your friend...
> > > > >
> > > > > Lewis
> > > > >
> > > > >
> > > > > On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang <[email protected]>
> > > > wrote:
> > > > > > well, these are all details. The bigger question is, how to
> seperate
> > > > the
> > > > > > crawling policy of site A from that of site B?
> > > > > >
> > > > > > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak <
> > > > > [email protected]>wrote:
> > > > > >
> > > > > >> You probably need to customize parse-metatags plugin.
> > > > > >>
> > > > > >> I think you go ahead and include all possible metatags. And take
> > > care
> > > > of
> > > > > >> missing metatags in solr.
> > > > > >>
> > > > > >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang <
> [email protected]>
> > > > > wrote:
> > > > > >>
> > > > > >> > I understand conf/regex-urlfilter.txt; I can put domain names
> into
> > > > the
> > > > > >> URL
> > > > > >> > patterns.
> > > > > >> >
> > > > > >> > But what about meta tags? What if I want to parse out
> different
> > > meta
> > > > > tags
> > > > > >> > for different sites?
> > > > > >> >
> > > > > >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
> > > > > >> [email protected]
> > > > > >> > >wrote:
> > > > > >> >
> > > > > >> > > 1) For parsing & indexing customized meta tags enable &
> > > configure
> > > > > >> plugin
> > > > > >> > > "parse-metatags"
> > > > > >> > >
> > > > > >> > > 2) There are several filters of url, like regex based. For
> > > regex,
> > > > > the
> > > > > >> > > patterns are specified via conf/regex-urlfilter.txt
> > > > > >> > >
> > > > > >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <
> > > > > [email protected]
> > > > > >> > > >wrote:
> > > > > >> > >
> > > > > >> > > > While defining url patterns, have the domain name in it so
> > > that
> > > > > you
> > > > > >> get
> > > > > >> > > > site/domain specific rules. I don't know about configuring
> > > meta
> > > > > tags.
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > > Tejas
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang <
> > > > [email protected]
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > > How to enforce site-specific crawling policies, i.e,
> > > different
> > > > > URL
> > > > > >> > > > > patterns, meta tags, etc. for different websites to be
> > > > crawled?
> > > > > I
> > > > > >> got
> > > > > >> > > the
> > > > > >> > > > > sense that multiple instances of nutch are needed? Is it
> > > > > correct?
> > > > > >> If
> > > > > >> > > yes,
> > > > > >> > > > > how?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lewis
> > > > >
> > > >
> > >
> >
>

Reply via email to