So how exactly do I set up different nutch instances then?

On Thu, Nov 15, 2012 at 7:52 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Joe,
>
> In all honesty, it might sound slightly optimistic, it may also depend
> upon the size and calibre of the different sites/domains but if you
> are attempting a depth first, domain specific crawl, then maybe
> separate Nutch instances will be your friend...
>
> Lewis
>
>
> On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang <[email protected]> wrote:
> > well, these are all details. The bigger question is, how to seperate the
> > crawling policy of site A from that of site B?
> >
> > On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak <
> [email protected]>wrote:
> >
> >> You probably need to customize parse-metatags plugin.
> >>
> >> I think you go ahead and include all possible metatags. And take care of
> >> missing metatags in solr.
> >>
> >> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang <[email protected]>
> wrote:
> >>
> >> > I understand conf/regex-urlfilter.txt; I can put domain names into the
> >> URL
> >> > patterns.
> >> >
> >> > But what about meta tags? What if I want to parse out different meta
> tags
> >> > for different sites?
> >> >
> >> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
> >> [email protected]
> >> > >wrote:
> >> >
> >> > > 1) For parsing & indexing customized meta tags enable & configure
> >> plugin
> >> > > "parse-metatags"
> >> > >
> >> > > 2) There are several filters of url, like regex based. For regex,
> the
> >> > > patterns are specified via conf/regex-urlfilter.txt
> >> > >
> >> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <
> [email protected]
> >> > > >wrote:
> >> > >
> >> > > > While defining url patterns, have the domain name in it so that
> you
> >> get
> >> > > > site/domain specific rules. I don't know about configuring meta
> tags.
> >> > > >
> >> > > > Thanks,
> >> > > > Tejas
> >> > > >
> >> > > >
> >> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang <[email protected]
> >
> >> > > wrote:
> >> > > >
> >> > > > > How to enforce site-specific crawling policies, i.e, different
> URL
> >> > > > > patterns, meta tags, etc. for different websites to be crawled?
> I
> >> got
> >> > > the
> >> > > > > sense that multiple instances of nutch are needed? Is it
> correct?
> >> If
> >> > > yes,
> >> > > > > how?
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>
>
>
> --
> Lewis
>

Reply via email to