Hi Joe,

In all honesty, it might sound slightly optimistic, it may also depend
upon the size and calibre of the different sites/domains but if you
are attempting a depth first, domain specific crawl, then maybe
separate Nutch instances will be your friend...

Lewis


On Thu, Nov 15, 2012 at 11:53 PM, Joe Zhang <[email protected]> wrote:
> well, these are all details. The bigger question is, how to seperate the
> crawling policy of site A from that of site B?
>
> On Thu, Nov 15, 2012 at 7:41 AM, Sourajit Basak 
> <[email protected]>wrote:
>
>> You probably need to customize parse-metatags plugin.
>>
>> I think you go ahead and include all possible metatags. And take care of
>> missing metatags in solr.
>>
>> On Thu, Nov 15, 2012 at 12:22 AM, Joe Zhang <[email protected]> wrote:
>>
>> > I understand conf/regex-urlfilter.txt; I can put domain names into the
>> URL
>> > patterns.
>> >
>> > But what about meta tags? What if I want to parse out different meta tags
>> > for different sites?
>> >
>> > On Wed, Nov 14, 2012 at 1:33 AM, Sourajit Basak <
>> [email protected]
>> > >wrote:
>> >
>> > > 1) For parsing & indexing customized meta tags enable & configure
>> plugin
>> > > "parse-metatags"
>> > >
>> > > 2) There are several filters of url, like regex based. For regex, the
>> > > patterns are specified via conf/regex-urlfilter.txt
>> > >
>> > > On Wed, Nov 14, 2012 at 1:33 PM, Tejas Patil <[email protected]
>> > > >wrote:
>> > >
>> > > > While defining url patterns, have the domain name in it so that you
>> get
>> > > > site/domain specific rules. I don't know about configuring meta tags.
>> > > >
>> > > > Thanks,
>> > > > Tejas
>> > > >
>> > > >
>> > > > On Tue, Nov 13, 2012 at 11:34 PM, Joe Zhang <[email protected]>
>> > > wrote:
>> > > >
>> > > > > How to enforce site-specific crawling policies, i.e, different URL
>> > > > > patterns, meta tags, etc. for different websites to be crawled? I
>> got
>> > > the
>> > > > > sense that multiple instances of nutch are needed? Is it correct?
>> If
>> > > yes,
>> > > > > how?
>> > > > >
>> > > >
>> > >
>> >
>>



-- 
Lewis

Reply via email to