Hi Tiger,

 But the negative keywords are usually some regex patterns.

 And, if I am not wrong HtmlParseFilter is a plug-in right? How do I enable
plug-ins in nutch or write my own?

 In other words, I understood what you said but unsure on where this logic
has to go :( Sorry for the trouble.

Thanks,
Abhishek


On Mon, Jan 31, 2011 at 3:47 PM, 黄淑明 <[email protected]> wrote:

> For the first feature, actually you can use simple string find ( such
> as indexOf) in your class that implements HtmlParseFilter.
> while the page contains words that you want to ignore, just return
> null, and add a metadata to content (say: crawl_me=0);
> and for those pages that contains words you like, just set crawl_me=0.9.
>
> while nutch are generate urls, put judge code in your urlFilter, while
> there's a "crawl_me" and value is zero, then return null.
> and judge in ScoreFilter class, and set a higher score to the urls
> that has higher crawl_me rate.
>
>
> tiger
> 2011/1/31
>
> 2011/1/31 .: Abhishek :. <[email protected]>:
> > Hi,
> >
> >  I am looking for ways to stop nutch from crawling or showing the
> negative
> > keywords in the search. What is the best way of doing it? Should I be
> using
> > any plugins?
> >
> >  Apart from this, I am also looking out ways to ignore or prioritize some
> > pattern of URL's that nutch is crawling.
> >
> >  Some help would be really appreciated.
> >
> > Thanks,
> > Abhi
> >
>

Reply via email to