Hi Tiger, But the negative keywords are usually some regex patterns.
And, if I am not wrong HtmlParseFilter is a plug-in right? How do I enable plug-ins in nutch or write my own? In other words, I understood what you said but unsure on where this logic has to go :( Sorry for the trouble. Thanks, Abhishek On Mon, Jan 31, 2011 at 3:47 PM, 黄淑明 <[email protected]> wrote: > For the first feature, actually you can use simple string find ( such > as indexOf) in your class that implements HtmlParseFilter. > while the page contains words that you want to ignore, just return > null, and add a metadata to content (say: crawl_me=0); > and for those pages that contains words you like, just set crawl_me=0.9. > > while nutch are generate urls, put judge code in your urlFilter, while > there's a "crawl_me" and value is zero, then return null. > and judge in ScoreFilter class, and set a higher score to the urls > that has higher crawl_me rate. > > > tiger > 2011/1/31 > > 2011/1/31 .: Abhishek :. <[email protected]>: > > Hi, > > > > I am looking for ways to stop nutch from crawling or showing the > negative > > keywords in the search. What is the best way of doing it? Should I be > using > > any plugins? > > > > Apart from this, I am also looking out ways to ignore or prioritize some > > pattern of URL's that nutch is crawling. > > > > Some help would be really appreciated. > > > > Thanks, > > Abhi > > >

