I don't know about available plugin, but you can develop your own, it's quite 
easy. First, read this article on how to develop custom plugin 
[http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html][1]
 . You need for plugin like that. Check key in HtmlParseFilter extending class 
(if you want to check it in HTML code) or IndexingFilter extending class(if you 
want to check whether key is presented in "visible text", in that case you can 
omit HtmlParseFilter extending class). In first case, get HTML code as follows:

    String url = content.getUrl();
    Parse parse = parseResult.get(url);
    String pageText = parse.getData().toString();

then check for key in it, set boolean field to metadata, check for it on 
indexing stage and if not set, return null. Sample code from my project where 
crawler checks for some code on page:

    if(metadata.getValues(RgPluginParser.NO_INDEX).length!=0){
      LOG.debug("No code found on "+url+" page - indexing has been cancelled");
      return null;
    }


In second case, just do such a thing on indexing stage:

    String text = parse.getText();
    if(!text.contains(key)){
      return null;
    }

  [1]: 
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

WBR, Alexander Chepurnoy.

--- 17.7.12, mausmust <[email protected]> :

От: mausmust <[email protected]>
Тема: Nutch Content Filtering
Кому: [email protected]
Дата: Вторник, 17 июль 2012, 11:41

While Apache Nutch 1.3 crawling pages, i want to analyze the content of the 
page and if the content contains some keywords then adding page for next steps, 
say indexing. If the content do not contains at least one key, then just 
getting links from that page and ignoring it. How can i do that? Is there any 
filtering plugin available for this purpose? Thnx.

Reply via email to