http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Is a good starting point. To get the content from a HTML page I use

    try {
        
        reader = new BufferedReader(
                          new InputStreamReader(new ByteArrayInputStream(
                          content.getContent()),"UTF-8"));
                
      while ((line = reader.readLine()) != null) {
              text.append(line);
      }

Good luck.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Fwd-How-to-write-a-plugin-to-ignore-certain-parts-of-a-HTML-Page-tp2272526p2272706.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to