http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html
Is a good starting point. To get the content from a HTML page I use
try {
reader = new BufferedReader(
new InputStreamReader(new ByteArrayInputStream(
content.getContent()),"UTF-8"));
while ((line = reader.readLine()) != null) {
text.append(line);
}
Good luck.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Fwd-How-to-write-a-plugin-to-ignore-certain-parts-of-a-HTML-Page-tp2272526p2272706.html
Sent from the Nutch - User mailing list archive at Nabble.com.