Hi,

What is the best way to provide either a whitelist (or blacklist) of html
classes (or names or id's) for Nutch to include (or exclude) prior to
inserting data into Lucene?

I ask because we want to index pages from sites, but without much of the
page, like header, menu, and footer.

thanks for considering,
-Gavin

Reply via email to