Hi, What is the best way to provide either a whitelist (or blacklist) of html classes (or names or id's) for Nutch to include (or exclude) prior to inserting data into Lucene?
I ask because we want to index pages from sites, but without much of the page, like header, menu, and footer. thanks for considering, -Gavin

