I'm having the following in my nutchsite.xml. Yet the boilerplate removal isn't quite successful. A lot of webpages (from reputable sources such as reuters.com) come with sidepanes and other junks that were not removed. Any suggestions from the experts?
<name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> <!-- tika properties to use BoilerPipe, according to Marcus Jelsma --> <property> <name>tika.use_boilerpipe</name> <value>true</value> </property> <property> <name>tika.boilerpipe.extractor</name> <value>ArticleExtractor</value> </property>

