Hi Dave, the boilerplate removal (boilerpipe) works if parse-tika is used for parsing, but the parser.html.NodesToExclude property belongs to a feature which never made it into the code base, see https://issues.apache.org/jira/browse/NUTCH-585
Or do you work with a patched version? Best, Sebastian On 9/12/19 9:24 PM, Dave Beckstrom wrote: > Hi All, > > I'm running NUTCH 1.15. > > In my nutch-site.xml I configured the below parameters and > specifically under parser.html.NodesToExclude I'm telling it not to index > "div id=sidebar" or "div id=footer" and yet it continues to index those > regions on the page. > > Does anyone have suggestions on why this isn't working and what I should do > to resolve this? > > Thank you! > > > > > <property> > <name>tika.extractor</name> > <value>boilerpipe</value> > <description> > Which text extraction algorithm to use. Valid values are: boilerpipe or > none. > </description> > </property> > <property> > <name>tika.extractor.boilerpipe.algorithm</name> > <value>ArticleExtractor</value> > <description> > Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, > ArticleExtractor > or CanolaExtractor. > </description> > </property> > <property> > <name>parser.html.NodesToExclude</name> > <value>div;id;sidebar|div;id;footer</value> > <description> > A list of nodes whose content will not be indexed separated by "|". > Use this to tell the HTML parser to ignore, for example, site > navigation text. > > Each node has three elements, separated by semi-colon: > the first one is the tag name, > the second one the attribute name, > the third one the value of the attribute. > > Example: table;summary;header|div;id;navigation > > Note that nodes with these attributes, and their children, will be > silently ignored by the parser so verify the indexed content > with Luke to confirm results. > </description> > </property> > > > > > Regards, > > Dave Beckstrom > Technical Delivery Manager / Senior Developer > em: dbeckst...@collectivefls.com <aha...@collectivefls.com> > ph: 763.323.3499 >