Hi All, I'm running NUTCH 1.15.
In my nutch-site.xml I configured the below parameters and specifically under parser.html.NodesToExclude I'm telling it not to index "div id=sidebar" or "div id=footer" and yet it continues to index those regions on the page. Does anyone have suggestions on why this isn't working and what I should do to resolve this? Thank you! <property> <name>tika.extractor</name> <value>boilerpipe</value> <description> Which text extraction algorithm to use. Valid values are: boilerpipe or none. </description> </property> <property> <name>tika.extractor.boilerpipe.algorithm</name> <value>ArticleExtractor</value> <description> Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor or CanolaExtractor. </description> </property> <property> <name>parser.html.NodesToExclude</name> <value>div;id;sidebar|div;id;footer</value> <description> A list of nodes whose content will not be indexed separated by "|". Use this to tell the HTML parser to ignore, for example, site navigation text. Each node has three elements, separated by semi-colon: the first one is the tag name, the second one the attribute name, the third one the value of the attribute. Example: table;summary;header|div;id;navigation Note that nodes with these attributes, and their children, will be silently ignored by the parser so verify the indexed content with Luke to confirm results. </description> </property> Regards, Dave Beckstrom Technical Delivery Manager / Senior Developer em: dbeckst...@collectivefls.com <aha...@collectivefls.com> ph: 763.323.3499 -- *Fig Leaf Software is now Collective FLS, Inc.* * * *Collective FLS, Inc.* https://www.collectivefls.com/ <https://www.collectivefls.com/>