Hi All,
I'm running NUTCH 1.15.
In my nutch-site.xml I configured the below parameters and
specifically under parser.html.NodesToExclude I'm telling it not to index
"div id=sidebar" or "div id=footer" and yet it continues to index those
regions on the page.
Does anyone have suggestions on why this isn't working and what I should do
to resolve this?
Thank you!
<property>
<name>tika.extractor</name>
<value>boilerpipe</value>
<description>
Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
</description>
</property>
<property>
<name>tika.extractor.boilerpipe.algorithm</name>
<value>ArticleExtractor</value>
<description>
Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
or CanolaExtractor.
</description>
</property>
<property>
<name>parser.html.NodesToExclude</name>
<value>div;id;sidebar|div;id;footer</value>
<description>
A list of nodes whose content will not be indexed separated by "|".
Use this to tell the HTML parser to ignore, for example, site
navigation text.
Each node has three elements, separated by semi-colon:
the first one is the tag name,
the second one the attribute name,
the third one the value of the attribute.
Example: table;summary;header|div;id;navigation
Note that nodes with these attributes, and their children, will be
silently ignored by the parser so verify the indexed content
with Luke to confirm results.
</description>
</property>
Regards,
Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: [email protected] <[email protected]>
ph: 763.323.3499
--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.*
https://www.collectivefls.com/ <https://www.collectivefls.com/>