Hi All,

I'm running NUTCH 1.15.

In my nutch-site.xml I configured the below parameters and
specifically under   parser.html.NodesToExclude I'm telling it not to index
"div id=sidebar" or "div id=footer" and yet it continues to index those
regions on the page.

Does anyone have suggestions on why this isn't working and what I should do
to resolve this?

Thank you!




<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  </description>
</property>
 <property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
  or CanolaExtractor.
  </description>
</property>
<property>
    <name>parser.html.NodesToExclude</name>
    <value>div;id;sidebar|div;id;footer</value>
    <description>
      A list of nodes whose content will not be indexed separated by "|".
      Use this to tell the HTML parser to ignore, for example, site
navigation text.

      Each node has three elements, separated by semi-colon:
      the first one is the tag name,
      the second one the attribute name,
      the third one the value of the attribute.

      Example: table;summary;header|div;id;navigation

      Note that nodes with these attributes, and their children, will be
      silently ignored by the parser so verify the indexed content
      with Luke to confirm results.
    </description>
  </property>




Regards,

Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: dbeckst...@collectivefls.com <aha...@collectivefls.com>
ph: 763.323.3499

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/

Reply via email to