Hi Dave,

the boilerplate removal (boilerpipe) works if parse-tika is used for parsing,
but the parser.html.NodesToExclude property belongs to a feature which never
made it into the code base, see
  https://issues.apache.org/jira/browse/NUTCH-585

Or do you work with a patched version?

Best,
Sebastian


On 9/12/19 9:24 PM, Dave Beckstrom wrote:
> Hi All,
> 
> I'm running NUTCH 1.15.
> 
> In my nutch-site.xml I configured the below parameters and
> specifically under   parser.html.NodesToExclude I'm telling it not to index
> "div id=sidebar" or "div id=footer" and yet it continues to index those
> regions on the page.
> 
> Does anyone have suggestions on why this isn't working and what I should do
> to resolve this?
> 
> Thank you!
> 
> 
> 
> 
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   </description>
> </property>
>  <property>
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>ArticleExtractor</value>
>   <description>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
> <property>
>     <name>parser.html.NodesToExclude</name>
>     <value>div;id;sidebar|div;id;footer</value>
>     <description>
>       A list of nodes whose content will not be indexed separated by "|".
>       Use this to tell the HTML parser to ignore, for example, site
> navigation text.
> 
>       Each node has three elements, separated by semi-colon:
>       the first one is the tag name,
>       the second one the attribute name,
>       the third one the value of the attribute.
> 
>       Example: table;summary;header|div;id;navigation
> 
>       Note that nodes with these attributes, and their children, will be
>       silently ignored by the parser so verify the indexed content
>       with Luke to confirm results.
>     </description>
>   </property>
> 
> 
> 
> 
> Regards,
> 
> Dave Beckstrom
> Technical Delivery Manager / Senior Developer
> em: dbeckst...@collectivefls.com <aha...@collectivefls.com>
> ph: 763.323.3499
> 

Reply via email to