Re: parser.html.NodesToExclud

2019-09-12 Thread Sebastian Nagel
Hi Dave,

the boilerplate removal (boilerpipe) works if parse-tika is used for parsing,
but the parser.html.NodesToExclude property belongs to a feature which never
made it into the code base, see
  https://issues.apache.org/jira/browse/NUTCH-585

Or do you work with a patched version?

Best,
Sebastian


On 9/12/19 9:24 PM, Dave Beckstrom wrote:
> Hi All,
> 
> I'm running NUTCH 1.15.
> 
> In my nutch-site.xml I configured the below parameters and
> specifically under   parser.html.NodesToExclude I'm telling it not to index
> "div id=sidebar" or "div id=footer" and yet it continues to index those
> regions on the page.
> 
> Does anyone have suggestions on why this isn't working and what I should do
> to resolve this?
> 
> Thank you!
> 
> 
> 
> 
> 
>   tika.extractor
>   boilerpipe
>   
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   
> 
>  
>   tika.extractor.boilerpipe.algorithm
>   ArticleExtractor
>   
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
>   or CanolaExtractor.
>   
> 
> 
> parser.html.NodesToExclude
> div;id;sidebar|div;id;footer
> 
>   A list of nodes whose content will not be indexed separated by "|".
>   Use this to tell the HTML parser to ignore, for example, site
> navigation text.
> 
>   Each node has three elements, separated by semi-colon:
>   the first one is the tag name,
>   the second one the attribute name,
>   the third one the value of the attribute.
> 
>   Example: table;summary;header|div;id;navigation
> 
>   Note that nodes with these attributes, and their children, will be
>   silently ignored by the parser so verify the indexed content
>   with Luke to confirm results.
> 
>   
> 
> 
> 
> 
> Regards,
> 
> Dave Beckstrom
> Technical Delivery Manager / Senior Developer
> em: dbeckst...@collectivefls.com 
> ph: 763.323.3499
> 



parser.html.NodesToExclud

2019-09-12 Thread Dave Beckstrom
Hi All,

I'm running NUTCH 1.15.

In my nutch-site.xml I configured the below parameters and
specifically under   parser.html.NodesToExclude I'm telling it not to index
"div id=sidebar" or "div id=footer" and yet it continues to index those
regions on the page.

Does anyone have suggestions on why this isn't working and what I should do
to resolve this?

Thank you!





  tika.extractor
  boilerpipe
  
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  

 
  tika.extractor.boilerpipe.algorithm
  ArticleExtractor
  
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
  or CanolaExtractor.
  


parser.html.NodesToExclude
div;id;sidebar|div;id;footer

  A list of nodes whose content will not be indexed separated by "|".
  Use this to tell the HTML parser to ignore, for example, site
navigation text.

  Each node has three elements, separated by semi-colon:
  the first one is the tag name,
  the second one the attribute name,
  the third one the value of the attribute.

  Example: table;summary;header|div;id;navigation

  Note that nodes with these attributes, and their children, will be
  silently ignored by the parser so verify the indexed content
  with Luke to confirm results.

  




Regards,

Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: dbeckst...@collectivefls.com 
ph: 763.323.3499

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/