Re: metatags missing with parse-html

2019-10-14 Thread Sebastian Nagel
Hi Dave,

could you share an example document? Which Nutch version is used?

I tried to reproduce the problem without success using Nutch v1.16:

- example document:



Test metatags





test for metatag extraction



- using parse-html (works)

> bin/nutch indexchecker -Dmetatags.names='*' \
   -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
   
-Dplugin.includes='protocol-http|parse-(html|metatags)|index-(basic|metadata)' \
   http://localhost/nutch/test_metatags.html
fetching: http://localhost/nutch/test_metatags.html
robots.txt whitelist not configured.
parsing: http://localhost/nutch/test_metatags.html
contentType: text/html
tstamp :Mon Oct 14 13:24:14 CEST 2019
metatag.language :  en
metatag.language :  en
metatag.category :  meta data
metatag.category :  meta data
digest :50d08494ba791bb52fcdeebfc08ba640
host :  localhost
metatag.subject :   test
metatag.subject :   test
id :http://localhost/nutch/test_metatags.html
title : Test metatags
url :   http://localhost/nutch/test_metatags.html
content :   Test metatags
test for metatag extraction

- using parse-tika (works)

> bin/nutch indexchecker -Dmetatags.names='*' \
   -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
   
-Dplugin.includes='protocol-http|parse-(tika|metatags)|index-(basic|metadata)' \
   http://localhost/nutch/test_metatags.html
fetching: http://localhost/nutch/test_metatags.html
robots.txt whitelist not configured.
parsing: http://localhost/nutch/test_metatags.html
contentType: text/html
tstamp :Mon Oct 14 13:25:34 CEST 2019
metatag.language :  en
metatag.language :  en
metatag.category :  meta data
metatag.category :  meta data
digest :50d08494ba791bb52fcdeebfc08ba640
host :  localhost
metatag.subject :   test
metatag.subject :   test
id :http://localhost/nutch/test_metatags.html
title : Test metatags
url :   http://localhost/nutch/test_metatags.html
content :   Test metatags
test for metatag extraction


There are currently two issue open around metatags:
 https://issues.apache.org/jira/browse/NUTCH-1559
 https://issues.apache.org/jira/browse/NUTCH-2525

Maybe it's related to one of those?


Best,
Sebastian


On 11.10.19 22:38, Dave Beckstrom wrote:
> Hi Everyone,
> 
> It seems like I take 1 step forward and 2 steps backwards.
> 
> I was using parse-tika and I needed to change to parse-html in order to use
> a plug-in for excluding content such as headers and footers.
> 
> I have the excludes working with the plug-in.  But now I see that all of
> the metatags are missing from solr.  The metatag fields are defined in SOLR
> but not populated.
> 
> Metatags were working prior to the change to parse-html.  What would
> explain the metatags not being indexed when the configuration
> parameters didn't change?  Is there some other setting for parse-html that
> I need to look into?
> 
> Thanks!
> 
> 
>  
>   plugin.includes
> 
> exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist
>
>  
>  
>  
>   metatags.names
>   *
>
>  
>  
>   index.parse.md
>metatag.language,metatag.subject,metatag.category
>
> 
> 



metatags missing with parse-html

2019-10-11 Thread Dave Beckstrom
Hi Everyone,

It seems like I take 1 step forward and 2 steps backwards.

I was using parse-tika and I needed to change to parse-html in order to use
a plug-in for excluding content such as headers and footers.

I have the excludes working with the plug-in.  But now I see that all of
the metatags are missing from solr.  The metatag fields are defined in SOLR
but not populated.

Metatags were working prior to the change to parse-html.  What would
explain the metatags not being indexed when the configuration
parameters didn't change?  Is there some other setting for parse-html that
I need to look into?

Thanks!


 
  plugin.includes

exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist
   
 
 
 
  metatags.names
  *
   
 
 
  index.parse.md
   metatag.language,metatag.subject,metatag.category
   


-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/