thanks... 
FYI here is the entire line from plugin.includes 
protocol-http|urlfilter-(regex|suffix)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
 

----- Original Message -----

From: "BlackIce" <[email protected]> 
To: [email protected] 
Sent: Friday, September 9, 2016 9:00:12 AM 
Subject: Re: indexing metatags with Nutch 1.12 

I had a similar problem once.. it was some stupid synrtax thing, lemme 
check my setup.... 

On Fri, Sep 9, 2016 at 2:46 PM, KRIS MUSSHORN <[email protected]> wrote: 

> Looks like this is NOT in fact working. 
> 
> How do I get the metatags into Solr? 
> 
> i have a webpage @ https://snip/inside/directorates/cisd/asset.cfm that 
> has this in source: 
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " 
> http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";> 
> <html xmlns="http://www.w3.org/1999/xhtml";> 
> <head> 
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
> <title>Asset Control and Behavior Branch</title> 
> <meta name="keywords" content="Computational and Information Sciences, 
> CISD, Tokarcik, research, data fusion, knowledge management, battlespace 
> weather, environmental effects, computational science and engineering, 
> battlefield communications and networks "> 
> <meta name="description" content="This page explains the CISD mission and 
> hosts the biographies of the CISD Director and Deputy Director."> 
> 
> The parse metatags plugin is setup in nutch-site.xml as 
> parse-(html|tika|metatags) 
> 
> Solr schema.xml is correctly set up to receive the metatags: 
> <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100"> 
> <analyzer type="index"> 
> <tokenizer class="solr.StandardTokenizerFactory" /> 
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" /> 
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="false" /> 
> <filter class="solr.LowerCaseFilterFactory" /> 
> </analyzer> 
> <analyzer type="query"> 
> <tokenizer class="solr.StandardTokenizerFactory" /> 
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" /> 
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="true" /> 
> <filter class="solr.LowerCaseFilterFactory" /> 
> </analyzer> 
> </fieldType> 
> 
> <field name="metatag.description" type="text_general" stored="true" 
> indexed="true" default="none" /> 
> <field name="metatag.keywords" type="text_general" stored="true" 
> indexed="true" default="none" /> 
> <field name="metatag.date" type="text_general" stored="true" 
> indexed="true" default="none" /> 
> 
> After indexing the document solr shows: 
> " title ": "Asset Control and Behavior Branch" , 
> " metatag.date ": "none" , 
> " metatag.description ": "none" , 
> " metatag.keywords ": "none" 
> 
> How do I get solr result of: 
> " title ": "Asset Control and Behavior Branch" , 
> " metatag.date ": "none" , 
> " metatag.description ": "This page explains the CISD mission and hosts 
> the biographies of the CISD Director and Deputy Director." , 
> " metatag.keywords ": "Computational and Information Sciences, CISD, 
> Tokarcik, research, data fusion, knowledge management, battlespace weather, 
> environmental effects, computational science and engineering, battlefield 
> communications and networks" 
> 
> Kris 
> 

Reply via email to