thanks... FYI here is the entire line from plugin.includes protocol-http|urlfilter-(regex|suffix)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
----- Original Message ----- From: "BlackIce" <[email protected]> To: [email protected] Sent: Friday, September 9, 2016 9:00:12 AM Subject: Re: indexing metatags with Nutch 1.12 I had a similar problem once.. it was some stupid synrtax thing, lemme check my setup.... On Fri, Sep 9, 2016 at 2:46 PM, KRIS MUSSHORN <[email protected]> wrote: > Looks like this is NOT in fact working. > > How do I get the metatags into Solr? > > i have a webpage @ https://snip/inside/directorates/cisd/asset.cfm that > has this in source: > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " > http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> > <title>Asset Control and Behavior Branch</title> > <meta name="keywords" content="Computational and Information Sciences, > CISD, Tokarcik, research, data fusion, knowledge management, battlespace > weather, environmental effects, computational science and engineering, > battlefield communications and networks "> > <meta name="description" content="This page explains the CISD mission and > hosts the biographies of the CISD Director and Deputy Director."> > > The parse metatags plugin is setup in nutch-site.xml as > parse-(html|tika|metatags) > > Solr schema.xml is correctly set up to receive the metatags: > <fieldType name="text_general" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory" /> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" /> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="false" /> > <filter class="solr.LowerCaseFilterFactory" /> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory" /> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" /> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true" /> > <filter class="solr.LowerCaseFilterFactory" /> > </analyzer> > </fieldType> > > <field name="metatag.description" type="text_general" stored="true" > indexed="true" default="none" /> > <field name="metatag.keywords" type="text_general" stored="true" > indexed="true" default="none" /> > <field name="metatag.date" type="text_general" stored="true" > indexed="true" default="none" /> > > After indexing the document solr shows: > " title ": "Asset Control and Behavior Branch" , > " metatag.date ": "none" , > " metatag.description ": "none" , > " metatag.keywords ": "none" > > How do I get solr result of: > " title ": "Asset Control and Behavior Branch" , > " metatag.date ": "none" , > " metatag.description ": "This page explains the CISD mission and hosts > the biographies of the CISD Director and Deputy Director." , > " metatag.keywords ": "Computational and Information Sciences, CISD, > Tokarcik, research, data fusion, knowledge management, battlespace weather, > environmental effects, computational science and engineering, battlefield > communications and networks" > > Kris >

