CLASSIFICATION: UNCLASSIFIED Are you suggesting I should remove the index.metadata property completely or just supply no value?
Thanks, Kris ~~~~~~~~~~~~~~~~~~~~~~~~~~ Kris T. Musshorn FileMaker Developer - Contractor – Catapult Technology Inc. US Army Research Lab Aberdeen Proving Ground Application Management & Development Branch 410-278-7251 [email protected] ~~~~~~~~~~~~~~~~~~~~~~~~~~ -----Original Message----- From: BlackIce [mailto:[email protected]] Sent: Friday, September 09, 2016 9:31 AM To: [email protected] Subject: [Non-DoD Source] Re: indexing metatags with Nutch 1.12 All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. ---- I had a similar problem, took me days to figure it out, I can't remember what exactly was going on, but it was some sort of conflict between parameters in site.xml. I think you need to leave this BLANK: <property> <name> index.metadata </name> <value> description,keywords </value> </property> My Set-up (Nutch 1.11): Nutch-stie.xml: <property> <name>plugin.includes</name> <value>nutch-extensionpoints|headings|language-identifier| protocol-http|urlfilter-regex|parse-(html|tika|metatags)| index-(basic|anchor|more|metadata)|indexer-solr|scoring-opic|urlnormalizer-( pass|regex|basic)</value> </property> <!-- index-metadata plugin properties --> <property> <name>index.parse.md</name> <value>metatag.description,metatag.keywords,h1,h2,h3,h4, h5,h6,metatag.title</value> </property> <!-- parse-metatags plugin properties --> <property> <name>metatags.names</name> <value>description,keywords,title,h1,h2,h3,h4,h5,h6</value> </property> On Fri, Sep 9, 2016 at 3:00 PM, BlackIce <[email protected]> wrote: > I had a similar problem once.. it was some stupid synrtax thing, lemme > check my setup.... > > On Fri, Sep 9, 2016 at 2:46 PM, KRIS MUSSHORN <[email protected]> > wrote: > >> Looks like this is NOT in fact working. >> >> How do I get the metatags into Solr? >> >> i have a webpage @ >> Caution-https://snip/inside/directorates/cisd/asset.cfm that has this in >> source: >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " >> Caution-http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> <html xmlns="Caution-http://www.w3.org/1999/xhtml"> >> <head> >> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> >> <title>Asset Control and Behavior Branch</title> <meta >> name="keywords" content="Computational and Information Sciences, >> CISD, Tokarcik, research, data fusion, knowledge management, >> battlespace weather, environmental effects, computational science and >> engineering, battlefield communications and networks "> <meta >> name="description" content="This page explains the CISD mission and >> hosts the biographies of the CISD Director and Deputy Director."> >> >> The parse metatags plugin is setup in nutch-site.xml as >> parse-(html|tika|metatags) >> >> Solr schema.xml is correctly set up to receive the metatags: >> <fieldType name="text_general" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory" /> <filter >> class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt" /> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true" expand="false" /> >> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer >> type="query"> <tokenizer class="solr.StandardTokenizerFactory" /> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt" /> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true" expand="true" /> >> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> >> </fieldType> >> >> <field name="metatag.description" type="text_general" stored="true" >> indexed="true" default="none" /> >> <field name="metatag.keywords" type="text_general" stored="true" >> indexed="true" default="none" /> >> <field name="metatag.date" type="text_general" stored="true" >> indexed="true" default="none" /> >> >> After indexing the document solr shows: >> " title ": "Asset Control and Behavior Branch" , " metatag.date ": >> "none" , " metatag.description ": "none" , " metatag.keywords ": >> "none" >> >> How do I get solr result of: >> " title ": "Asset Control and Behavior Branch" , " metatag.date ": >> "none" , " metatag.description ": "This page explains the CISD >> mission and hosts the biographies of the CISD Director and Deputy >> Director." , " metatag.keywords ": "Computational and Information >> Sciences, CISD, Tokarcik, research, data fusion, knowledge >> management, battlespace weather, environmental effects, computational >> science and engineering, battlefield communications and networks" >> >> Kris >> > > CLASSIFICATION: UNCLASSIFIED

