RE: [Non-DoD Source] Re: indexing metatags with Nutch 1.12 (UNCLASSIFIED)

BlackIce Fri, 09 Sep 2016 07:09:01 -0700

I don't have it at all

On Sep 9, 2016 3:42 PM, "Musshorn, Kris T CTR USARMY RDECOM ARL (US)" <
[email protected]> wrote:


> CLASSIFICATION: UNCLASSIFIED
>
> Are you suggesting I should remove the index.metadata property completely
> or just supply no value?
>
> Thanks,
> Kris
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.
> US Army Research Lab
> Aberdeen Proving Ground
> Application Management & Development Branch
> 410-278-7251
> [email protected]
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> -----Original Message-----
> From: BlackIce [mailto:[email protected]]
> Sent: Friday, September 09, 2016 9:31 AM
> To: [email protected]
> Subject: [Non-DoD Source] Re: indexing metatags with Nutch 1.12
>
> All active links contained in this email were disabled.  Please verify the
> identity of the sender, and confirm the authenticity of all links contained
> within the message prior to copying and pasting the address to a Web
> browser.
>
>
>
>
> ----
>
> I had a similar problem, took me days to figure it out, I can't remember
> what exactly was going on, but it was some sort of conflict between
> parameters in site.xml. I think you need to leave this BLANK:
>
> <property>
>                 <name>
>                         index.metadata
>                 </name>
>                 <value>
>                         description,keywords
>                 </value>
>         </property>
>
>
> My Set-up (Nutch 1.11):
>
> Nutch-stie.xml:
>
> <property>
>   <name>plugin.includes</name>
>   <value>nutch-extensionpoints|headings|language-identifier|
> protocol-http|urlfilter-regex|parse-(html|tika|metatags)|
> index-(basic|anchor|more|metadata)|indexer-solr|
> scoring-opic|urlnormalizer-(
> pass|regex|basic)</value>
>
> </property>
>
> <!-- index-metadata plugin properties -->
>
> <property>
>   <name>index.parse.md</name>
>   <value>metatag.description,metatag.keywords,h1,h2,h3,h4,
> h5,h6,metatag.title</value>
>
> </property>
>
>
>
> <!-- parse-metatags plugin properties --> <property>
>   <name>metatags.names</name>
>   <value>description,keywords,title,h1,h2,h3,h4,h5,h6</value>
>
> </property>
>
> On Fri, Sep 9, 2016 at 3:00 PM, BlackIce <[email protected]> wrote:
>
> > I had a similar problem once.. it was some stupid synrtax thing, lemme
> > check my setup....
> >
> > On Fri, Sep 9, 2016 at 2:46 PM, KRIS MUSSHORN <[email protected]>
> > wrote:
> >
> >> Looks like this is NOT in fact working.
> >>
> >> How do I get the metatags into Solr?
> >>
> >> i have a webpage @
> >> Caution-https://snip/inside/directorates/cisd/asset.cfm that has this
> in source:
> >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
> >> Caution-http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> >> <html xmlns="Caution-http://www.w3.org/1999/xhtml";>
> >> <head>
> >> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> >> <title>Asset Control and Behavior Branch</title> <meta
> >> name="keywords" content="Computational and Information Sciences,
> >> CISD, Tokarcik, research, data fusion, knowledge management,
> >> battlespace weather, environmental effects, computational science and
> >> engineering, battlefield communications and networks "> <meta
> >> name="description" content="This page explains the CISD mission and
> >> hosts the biographies of the CISD Director and Deputy Director.">
> >>
> >> The parse metatags plugin is setup in nutch-site.xml as
> >> parse-(html|tika|metatags)
> >>
> >> Solr schema.xml is correctly set up to receive the metatags:
> >> <fieldType name="text_general" class="solr.TextField"
> >> positionIncrementGap="100">
> >> <analyzer type="index">
> >> <tokenizer class="solr.StandardTokenizerFactory" /> <filter
> >> class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt" />
> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="false" />
> >> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer
> >> type="query"> <tokenizer class="solr.StandardTokenizerFactory" />
> >> <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt" />
> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="true" />
> >> <filter class="solr.LowerCaseFilterFactory" /> </analyzer>
> >> </fieldType>
> >>
> >> <field name="metatag.description" type="text_general" stored="true"
> >> indexed="true" default="none" />
> >> <field name="metatag.keywords" type="text_general" stored="true"
> >> indexed="true" default="none" />
> >> <field name="metatag.date" type="text_general" stored="true"
> >> indexed="true" default="none" />
> >>
> >> After indexing the document solr shows:
> >> " title ": "Asset Control and Behavior Branch" , " metatag.date ":
> >> "none" , " metatag.description ": "none" , " metatag.keywords ":
> >> "none"
> >>
> >> How do I get solr result of:
> >> " title ": "Asset Control and Behavior Branch" , " metatag.date ":
> >> "none" , " metatag.description ": "This page explains the CISD
> >> mission and hosts the biographies of the CISD Director and Deputy
> >> Director." , " metatag.keywords ": "Computational and Information
> >> Sciences, CISD, Tokarcik, research, data fusion, knowledge
> >> management, battlespace weather, environmental effects, computational
> >> science and engineering, battlefield communications and networks"
> >>
> >> Kris
> >>
> >
> >
>
>
> CLASSIFICATION: UNCLASSIFIED
>

RE: [Non-DoD Source] Re: indexing metatags with Nutch 1.12 (UNCLASSIFIED)

Reply via email to