I had a similar problem, took me days to figure it out, I can't remember
what exactly was going on, but it was some sort of conflict between
parameters in site.xml. I think you need to leave this BLANK:
<property>
<name>
index.metadata
</name>
<value>
description,keywords
</value>
</property>
My Set-up (Nutch 1.11):
Nutch-stie.xml:
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|headings|language-identifier|
protocol-http|urlfilter-regex|parse-(html|tika|metatags)|
index-(basic|anchor|more|metadata)|indexer-solr|scoring-opic|urlnormalizer-(
pass|regex|basic)</value>
</property>
<!-- index-metadata plugin properties -->
<property>
<name>index.parse.md</name>
<value>metatag.description,metatag.keywords,h1,h2,h3,h4,
h5,h6,metatag.title</value>
</property>
<!-- parse-metatags plugin properties -->
<property>
<name>metatags.names</name>
<value>description,keywords,title,h1,h2,h3,h4,h5,h6</value>
</property>
On Fri, Sep 9, 2016 at 3:00 PM, BlackIce <[email protected]> wrote:
> I had a similar problem once.. it was some stupid synrtax thing, lemme
> check my setup....
>
> On Fri, Sep 9, 2016 at 2:46 PM, KRIS MUSSHORN <[email protected]>
> wrote:
>
>> Looks like this is NOT in fact working.
>>
>> How do I get the metatags into Solr?
>>
>> i have a webpage @ https://snip/inside/directorates/cisd/asset.cfm that
>> has this in source:
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
>> http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>> <html xmlns="http://www.w3.org/1999/xhtml">
>> <head>
>> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
>> <title>Asset Control and Behavior Branch</title>
>> <meta name="keywords" content="Computational and Information Sciences,
>> CISD, Tokarcik, research, data fusion, knowledge management, battlespace
>> weather, environmental effects, computational science and engineering,
>> battlefield communications and networks ">
>> <meta name="description" content="This page explains the CISD mission and
>> hosts the biographies of the CISD Director and Deputy Director.">
>>
>> The parse metatags plugin is setup in nutch-site.xml as
>> parse-(html|tika|metatags)
>>
>> Solr schema.xml is correctly set up to receive the metatags:
>> <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.StandardTokenizerFactory" />
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="false" />
>> <filter class="solr.LowerCaseFilterFactory" />
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.StandardTokenizerFactory" />
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true" />
>> <filter class="solr.LowerCaseFilterFactory" />
>> </analyzer>
>> </fieldType>
>>
>> <field name="metatag.description" type="text_general" stored="true"
>> indexed="true" default="none" />
>> <field name="metatag.keywords" type="text_general" stored="true"
>> indexed="true" default="none" />
>> <field name="metatag.date" type="text_general" stored="true"
>> indexed="true" default="none" />
>>
>> After indexing the document solr shows:
>> " title ": "Asset Control and Behavior Branch" ,
>> " metatag.date ": "none" ,
>> " metatag.description ": "none" ,
>> " metatag.keywords ": "none"
>>
>> How do I get solr result of:
>> " title ": "Asset Control and Behavior Branch" ,
>> " metatag.date ": "none" ,
>> " metatag.description ": "This page explains the CISD mission and hosts
>> the biographies of the CISD Director and Deputy Director." ,
>> " metatag.keywords ": "Computational and Information Sciences, CISD,
>> Tokarcik, research, data fusion, knowledge management, battlespace weather,
>> environmental effects, computational science and engineering, battlefield
>> communications and networks"
>>
>> Kris
>>
>
>