Re: problem with index-more and solr

kaveh minooie Fri, 27 Jan 2012 14:12:26 -0800

sorry wrong solrindex-mapping in previous email. this is the correct one:
<mapping>


<!--
 Simple mapping of fields created by Nutch IndexingFilters
             to fields defined (and expected) in Solr schema.xml.

             Any fields in NutchDocument that match a name defined
             in field/@source will be renamed to the corresponding
             field/@dest.
             Additionally, if a field name (before mapping) matches
             a copyField/@source then its values will be copied to
             the corresponding copyField/@dest.

             uniqueKey has the same meaning as in Solr schema.xml
             and defaults to "id" if not defined.

-->
<fields>
<field dest="cache" source="cache"/>
<field dest="anchor" source="anchor"/>
<field dest="type" source="type"/>
<field dest="contentLength" source="contentLength"/>
<field dest="lastModified" source="lastModified"/>
<field dest="date" source="date"/>
<field dest="lang" source="lang"/>
<field dest="subcollection" source="subcollection"/>
<field dest="plutoz_ranking" source="plutoz_ranking"/>
<field dest="user_ranking" source="user_ranking"/>
<field dest="domain_ranking" source="domain_ranking"/>
<field dest="categories" source="categories"/>
<field dest="author" source="author"/>
<field dest="terms" source="terms"/>
<field dest="publishedDate" source="publishedDate"/>
<field dest="updatedDate" source="updatedDate"/>
<field dest="content" source="content"/>
<field dest="site" source="site"/>
<field dest="title" source="title"/>
<field dest="host" source="host"/>
<field dest="segment" source="segment"/>
<field dest="boost" source="boost"/>
<field dest="digest" source="digest"/>
<field dest="tstamp" source="tstamp"/>
<!--field dest="id" source="url"/-->
<!--copyField source="url" dest="url"/-->
</fields>
<uniqueKey>url</uniqueKey>
</mapping>

On 01/27/2012 02:08 PM, kaveh minooie wrote:

thanks for that. so I did that and content is empty and contentlength is
0. I am not storing the content in the schema.xml file

<field name="content" type="text" stored="false" indexed="true"
termVectors="true"/>
<field name="contentLength" type="long" stored="true" indexed="false"/>

do I have to store content to be able to get the size?

this is the result of the indexchecker btw:

kaveh@index9:~/build/nutch/runtime/local$ bin/nutch indexchecker
http://www.hafingtonpost.com
fetching: http://www.hafingtonpost.com
parsing: http://www.hafingtonpost.com
contentType: text/html
user_ranking : 25.0
content :
domain_ranking : 10.0
host : www.hafingtonpost.com
plutoz_ranking : 0.0
terms :
categories :
tstamp : Fri Jan 27 13:56:39 PST 2012
type : text/html
type : text
type : html
date : Fri Jan 27 13:56:39 PST 2012
contentLength : 0
url : http://www.hafingtonpost.com


also while we are at it some of the other fields here do not appear in
the solr result as well such as user_ranking domain_ranking date
contentlength and terms. content is not there either but no surprise
there. I know solr do not include empty fields here for example terms,
but I don't know why user_ranking and domain_ranking do not show up:



here is my schema ans solrindex-mapping files:

<schema name="nutch" version="1.1">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true"
omitNorms="true"/>
<fieldType name="long" class="solr.LongField" omitNorms="true"/>
<fieldType name="float" class="solr.FloatField" omitNorms="true"/>
<fieldType name="integer" class="solr.IntField" omitNorms="true"/>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!--<charFilter class="solr.HTMLStripCharFilterFactory"/>-->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="ContentSynonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url" class="solr.StrField" positionIncrementGap="100">
<!--<analyzer>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>-->
</fieldType>
<fieldType name="url2" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.LetterTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LetterTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url3" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="textSpell" class="solr.TextField"
positionIncrementGap="100" stored="false" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="string" stored="true" indexed="true"/>
<field name="segment" type="string" stored="true" indexed="true"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<field name="dupid" type="string" stored="true" indexed="false"/>
<field name="host" type="url" stored="true" indexed="true"/>
<field name="host2" type="url2" stored="false" indexed="true"/>
<field name="host3" type="url3" stored="false" indexed="true"/>
<field name="site" type="string" stored="false" indexed="true"/>
<field name="url" type="url" stored="true" indexed="true" required="true"/>
<field name="url2" type="url3" stored="false" indexed="true"/>
<field name="content" type="text" stored="false" indexed="true"
termVectors="true"/>
<field name="title" type="text" stored="false" indexed="true"
termVectors="true"/>
<field name="cache" type="string" stored="true" indexed="false"/>
<field name="tstamp" type="long" stored="true" indexed="true"/>
<field name="anchor" type="string" stored="false" indexed="true"
multiValued="true"/>
<field name="type" type="string" stored="true" indexed="true"
multiValued="true"/>
<field name="contentLength" type="long" stored="true" indexed="false"/>
<field name="lastModified" type="long" stored="true" indexed="false"/>
<field name="date" type="integer" stored="false" indexed="false"/>
<field name="lang" type="string" stored="false" indexed="true"/>
<field name="subcollection" type="string" stored="true" indexed="true"/>
<field name="plutoz_ranking" type="float" stored="true" indexed="true"/>
<field name="user_ranking" type="float" stored="false" indexed="true"/>
<field name="domain_ranking" type="float" stored="false" indexed="true"/>
<field name="categories" type="string" stored="true" indexed="true"
multiValued="true"/>
<field name="categories2" type="text" stored="false" indexed="true"
multiValued="true"/>
<field name="terms" type="text" stored="false" indexed="true"
multiValued="true"/>
<field name="author" type="string" stored="false" indexed="true"/>
<field name="tag" type="string" stored="false" indexed="true"/>
<field name="feed" type="string" stored="false" indexed="true"/>
<field name="publishedDate" type="integer" stored="true" indexed="false"/>
<field name="updatedDate" type="integer" stored="false" indexed="true"/>
<field name="a_spell" type="textSpell" multiValued="true"/>
</fields>
<uniqueKey>url</uniqueKey>
<defaultSearchField>title</defaultSearchField>
<solrQueryParser defaultOperator="AND"/>
<copyField source="url" dest="id"/>
<copyField source="url" dest="url2"/>
<copyField source="host" dest="host2"/>
<copyField source="host" dest="host3"/>
<copyField source="categories" dest="categories2"/>
<copyField source="terms" dest="a_spell"/>
<copyField source="title" dest="a_spell"/>
<copyField source="content" dest="a_spell"/>
</schema>


solrindex-mapping:

<mapping>
<!-- Simple mapping of fields created by Nutch IndexingFilters
to fields defined (and expected) in Solr schema.xml.

Any fields in NutchDocument that match a name defined
in field/@source will be renamed to the corresponding
field/@dest.
Additionally, if a field name (before mapping) matches
a copyField/@source then its values will be copied to
the corresponding copyField/@dest.

uniqueKey has the same meaning as in Solr schema.xml
and defaults to "id" if not defined.
-->
<fields>
<!--field dest="cache" source="cache"/>
<field dest="anchor" source="anchor"/>
<field dest="type" source="type"/>
<field dest="contentLength" source="contentLength"/>
<field dest="lastModified" source="lastModified"/>
<field dest="date" source="date"/>
<field dest="lang" source="lang"/>
<field dest="subcollection" source="subcollection"/>
<field dest="plutoz_ranking" source="plutoz_ranking"/>
<field dest="user_ranking" source="user_ranking"/>
<field dest="domain_ranking" source="domain_ranking"/>
<field dest="categories" source="categories"/>
<field dest="author" source="author"/>
<field dest="terms" source="terms"/>
<field dest="publishedDate" source="publishedDate"/>
<field dest="updatedDate" source="updatedDate"/-->

<field dest="content" source="content"/>
<field dest="site" source="site"/>
<field dest="title" source="title"/>
<field dest="host" source="host"/>
<field dest="segment" source="segment"/>
<field dest="boost" source="boost"/>
<field dest="digest" source="digest"/>
<field dest="tstamp" source="tstamp"/>
<field dest="id" source="url"/>
<copyField source="url" dest="url"/>
</fields>
<uniqueKey>url</uniqueKey>
</mapping>







On 01/27/2012 01:41 PM, Markus Jelsma wrote:

Strange, the defaults work when the plugin is added to
plugin.includes. You
can use the indexchecker tool to make sure all is well on Nutch' side.

I apologize if it is a stupid question but I have been googleing and I
haven't been able to find anything usefull. I am trying to use the
index-more plugin to get the contentlenght and date. I have put them in
nutch-site.xml and i can see that index-more gets loaded (in hadoop.log
file) but after indexing I can't see those fields in the result. both of
them are both in schema.xml file and solrindex-mapping.xml. I'll
appreciate any hint you guys can spare.

btw i am using nutch Revision: 1236417

thanks,


--
Kaveh Minooie

www.plutoz.com

Re: problem with index-more and solr

Reply via email to