Re: problem with index-more and solr

kaveh minooie Fri, 27 Jan 2012 14:08:35 -0800

thanks for that. so I did that and content is empty and contentlength is0. I am not storing the content in the schema.xml file

<field name="content" type="text" stored="false" indexed="true"termVectors="true"/>

         <field name="contentLength" type="long" stored="true" indexed="false"/>


do I have to store content to be able to get the size?

this is the result of the indexchecker btw:

kaveh@index9:~/build/nutch/runtime/local$ bin/nutch indexcheckerhttp://www.hafingtonpost.com

fetching: http://www.hafingtonpost.com
parsing: http://www.hafingtonpost.com
contentType: text/html
user_ranking :  25.0
content :       
domain_ranking :        10.0
host :  www.hafingtonpost.com
plutoz_ranking :        0.0
terms : 
categories :    
tstamp :        Fri Jan 27 13:56:39 PST 2012
type :  text/html
type :  text
type :  html
date :  Fri Jan 27 13:56:39 PST 2012
contentLength : 0
url :   http://www.hafingtonpost.com

also while we are at it some of the other fields here do not appear inthe solr result as well such as user_ranking domain_ranking datecontentlength and terms. content is not there either but no surprisethere. I know solr do not include empty fields here for example terms,but I don't know why user_ranking and domain_ranking do not show up:




here is my schema ans solrindex-mapping files:

<schema name="nutch" version="1.1">
    <types>

<fieldType name="string" class="solr.StrField"sortMissingLast="true" omitNorms="true"/>

        <fieldType name="long" class="solr.LongField" omitNorms="true"/>
        <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
        <fieldType name="integer" class="solr.IntField" omitNorms="true"/>

<fieldType name="text" class="solr.TextField"positionIncrementGap="100">

          <analyzer type="index">
            <!--<charFilter class="solr.HTMLStripCharFilterFactory"/>-->
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1" catenateWords="1"catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

            <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory"protected="protwords.txt"/><filter class="solr.SynonymFilterFactory"synonyms="ContentSynonyms.txt" ignoreCase="true" expand="true"/>

            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>

            <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.EnglishPorterFilterFactory"protected="protwords.txt"/>

            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
        </fieldType>

<fieldType name="url" class="solr.StrField"positionIncrementGap="100">

          <!--<analyzer>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>-->       
        </fieldType>

<fieldType name="url2" class="solr.TextField"positionIncrementGap="100">

          <analyzer type="index">
            <tokenizer class="solr.LetterTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.SynonymFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.LetterTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EnglishPorterFilterFactory"/>

<filter class="solr.SynonymFilterFactory"synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
        </fieldType>

<fieldType name="url3" class="solr.TextField"positionIncrementGap="100">

          <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
          <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
          </analyzer>
        </fieldType>

<fieldType name="textSpell" class="solr.TextField"positionIncrementGap="100" stored="false" omitNorms="true">

         <analyzer type="index">
           <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/>

           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.StandardFilterFactory"/>
         </analyzer>
         <analyzer type="query">
           <tokenizer class="solr.StandardTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true"words="stopwords.txt"/>

           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.StandardFilterFactory"/>
         </analyzer>
        </fieldType>
   </types>
       <fields>
         <field name="id" type="string" stored="true" indexed="true"/>
         <field name="segment" type="string" stored="true" indexed="true"/>
         <field name="digest" type="string" stored="true" indexed="false"/>
         <field name="boost" type="float" stored="true" indexed="false"/>
         <field name="dupid" type="string" stored="true" indexed="false"/>
         <field name="host" type="url" stored="true" indexed="true"/>
         <field name="host2" type="url2" stored="false" indexed="true"/>
         <field name="host3" type="url3" stored="false" indexed="true"/>
         <field name="site" type="string" stored="false" indexed="true"/>

<field name="url" type="url" stored="true" indexed="true"required="true"/>

         <field name="url2" type="url3" stored="false" indexed="true"/>

<field name="content" type="text" stored="false" indexed="true"termVectors="true"/><field name="title" type="text" stored="false" indexed="true"termVectors="true"/>

         <field name="cache" type="string" stored="true" indexed="false"/>
         <field name="tstamp" type="long" stored="true" indexed="true"/>

<field name="anchor" type="string" stored="false" indexed="true"multiValued="true"/><field name="type" type="string" stored="true" indexed="true"multiValued="true"/>

         <field name="contentLength" type="long" stored="true" indexed="false"/>
         <field name="lastModified" type="long" stored="true" indexed="false"/>
         <field name="date" type="integer" stored="false" indexed="false"/>
         <field name="lang" type="string" stored="false" indexed="true"/>
         <field name="subcollection" type="string" stored="true" 
indexed="true"/>
         <field name="plutoz_ranking" type="float" stored="true" 
indexed="true"/>
         <field name="user_ranking" type="float" stored="false" indexed="true"/>
         <field name="domain_ranking" type="float" stored="false" 
indexed="true"/>

<field name="categories" type="string" stored="true" indexed="true"multiValued="true"/><field name="categories2" type="text" stored="false" indexed="true"multiValued="true"/><field name="terms" type="text" stored="false" indexed="true"multiValued="true"/>

         <field name="author" type="string" stored="false" indexed="true"/>
         <field name="tag" type="string" stored="false" indexed="true"/>
         <field name="feed" type="string" stored="false" indexed="true"/>
         <field name="publishedDate" type="integer" stored="true" 
indexed="false"/>
         <field name="updatedDate" type="integer" stored="false" 
indexed="true"/>
         <field name="a_spell" type="textSpell" multiValued="true"/>
       </fields>
       <uniqueKey>url</uniqueKey>
       <defaultSearchField>title</defaultSearchField>
       <solrQueryParser defaultOperator="AND"/>
       <copyField source="url" dest="id"/>
       <copyField source="url" dest="url2"/>
       <copyField source="host" dest="host2"/>
       <copyField source="host" dest="host3"/>
       <copyField source="categories" dest="categories2"/>
       <copyField source="terms" dest="a_spell"/>
       <copyField source="title" dest="a_spell"/>
       <copyField source="content" dest="a_spell"/>
</schema>


solrindex-mapping:

<mapping>
        <!-- Simple mapping of fields created by Nutch IndexingFilters
             to fields defined (and expected) in Solr schema.xml.

             Any fields in NutchDocument that match a name defined
             in field/@source will be renamed to the corresponding
             field/@dest.
             Additionally, if a field name (before mapping) matches
             a copyField/@source then its values will be copied to
             the corresponding copyField/@dest.

             uniqueKey has the same meaning as in Solr schema.xml
             and defaults to "id" if not defined.
         -->
        <fields>
                <!--field dest="cache" source="cache"/>               
                <field dest="anchor" source="anchor"/>
                <field dest="type" source="type"/>
                <field dest="contentLength" source="contentLength"/>
                <field dest="lastModified" source="lastModified"/>
                <field dest="date" source="date"/>
                <field dest="lang" source="lang"/>
                <field dest="subcollection" source="subcollection"/>
                <field dest="plutoz_ranking" source="plutoz_ranking"/>
                <field dest="user_ranking" source="user_ranking"/>
                <field dest="domain_ranking" source="domain_ranking"/>
                <field dest="categories" source="categories"/>
                <field dest="author" source="author"/>
                <field dest="terms" source="terms"/>
                <field dest="publishedDate" source="publishedDate"/>
                <field dest="updatedDate" source="updatedDate"/-->
                
                <field dest="content" source="content"/>
                <field dest="site" source="site"/>
                <field dest="title" source="title"/>
                <field dest="host" source="host"/>
                <field dest="segment" source="segment"/>
                <field dest="boost" source="boost"/>
                <field dest="digest" source="digest"/>
                <field dest="tstamp" source="tstamp"/>
                <field dest="id" source="url"/>
                <copyField source="url" dest="url"/>
        </fields>
        <uniqueKey>url</uniqueKey>
</mapping>







On 01/27/2012 01:41 PM, Markus Jelsma wrote:

Strange, the defaults work when the plugin is added to plugin.includes. You
can use the indexchecker tool to make sure all is well on Nutch' side.

I apologize if it is a stupid question but I have been googleing and I
haven't been able to find anything usefull. I am trying to use the
index-more plugin to get the contentlenght and date. I have put them in
nutch-site.xml and i can see that index-more gets loaded (in hadoop.log
file) but after indexing I can't see those fields in the result. both of
them are both in schema.xml file and solrindex-mapping.xml. I'll
appreciate any hint you guys can spare.

btw i am using nutch Revision: 1236417

thanks,


--
Kaveh Minooie

www.plutoz.com

Re: problem with index-more and solr

Reply via email to