Here is the same question in stackOverflow for better format.

http://stackoverflow.com/questions/42370231/solr-
dynamic-field-blowing-up-the-index-size

Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine but
the problem is that index size with solr 6 is way too large. In solr 5,
index size was about 15GB and in solr 6, for the same data, the index size
is 300GB! I am not able to understand what contributes to such huge
difference in solr 6.

I have been able to identify a field which is blowing up the size of index.
It is as follows.

<dynamicField name="*_note" type="text_general" indexed="true"
stored="true" multiValued="true"  />

<field name="textproperty" type="text_general" indexed="true"
stored="false" multiValued="true"  />
<copyField source="*_note" dest="textproperty"/>

When this field is commented out, the index size reduces to less than 10GB.

This field is of type text_general. Following is the definition of this
type.

<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="((?m)[a-z]+)'s" replacement="$1s" />
        <filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.KStemFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="C:/Users/pratik/Desktop/solr-6.4.1_playground/solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory" />
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="((?m)[a-z]+)'s" replacement="$1s" />
        <filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.KStemFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="C:/Users/pratik/Desktop/solr-6.4.1_playground/solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
/>
      </analyzer>
  </fieldType>

Few things which I did to debug this issue:

   - I have ensured that field type definition is same as what I was using
   in solr 5 and it is also valid in version 6. This field type considers a
   list of "stopwords" to be ignored during indexing. I have supplied the same
   list of stopwords which we were using in solr 5. I have verified that path
   of this file is correct and it is being loaded fine in solr admin UI. When
   I analyse these fields using "Analysis" tab of the solr admin UI, I can see
   that stopwords are being filtered out. However, when I query with some of
   these stopwords, I do get the results back which makes me think that
   probably stopwords are being indexed.

Any idea what could increase the size of index by so much in solr 6?

Reply via email to