Is there a way to get an approximate measure of the memory used by an indexed 
field(s). I’m looking into a problem with one of our Solr indexes. I have a 
Japanese query that causes the replicas to run out of memory when processing a 
query.
Also, is there a way to change or disable the timeout in the Solr Console? When 
I run this query there it always times out, and that is a real pain. I know 
that it will complete eventually.

I have this field type:
   <!-- Field type to support Asian languages
         Transforms Traditional Han to Simplified Han
         Transforms Hiragana to Katakana
         tokenizes languages to unigrams and bigrams for analysis and searching
     -->
    <fieldtype name="text_deep_cjk" class="solr.TextField" 
positionIncrementGap="10000" autoGeneratePhraseQueries="false">
     <analyzer type="index">
         <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
 replacement="$1"/>
         <tokenizer class="solr.ICUTokenizerFactory" />
        <!-- normalize width before bigram, as e.g. half-width dakuten combine  
-->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" 
id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca 
-->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case 
folding, diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" 
katakana="true" hangul="true" outputUnigrams="true" />
      </analyzer>

     <analyzer type="query">
         <!-- remove spaces between CJK characters -->
       <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])"
 replacement="$1"/>

       <tokenizer class="solr.ICUTokenizerFactory" />
        <!-- normalize width before bigram, as e.g. half-width dakuten combine  
-->
                <filter class="solr.SynonymGraphFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true" 
tokenizerFactory="solr.ICUTokenizerFactory" />
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- Transform Traditional Han to Simplified Han -->
        <filter class="solr.ICUTransformFilterFactory" 
id="Traditional-Simplified"/>
                <!-- Transform Hiragana to Katakana just as was done for Endeca 
-->
        <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/>
        <filter class="solr.ICUFoldingFilterFactory"/>   <!-- NFKC, case 
folding, diacritics removed -->
        <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" 
katakana="true" hangul="true" outputUnigrams="true" />
      </analyzer>
    </fieldtype>
I have a number of fields of this type. The CJKBigramFilterFactory can generate 
a lot of tokens. I’m concerned that this combination is what is killing our 
solr instances
This is the query that is causing my problems:
モノクローナル抗ニコチン性アセチルコリンレセプター(??7サブユニット)抗体 マウス宿主抗体

We are using Solr 7.2 in a solrcloud

Reply via email to