Recently we had a few Japanese queries that killed our production Solrcloud instance. Our schemas support multiple languages, with language specific search fields.
This query and similar ones caused OOM errors in Solr: モノクローナル抗ニコチン性アセチルコリンレセプター(??7サブユニット)抗体 マウス宿主抗体 The query doesn’t match anything We are running Solr 7.2 in Google cloud. The Solr cloud has 4 solr nodes (3 zookeepers on their own nodes) holding 18 collections. The usage on most of the collections is currently fairly light. One of them gets a lot of traffic. This has 500,000 documents of which 25,000 contain some Japanese fields. We did a lot of tests, but I think we used historical search data which tends to have short queries. A 44 character CJK string generates ~80 tokens I ran the query against a single Japanese field and it took ~30 seconds to come back. Removing the ?? from it made no significant difference in performance. I’ve run other Japanese queries of a similar length and they return in ~200 msecs. Our solr cloud usually performs quite well, but in this case it was horrible. The bigram filter creates a lot of tokens, but this seems to be a fairly standard approach for Chinese and Japanese searches. How can I debug what is going on with this query? How resource intensive will searches against these fields be? How do we estimate the additional memory that seem to require? We have about a dozen Japanese search fields. These all have this CJKBigram field type. <!-- Field type to support Asian languages Transforms Traditional Han to Simplified Han Transforms Hiragana to Katakana tokenizes languages to unigrams and bigrams for analysis and searching --> <fieldtype name="text_deep_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false"> <analyzer type="index"> <!-- remove spaces between CJK characters --> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])" replacement="$1"/> <tokenizer class="solr.ICUTokenizerFactory" /> <!-- normalize width before bigram, as e.g. half-width dakuten combine --> <filter class="solr.CJKWidthFilterFactory"/> <!-- Transform Traditional Han to Simplified Han --> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> <!-- Transform Hiragana to Katakana just as was done for Endeca --> <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/> <filter class="solr.ICUFoldingFilterFactory"/> <!-- NFKC, case folding, diacritics removed --> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> </analyzer> <analyzer type="query"> <!-- remove spaces between CJK characters --> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}]+)\s+(?=[\p{IsHangul}\p{IsHan}\p{IsKatakana}\p{IsHiragana}])" replacement="$1"/> <tokenizer class="solr.ICUTokenizerFactory" /> <!-- normalize width before bigram, as e.g. half-width dakuten combine --> <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.ICUTokenizerFactory" /> <filter class="solr.CJKWidthFilterFactory"/> <!-- Transform Traditional Han to Simplified Han --> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> <!-- Transform Hiragana to Katakana just as was done for Endeca --> <filter class="solr.ICUTransformFilterFactory" id="Hiragana-Katakana"/> <filter class="solr.ICUFoldingFilterFactory"/> <!-- NFKC, case folding, diacritics removed --> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> </analyzer> </fieldtype>