If it's regular, you could try using a PatternReplaceCharFilterFactory (PRCFF), which gets applied to the input before tokenization (note, this is NOT PatternReplaceFilterFatory, which gets applied after tokenization).
I don't really see how you could make this work though. WhitespaceTokenizer will break "abc def" into "abc" and "def" even if you use PRCFF. WordDelimiterGraphFilterFactory would break up "abc-def" into "abc" "def" and possibly "abcdef" depending on catenateWords' value. Instead of this, would it answer to use _phrase_ searches when you wanted to find "abc def"? Best, Erick On Wed, Jan 3, 2018 at 12:04 PM, Nawab Zada Asad Iqbal <khi...@gmail.com> wrote: > Hi, > > So, I have a string for indexing: > > abc - def (notice the space on either side of hyphen) > > which is being processed with this filter-list:- > > > <fieldType name="shingle" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <charFilter > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory" > name="nfkc" mode="compose"/> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.WordDelimiterGraphFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" preserveOriginal="0" > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/> > <filter class="solr.FlattenGraphFilterFactory"/> > <filter class="solr.PatternReplaceFilterFactory" > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.ASCIIFoldingFilterFactory"/> > <filter class="solr.ShingleFilterFactory" maxShingleSize="2" > outputUnigrams="false" fillerToken=""/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > <filter class="solr.LimitTokenCountFilterFactory" > maxTokenCount="10000" consumeAllTokens="false"/> > <filter class="solr.LengthFilterFactory" min="1" max="255"/> > </analyzer> > > > I get two shingle tokens at the end: > > "abc" "def" > > I want to get "abc def" . What can I tweak to get this? > > > Thanks > Nawab