Re: Small Tokenization issue

Erick Erickson Wed, 03 Jan 2018 12:23:16 -0800

If it's regular, you could try using a PatternReplaceCharFilterFactory
(PRCFF), which gets applied to the input before tokenization (note,
this is NOT PatternReplaceFilterFatory, which gets applied after
tokenization).


I don't really see how you could make this work though.
WhitespaceTokenizer will break "abc def" into "abc" and "def" even if
you use PRCFF. WordDelimiterGraphFilterFactory would break up
"abc-def" into "abc" "def" and possibly "abcdef" depending on
catenateWords' value.

Instead of this, would it answer to use _phrase_ searches when you
wanted to find "abc def"?

Best,
Erick

On Wed, Jan 3, 2018 at 12:04 PM, Nawab Zada Asad Iqbal <khi...@gmail.com> wrote:
> Hi,
>
> So, I have a string for indexing:
>
> abc - def (notice the space on either side of hyphen)
>
> which is being processed with this filter-list:-
>
>
>     <fieldType name="shingle" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <charFilter
> class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
> name="nfkc" mode="compose"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.WordDelimiterGraphFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" preserveOriginal="0"
> splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
>         <filter class="solr.FlattenGraphFilterFactory"/>
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> outputUnigrams="false" fillerToken=""/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.LimitTokenCountFilterFactory"
> maxTokenCount="10000" consumeAllTokens="false"/>
>         <filter class="solr.LengthFilterFactory" min="1" max="255"/>
>       </analyzer>
>
>
> I get two shingle tokens at the end:
>
> "abc" "def"
>
> I want to get "abc def" . What can I tweak to get this?
>
>
> Thanks
> Nawab

Re: Small Tokenization issue

Reply via email to