Re: Small Tokenization issue

Erick Erickson Wed, 03 Jan 2018 13:41:35 -0800

WordDelimiterGraphFilterFactory is a new implementation so it's also
quite possible that the behavior just changed.


I just took a look and indeed it does. WordDelimiterFilterFactory
(done on "p / n whatever) produces
token:      p  n  whatever
position:  1  2      3

whereas WordDelimiterGraphFilterFactory produces:

token:      p  n  whatever
position:  1  3      4


Arguably the Graph version is correct behavior.

What if you use phrases to search for this instead?

Best,
Erick

On Wed, Jan 3, 2018 at 12:56 PM, Nawab Zada Asad Iqbal <khi...@gmail.com> wrote:
> Thanks Emir, Erick.
>
> What i want to do is remove empty tokens after WordDelimiterGraphFilter ?
> Is there any such option in WordDelimiterGraphFilter to not generate empty
> tokens?
>
> This index field is intended to use for strange strings e.g. part numbers.
> P/N HSC0424PP
> The benefit of removing the empty tokens is that if someone unintentionally
> puts a space around the '/' (in above example) this field is still able to
> match.
>
> In previous solr version, ShingleFilter used to work fine in case of empty
> positions and was making shingles across the empty space. Although, it is
> possible that i have learned to rely on a bug.
>
>
>
>
>
>
> On Wed, Jan 3, 2018 at 12:23 PM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Nawab,
>> The reason why you do not get shingle is because there is empty token
>> because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token
>> that you are interested in are not next to each other and cannot form
>> shingle.
>> What you can do is apply char filter before tokenization to remove ‘-‘
>> something like:
>>
>> <charFilter class="solr.PatternReplaceCharFilterFactory"
>>              pattern=“\s*-\s*” replacement=“ ”/>
>>
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal <khi...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > So, I have a string for indexing:
>> >
>> > abc - def (notice the space on either side of hyphen)
>> >
>> > which is being processed with this filter-list:-
>> >
>> >
>> >    <fieldType name="shingle" class="solr.TextField"
>> > positionIncrementGap="100">
>> >      <analyzer type="index">
>> >        <charFilter
>> > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
>> > name="nfkc" mode="compose"/>
>> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >        <filter class="solr.WordDelimiterGraphFilterFactory"
>> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> > catenateNumbers="0" catenateAll="0" preserveOriginal="0"
>> > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
>> >        <filter class="solr.FlattenGraphFilterFactory"/>
>> >        <filter class="solr.PatternReplaceFilterFactory"
>> > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
>> >        <filter class="solr.LowerCaseFilterFactory"/>
>> >        <filter class="solr.ASCIIFoldingFilterFactory"/>
>> >        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
>> > outputUnigrams="false" fillerToken=""/>
>> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> >        <filter class="solr.LimitTokenCountFilterFactory"
>> > maxTokenCount="10000" consumeAllTokens="false"/>
>> >        <filter class="solr.LengthFilterFactory" min="1" max="255"/>
>> >      </analyzer>
>> >
>> >
>> > I get two shingle tokens at the end:
>> >
>> > "abc" "def"
>> >
>> > I want to get "abc def" . What can I tweak to get this?
>> >
>> >
>> > Thanks
>> > Nawab
>>
>>

Re: Small Tokenization issue

Reply via email to