Re: Small Tokenization issue

2018-01-05 Thread Rick Leir

Nawab

Look at classicTokenizer. It is a good choice if you have part numbers 
with hyphens. The second tokenizer on this page: 
https://lucene.apache.org/solr/guide/6_6/tokenizers.html


Cheers -- Rick


On 01/03/2018 04:52 PM, Shawn Heisey wrote:

On 1/3/2018 1:56 PM, Nawab Zada Asad Iqbal wrote:

Thanks Emir, Erick.

What i want to do is remove empty tokens after 
WordDelimiterGraphFilter ?
Is there any such option in WordDelimiterGraphFilter to not generate 
empty

tokens?


I use LengthFilterFactory with a minimum of 1 and a maximum of 512 to 
remove empty tokens.


Thanks,
Shawn





Re: Small Tokenization issue

2018-01-03 Thread Shawn Heisey

On 1/3/2018 1:56 PM, Nawab Zada Asad Iqbal wrote:

Thanks Emir, Erick.

What i want to do is remove empty tokens after WordDelimiterGraphFilter ?
Is there any such option in WordDelimiterGraphFilter to not generate empty
tokens?


I use LengthFilterFactory with a minimum of 1 and a maximum of 512 to 
remove empty tokens.


Thanks,
Shawn



Re: Small Tokenization issue

2018-01-03 Thread Erick Erickson
WordDelimiterGraphFilterFactory is a new implementation so it's also
quite possible that the behavior just changed.

I just took a look and indeed it does. WordDelimiterFilterFactory
(done on "p / n whatever) produces
token:  p  n  whatever
position:  1  2  3

whereas WordDelimiterGraphFilterFactory produces:

token:  p  n  whatever
position:  1  3  4


Arguably the Graph version is correct behavior.

What if you use phrases to search for this instead?

Best,
Erick

On Wed, Jan 3, 2018 at 12:56 PM, Nawab Zada Asad Iqbal  wrote:
> Thanks Emir, Erick.
>
> What i want to do is remove empty tokens after WordDelimiterGraphFilter ?
> Is there any such option in WordDelimiterGraphFilter to not generate empty
> tokens?
>
> This index field is intended to use for strange strings e.g. part numbers.
> P/N HSC0424PP
> The benefit of removing the empty tokens is that if someone unintentionally
> puts a space around the '/' (in above example) this field is still able to
> match.
>
> In previous solr version, ShingleFilter used to work fine in case of empty
> positions and was making shingles across the empty space. Although, it is
> possible that i have learned to rely on a bug.
>
>
>
>
>
>
> On Wed, Jan 3, 2018 at 12:23 PM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Nawab,
>> The reason why you do not get shingle is because there is empty token
>> because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token
>> that you are interested in are not next to each other and cannot form
>> shingle.
>> What you can do is apply char filter before tokenization to remove ‘-‘
>> something like:
>>
>> >  pattern=“\s*-\s*” replacement=“ ”/>
>>
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal  wrote:
>> >
>> > Hi,
>> >
>> > So, I have a string for indexing:
>> >
>> > abc - def (notice the space on either side of hyphen)
>> >
>> > which is being processed with this filter-list:-
>> >
>> >
>> >> > positionIncrementGap="100">
>> >  
>> >> > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
>> > name="nfkc" mode="compose"/>
>> >
>> >> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> > catenateNumbers="0" catenateAll="0" preserveOriginal="0"
>> > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
>> >
>> >> > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
>> >
>> >
>> >> > outputUnigrams="false" fillerToken=""/>
>> >
>> >> > maxTokenCount="1" consumeAllTokens="false"/>
>> >
>> >  
>> >
>> >
>> > I get two shingle tokens at the end:
>> >
>> > "abc" "def"
>> >
>> > I want to get "abc def" . What can I tweak to get this?
>> >
>> >
>> > Thanks
>> > Nawab
>>
>>


Re: Small Tokenization issue

2018-01-03 Thread Nawab Zada Asad Iqbal
Thanks Emir, Erick.

What i want to do is remove empty tokens after WordDelimiterGraphFilter ?
Is there any such option in WordDelimiterGraphFilter to not generate empty
tokens?

This index field is intended to use for strange strings e.g. part numbers.
P/N HSC0424PP
The benefit of removing the empty tokens is that if someone unintentionally
puts a space around the '/' (in above example) this field is still able to
match.

In previous solr version, ShingleFilter used to work fine in case of empty
positions and was making shingles across the empty space. Although, it is
possible that i have learned to rely on a bug.






On Wed, Jan 3, 2018 at 12:23 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Nawab,
> The reason why you do not get shingle is because there is empty token
> because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token
> that you are interested in are not next to each other and cannot form
> shingle.
> What you can do is apply char filter before tokenization to remove ‘-‘
> something like:
>
>   pattern=“\s*-\s*” replacement=“ ”/>
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal  wrote:
> >
> > Hi,
> >
> > So, I have a string for indexing:
> >
> > abc - def (notice the space on either side of hyphen)
> >
> > which is being processed with this filter-list:-
> >
> >
> > > positionIncrementGap="100">
> >  
> > > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
> > name="nfkc" mode="compose"/>
> >
> > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" preserveOriginal="0"
> > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
> >
> > > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
> >
> >
> > > outputUnigrams="false" fillerToken=""/>
> >
> > > maxTokenCount="1" consumeAllTokens="false"/>
> >
> >  
> >
> >
> > I get two shingle tokens at the end:
> >
> > "abc" "def"
> >
> > I want to get "abc def" . What can I tweak to get this?
> >
> >
> > Thanks
> > Nawab
>
>


Re: Small Tokenization issue

2018-01-03 Thread Emir Arnautović
Hi Nawab,
The reason why you do not get shingle is because there is empty token because 
after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token that you 
are interested in are not next to each other and cannot form shingle.
What you can do is apply char filter before tokenization to remove ‘-‘ 
something like:



Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal  wrote:
> 
> Hi,
> 
> So, I have a string for indexing:
> 
> abc - def (notice the space on either side of hyphen)
> 
> which is being processed with this filter-list:-
> 
> 
> positionIncrementGap="100">
>  
> class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
> name="nfkc" mode="compose"/>
>
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" preserveOriginal="0"
> splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
>
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
>
>
> outputUnigrams="false" fillerToken=""/>
>
> maxTokenCount="1" consumeAllTokens="false"/>
>
>  
> 
> 
> I get two shingle tokens at the end:
> 
> "abc" "def"
> 
> I want to get "abc def" . What can I tweak to get this?
> 
> 
> Thanks
> Nawab



Re: Small Tokenization issue

2018-01-03 Thread Erick Erickson
If it's regular, you could try using a PatternReplaceCharFilterFactory
(PRCFF), which gets applied to the input before tokenization (note,
this is NOT PatternReplaceFilterFatory, which gets applied after
tokenization).

I don't really see how you could make this work though.
WhitespaceTokenizer will break "abc def" into "abc" and "def" even if
you use PRCFF. WordDelimiterGraphFilterFactory would break up
"abc-def" into "abc" "def" and possibly "abcdef" depending on
catenateWords' value.

Instead of this, would it answer to use _phrase_ searches when you
wanted to find "abc def"?

Best,
Erick

On Wed, Jan 3, 2018 at 12:04 PM, Nawab Zada Asad Iqbal  wrote:
> Hi,
>
> So, I have a string for indexing:
>
> abc - def (notice the space on either side of hyphen)
>
> which is being processed with this filter-list:-
>
>
>  positionIncrementGap="100">
>   
>  class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
> name="nfkc" mode="compose"/>
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" preserveOriginal="0"
> splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
> 
>  pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
> 
> 
>  outputUnigrams="false" fillerToken=""/>
> 
>  maxTokenCount="1" consumeAllTokens="false"/>
> 
>   
>
>
> I get two shingle tokens at the end:
>
> "abc" "def"
>
> I want to get "abc def" . What can I tweak to get this?
>
>
> Thanks
> Nawab


Small Tokenization issue

2018-01-03 Thread Nawab Zada Asad Iqbal
Hi,

So, I have a string for indexing:

abc - def (notice the space on either side of hyphen)

which is being processed with this filter-list:-



  











  


I get two shingle tokens at the end:

"abc" "def"

I want to get "abc def" . What can I tweak to get this?


Thanks
Nawab