Re: Small Tokenization issue
Nawab Look at classicTokenizer. It is a good choice if you have part numbers with hyphens. The second tokenizer on this page: https://lucene.apache.org/solr/guide/6_6/tokenizers.html Cheers -- Rick On 01/03/2018 04:52 PM, Shawn Heisey wrote: On 1/3/2018 1:56 PM, Nawab Zada Asad Iqbal wrote: Thanks Emir, Erick. What i want to do is remove empty tokens after WordDelimiterGraphFilter ? Is there any such option in WordDelimiterGraphFilter to not generate empty tokens? I use LengthFilterFactory with a minimum of 1 and a maximum of 512 to remove empty tokens. Thanks, Shawn
Re: Small Tokenization issue
On 1/3/2018 1:56 PM, Nawab Zada Asad Iqbal wrote: Thanks Emir, Erick. What i want to do is remove empty tokens after WordDelimiterGraphFilter ? Is there any such option in WordDelimiterGraphFilter to not generate empty tokens? I use LengthFilterFactory with a minimum of 1 and a maximum of 512 to remove empty tokens. Thanks, Shawn
Re: Small Tokenization issue
WordDelimiterGraphFilterFactory is a new implementation so it's also quite possible that the behavior just changed. I just took a look and indeed it does. WordDelimiterFilterFactory (done on "p / n whatever) produces token: p n whatever position: 1 2 3 whereas WordDelimiterGraphFilterFactory produces: token: p n whatever position: 1 3 4 Arguably the Graph version is correct behavior. What if you use phrases to search for this instead? Best, Erick On Wed, Jan 3, 2018 at 12:56 PM, Nawab Zada Asad Iqbalwrote: > Thanks Emir, Erick. > > What i want to do is remove empty tokens after WordDelimiterGraphFilter ? > Is there any such option in WordDelimiterGraphFilter to not generate empty > tokens? > > This index field is intended to use for strange strings e.g. part numbers. > P/N HSC0424PP > The benefit of removing the empty tokens is that if someone unintentionally > puts a space around the '/' (in above example) this field is still able to > match. > > In previous solr version, ShingleFilter used to work fine in case of empty > positions and was making shingles across the empty space. Although, it is > possible that i have learned to rely on a bug. > > > > > > > On Wed, Jan 3, 2018 at 12:23 PM, Emir Arnautović < > emir.arnauto...@sematext.com> wrote: > >> Hi Nawab, >> The reason why you do not get shingle is because there is empty token >> because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token >> that you are interested in are not next to each other and cannot form >> shingle. >> What you can do is apply char filter before tokenization to remove ‘-‘ >> something like: >> >> > pattern=“\s*-\s*” replacement=“ ”/> >> >> Regards, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >> > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal wrote: >> > >> > Hi, >> > >> > So, I have a string for indexing: >> > >> > abc - def (notice the space on either side of hyphen) >> > >> > which is being processed with this filter-list:- >> > >> > >> >> > positionIncrementGap="100"> >> > >> >> > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory" >> > name="nfkc" mode="compose"/> >> > >> >> > generateWordParts="1" generateNumberParts="1" catenateWords="0" >> > catenateNumbers="0" catenateAll="0" preserveOriginal="0" >> > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/> >> > >> >> > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/> >> > >> > >> >> > outputUnigrams="false" fillerToken=""/> >> > >> >> > maxTokenCount="1" consumeAllTokens="false"/> >> > >> > >> > >> > >> > I get two shingle tokens at the end: >> > >> > "abc" "def" >> > >> > I want to get "abc def" . What can I tweak to get this? >> > >> > >> > Thanks >> > Nawab >> >>
Re: Small Tokenization issue
Thanks Emir, Erick. What i want to do is remove empty tokens after WordDelimiterGraphFilter ? Is there any such option in WordDelimiterGraphFilter to not generate empty tokens? This index field is intended to use for strange strings e.g. part numbers. P/N HSC0424PP The benefit of removing the empty tokens is that if someone unintentionally puts a space around the '/' (in above example) this field is still able to match. In previous solr version, ShingleFilter used to work fine in case of empty positions and was making shingles across the empty space. Although, it is possible that i have learned to rely on a bug. On Wed, Jan 3, 2018 at 12:23 PM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Nawab, > The reason why you do not get shingle is because there is empty token > because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token > that you are interested in are not next to each other and cannot form > shingle. > What you can do is apply char filter before tokenization to remove ‘-‘ > something like: > > pattern=“\s*-\s*” replacement=“ ”/> > > Regards, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbalwrote: > > > > Hi, > > > > So, I have a string for indexing: > > > > abc - def (notice the space on either side of hyphen) > > > > which is being processed with this filter-list:- > > > > > > > positionIncrementGap="100"> > > > > > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory" > > name="nfkc" mode="compose"/> > > > > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > > catenateNumbers="0" catenateAll="0" preserveOriginal="0" > > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/> > > > > > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/> > > > > > > > outputUnigrams="false" fillerToken=""/> > > > > > maxTokenCount="1" consumeAllTokens="false"/> > > > > > > > > > > I get two shingle tokens at the end: > > > > "abc" "def" > > > > I want to get "abc def" . What can I tweak to get this? > > > > > > Thanks > > Nawab > >
Re: Small Tokenization issue
Hi Nawab, The reason why you do not get shingle is because there is empty token because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token that you are interested in are not next to each other and cannot form shingle. What you can do is apply char filter before tokenization to remove ‘-‘ something like: Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbalwrote: > > Hi, > > So, I have a string for indexing: > > abc - def (notice the space on either side of hyphen) > > which is being processed with this filter-list:- > > > positionIncrementGap="100"> > > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory" > name="nfkc" mode="compose"/> > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" preserveOriginal="0" > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/> > > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/> > > > outputUnigrams="false" fillerToken=""/> > > maxTokenCount="1" consumeAllTokens="false"/> > > > > > I get two shingle tokens at the end: > > "abc" "def" > > I want to get "abc def" . What can I tweak to get this? > > > Thanks > Nawab
Re: Small Tokenization issue
If it's regular, you could try using a PatternReplaceCharFilterFactory (PRCFF), which gets applied to the input before tokenization (note, this is NOT PatternReplaceFilterFatory, which gets applied after tokenization). I don't really see how you could make this work though. WhitespaceTokenizer will break "abc def" into "abc" and "def" even if you use PRCFF. WordDelimiterGraphFilterFactory would break up "abc-def" into "abc" "def" and possibly "abcdef" depending on catenateWords' value. Instead of this, would it answer to use _phrase_ searches when you wanted to find "abc def"? Best, Erick On Wed, Jan 3, 2018 at 12:04 PM, Nawab Zada Asad Iqbalwrote: > Hi, > > So, I have a string for indexing: > > abc - def (notice the space on either side of hyphen) > > which is being processed with this filter-list:- > > > positionIncrementGap="100"> > > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory" > name="nfkc" mode="compose"/> > > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" preserveOriginal="0" > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/> > > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/> > > > outputUnigrams="false" fillerToken=""/> > > maxTokenCount="1" consumeAllTokens="false"/> > > > > > I get two shingle tokens at the end: > > "abc" "def" > > I want to get "abc def" . What can I tweak to get this? > > > Thanks > Nawab
Small Tokenization issue
Hi, So, I have a string for indexing: abc - def (notice the space on either side of hyphen) which is being processed with this filter-list:- I get two shingle tokens at the end: "abc" "def" I want to get "abc def" . What can I tweak to get this? Thanks Nawab