Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires
Thanks Jack Krupansky, Its very helpful :) Jack Krupansky-2 wrote The WDF types will treat a character the same regardless of where it appears. For something conditional, like dot between letters vs. dot lot preceded and followed by a letter, you either have to have a custom tokenizer or a character filter. Interesting that although the standard tokenizer messes up embedded hyphens, it does handle the embedded dot vs. trailing dot case as you wish (but messes up U.S.A. by stripping the trailing dot) - but that doesn't help your case. A character filter like the following might help your case: fieldType name=text_ws_dot class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=([\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=(^|[^\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=(^|[^\w\d])[\._amp;]+([\w\d]) replacement=$1 $2 / tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType I'm not a regular expression expert, so I'm not sure whether/how those patterns could be combined. Also, that doesn't allow the case of a single ., , or _ as a word - but you didn't specify how that case should be handled. -- Jack Krupansky -Original Message- From: meghana Sent: Wednesday, April 24, 2013 6:49 AM To: solr-user@.apache Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires I have configured WordDelimiterFilterFactory for custom tokenizers for '' and '-' , and for few tokenizer (like . _ :) we need to split on boundries only. e.g. test.com (should tokenized to test.com) newyear. (should tokenized to newyear) new_car (should tokenized to new_car) .. .. Below is defination for text field fieldType name=text_general_preserved class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType below is wdfftypes_general.txt content = ALPHA - = ALPHA _ = SUBWORD_DELIM : = SUBWORD_DELIM . = SUBWORD_DELIM types can be used in worddelimiter are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work. Can anybody suggest me how can i set configuration for worddelimiter factory to fulfill my requirement. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557p4060011.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires
I have configured WordDelimiterFilterFactory for custom tokenizers for '' and '-' , and for few tokenizer (like . _ :) we need to split on boundries only. e.g. test.com (should tokenized to test.com) newyear. (should tokenized to newyear) new_car (should tokenized to new_car) .. .. Below is defination for text field fieldType name=text_general_preserved class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType below is wdfftypes_general.txt content = ALPHA - = ALPHA _ = SUBWORD_DELIM : = SUBWORD_DELIM . = SUBWORD_DELIM types can be used in worddelimiter are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work. Can anybody suggest me how can i set configuration for worddelimiter factory to fulfill my requirement. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires
The WDF types will treat a character the same regardless of where it appears. For something conditional, like dot between letters vs. dot lot preceded and followed by a letter, you either have to have a custom tokenizer or a character filter. Interesting that although the standard tokenizer messes up embedded hyphens, it does handle the embedded dot vs. trailing dot case as you wish (but messes up U.S.A. by stripping the trailing dot) - but that doesn't help your case. A character filter like the following might help your case: fieldType name=text_ws_dot class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=([\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=(^|[^\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=(^|[^\w\d])[\._amp;]+([\w\d]) replacement=$1 $2 / tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType I'm not a regular expression expert, so I'm not sure whether/how those patterns could be combined. Also, that doesn't allow the case of a single ., , or _ as a word - but you didn't specify how that case should be handled. -- Jack Krupansky -Original Message- From: meghana Sent: Wednesday, April 24, 2013 6:49 AM To: solr-user@lucene.apache.org Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires I have configured WordDelimiterFilterFactory for custom tokenizers for '' and '-' , and for few tokenizer (like . _ :) we need to split on boundries only. e.g. test.com (should tokenized to test.com) newyear. (should tokenized to newyear) new_car (should tokenized to new_car) .. .. Below is defination for text field fieldType name=text_general_preserved class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType below is wdfftypes_general.txt content = ALPHA - = ALPHA _ = SUBWORD_DELIM : = SUBWORD_DELIM . = SUBWORD_DELIM types can be used in worddelimiter are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work. Can anybody suggest me how can i set configuration for worddelimiter factory to fulfill my requirement. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: WordDelimiterFactory
Ashok: You really, _really_ need to dive into the admin/analysis page. That'll show you exactly what WDFF (and all the other elements of your chain) do to input tokens. Understanding the index and query-time implications of all the settings in WDFF takes a while. But from what you're describing, WDFF may not be what you're looking for anyway, some of the regex filters could split, for instance, on all non-alphanum characters. Best Erick On Wed, Apr 17, 2013 at 12:25 AM, Shawn Heisey s...@elyograg.org wrote: On 4/16/2013 8:12 PM, Ashok wrote: It looks like any 'word' that starts with a digit is treated as a numeric string. Setting generateNumberParts=1 in stead of 0 seems to generate the right tokens in this case but need to see if it has any other impacts on the finalized token list... I have a fieldType that is using WDF with the following settings on the index side. Both index and query analysis show it behaving correctly with terms that start with numbers, on versions 4.2.1 and 3.5.0: filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 / It has different settings on the query side, but generateNumberParts is 1 for both: filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 / I haven't tried it with generateNumberParts set to 0. Thanks, Shawn
Re: WordDelimiterFactory
Yes, thank you Erick. The analysis/document handlers hold the key to deciding the type order of the filters to employ given one's document set, subject matter at hand. The finalized terms they produce for SOLR search, mlt etc... are crucial to the quality of the results. - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529p4057349.html Sent from the Solr - User mailing list archive at Nabble.com.
WordDelimiterFactory
Hi, Why does WDF swallow all 'words' that start with a 'digit'? My config is: filter class=solr.WordDelimiterFilterFactory generateNumberParts=0 splitOnNumerics=0 splitOnCaseChange=0 preserveOriginal=0 protected=protwords.txt/ For some text like 20x-30y I am expecting ( want) '20x' '30y' to be returned retained as the tokens after WDF is done with it. But I get nothing as per the analysis page. Any idea why? I am using 4.1 Thanks - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: WordDelimiterFactory
Because you told it to!!! With: generateNumberParts=0 WDF is tricky... tell us exactly what rules you want it to follow and then we can tell you how to set the options. Maybe more to the point: why exactly do you think you want it use WDF? Not that there aren't good reasons, but what specifically are yours? Generally, see the schema in the Solr example for suggested best practices. Copy and paste from there, or, better yet, use exactly the types that are there. -- Jack Krupansky -Original Message- From: Ashok Sent: Tuesday, April 16, 2013 7:52 PM To: solr-user@lucene.apache.org Subject: WordDelimiterFactory Hi, Why does WDF swallow all 'words' that start with a 'digit'? My config is: filter class=solr.WordDelimiterFilterFactory generateNumberParts=0 splitOnNumerics=0 splitOnCaseChange=0 preserveOriginal=0 protected=protwords.txt/ For some text like 20x-30y I am expecting ( want) '20x' '30y' to be returned retained as the tokens after WDF is done with it. But I get nothing as per the analysis page. Any idea why? I am using 4.1 Thanks - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: WordDelimiterFactory
Thank you Jack, yes it is tricky. If my text is x20-y30 I get two nice tokens x20 y30 that I need to keep. But the text 20x-30y is treated differently and I get nothing. 20x-y30 gives me just 'y30' The docs on LucidWorks say generateNumberParts: (integer, default 1) If non-zero, splits numeric strings at delimiters:1947-32 -1947, 32 It looks like any 'word' that starts with a digit is treated as a numeric string. Setting generateNumberParts=1 in stead of 0 seems to generate the right tokens in this case but need to see if it has any other impacts on the finalized token list... Thanks - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529p4056544.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: WordDelimiterFactory
On 4/16/2013 8:12 PM, Ashok wrote: It looks like any 'word' that starts with a digit is treated as a numeric string. Setting generateNumberParts=1 in stead of 0 seems to generate the right tokens in this case but need to see if it has any other impacts on the finalized token list... I have a fieldType that is using WDF with the following settings on the index side. Both index and query analysis show it behaving correctly with terms that start with numbers, on versions 4.2.1 and 3.5.0: filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 / It has different settings on the query side, but generateNumberParts is 1 for both: filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 / I haven't tried it with generateNumberParts set to 0. Thanks, Shawn
Regarding WordDelimiterFactory
Hello, I have a file with the input string 91{40}9490949090, and I wanted to return this file when I search for the query string +91?40?9*. The problem is that, the input string is getting indexed as 3 terms 91, 40, 9490949090. Is there a way to consider { and } as part of the string itself. Can we configure WordDelimiterFilterFactory *not to consider* curly braces as delimiters? Thanks, Sandhya
Re: Regarding WordDelimiterFactory
On Thu, Sep 9, 2010 at 3:57 AM, Sandhya Agarwal sagar...@opentext.comwrote: Hello, I have a file with the input string 91{40}9490949090, and I wanted to return this file when I search for the query string +91?40?9*. The problem is that, the input string is getting indexed as 3 terms 91, 40, 9490949090. Is there a way to consider { and } as part of the string itself. Can we configure WordDelimiterFilterFactory *not to consider* curly braces as delimiters? See: https://issues.apache.org/jira/browse/SOLR-2059 https://issues.apache.org/jira/browse/SOLR-2059 as a workaround, if you dont want to use trunk, you could also turn on preserveOriginal -- Robert Muir rcm...@gmail.com
Re: Regarding WordDelimiterFactory
set splitWordsPart=0,splitNumberPart=0 - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Regarding-WordDelimiterFactory-tp1444694p1444742.html Sent from the Solr - User mailing list archive at Nabble.com.