Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires
Thanks Jack Krupansky, Its very helpful :) Jack Krupansky-2 wrote The WDF types will treat a character the same regardless of where it appears. For something conditional, like dot between letters vs. dot lot preceded and followed by a letter, you either have to have a custom tokenizer or a character filter. Interesting that although the standard tokenizer messes up embedded hyphens, it does handle the embedded dot vs. trailing dot case as you wish (but messes up U.S.A. by stripping the trailing dot) - but that doesn't help your case. A character filter like the following might help your case: fieldType name=text_ws_dot class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=([\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=(^|[^\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=(^|[^\w\d])[\._amp;]+([\w\d]) replacement=$1 $2 / tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType I'm not a regular expression expert, so I'm not sure whether/how those patterns could be combined. Also, that doesn't allow the case of a single ., , or _ as a word - but you didn't specify how that case should be handled. -- Jack Krupansky -Original Message- From: meghana Sent: Wednesday, April 24, 2013 6:49 AM To: solr-user@.apache Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires I have configured WordDelimiterFilterFactory for custom tokenizers for '' and '-' , and for few tokenizer (like . _ :) we need to split on boundries only. e.g. test.com (should tokenized to test.com) newyear. (should tokenized to newyear) new_car (should tokenized to new_car) .. .. Below is defination for text field fieldType name=text_general_preserved class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType below is wdfftypes_general.txt content = ALPHA - = ALPHA _ = SUBWORD_DELIM : = SUBWORD_DELIM . = SUBWORD_DELIM types can be used in worddelimiter are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work. Can anybody suggest me how can i set configuration for worddelimiter factory to fulfill my requirement. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557p4060011.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires
I have configured WordDelimiterFilterFactory for custom tokenizers for '' and '-' , and for few tokenizer (like . _ :) we need to split on boundries only. e.g. test.com (should tokenized to test.com) newyear. (should tokenized to newyear) new_car (should tokenized to new_car) .. .. Below is defination for text field fieldType name=text_general_preserved class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType below is wdfftypes_general.txt content = ALPHA - = ALPHA _ = SUBWORD_DELIM : = SUBWORD_DELIM . = SUBWORD_DELIM types can be used in worddelimiter are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work. Can anybody suggest me how can i set configuration for worddelimiter factory to fulfill my requirement. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires
The WDF types will treat a character the same regardless of where it appears. For something conditional, like dot between letters vs. dot lot preceded and followed by a letter, you either have to have a custom tokenizer or a character filter. Interesting that although the standard tokenizer messes up embedded hyphens, it does handle the embedded dot vs. trailing dot case as you wish (but messes up U.S.A. by stripping the trailing dot) - but that doesn't help your case. A character filter like the following might help your case: fieldType name=text_ws_dot class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=([\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=(^|[^\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 / charFilter class=solr.PatternReplaceCharFilterFactory pattern=(^|[^\w\d])[\._amp;]+([\w\d]) replacement=$1 $2 / tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType I'm not a regular expression expert, so I'm not sure whether/how those patterns could be combined. Also, that doesn't allow the case of a single ., , or _ as a word - but you didn't specify how that case should be handled. -- Jack Krupansky -Original Message- From: meghana Sent: Wednesday, April 24, 2013 6:49 AM To: solr-user@lucene.apache.org Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires I have configured WordDelimiterFilterFactory for custom tokenizers for '' and '-' , and for few tokenizer (like . _ :) we need to split on boundries only. e.g. test.com (should tokenized to test.com) newyear. (should tokenized to newyear) new_car (should tokenized to new_car) .. .. Below is defination for text field fieldType name=text_general_preserved class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=false / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange =0 splitOnNumerics =0 stemEnglishPossessive =0 generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=0 protected=protwords_general.txt types=wdfftypes_general.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType below is wdfftypes_general.txt content = ALPHA - = ALPHA _ = SUBWORD_DELIM : = SUBWORD_DELIM . = SUBWORD_DELIM types can be used in worddelimiter are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work. Can anybody suggest me how can i set configuration for worddelimiter factory to fulfill my requirement. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html Sent from the Solr - User mailing list archive at Nabble.com.