subject:"Solr \- WordDelimiterFactory with Custom Tokenizer to split only on Boundires"

Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

2013-04-30 Thread meghana

Thanks Jack Krupansky, Its very helpful :)

Jack Krupansky-2 wrote
 The WDF types will treat a character the same regardless of where it 
 appears.
 
 For something conditional, like dot between letters vs. dot lot preceded
 and 
 followed by a letter, you either have to have a custom tokenizer or a 
 character filter.
 
 Interesting that although the standard tokenizer messes up embedded
 hyphens, 
 it does handle the embedded dot vs. trailing dot case as you wish (but 
 messes up U.S.A. by stripping the trailing dot) - but that doesn't help 
 your case.
 
 A character filter like the following might help your case:
 fieldType name=text_ws_dot class=solr.TextField 
 positionIncrementGap=100
   
 analyzer
 
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=([\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 /
 
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=(^|[^\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 /
 
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=(^|[^\w\d])[\._amp;]+([\w\d]) replacement=$1 $2 /
 
 tokenizer class=solr.WhitespaceTokenizerFactory/
   
 /analyzer
 /fieldType
 I'm not a regular expression expert, so I'm not sure whether/how those 
 patterns could be combined.
 
 Also, that doesn't allow the case of a single ., , or _ as a word - 
 but you didn't specify how that case should be handled.
 
 
 
 -- Jack Krupansky
 -Original Message- 
 From: meghana
 Sent: Wednesday, April 24, 2013 6:49 AM
 To: 

 solr-user@.apache

 Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only
 on 
 Boundires
 
 I have configured WordDelimiterFilterFactory for custom tokenizers for ''
 and '-' , and for few tokenizer (like . _ :) we need to split on boundries
 only.
 
 e.g.
 test.com (should tokenized to test.com)
 newyear.  (should tokenized to newyear)
 new_car (should tokenized to new_car)
 ..
 ..
 
 Below is defination for text field
 fieldType name=text_general_preserved class=solr.TextField
 positionIncrementGap=100
   
 analyzer type=index
  
 tokenizer class=solr.WhitespaceTokenizerFactory/
  
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=false /
  
 filter class=solr.WordDelimiterFilterFactory
 splitOnCaseChange =0
 splitOnNumerics =0
 stemEnglishPossessive =0
 generateWordParts=1
 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0
 catenateAll=0
 preserveOriginal=0
 protected=protwords_general.txt
 types=wdfftypes_general.txt
 /
 
 filter class=solr.LowerCaseFilterFactory/
   
 /analyzer
   
 analyzer type=query
 
 tokenizer class=solr.WhitespaceTokenizerFactory/
 
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=false /
 
 filter class=solr.WordDelimiterFilterFactory
 splitOnCaseChange =0
 splitOnNumerics =0
 stemEnglishPossessive =0
 generateWordParts=1
 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0
 catenateAll=0
 preserveOriginal=0
 protected=protwords_general.txt
 types=wdfftypes_general.txt
 /
 
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 
 filter class=solr.LowerCaseFilterFactory/
   
 /analyzer
 
 /fieldType
 below is wdfftypes_general.txt content
 
  = ALPHA
 - = ALPHA
 _ = SUBWORD_DELIM
 : = SUBWORD_DELIM
 . = SUBWORD_DELIM
 
 types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
 ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
 type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but
 it
 doesn't seem to work.
 
 Can anybody suggest me how can i set configuration for worddelimiter
 factory
 to fulfill my requirement.
 
 Thanks.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html
 Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557p4060011.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

2013-04-24 Thread meghana

I have configured WordDelimiterFilterFactory for custom tokenizers for ''
and '-' , and for few tokenizer (like . _ :) we need to split on boundries
only. 

e.g. 
test.com (should tokenized to test.com)
newyear.  (should tokenized to newyear)
new_car (should tokenized to new_car)
..
..

Below is defination for text field

fieldType name=text_general_preserved class=solr.TextField
positionIncrementGap=100
  analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
 filter class=solr.WordDelimiterFilterFactory
splitOnCaseChange =0
splitOnNumerics =0
stemEnglishPossessive =0
generateWordParts=1
generateNumberParts=1
catenateWords=0
catenateNumbers=0
catenateAll=0
preserveOriginal=0
protected=protwords_general.txt
types=wdfftypes_general.txt
/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
filter class=solr.WordDelimiterFilterFactory
splitOnCaseChange =0
splitOnNumerics =0
stemEnglishPossessive =0
generateWordParts=1
generateNumberParts=1
catenateWords=0
catenateNumbers=0
catenateAll=0
preserveOriginal=0
protected=protwords_general.txt
types=wdfftypes_general.txt
/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

below is wdfftypes_general.txt content

 = ALPHA
- = ALPHA
_ = SUBWORD_DELIM
: = SUBWORD_DELIM
. = SUBWORD_DELIM

types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it
doesn't seem to work. 

Can anybody suggest me how can i set configuration for worddelimiter factory
to fulfill my requirement. 

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

2013-04-24 Thread Jack Krupansky

The WDF types will treat a character the same regardless of where it 
appears.


For something conditional, like dot between letters vs. dot lot preceded and 
followed by a letter, you either have to have a custom tokenizer or a 
character filter.


Interesting that although the standard tokenizer messes up embedded hyphens, 
it does handle the embedded dot vs. trailing dot case as you wish (but 
messes up U.S.A. by stripping the trailing dot) - but that doesn't help 
your case.


A character filter like the following might help your case:

fieldType name=text_ws_dot class=solr.TextField 
positionIncrementGap=100

 analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=([\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 /
   charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=(^|[^\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 /
   charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=(^|[^\w\d])[\._amp;]+([\w\d]) replacement=$1 $2 /

   tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
/fieldType

I'm not a regular expression expert, so I'm not sure whether/how those 
patterns could be combined.


Also, that doesn't allow the case of a single ., , or _ as a word - 
but you didn't specify how that case should be handled.




-- Jack Krupansky
-Original Message- 
From: meghana

Sent: Wednesday, April 24, 2013 6:49 AM
To: solr-user@lucene.apache.org
Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only on 
Boundires


I have configured WordDelimiterFilterFactory for custom tokenizers for ''
and '-' , and for few tokenizer (like . _ :) we need to split on boundries
only.

e.g.
test.com (should tokenized to test.com)
newyear.  (should tokenized to newyear)
new_car (should tokenized to new_car)
..
..

Below is defination for text field

fieldType name=text_general_preserved class=solr.TextField
positionIncrementGap=100
 analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange =0
   splitOnNumerics =0
   stemEnglishPossessive =0
   generateWordParts=1
   generateNumberParts=1
   catenateWords=0
   catenateNumbers=0
   catenateAll=0
   preserveOriginal=0
   protected=protwords_general.txt
   types=wdfftypes_general.txt
   /

   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
   filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange =0
   splitOnNumerics =0
   stemEnglishPossessive =0
   generateWordParts=1
   generateNumberParts=1
   catenateWords=0
   catenateNumbers=0
   catenateAll=0
   preserveOriginal=0
   protected=protwords_general.txt
   types=wdfftypes_general.txt
   /
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
   /fieldType

below is wdfftypes_general.txt content

 = ALPHA
- = ALPHA
_ = SUBWORD_DELIM
: = SUBWORD_DELIM
. = SUBWORD_DELIM

types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it
doesn't seem to work.

Can anybody suggest me how can i set configuration for worddelimiter factory
to fulfill my requirement.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

3 matches

Site Navigation

Mail list logo

Footer information