Thank you all. By the way, Jack I gonna by your book. Where to buy? Floyd
2013/8/22 Jack Krupansky <j...@basetechnology.com> > "I thought that the StandardTokenizer always split on punctuation, " > > Proving that you haven't read my book! The section on the standard > tokenizer details the rules that the tokenizer uses (in addition to > extensive examples.) That's what I mean by "deep dive." > > -- Jack Krupansky > > -----Original Message----- From: Shawn Heisey > Sent: Wednesday, August 21, 2013 10:41 PM > To: solr-user@lucene.apache.org > Subject: Re: How to avoid underscore sign indexing problem? > > > On 8/21/2013 7:54 PM, Floyd Wu wrote: > >> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get >> >> ST >> textraw_**bytesstartendtypeposition >> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1 >> >> How to make this string to be tokenized to these two tokens "Pacific", >> "Rim"? >> Set _ as stopword? >> Please kindly help on this. >> Many thanks. >> > > Interesting. I thought that the StandardTokenizer always split on > punctuation, but apparently that's not the case for the underscore > character. > > You can always use the WordDelimeterFilter after the StandardTokenizer. > > http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** > WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory> > > Thanks, > Shawn >