Thank you all.
By the way, Jack I gonna by your book. Where to buy?
Floyd

2013/8/22 Jack Krupansky <j...@basetechnology.com>

> "I thought that the StandardTokenizer always split on punctuation, "
>
> Proving that you haven't read my book! The section on the standard
> tokenizer details the rules that the tokenizer uses (in addition to
> extensive examples.) That's what I mean by "deep dive."
>
> -- Jack Krupansky
>
> -----Original Message----- From: Shawn Heisey
> Sent: Wednesday, August 21, 2013 10:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to avoid underscore sign indexing problem?
>
>
> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>
>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>
>> ST
>> textraw_**bytesstartendtypeposition
>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>
>> How to make this string to be tokenized to these two tokens "Pacific",
>> "Rim"?
>> Set _ as stopword?
>> Please kindly help on this.
>> Many thanks.
>>
>
> Interesting.  I thought that the StandardTokenizer always split on
> punctuation, but apparently that's not the case for the underscore
> character.
>
> You can always use the WordDelimeterFilter after the StandardTokenizer.
>
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>
> Thanks,
> Shawn
>

Reply via email to