Exactly - Solr does not define the punctuation, UAX#29 defines it, and I
have deciphered the UAX#29 rules and included them in my book. Some
punctuation is always punctuation and always removed, and some is
conditional on context - I tried to lay out all the implied rules.
-- Jack Krupansky
-----Original Message-----
From: Steve Rowe
Sent: Friday, August 23, 2013 12:30 AM
To: [email protected]
Subject: Re: How to avoid underscore sign indexing problem?
Dan,
StandardTokenizer implements the word boundary rules from the Unicode Text
Segmentation standard annex UAX#29:
http://www.unicode.org/reports/tr29/#Word_Boundaries
Every character sequence within UAX#29 boundaries that contains a numeric or
an alphabetic character is emitted as a term, and nothing else is emitted.
Punctuation can be included within a term, e.g. "1,248.99" or "192.168.1.1".
To split on underscores, you can convert underscores to e.g. spaces by
adding PatternReplaeCharFilterFactory to your analyzer:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="_"
replacement=" "/>
This replacement will be performed prior to StandardTokenizer, which will
then see token-splitting spaces instead of underscores.
Steve
On Aug 22, 2013, at 10:23 PM, Dan Davis <[email protected]> wrote:
Ah, but what is the definition of punctuation in Solr?
On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky
<[email protected]>wrote:
"I thought that the StandardTokenizer always split on punctuation, "
Proving that you haven't read my book! The section on the standard
tokenizer details the rules that the tokenizer uses (in addition to
extensive examples.) That's what I mean by "deep dive."
-- Jack Krupansky
-----Original Message----- From: Shawn Heisey
Sent: Wednesday, August 21, 2013 10:41 PM
To: [email protected]
Subject: Re: How to avoid underscore sign indexing problem?
On 8/21/2013 7:54 PM, Floyd Wu wrote:
When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
ST
textraw_**bytesstartendtypeposition
pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
How to make this string to be tokenized to these two tokens "Pacific",
"Rim"?
Set _ as stopword?
Please kindly help on this.
Many thanks.
Interesting. I thought that the StandardTokenizer always split on
punctuation, but apparently that's not the case for the underscore
character.
You can always use the WordDelimeterFilter after the StandardTokenizer.
http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
Thanks,
Shawn