Re: ICUTokenizer labels number as Han character?

Robert Muir Wed, 19 Dec 2012 15:05:34 -0800

Your attachment didnt come through: I think the list strips them.
Maybe just open a JIRA and attach your screenshots? or put them
elsewhere and just include a link?

As far as the ultimate behavior, I think its correct. Keep in mind
tokens don't really get a script value: runs of untokenized text do.
"common" is stuff like numbers/punctuation/etc that just keeps the run
whatever it was before (e.g. Han).

And the bigram filter only bigrams text with certain token types (NUM
is not one of them), so making a singleton is correct.

On Wed, Dec 19, 2012 at 5:10 PM, Tom Burton-West <tburt...@umich.edu> wrote:
> Hello,
>
> Don't know if the Solr admin panel is lying, or if this is a wierd bug.
> The string: "1986年"  gets analyzed by the ICUTokenizer with "1986" being
> identified as type:NUM and script:Han.  Then the CJKBigram filter identifies
> "1986" as type:Num and script:Han and "年" as type:Single and script: Common.
>
> This doesn't seem right.   Couldn't fit the whole analysis output on one
> screen so there are two screenshots attached.
>
> Any clues as to what is going on and whether it is a problem?
>
> Tom

Re: ICUTokenizer labels number as Han character?

Reply via email to