Hi ­ I have some questions regarding accumulo and unicode.

I'm working with the wikisearch example:

Given some article such as: 1975­76 ...

I see in the Wiki example that the title is normalized and becomes encoded
as 1975\xE2\x80\x9376
But if I ingest that same data myself and do not use the Normalizer I get
the same title that the normalizer produced.  Likewise, if I insert the
wikipedia data as plain XML and not base64 encoded, I see the same thing,
specifically where articles link to other languages.  The language
characters are normalized.

Does accumulo normalize automatically?  Am I misunderstanding what I am
seeing?  What is the general guidance for using accumulo with Unicode
characters?

Thanks,
Ralph
 


Reply via email to