Unicode

Perko, Ralph J Thu, 03 May 2012 10:23:00 -0700

Hi  I have some questions regarding accumulo and unicode.

I'm working with the wikisearch example:


Given some article such as: 197576 ...

I see in the Wiki example that the title is normalized and becomes encoded
as 1975\xE2\x80\x9376
But if I ingest that same data myself and do not use the Normalizer I get
the same title that the normalizer produced.  Likewise, if I insert the
wikipedia data as plain XML and not base64 encoded, I see the same thing,
specifically where articles link to other languages.  The language
characters are normalized.

Does accumulo normalize automatically?  Am I misunderstanding what I am
seeing?  What is the general guidance for using accumulo with Unicode
characters?

Thanks,
Ralph

Unicode

Reply via email to