Re: Unicode

Perko, Ralph J Wed, 06 Jun 2012 15:53:10 -0700

Thanks for the help
__________________________________________________
Ralph Perko
Pacific Northwest National Laboratory

From: Adam Fuchs <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Unicode

Hi Ralph,

Accumulo itself doesn't do any normalization or encoding. Everything looks like 
byte arrays to Accumulo. The Accumulo shell will output unprintable characters 
using the \xXX, where the XX is the hex encoding of the given byte. This is 
probably what you are seeing. The WikiSearch application includes a bunch of 
code to parse wikipedia files and canonicalize the encoding of data into 
unicode before ingesting into Accumulo. That code is mostly in the 
src/examples/wikisearch/ingest/src/main/java/org/apache/accumulo/examples/wikisearch/normalizer
 directory. The WikiSearch approach is certainly good enough for a 
demonstration, but this is a big area where a lot of people have done a lot of 
work, and we certainly don't try to recreate that within Accumulo. One other 
place to look is Lucene for tokenization and normalization libraries.

Cheers,
Adam

On Thu, May 3, 2012 at 1:30 PM, Perko, Ralph J 
<[email protected]<mailto:[email protected]>> wrote:
The formatting got lost in the example - there is supposed to be a dash
(-) between 1975 and 76.

On 5/3/12 10:21 AM, "Perko, Ralph J" 
<[email protected]<mailto:[email protected]>> wrote:

>Hi  I have some questions regarding accumulo and unicode.
>
>I'm working with the wikisearch example:
>
>Given some article such as: 197576 ...
>
>I see in the Wiki example that the title is normalized and becomes encoded
>as 1975\xE2\x80\x9376
>But if I ingest that same data myself and do not use the Normalizer I get
>the same title that the normalizer produced.  Likewise, if I insert the
>wikipedia data as plain XML and not base64 encoded, I see the same thing,
>specifically where articles link to other languages.  The language
>characters are normalized.
>
>Does accumulo normalize automatically?  Am I misunderstanding what I am
>seeing?  What is the general guidance for using accumulo with Unicode
>characters?
>
>Thanks,
>Ralph
>
>
>

Re: Unicode

Reply via email to