>From Kudu's perspective, I think the intent is for STRING to enforce UTF-8 encoding and if that is inappropriate for your use case, you should use BINARY (which is effectively STRING minus that enforcement). The fact that the C++ client doesn't enforce the encoding is a "bug" rather than a "feature". Though, looking at this more deeply, what actually happens if you try to shoehorn your HLL intermediates into the Java STRING APIs? Does the data actually get mangled, and if so, is it at write time, or at scan time?
Of course, Kudu doesn't operate in a vacuum so Impala's considerations are important too. Unfortunately, there doesn't appear to have been any progress on IMPALA-5323, which would be the clearest path forward. Maybe you could update that ticket with your use case and hopefully get the attention of some Impala developers? On Mon, Dec 16, 2019 at 10:16 AM Cliff Resnick <cre...@gmail.com> wrote: > > Hi Kudu team, > > We use Kudu with Impala, and usually update Kudu through the Java api. We > store some binary HLL intermediates in Kudu, but must use String type since > Impala does not have a Binary type. Kudu's java client forces UTF-8 encoding > and we have a C++ UDAF in Impala that must decode Kudu's UTF-8 on every value. > > It looks like UTF-8 is not enforced in Kudu's C++ client, so I'm wondering > why we could not have control over the String encoding in Java as well? As-is > it looks like we'd have to fork the java code to add this support. Or is > there another way? > > -Cliff