I guess the issue is that we use rapidjson's 'String' support to write out C++ strings, which are binary data, not valid UTF8. That's somewhat incorrect of us, and we should be base64-encoding such binary data.
Fixing this is a little bit incompatible, but for something like partition keys I think we probably should do it anyway and release note it, considering partition keys are quite likely to be invalid UTF8. -Todd On Tue, Jun 11, 2019 at 6:08 AM Pavel Martynov <[email protected]> wrote: > Hi, guys! > > We trying to use an output of "kudu cluster ksck master -ksck_format > json_compact" for integration with our monitoring system and hit a little > strange. Some part of output can't be read as UTF-8 with Python 3: > $ kudu cluster ksck master -ksck_format json_compact > kudu.json > $ python > with open(' kudu.json', mode='rb') as file: > bs = file.read() > bs.decode('utf-8') > UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position > 705196: invalid start byte > > There how SublimeText shows this block of text: > https://yadi.sk/i/4zpWKZ37iP8OEA > As you can see kudu tool encodes zeros as \u0000, but don't encode some > other non-text bytes. > > What do you think about it? > > -- > with best regards, Pavel Martynov > -- Todd Lipcon Software Engineer, Cloudera
