Re: parquet using encoding other than UTF-8

2019-02-07 Thread Manik Singla
thanks all, I decided to convert to UTF-8 only. Regards Manik Singla +91-9996008893 +91-9665639677 "Life doesn't consist in holding good cards but playing those you hold well." On Thu, Feb 7, 2019 at 5:41 AM Ryan Blue wrote: > Ok, thanks. Sorry I misread what you meant! > > On Wed, Feb 6, 201

Re: parquet using encoding other than UTF-8

2019-02-06 Thread Ryan Blue
Ok, thanks. Sorry I misread what you meant! On Wed, Feb 6, 2019 at 3:02 PM Wes McKinney wrote: > > I think the right thing is to transcode your data to UTF-8. > > I agree -- I wasn't recommending the approach I described, just to say > that it is not impossible. > > On Wed, Feb 6, 2019 at 12:10

Re: parquet using encoding other than UTF-8

2019-02-06 Thread Wes McKinney
> I think the right thing is to transcode your data to UTF-8. I agree -- I wasn't recommending the approach I described, just to say that it is not impossible. On Wed, Feb 6, 2019 at 12:10 PM Ryan Blue wrote: > > I disagree with Wes. He's right that you *could* just use binary and keep > extra m

Re: parquet using encoding other than UTF-8

2019-02-06 Thread Ryan Blue
I disagree with Wes. He's right that you *could* just use binary and keep extra metadata somewhere, it is very unlikely that Parquet would ever support such a scheme. And it is bad for the community when people attempt to go around the format spec, as we see with the INT96 timestamp mess. I think

Re: parquet using encoding other than UTF-8

2019-02-06 Thread Wes McKinney
hi Manik -- you are free to store the data as opaque binary as BYTE_ARRAY type and add some metadata to the schema so your readers can recognize that it's UTF-16 stored as binary On Wed, Feb 6, 2019 at 12:24 AM Manik Singla wrote: > > I am not the producer of data so I can not control encoding. I

Re: parquet using encoding other than UTF-8

2019-02-05 Thread Manik Singla
I am not the producer of data so I can not control encoding. I do receive ByteBuffer and encoding. I can decode data with given encoding and covert to UTF-8 for storing with Parquet. I was thinking to remove that overhead if possible Regards Manik Singla +91-9996008893 +91-9665639677 "Life doesn'

Re: parquet using encoding other than UTF-8

2019-02-05 Thread Uwe L. Korn
Hello Manik, this is not possible at the moment. As Parquet is a portable on-disk format, we focus on having a single representation for each data type. Thus implementing readers/writers is limited to these to make their implementation simpler. Especially as you are the producer but not the con

parquet using encoding other than UTF-8

2019-02-05 Thread Manik Singla
Hi I am new to Parquet. I am trying to save UTF-16 or some other encoding than UTF-8. I am also trying to use encoding hint when saving ByteBuffer. I don't find way to use any thing other than UTF-8. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says we can extend primitive