Re: Interest in Parquet V3

2024-05-14 Thread wish maple
I also think most of the proposed benefits from these new formats can be achieved using the current parquet format and improved implementations. My concern is that: 1. For encoding, though so many interesting encoding is introduced, most implementation now just uses and implements PLAIN and

Re: Repeated fields spec clarification

2024-05-11 Thread wish maple
IMO when Page V2 is present or PageIndex is enabled, the boundaries should be check[1] [1] https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237 Jan Finis 于2024年5月11日周六 01:15写道: > Hey Parquet devs, > > I so far thought that

Re: Reading/Writing ColumnIndexes in Rust

2024-05-07 Thread wish maple
> get > > involved? > > > > ~Fe2O3 > > > > On Tue, May 7, 2024 at 8:51 AM wish maple > wrote: > > > > > You can refer to [1] or [2] for help > > > > > > [1] https://github.com/apache/arrow-rs/tree/master/parquet > > &

Re: Reading/Writing ColumnIndexes in Rust

2024-05-07 Thread wish maple
You can refer to [1] or [2] for help [1] https://github.com/apache/arrow-rs/tree/master/parquet [2] https://github.com/jorgecarleitao/parquet2 Best, Xuwei FU Fe2O3 于2024年5月7日周二 23:46写道: > Hi, > > Do we have any Rust libraries that allow reading/writing column indexes in > Parquet? If not, is

Re: [VOTE] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64

2024-03-07 Thread wish maple
+1 (non-binding) Best, Xuwei Fu Antoine Pitrou 于2024年3月7日周四 21:18写道: > > Hello, > > As discussed previously on this ML [1], I am proposing to expand > the types supported by the BYTE_STREAM_SPLIT encoding. The currently > supported types are FLOAT and DOUBLE. The proposal expands the >

Re: [Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY

2024-01-08 Thread wish maple
+1 Best, Xuwei Fu Gang Wu 于2024年1月9日周二 10:58写道: > +1 > > > What should be the way forward? Should I submit a format update > and then one or two implementations thereof? > > Based on my observation of recent format changes, it usually follows > the steps below: > (1) A PR for a format change.

Re: Pitch for Pcodec Encoding in Parquet

2024-01-02 Thread wish maple
Hi Martin, Parquet has "Compression" and "Encoding" parts. So, this new method is a part of integer/float-point encoding, but also doing some compression workload? Best, Xuwei Fu Martin Loncaric 于2024年1月3日周三 13:10写道: > I'd like to propose and get feedback on a new encoding for numerical >

Re: [VOTE] Release Apache Parquet Format 2.10.0 RC0

2023-11-16 Thread wish maple
+1 (no-binding) Thanks Gang for release! Best, Xuwei Fu Gang Wu 于2023年11月16日周四 14:07写道: > Hi everyone, > > I propose the following RC to be released as the official Apache Parquet > Format 2.10.0 release. > > The commit id is b9c4fa81c3be13dc98760c92b037fa4dd465cef8 > * This corresponds to

Re: Max bitwidth for delta encoding

2023-10-25 Thread wish maple
Hi Ed, As [1] said, the DELTA_BINARY_PACKED might not be suitable for all cases. [3] Also talks about the same problem. I think because we already have data like this. This could be compatible. Also, [2] introduce some optimizations about DELTA_BINARY_PACED. Besides, maybe we can introduce PFor

RE: [C++] Parquet and Arrow overlap

2023-04-12 Thread wish maple
On 2023/02/01 19:27:22 Will Jones wrote: > Hello, > > A while back, the Parquet C++ implementation was merged into the Apache > Arrow monorepo [1]. As I understand it, this helped the development process > immensely. However, I am noticing some governance issues because of it. > > First, it's not

RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-27 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met: 1. Null variables. In Arrow Array, null-value should occupy some place, but field-raw size cannot represent that value. 2. Size of

RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-27 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met: 1. Null variables. In Arrow Array, null-value should occupy some place, but field-raw size cannot represent that value. 2. Size of

RE: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met: 1. Null variables. In Arrow Array, null-value should occupy some place, but field-raw size cannot represent that value. 2. Size of

[DISCUSS] ByteStreamSplitDecoder broken in presence of nulls

2023-02-09 Thread wish maple
This problem is shown in this issue: https://github.com/apache/arrow/issues/15173Let me talk about it briefly: * Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT, but using "num_values" as stride in BYTE_STREAM_SPLIT * When decoding, for DATA_PAGE_V2, it can now the