I also think most of the proposed benefits from these new formats can be
achieved using the current parquet format and improved implementations.
My concern is that:
1. For encoding, though so many interesting encoding is introduced, most
implementation now just uses and implements PLAIN and
IMO when Page V2 is present or PageIndex is enabled, the boundaries
should be check[1]
[1]
https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237
Jan Finis 于2024年5月11日周六 01:15写道:
> Hey Parquet devs,
>
> I so far thought that
> get
> > involved?
> >
> > ~Fe2O3
> >
> > On Tue, May 7, 2024 at 8:51 AM wish maple
> wrote:
> >
> > > You can refer to [1] or [2] for help
> > >
> > > [1] https://github.com/apache/arrow-rs/tree/master/parquet
> > &
You can refer to [1] or [2] for help
[1] https://github.com/apache/arrow-rs/tree/master/parquet
[2] https://github.com/jorgecarleitao/parquet2
Best,
Xuwei FU
Fe2O3 于2024年5月7日周二 23:46写道:
> Hi,
>
> Do we have any Rust libraries that allow reading/writing column indexes in
> Parquet? If not, is
+1 (non-binding)
Best,
Xuwei Fu
Antoine Pitrou 于2024年3月7日周四 21:18写道:
>
> Hello,
>
> As discussed previously on this ML [1], I am proposing to expand
> the types supported by the BYTE_STREAM_SPLIT encoding. The currently
> supported types are FLOAT and DOUBLE. The proposal expands the
>
+1
Best,
Xuwei Fu
Gang Wu 于2024年1月9日周二 10:58写道:
> +1
>
> > What should be the way forward? Should I submit a format update
> and then one or two implementations thereof?
>
> Based on my observation of recent format changes, it usually follows
> the steps below:
> (1) A PR for a format change.
Hi Martin,
Parquet has "Compression" and "Encoding" parts. So, this new
method is a part of integer/float-point encoding, but also doing some
compression workload?
Best,
Xuwei Fu
Martin Loncaric 于2024年1月3日周三 13:10写道:
> I'd like to propose and get feedback on a new encoding for numerical
>
+1 (no-binding)
Thanks Gang for release!
Best,
Xuwei Fu
Gang Wu 于2023年11月16日周四 14:07写道:
> Hi everyone,
>
> I propose the following RC to be released as the official Apache Parquet
> Format 2.10.0 release.
>
> The commit id is b9c4fa81c3be13dc98760c92b037fa4dd465cef8
> * This corresponds to
Hi Ed,
As [1] said, the DELTA_BINARY_PACKED might not be suitable
for all cases. [3] Also talks about the same problem. I think
because we already have data like this. This could be compatible.
Also, [2] introduce some optimizations about DELTA_BINARY_PACED.
Besides, maybe we can introduce PFor
On 2023/02/01 19:27:22 Will Jones wrote:
> Hello,
>
> A while back, the Parquet C++ implementation was merged into the Apache
> Arrow monorepo [1]. As I understand it, this helped the development
process
> immensely. However, I am noticing some governance issues because of it.
>
> First, it's not
+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of
+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of
+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of
This problem is shown in this issue:
https://github.com/apache/arrow/issues/15173Let me talk about it briefly:
* Encoder doesn't write "num_values" on Page payload for BYTE_STREAM_SPLIT,
but using "num_values" as stride in BYTE_STREAM_SPLIT
* When decoding, for DATA_PAGE_V2, it can now the
14 matches
Mail list logo