Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-04-19 Thread Micah Kornfield
FYI https://github.com/apache/parquet-format/pull/197 captures the proposal and has gone through a few rounds of reviews. A quick summary: This adds the following optional metadata: 1. Variable byte count for BYTE_ARRAY types at the column chunk. 2. A histogram of repetition and definition

RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-27 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met: 1. Null variables. In Arrow Array, null-value should occupy some place, but field-raw size cannot represent that value. 2. Size of

RE: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-27 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met: 1. Null variables. In Arrow Array, null-value should occupy some place, but field-raw size cannot represent that value. 2. Size of

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
I put together a draft PR: https://github.com/apache/parquet-format/pull/197/files Thinking about the nulls and nesting level a bit more I think keeping a historgram of repetition and definition levels probably strikes the right balance for simplicity and accuracy but it would be great to here if

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
> > 2. For repeated values, I think it is sufficient to get a reasonable > estimate to know the number of start arrays (this includes nested arrays) > contained in a page/column chunk and we can add a new field to record this > separately. Apologies for replying to myself but one more thought,

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
> > 1. How primitive types are computed? Should we simply compute the raw size > by assuming the data is plain-encoded? > For example, does INT16 use the same bit-width as INT32? > What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra > sizeof(int32) for its length? Yes my

Re: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
> > 1. Null variables. In Arrow Array, null-value should occupy some place, but > field-raw size cannot represent that value. This is a good point. The number of nulls can be inferred from statistics or is included in data-page v2 [1]. I'd rather not bake in assumptions about size of nulls

RE: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met: 1. Null variables. In Arrow Array, null-value should occupy some place, but field-raw size cannot represent that value. 2. Size of

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Gang Wu
+1 for adding the raw size of each column into the Parquet specs. I used to work around these by adding similar but hidden fields to the file formats. Let me bring some detailed questions to the table. 1. How primitive types are computed? Should we simply compute the raw size by assuming the

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-24 Thread Will Jones
Hi Micah, We were just discussing in the Arrow repo how useful it would be to have utilities that could accurately estimate the deserialized size of a Parquet file. [1] So I would be very supportive of this. IIUC the implementation of this should be trivial for many fixed-size types, although

[DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-24 Thread Micah Kornfield
Parquet metadata currently tracks uncompressed and compressed page/column sizes [1][2]. Uncompressed size here corresponds to encoded size which can differ substantially from the plain encoding size due to RLE/Dictionary encoding. When doing query planning/execution it can be useful to