Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-09 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r2907679487


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-06 Thread via GitHub


emkornfield commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r2897287421


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   > In other words, I am not sure the composeability of fixed size list into 
different element types makes a lot of sense
   
   +1 to this.  I think this comes up as a theoretical compatibility with arrow 
things, where Arrow places no such limitations.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-06 Thread via GitHub


alamb commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r2894162624


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   I am not even sure what a "fixed sized list of structs" even means. Would it 
mean that each struct has a known size (so that each element is fixed size 🤔 ). 
How would that work to have a fixed size list of structs where one of the 
structs was a (non fixed size) list 🤔 
   
   In other words, I am not sure the composeability of fixed size list into 
different element types makes a lot of sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-05 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-4004325670

   Application layer can also easily implement it's own null mask with a 
boolean vector under the new proposal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-04 Thread via GitHub


adamreeve commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-4001134563

   Ah yes you're right, sorry. And I didn't mean to suggest that nullable 
elements should be supported, I'd just misunderstood how this would work. For 
scenarios where I see this being used, zero or NaN is often used in place of 
nulls, so I think it's fine to only support required vectors and elements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-04 Thread via GitHub


etseidl commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-4001043887

   > > as each vector value has a fixed number of non-optional child elements
   > 
   > I think the vectors themselves would be non-optional but the child 
elements should be allowed to be optional?
   
   @pitrou's proposal would increase neither the repetition nor definition 
levels, so that implies vectors and their elements are non-optional. If we want 
to allow optional vectors and elements, we'd need some kind of 3-level 
structure like what currently exists for lists. This would eliminate some 
repetition level decoding, but would still require extra definition level 
handling, so I wonder if we'd see as much of a decoding speed improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-04 Thread via GitHub


adamreeve commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-4000570329

   I like the new proposal. This previous proposal means readers could fall 
back to reading the primitive values if they don't understand the new logical 
type, whereas there wouldn't be a fallback path for readers that don't 
understand the new repetition type, but I think that's acceptable given it 
allows use of encodings that are better suited to the element values (eg. byte 
stream split for floats or maybe ALP in the future).
   
   > as each vector value has a fixed number of non-optional child elements
   
   I think the vectors themselves would be non-optional but the child elements 
should be allowed to be optional?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-04 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-3998327552

   For the record we have a [new proposal on the mailing 
list](https://lists.apache.org/thread/soqd69k8y7b6z0sxbmgrbxcwxbvlj353) and 
[here](https://github.com/apache/parquet-format/compare/master...pitrou:vector-repetition)
 that is relevant for this discussion. It would be great to get eyes on it and 
decide if we want to rather go with it.
   
   cc @tustvold @JFinis @jhorstmann @wgtmac @alippai @coastalwhite @etseidl 
@mapleFU @mhaseeb123 @rahil-c @adamreeve @emkornfield 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-03 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r2878770027


##
src/main/thrift/parquet.thrift:
##
@@ -319,6 +319,11 @@ struct ListType {}// see LogicalTypes.md
 struct EnumType {}// allowed for BYTE_ARRAY, must be encoded with UTF-8
 struct DateType {}// allowed for INT32
 struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 
bytes (see LogicalTypes.md)
+struct FixedSizeListType {// allowed for FIXED_LEN_BYTE_ARRAY; see 
LogicalTypes.md
+1: required Type type;// element type (fixed-width primitive; must 
not be BOOLEAN, INT96, or BYTE_ARRAY)

Review Comment:
   > The only blocker for that is that there is no logical type annotation to 
indicate FLOAT or DOUBLE.
   
   Yes, I think we need either `Type` and `Enum` as you originally suggested or 
`Type` and optional `LogicalType`. I slightly prefer `LogicalType` because we 
already define it. Shall I update the language to sketch the `LogicalType` path?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-03 Thread via GitHub


jhorstmann commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r2877924765


##
src/main/thrift/parquet.thrift:
##
@@ -319,6 +319,11 @@ struct ListType {}// see LogicalTypes.md
 struct EnumType {}// allowed for BYTE_ARRAY, must be encoded with UTF-8
 struct DateType {}// allowed for INT32
 struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 
bytes (see LogicalTypes.md)
+struct FixedSizeListType {// allowed for FIXED_LEN_BYTE_ARRAY; see 
LogicalTypes.md
+1: required Type type;// element type (fixed-width primitive; must 
not be BOOLEAN, INT96, or BYTE_ARRAY)

Review Comment:
   Adding a logical type could work, and it would then even support nested 
lists or matrices. It's not immediately obvious, but `Type` could not support 
that since the length of `FIXED_LEN_BYTE_ARRAY` is stored in `SchemaElement`.
   
   What I don't like is that here, the logical type is used to influence the 
physical layout, where as elsewhere, a PLAIN encoded INT32 with logical type 
INT_8 would still be stored using 4 bytes.
   
   Hm, thinking out loud a bit, the physical width is already defined by 
`type_length of FIXED_LEN_BYTE_ARRAY / num_values`. The logical type should 
then be enough to interpret these bytes, without the `Type` field. The only 
blocker for that is that there is no logical type annotation to indicate 
`FLOAT` or `DOUBLE`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-03 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r2877690702


##
src/main/thrift/parquet.thrift:
##
@@ -319,6 +319,11 @@ struct ListType {}// see LogicalTypes.md
 struct EnumType {}// allowed for BYTE_ARRAY, must be encoded with UTF-8
 struct DateType {}// allowed for INT32
 struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 
bytes (see LogicalTypes.md)
+struct FixedSizeListType {// allowed for FIXED_LEN_BYTE_ARRAY; see 
LogicalTypes.md
+1: required Type type;// element type (fixed-width primitive; must 
not be BOOLEAN, INT96, or BYTE_ARRAY)

Review Comment:
   Good point, decimal is another type we'd lose annotation for. To avoid a new 
enum, how about `optional LogicalType`:
   
   ```thrift
   struct FixedSizeListType {
   1: required Type type;// element type (fixed-width primitive)
   2: required i32 num_values;
   3: optional LogicalType element_logical_type; // optional semantic 
annotation of elements, 
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-03 Thread via GitHub


jhorstmann commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r2876791483


##
src/main/thrift/parquet.thrift:
##
@@ -319,6 +319,11 @@ struct ListType {}// see LogicalTypes.md
 struct EnumType {}// allowed for BYTE_ARRAY, must be encoded with UTF-8
 struct DateType {}// allowed for INT32
 struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 
bytes (see LogicalTypes.md)
+struct FixedSizeListType {// allowed for FIXED_LEN_BYTE_ARRAY; see 
LogicalTypes.md
+1: required Type type;// element type (fixed-width primitive; must 
not be BOOLEAN, INT96, or BYTE_ARRAY)

Review Comment:
   It might make sense to introduce a new enum for the list element types. The 
`Type` enum does not distinguish smaller integer types, signed/unsigned types 
or the float16 type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-02 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-3987738494

   Based on the discussion I made language a bit more vebose and explicit: 
https://github.com/apache/parquet-format/pull/241/changes/bc3df18b190e4a792522e91b4c6504c190aabdc6
   
   I'd love to hear more feedback, will also ping the ML.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-03-02 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-3987279321

   > This is quite an old PR so I am reading earlier comments to understand 
what's going on. Please correct me if I am wrong anywhere or missed something.
   
   Thanks for taking the time!

   > This PR adds a new `FIXED_SIZE_LIST ` logical type containing `[num_values 
x FLBA element]` per such logical list. The underlying FLBA data would encode 
as non-nullable PLAIN (no repetition levels and no def levels so no nulls at 
any level?).
   
   The [num_values x FLBA] rows themselves would be nullable with definition 
levels. What we wouldn't have is intra-row nullability.
   
   > 1. Why not allow all or at least all fixed length physical types (INTs, 
FLOAT, DOUBLE, BOOLEAN etc) to be a part of `FIXED_SIZE_LIST` instead of just 
the `FIXED_LEN_BYTE_ARRAY`s type (do we only care for FLOAT16, DECIMALS, and 
UUIDs here)?
   
   FLBA is meant as physical type (container) for arbitrary 
FIXED_SIZE_LIST(type, size) logical type. All fixed length physical types are 
allowed.
   
   > 2. Why only PLAIN encoding allowed for the contained FLBA data? Is it to 
allow zero-copy?
   
   Encoding here is meant for byte layout within the FLBA. FIXED_SIZE_LIST 
columns can use any encoding that supports FLBA (plain, dictionary, 
delta_byte_array, byte_stream_split).
   
   > Consider all above is true, then from cuDF's perspective, we would 
definitely get some speed boost from not having to decode (and write) levels 
data for such types and then viewing the data as lists would also be trivial.
   > 
   > To that effect, if this type is added to Parquet, I would prefer if it 
supports more than just FLBAs to make the effort to support it worthwhile. 
CC'ing @pmattione-nvidia as he can speak more on the overhead incurred from 
decoding levels data in the last libcudf version.
   
   As stated above, this already supports all fixed-width primitive types as 
elements — FLBA is just the container. Glad to hear this would be useful for 
cuDF!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2026-02-28 Thread via GitHub


rahil-c commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-3977803111

   Thanks @rok for revisiting this, if you need any help on this let me know. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-22 Thread via GitHub


tustvold commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2429736936

   Some points in no particular order:
   
   * The parquet schema is authoritative, with any other schema information 
merely a hint, this makes the notion of using the arrow schema, or something 
else to drive decode a little dubious.
   * The record shredding logic for lists is the single most complex, confusing 
and subtle aspect of any parquet reader, which:
 *  Limits the pool of people who can implement / review such changes
 * Sets a very high bar for including such changes
   * Even some optimal record shredding setup will never perform better than an 
implementation that can simply skip it entirely
   * Both arrow-rs and polars exploit that the hybrid RLE is effectively a 
bitmask if the max definition level is only 1, this allows for very efficient 
decode. This isn't possible when there are repetition levels
   * Performant record skipping, e.g. for predicate/index pushdown or late 
materialization, is not really possible against data with repetition levels
   * Many readers have quirky support for repetition levels and lists in 
general, especially w.r.t areas where the specification has been ambiguous in 
the past, finding ways for people to avoid these pain points is potentially 
valuable
   
   That's all to say providing a way to encode fixed size lists seems like a 
very useful capability. That being said, it does seem to be a bit of a hack to 
make this a logical type, and will potentially limit the options for encodings, 
statistics, sort orders, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-22 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2429710866

   Sorry for my abundance of caution @alippai. I'll try to summarize this 
thread to the ML and ask for some more input ASAP. It would be nice to actually 
start some work on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-22 Thread via GitHub


alippai commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2429686158

   @rok Sorry, wrong phrasing. I meant that was the recommendation to explore 
on the ML and by @coastalwhite.
   
   I didn’t see objections adding this feature to the parquet format or 
commitments for adding the fast path to any of the libraries (arrow cpp 
actually noted it’s a non-trivial part of the codebase)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-22 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2429373213

   > @rok based on the ML discussion we should add the fast path in the cases 
of polars, arrow and arrow-rs where we know the fixed size already (from schema 
stored in the metadata or if it's provided by the consumer). This is more 
fragile and less universal, but maybe a good first step forward
   
   @alippai are you sure we have a strong enough consensus yet to start 
implementing fast paths? I would really like to have some more discussion 
before committing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-21 Thread via GitHub


alippai commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2427321660

   @rok based on the ML discussion we should add the fast path in the cases of 
polars, arrow and arrow-rs where we know the fixed size already (from schema 
stored in the metadata or if it's provided by the consumer). This is more 
fragile and less universal, but maybe a good first step forward


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-21 Thread via GitHub


coastalwhite commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2426668631

   > @coastalwhite there is a 10x penalty in Polars 1.9.0 parquet _reading_ as 
well using this snippet: [apache/arrow#34510 
(comment)](https://github.com/apache/arrow/issues/34510#issuecomment-1462753928)
   
   Thank you for putting that to my attention. Still, I feel like that is more 
of a bug than an inherent performance problem in the Parquet file format. 
However, it is probably easier to optimize for what is proposed in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-21 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2426633671

   > @rok is there anything I can help with?
   
   @alippai thanks for pinging. I was advised on the parquet sync call to 
re-open a ML discussion on this, but I need a couple of weeks to get to it. If 
you'd like you can start it now, here's the existing thread: 
https://lists.apache.org/thread/xot5f3ghhtc82n1bf0wdl9zqwlrzqks3
   I suppose it'd be useful to report on the pros and cons discussed here and 
propose we move forward.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-21 Thread via GitHub


alippai commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2426579259

   @coastalwhite there is a 10x penalty in Polars 1.9.0 parquet _reading_ as 
well using this snippet: 
https://github.com/apache/arrow/issues/34510#issuecomment-1462753928


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-21 Thread via GitHub


coastalwhite commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2425895697

   I like the general idea of moving `FixedSizeList` partially away from `List` 
and towards `FixedSizeBinary`, but I doubt it would lead to serious speedups or 
simplification.
   
   The `List` based deserializer most of the time already batches decoding 
similarly to what this would allow, although it would allow skipping many 
checks that happen before the actual deserialization takes place. We would also 
still need to support the old path for a long time, since a lot of people write 
parquet files using old versions of the parquet specification and generally use 
old parquet files.
   
   The one potentially large upside I can imagine of this is getting dictionary 
encoding for array's, but I am not sure how common that will be in real-world 
scenarios.
   
   In general, I would say I am in favor. Although, I am not 100% convinced yet 
that the added complexity will result in significant performance, file size or 
other benefits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-10-18 Thread via GitHub


alippai commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2423305224

   @rok is there anything I can help with?
   
   @mapleFU I saw your questions above. Are you satisfied with the answers?
   
   @coastalwhite I see you are familiar with Parquet and Array in Polars. Do 
you think this proposal is useful for your project?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-24 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1650648046


##
LogicalTypes.md:
##
@@ -256,6 +256,24 @@ The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive 
type.
+
+The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of
+elements of the same primitive data type.
+
+The sort order used for `FIXED_SIZE_LIST` is undefined.
+
+### VARIABLE_SIZE_LIST

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-04 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2148750273

   Apologies for taking a while to reply.
   
   I've split this into two cases: `FixedSizeListType` (length is constant) and 
`VariableSizeListType` (length differs per row) for the sake of discussion. I 
would move `VariableSizeListType` into a separate PR if we even decide it is 
needed next to `ListType`.
   
   > One thing to perhaps give thought to is how this might represent nested 
lists, say you wanted to encode a m by n matrix, would you just encode this as 
a `m * n` list or do we want to support this as a first-class concept?
   
   We could start with a more general multidimensional array definition and 
have list be a 1 dimensional array. Additional metadata required would not be 
that bad. I'm just a bit scared of validation and striding logic bleeding into 
parquet implementations. Do we have any other inputs / opinions?
   
   > I had perhaps been anticipating that fixed size list would be a variant of 
"REPEATED" as opposed to a physical type, that is just able to avoid 
incrementing the max_def_level and max_rep_level. This would make it 
significantly more flexible I think, although I concede it will make it harder 
to implement.
   
   That's interesting. What would you expect performance wise with this 
approach?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-04 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1626825231


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   > Could you please provide a concrete example on how the list is structured? 
What about their definition & repetition levels? Intuitively, I thought not 
limit it to binary type. For example, it would be possible to support something 
like int[N] or double[N] and even multi-dimensional list like int[M][N].
   
   I would represent the fixed sized list as a non-nested 
`FIXED_LEN_BYTE_ARRAY` + `type` + `num_values`. Multidimensional lists/arrays 
bring much more complexity that I'm not sure makes sense to store as a logical 
type (see [FixedShapeTensor in 
Arrow](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor)).
 Also see 
https://github.com/apache/parquet-format/pull/241#issuecomment-2148750273.
   
   > Perhaps use `byte_array` in this PR (see #251).
   
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-04 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1626834692


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   > * This solves the problem of redundant R-Levels. Since it's just a 
primitive column, no r-level considerations have to be taken into account.
   
   This is the main reason I'd like to propose this type, see 
https://github.com/apache/arrow/issues/34510.
   
   > * Cannot create fixed size lists of nested types (e.g., list of structs). 
I see that this isn't necessary for tensors or embedding vectors, but shouldn't 
the feature be extensible for other scenarios as well? This limits the 
composability of the feature. I can now create a struct of fixed size lists, 
but not a fixed size list of structs.
   
   Lack of composability is a downside, but I think it's still worth the 
compromise. I've not seen need for fixed_size_list(struct) in tensor computing, 
but that's probably just because it's not available.
   
   > * Cannot have null elements in fixed size lists. This might not be desired 
for all lists, but there can be use cases where having null values in them is 
preferrable.
   
   In tensor computation this is usually addressed with bitmasks, which can be 
stored as a `fixed_size_list(binary, num_values)`.
   
   > * Parquet has a concept for (non-fixed size) lists. It is conceptually 
weird that fixed size lists are totally different from (non-fixed size) lists.
   
   Perhaps we should call this type `FixedSizeArray` to disambiguate?
   
   
   > I'm now feeling that maybe wrapping a `Vector[PrimitiveType, Size]` is 
also ok, but currently representing this is a bitweird in the model. May I ask 
would a `Vector` having data below?
   > 
   > ```
   > 1. [1, 1, 1], [null, 1, 1] <-- data with null
   > 2. null, [1, 1, 1] <-- null vector
   > ```
   > 
   > And would vector contains a "nested" vector?
   
   I think case 2. is ok, but case 1. should be expressed with a separate null 
bitmask that's not part of the type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-04 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1626826071


##
src/main/thrift/parquet.thrift:
##
@@ -282,13 +282,14 @@ struct Statistics {
 }
 
 /** Empty structs to use as logical type annotations */
-struct StringType {}  // allowed for BINARY, must be encoded with UTF-8
-struct UUIDType {}// allowed for FIXED[16], must encoded raw UUID bytes
-struct MapType {} // see LogicalTypes.md
-struct ListType {}// see LogicalTypes.md
-struct EnumType {}// allowed for BINARY, must be encoded with UTF-8
-struct DateType {}// allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct StringType {}// allowed for BINARY, must be encoded with UTF-8
+struct UUIDType {}  // allowed for FIXED[16], must encoded raw UUID 
bytes
+struct MapType {}   // see LogicalTypes.md
+struct ListType {}  // see LogicalTypes.md
+struct EnumType {}  // allowed for BINARY, must be encoded with UTF-8
+struct DateType {}  // allowed for INT32
+struct Float16Type {}   // allowed for FIXED[2], must encoded raw FLOAT16 
bytes
+struct FixedSizeListType {} // see LogicalTypes.md

Review Comment:
   Changed to:
   ```
   struct FixedSizeListType {// allowed for 
FIXED_LEN_BYTE_ARRAY[num_values * width of type],
   1: required Type type;// see LogicalTypes.md
   2: required i32 num_values;
   }
   struct VariableSizeListType { // allowed for BYTE_ARRAY, see 
LogicalTypes.md
   1: required Type type;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-04 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1626825603


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   Yes, thanks! Changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-04 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1626825231


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   > Could you please provide a concrete example on how the list is structured? 
What about their definition & repetition levels? Intuitively, I thought not 
limit it to binary type. For example, it would be possible to support something 
like int[N] or double[N] and even multi-dimensional list like int[M][N].
   
   I would represent the fixed sized list as a non-nested 
`FIXED_LEN_BYTE_ARRAY` + `type` + `num_values`. Multidimensional lists/arrays 
bring much more complexity that I'm not sure makes sense to store as a logical 
type (see [FixedShapeTensor in 
Arrow](https://arrow.apache.org/docs/format/CanonicalExtensions.html#fixed-shape-tensor)).
   
   > Perhaps use `byte_array` in this PR (see #251).
   
   Done.



##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.
+
+The `fixed_len_byte_array` data is interpreted as a sequence of elements of
+the same primitive data type.

Review Comment:
   Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-04 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1626450400


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   Will do, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-06-04 Thread via GitHub


etseidl commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1626434760


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   Perhaps use `byte_array` in this PR (see #251).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-24 Thread via GitHub


mapleFU commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1613765507


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   I'm now feeling that maybe wrapping a `Vector[PrimitiveType, Size]` is also 
ok, but currently representing this is a bitweird in  the model. May I ask 
would a `Vector` having data below?
   
   ```
   1. [1, 1, 1], [null, 1, 1] <-- data with null
   2. null, [1, 1, 1] <-- null vector
   ```
   
   And would vector contains a "nested" vector?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-18 Thread via GitHub


JFinis commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602697549


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   Interesting choice to annotate a binary primitive field instead of a 
repeated group field. I see pros and cons with this design:
   
   PROs:
   * Guarantees zero-copy, as the layout is defined to be just bytes. In 
contrast, would this annotate a group, a writer could decide to use a fancy 
per-value encoding (e.g., dictionary) and thus create a list that first has to 
be "decoded" before it can be used.
   * Guarantees that a list is always contained on one page instead of being 
split over multiple pages. Again, this helps in keeping decoders easy and 
guaranteeing zero copy.
   * This solves the problem of redundant R-Levels. Since it's just a primitive 
column, no r-level considerations have to be taken into account.
   
   CONs:
   * Cannot create fixed size lists of nested types (e.g., list of structs). I 
see that this isn't necessary for tensors or embedding vectors, but shouldn't 
the feature be extensible for other scenarios as well? This limits the 
composability of the feature. I can now create a struct of fixed size lists, 
but not a fixed size list of structs.
   * Cannot have null elements in fixed size lists. This might not be desired 
for all lists, but there can be use cases where having null values in them is 
preferrable.
   * Parquet has a concept for (non-fixed size) lists. It is conceptually weird 
that fixed size lists are totally different from (non-fixed size) lists.
   
   I think the PROs outweigh the CONs here, so I think this is fine with me. I 
just want everyone to be aware about the ramifications.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


JFinis commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602699118


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   cc @tustvold, as you also brought up this point. I agree that having a new 
property of a repeated group would be more flexible, but it also comes at some 
cost, as outlined above. Also, it couldn't be just a logical type in this case, 
as a logical type cannot change the handling of R-Levels.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


JFinis commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602675688


##
src/main/thrift/parquet.thrift:
##
@@ -282,13 +282,14 @@ struct Statistics {
 }
 
 /** Empty structs to use as logical type annotations */
-struct StringType {}  // allowed for BINARY, must be encoded with UTF-8
-struct UUIDType {}// allowed for FIXED[16], must encoded raw UUID bytes
-struct MapType {} // see LogicalTypes.md
-struct ListType {}// see LogicalTypes.md
-struct EnumType {}// allowed for BINARY, must be encoded with UTF-8
-struct DateType {}// allowed for INT32
-struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes
+struct StringType {}// allowed for BINARY, must be encoded with UTF-8
+struct UUIDType {}  // allowed for FIXED[16], must encoded raw UUID 
bytes
+struct MapType {}   // see LogicalTypes.md
+struct ListType {}  // see LogicalTypes.md
+struct EnumType {}  // allowed for BINARY, must be encoded with UTF-8
+struct DateType {}  // allowed for INT32
+struct Float16Type {}   // allowed for FIXED[2], must encoded raw FLOAT16 
bytes
+struct FixedSizeListType {} // see LogicalTypes.md

Review Comment:
   Something is missing here. Shouldn't this type contain the element type? And 
the length of the list? The length of the list could be deduced from the size 
of the underlying `fixed_len_byte_array`, but at least the element type would 
be necessary then.



##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST

Review Comment:
   Interesting choice to annotate a binary primitive field instead of a 
repeated group field. I see pros and cons with this design:
   
   PROs:
   * Guarantees zero-copy, as the layout is defined to be just bytes. In 
contrast, would this annotate a group, a writer could decide to use a fancy 
per-value encoding (e.g., dictionary) and thus create a list that first has to 
be "decoded" before it can be used.
   * Guarantees that a list is always contained on one page instead of being 
split over multiple pages. Again, this helps in keeping decoders easy and 
guaranteeing zero copy.
   * This solves the problem of redundant R-Levels. Since it's just a primitive 
column, no r-level considerations have to be taken into account.
   
   CONs:
   * Cannot create fixed size lists of nested types (e.g., list of structs). I 
see that this isn't necessary for tensors or embedding vectors, but shouldn't 
the feature be extensible for other scenarios as well? This limits the 
composability of the feature. I can now create a struct of fixed size lists, 
but not a fixed size list of structs.
   * Parquet has a concept for (non-fixed size) lists. It is conceptually weird 
that fixed size lists are totally different from (non-fixed size) lists.
   
   I think the PROs outweigh the CONs here, so I think this is fine with me. I 
just want everyone to be aware about the ramifications.



##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   `binary` isn't a defined primitive type in Parquet. Do you mean 
FIXED_LEN_BYTE_ARRAY?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


wgtmac commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2113906022

   cc @JFinis 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


wgtmac commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602489258


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   Could you please provide a concrete example on how the list is structured? 
What about their definition & repetition levels? Intuitively, I thought not 
limit it to binary type. For example, it would be possible to support something 
like int[N] or double[N] and even multi-dimensional list like int[M][N].



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


tustvold commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2113662734

   One thing to perhaps give thought to is how this might represent nested 
lists, say you wanted to encode a m by n matrix, would you just encode this as 
a `m * n` list or do we want to support this as a first-class concept?
   
   I had perhaps been anticipating that fixed size list would be a variant of 
"REPEATED" as opposed to a physical type. This would make it significantly more 
flexible I think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


etseidl commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602302147


##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.
+
+The `fixed_len_byte_array` data is interpreted as a sequence of elements of
+the same primitive data type.

Review Comment:
   If this is only for primitive types, should the type be added to the 
`FixedSizedListType` struct?



##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.
+
+The `fixed_len_byte_array` data is interpreted as a sequence of elements of

Review Comment:
   ```suggestion
   The `binary` data is interpreted as a sequence of elements of
   ```
   



##
LogicalTypes.md:
##
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a primitive data type. It must annotate a `binary` primitive type.

Review Comment:
   "binary" means either fixed or variable length, right? I always get confused 
😅.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602294932


##
LogicalTypes.md:
##
@@ -255,6 +255,18 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a fixed-width data type. It must annotate an N-byte fixed length binary

Review Comment:
   I suppose `binary` vs `fixed_len_byte_array` wouldn't affect performance 
that much. Changed.
   I wonder if it would make sense to also propose a non-nested LIST type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


rok commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602290688


##
LogicalTypes.md:
##
@@ -255,6 +255,18 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a fixed-width data type. It must annotate an N-byte fixed length binary

Review Comment:
   My assumption was `fixed_len_byte_array` will give us better performance 
than `binary` but we then require constant bit width. Are you saying we could 
relax this condition, use `binary` and cover non-fixed-width types?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


tustvold commented on code in PR #241:
URL: https://github.com/apache/parquet-format/pull/241#discussion_r1602089071


##
LogicalTypes.md:
##
@@ -255,6 +255,18 @@ The primitive type is a 2-byte fixed length binary.
 
 The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
 
+### FIXED_SIZE_LIST
+
+The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
+of a fixed-width data type. It must annotate an N-byte fixed length binary

Review Comment:
   Why couldn't we support variable width data types here? Effectively this 
type would act like a repeated element, but with the repetition levels not 
needing to be specified?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] PARQUET-2474: Add FIXED_SIZE_LIST logical type [parquet-format]

2024-05-15 Thread via GitHub


rok commented on PR #241:
URL: https://github.com/apache/parquet-format/pull/241#issuecomment-2113116529

   cc @wgtmac @tustvold @alippai @mapleFU @AlenkaF


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]