Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-02-06 Thread via GitHub


rdblue commented on PR #461:
URL: https://github.com/apache/parquet-format/pull/461#issuecomment-2640995604

   Thanks for all the reviews, everyone! I just merged this so that we can move 
forward with the next round of discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-02-06 Thread via GitHub


rdblue merged PR #461:
URL: https://github.com/apache/parquet-format/pull/461


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-02-05 Thread via GitHub


rdblue commented on PR #461:
URL: https://github.com/apache/parquet-format/pull/461#issuecomment-2637990323

   Thanks to everyone for reviewing! I'll merge this tomorrow unless there are 
objections. I think to avoid confusion like 
https://github.com/apache/parquet-format/issues/479, we should get this in and 
then close on any remaining issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-02-03 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1940281393


##
VariantShredding.md:
##
@@ -25,290 +25,325 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-02-03 Thread via GitHub


aihuaxu commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1940053559


##
VariantShredding.md:
##
@@ -25,290 +25,325 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-Th

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-02-03 Thread via GitHub


aihuaxu commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1841305033


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-02-03 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1939803766


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-22 Thread via GitHub


aihuaxu commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1926088105


##
VariantShredding.md:
##
@@ -25,290 +25,319 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-Th

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-14 Thread via GitHub


cashmand commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1914882151


##
VariantEncoding.md:
##
@@ -39,13 +39,42 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+Variant fields can also be _shredded_.
+Shredding refers to extracting some elements of the variant into separate 
columns for more efficient extraction/filter pushdown.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is `required` and must be a valid Variant metadata, as 
defined below.
+* The `value` field must be annotated as `required` for unshredded Variant 
values, or `optional` if parts of the value are [shredded](VariantShredding.md) 
as typed Parquet columns.
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;

Review Comment:
   I agree that the discrepancy is a bit unfortunate. Later in the shredding 
doc it says `The Parquet columns used to store variant metadata and values must 
be accessed by name, not by position`; maybe we should put it up front here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-13 Thread via GitHub


Zouxxyy commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1914176697


##
VariantEncoding.md:
##
@@ -39,13 +39,42 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+Variant fields can also be _shredded_.
+Shredding refers to extracting some elements of the variant into separate 
columns for more efficient extraction/filter pushdown.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is `required` and must be a valid Variant metadata, as 
defined below.
+* The `value` field must be annotated as `required` for unshredded Variant 
values, or `optional` if parts of the value are [shredded](VariantShredding.md) 
as typed Parquet columns.
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;

Review Comment:
   > Okay, I think it's fine if you want to make a PR with the change.
   
   Oh, I checked again, and it seems that `new VariantVal(byte[] value, byte[] 
metadata)` is already used everywhere in spark, this change comes at a 
significant cost. @cashmand @rdblue , do you think the inconsistency here with 
the specs will have a big impact in the future?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-13 Thread via GitHub


cashmand commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1914098203


##
VariantEncoding.md:
##
@@ -39,13 +39,42 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+Variant fields can also be _shredded_.
+Shredding refers to extracting some elements of the variant into separate 
columns for more efficient extraction/filter pushdown.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is `required` and must be a valid Variant metadata, as 
defined below.
+* The `value` field must be annotated as `required` for unshredded Variant 
values, or `optional` if parts of the value are [shredded](VariantShredding.md) 
as typed Parquet columns.
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;

Review Comment:
   Okay, I think it's fine if you want to make a PR with the change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-13 Thread via GitHub


Zouxxyy commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1914049795


##
VariantEncoding.md:
##
@@ -39,13 +39,42 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+Variant fields can also be _shredded_.
+Shredding refers to extracting some elements of the variant into separate 
columns for more efficient extraction/filter pushdown.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is `required` and must be a valid Variant metadata, as 
defined below.
+* The `value` field must be annotated as `required` for unshredded Variant 
values, or `optional` if parts of the value are [shredded](VariantShredding.md) 
as typed Parquet columns.
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;

Review Comment:
   > Hi @Zouxxyy, in the Spark shredding PRs I've been working on, I put 
metadata first. I didn't see much benefit to changing the order in the existing 
non-shredded code, but I don't feel too strongly about it either way. The spec 
is pretty clear that readers should identify the appropriate columns based on 
field names, not field order, and I think things could become quite fragile if 
they did rely on field order.
   
   Thank you, but future users or developers may find it strange if the actual 
implementation differs from the specs. I think it's better to adhere to the 
specs as long as it doesn’t affect performance, before the official release of 
spark 4.0. If you don’t mind, I can work on this and raise a PR to spark, WDYT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-13 Thread via GitHub


cashmand commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1913567140


##
VariantEncoding.md:
##
@@ -39,13 +39,42 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+Variant fields can also be _shredded_.
+Shredding refers to extracting some elements of the variant into separate 
columns for more efficient extraction/filter pushdown.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is `required` and must be a valid Variant metadata, as 
defined below.
+* The `value` field must be annotated as `required` for unshredded Variant 
values, or `optional` if parts of the value are [shredded](VariantShredding.md) 
as typed Parquet columns.
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;

Review Comment:
   Hi @Zouxxyy, in the Spark shredding PRs I've been working on, I put metadata 
first. I didn't see much benefit to changing the order in the existing 
non-shredded code, but I don't feel too strongly about it either way. The spec 
is pretty clear that readers should identify the appropriate columns based on 
field names, not field order, and I think things could become quite fragile if 
they did rely on field order.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-12 Thread via GitHub


Zouxxyy commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1912635682


##
VariantEncoding.md:
##
@@ -39,13 +39,42 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+Variant fields can also be _shredded_.
+Shredding refers to extracting some elements of the variant into separate 
columns for more efficient extraction/filter pushdown.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is `required` and must be a valid Variant metadata, as 
defined below.
+* The `value` field must be annotated as `required` for unshredded Variant 
values, or `optional` if parts of the value are [shredded](VariantShredding.md) 
as typed Parquet columns.
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;

Review Comment:
   @chenhao-db @cashmand Hi, I noticed that currently Spark's Variant writes 
the value first and then the metadata.  which is the opposite of shredding. 
Have we considered making adjustments to this?
   
   spark's variant
   ```
   optional group variant_name (VARIANT) {
 required binary value;
 required binary metadata;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1911863780


##
VariantShredding.md:
##
@@ -25,290 +25,319 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1911863780


##
VariantShredding.md:
##
@@ -25,290 +25,319 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-10 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1911863780


##
VariantShredding.md:
##
@@ -25,290 +25,319 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-10 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1911863780


##
VariantShredding.md:
##
@@ -25,290 +25,319 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-10 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1911862049


##
VariantShredding.md:
##
@@ -25,290 +25,319 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-08 Thread via GitHub


gene-db commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1907986419


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-03 Thread via GitHub


RussellSpitzer commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r190207


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2025-01-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1902031603


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-30 Thread via GitHub


Zouxxyy commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1899917696


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-Th

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-29 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1899171394


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-23 Thread via GitHub


Zouxxyy commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1896507319


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-Th

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-20 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1894292367


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-15 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1886161262


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-13 Thread via GitHub


cashmand commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1884404362


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-T

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881108489


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


RussellSpitzer commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881077834


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881064507


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881064507


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881064507


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881054123


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.

Review Comment:
   My intent was for those not fluent with Variant types to define what a 
"partial projection" means.  I proposed an alternative formulation of the 
sentence below as an alternative.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881053679


##
VariantShredding.md:
##
@@ -25,290 +25,320 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.

Review Comment:
   ```suggestion
   A partial projection occurs with a query like `SELECT variant_get(event, 
'$.event_ts', 'timestamp') FROM tbl`. In this case an engine only needs to load 
field `event_ts`, and if `event_ts` column is shredded, it can be read without 
reading or deserializing the rest of the `event` Variant.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881046365


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-11 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1881045255


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


sfc-gh-saya commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1879024065


##
VariantShredding.md:
##
@@ -25,276 +25,299 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1879015316


##
VariantShredding.md:
##
@@ -25,276 +25,299 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1879010171


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1879004931


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878991946


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878995166


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


sfc-gh-saya commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878991799


##
VariantShredding.md:
##
@@ -25,276 +25,299 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878990162


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878982761


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878967986


##
VariantShredding.md:
##
@@ -25,276 +25,299 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878954798


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878945956


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.

Review Comment:
   I think it's pretty clear how this would work so I'm not sure the value here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-10 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1878909221


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.

Review Comment:
   I agree with Micah and Gene. Let's not put a limit in the spec. That should 
be determined in other ways.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-09 Thread via GitHub


cashmand commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1876773911


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-T

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-08 Thread via GitHub


alamb commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1874839885


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-06 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1873830923


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-06 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868938364


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-05 Thread via GitHub


gene-db commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1872656516


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-Th

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-05 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1871880017


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-05 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1871877857


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-04 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1869037182


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-04 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1869037182


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-04 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868866777


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-04 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868977523


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-04 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868952437


##
VariantEncoding.md:
##
@@ -416,14 +444,36 @@ Field names are case-sensitive.
 Field names are required to be unique for each object.
 It is an error for an object to contain two fields with the same name, whether 
or not they have distinct dictionary IDs.
 
-# Versions and extensions
+## Versions and extensions
 
 An implementation is not expected to parse a Variant value whose metadata 
version is higher than the version supported by the implementation.
 However, new types may be added to the specification without incrementing the 
version ID.
 In such a situation, an implementation should be able to read the rest of the 
Variant value if desired.
 
-# Shredding
+## Shredding
 
 A single Variant object may have poor read performance when only a small 
subset of fields are needed.
 A better approach is to create separate columns for individual fields, 
referred to as shredding or subcolumnarization.
 [VariantShredding.md](VariantShredding.md) describes the Variant shredding 
specification in Parquet.
+
+## Conversion to JSON
+
+Values stored in the Variant encoding are a superset of JSON values.
+For example, a Variant value can be a date that has no equivalent type in JSON.
+To maximize compatibility with readers that can process JSON but not Variant, 
the following conversions should be used when producing JSON from a Variant:
+
+| Variant type  | JSON type | Representation requirements  
| Example  |
+|---|---|--|--|
+| Null type | null  | `null`   
| `null`   |
+| Boolean   | boolean   | `true` or `false`
| `true`   |
+| Exact Numeric | number| Digits in fraction must match scale, no exponent 
| `34`, 34.00  |

Review Comment:
   > Why would we require an engine to produce a normalized value?
   
   At least for me, I don't think it is about "requiring" and engine to produce 
a normalized value first.  I think if an engine is reading variant and 
converting it to JSON, it is possibly doing so through an internal 
representation so it can still apply operators on top of the JSON value and 
possibly even storing it as an internal representation.  Conversion to a string 
is really only an end-user visible thing.  So when I read this it seems to be 
requiring an engine to NOT normalize which could be hard to implement for some 
engines.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-04 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868938364


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868866777


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868839388


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868856218


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868845406


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868840473


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868836731


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868836731


##
VariantShredding.md:
##
@@ -25,290 +25,318 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868833706


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868818645


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is required and must be a valid Variant metadata, as 
defined below.
+* The `value` field is required for unshredded Variant values.
+* The `value` field is optional when parts of the Variant value are shredded 
according to the [Variant Shredding specification](VariantShredding.md).
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+This is an example representation of a shredded Variant in Parquet:

Review Comment:
   Added concrete suggestion, this is more about flow, and what the reader is 
expected to understand at this point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868824200


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868820400


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.

Review Comment:
   sorry this should have been "VARIANT" (not JSON) path.  Would that make more 
sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868819244


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is required and must be a valid Variant metadata, as 
defined below.
+* The `value` field is required for unshredded Variant values.
+* The `value` field is optional when parts of the Variant value are shredded 
according to the [Variant Shredding specification](VariantShredding.md).
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+This is an example representation of a shredded Variant in Parquet:
+```
+optional group shredded_variant_name (VARIANT) {
+  required binary metadata;
+  optional binary value;
+  optional int64 typed_value;
+}
+```
+
+The `VARIANT` annotation places no additional restrictions on the repetition 
of Variant groups, but repetition may be restricted by containing types (such 
as `MAP` and `LIST`).

Review Comment:
   OK, I don't think this matters too much.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-12-03 Thread via GitHub


emkornfield commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1868818298


##
VariantEncoding.md:
##
@@ -39,13 +39,42 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+Variant fields can also be _shredded_.
+Shredding refers to extracting some elements of the variant into separate 
columns for more efficient extraction/filter pushdown.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is `required` and must be a valid Variant metadata, as 
defined below.
+* The `value` field must be annotated as `required` for unshredded Variant 
values, or `optional` if parts of the value are [shredded](VariantShredding.md) 
as typed Parquet columns.
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+This is an example representation of a shredded Variant in Parquet:
+```
+optional group shredded_variant_name (VARIANT) {
+  required binary metadata;
+  optional binary value;
+  optional int64 typed_value;

Review Comment:
   ```suggestion
 // The exact semantics of this field are discussed in detail below, but 
this column stores the variant value when it is an integer.
 optional int64 typed_value;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859151894


##
VariantEncoding.md:
##
@@ -416,14 +444,36 @@ Field names are case-sensitive.
 Field names are required to be unique for each object.
 It is an error for an object to contain two fields with the same name, whether 
or not they have distinct dictionary IDs.
 
-# Versions and extensions
+## Versions and extensions
 
 An implementation is not expected to parse a Variant value whose metadata 
version is higher than the version supported by the implementation.
 However, new types may be added to the specification without incrementing the 
version ID.
 In such a situation, an implementation should be able to read the rest of the 
Variant value if desired.
 
-# Shredding
+## Shredding
 
 A single Variant object may have poor read performance when only a small 
subset of fields are needed.
 A better approach is to create separate columns for individual fields, 
referred to as shredding or subcolumnarization.
 [VariantShredding.md](VariantShredding.md) describes the Variant shredding 
specification in Parquet.
+
+## Conversion to JSON
+
+Values stored in the Variant encoding are a superset of JSON values.
+For example, a Variant value can be a date that has no equivalent type in JSON.
+To maximize compatibility with readers that can process JSON but not Variant, 
the following conversions should be used when producing JSON from a Variant:
+
+| Variant type  | JSON type | Representation requirements  
| Example  |
+|---|---|--|--|
+| Null type | null  | `null`   
| `null`   |
+| Boolean   | boolean   | `true` or `false`
| `true`   |
+| Exact Numeric | number| Digits in fraction must match scale, no exponent 
| `34`, 34.00  |

Review Comment:
   > When an engine wants to convert a variant value to a JSON string, here are 
the rules
   
   Yes, this is correct. We want a clear way to convert to a JSON string. 
However, the normalization needs to happen first. We don't want to specify that 
the JSON must be any more lossy than it already is.
   
   Why would we require an engine to produce a normalized value?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859148222


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859147187


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859143649


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859141628


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859139239


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859130304


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859127325


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859117065


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859108674


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859092517


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859099929


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859095957


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859093933


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859087543


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and if that column is shredded, 
it can be read by columnar projection without reading or deserializing the rest 
of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859086423


##
VariantShredding.md:
##
@@ -25,290 +25,316 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of Parquet's columnar representation for more 
compact data encoding, column statistics for data skipping, and partial 
projections.

Review Comment:
   I think JSON makes it more confusing because these objects are not JSON and 
contain typed values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859084567


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is required and must be a valid Variant metadata, as 
defined below.
+* The `value` field is required for unshredded Variant values.
+* The `value` field is optional when parts of the Variant value are shredded 
according to the [Variant Shredding specification](VariantShredding.md).
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+This is an example representation of a shredded Variant in Parquet:
+```
+optional group shredded_variant_name (VARIANT) {
+  required binary metadata;
+  optional binary value;
+  optional int64 typed_value;
+}
+```
+
+The `VARIANT` annotation places no additional restrictions on the repetition 
of Variant groups, but repetition may be restricted by containing types (such 
as `MAP` and `LIST`).

Review Comment:
   I don't agree that it is considered a primitive type. And we don't need to 
in order to state that it places no additional restrictions on the repetition 
of Variant groups.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859083339


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is required and must be a valid Variant metadata, as 
defined below.
+* The `value` field is required for unshredded Variant values.
+* The `value` field is optional when parts of the Variant value are shredded 
according to the [Variant Shredding specification](VariantShredding.md).
+* When present, the `value` field must be a valid Variant value, as defined 
below. 
+
+This is the expected unshredded representation in Parquet:
+
+```
+optional group variant_name (VARIANT) {
+  required binary metadata;
+  required binary value;
+}
+```
+
+This is an example representation of a shredded Variant in Parquet:

Review Comment:
   This already points to the shredding spec in multiple places, so I think it 
is clear how to get more information about `typed_value`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859080002


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.

Review Comment:
   Thanks! Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859077883


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859075924


##
VariantEncoding.md:
##
@@ -39,13 +39,41 @@ Another motivation for the representation is that (aside 
from metadata) each nes
 For example, in a Variant containing an Array of Variant values, the 
representation of an inner Variant value, when paired with the metadata of the 
full variant, is itself a valid Variant.
 
 This document describes the Variant Binary Encoding scheme.
-[VariantShredding.md](VariantShredding.md) describes the details of the 
Variant shredding scheme.
+The [Variant Shredding specification](VariantShredding.md) describes the 
details of shredding Variant values as typed Parquet columns.
+
+## Variant in Parquet
 
-# Variant in Parquet
 A Variant value in Parquet is represented by a group with 2 fields, named 
`value` and `metadata`.
-Both fields `value` and `metadata` are of type `binary`, and cannot be `null`.
 
-# Metadata encoding
+* The Variant group must be annotated with the `VARIANT` logical type.
+* Both fields `value` and `metadata` must be of type `binary` (called 
`BYTE_ARRAY` in the Parquet thrift definition).
+* The `metadata` field is required and must be a valid Variant metadata, as 
defined below.
+* The `value` field is required for unshredded Variant values.
+* The `value` field is optional when parts of the Variant value are shredded 
according to the [Variant Shredding specification](VariantShredding.md).

Review Comment:
   I've updated this to make it clear that this is referring to the repetition 
level. There are also examples, so I think that it is unambiguous.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859071155


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859061998


##
VariantEncoding.md:
##
@@ -416,14 +444,36 @@ Field names are case-sensitive.
 Field names are required to be unique for each object.
 It is an error for an object to contain two fields with the same name, whether 
or not they have distinct dictionary IDs.
 
-# Versions and extensions
+## Versions and extensions
 
 An implementation is not expected to parse a Variant value whose metadata 
version is higher than the version supported by the implementation.
 However, new types may be added to the specification without incrementing the 
version ID.
 In such a situation, an implementation should be able to read the rest of the 
Variant value if desired.
 
-# Shredding
+## Shredding
 
 A single Variant object may have poor read performance when only a small 
subset of fields are needed.
 A better approach is to create separate columns for individual fields, 
referred to as shredding or subcolumnarization.
 [VariantShredding.md](VariantShredding.md) describes the Variant shredding 
specification in Parquet.
+
+## Conversion to JSON
+
+Values stored in the Variant encoding are a superset of JSON values.
+For example, a Variant value can be a date that has no equivalent type in JSON.
+To maximize compatibility with readers that can process JSON but not Variant, 
the following conversions should be used when producing JSON from a Variant:
+
+| Variant type  | JSON type | Representation requirements  
| Example  |
+|---|---|--|--|
+| Null type | null  | `null`   
| `null`   |
+| Boolean   | boolean   | `true` or `false`
| `true`   |
+| Exact Numeric | number| Digits in fraction must match scale, no exponent 
| `34`, 34.00  |
+| Float | number| Fraction must be present 
| `14.20`  |
+| Double| number| Fraction must be present 
| `1.0`|
+| Date  | string| ISO-8601 formatted date  
| `"2017-11-16"`   |
+| Timestamp | string| ISO-8601 formatted UTC timestamp including 
+00:00 offset | `"2017-11-16T22:31:08.01+00:00"` |
+| TimestampNTZ  | string| ISO-8601 formatted UTC timestamp with no offset 
or zone  | `"2017-11-16T22:31:08.01"`   |

Review Comment:
   In that case, I'll require trailing 0s.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Simplify Variant shredding and refactor for clarity [parquet-format]

2024-11-26 Thread via GitHub


rdblue commented on code in PR #461:
URL: https://github.com/apache/parquet-format/pull/461#discussion_r1859059592


##
VariantShredding.md:
##
@@ -25,276 +25,302 @@
 The Variant type is designed to store and process semi-structured data 
efficiently, even with heterogeneous values.
 Query engines encode each Variant value in a self-describing format, and store 
it as a group containing `value` and `metadata` binary fields in Parquet.
 Since data is often partially homogenous, it can be beneficial to extract 
certain fields into separate Parquet columns to further improve performance.
-We refer to this process as **shredding**.
-Each Parquet file remains fully self-describing, with no additional metadata 
required to read or fully reconstruct the Variant data from the file.
-Combining shredding with a binary residual provides the flexibility to 
represent complex, evolving data with an unbounded number of unique fields 
while limiting the size of file schemas, and retaining the performance benefits 
of a columnar format.
+This process is **shredding**.
 
-This document focuses on the shredding semantics, Parquet representation, 
implications for readers and writers, as well as the Variant reconstruction.
-For now, it does not discuss which fields to shred, user-facing API changes, 
or any engine-specific considerations like how to use shredded columns.
-The approach builds upon the [Variant Binary Encoding](VariantEncoding.md), 
and leverages the existing Parquet specification.
+Shredding enables the use of of Parquet's columnar representation for more 
compact data encoding, the use of column statistics for data skipping, and 
partial projections from Parquet's columnar layout.
 
-At a high level, we replace the `value` field of the Variant Parquet group 
with one or more fields called `object`, `array`, `typed_value`, and 
`variant_value`.
-These represent a fixed schema suitable for constructing the full Variant 
value for each row.
+For example, the query `SELECT variant_get(event, '$.event_ts', 'timestamp') 
FROM tbl` only needs to load field `event_ts`, and shredding can enable 
columnar projection that ignores the rest of the `event` Variant.
+Similarly, for the query `SELECT * FROM tbl WHERE variant_get(event, 
'$.event_type', 'string') = 'signup'`, the `event_type` shredded column 
metadata can be used for skipping and to lazily load the rest of the Variant.
 
-Shredding allows a query engine to reap the full benefits of Parquet's 
columnar representation, such as more compact data encoding, min/max statistics 
for data skipping, and I/O and CPU savings from pruning unnecessary fields not 
accessed by a query (including the non-shredded Variant binary data).
-Without shredding, any query that accesses a Variant column must fetch all 
bytes of the full binary buffer.
-With shredding, we can get nearly equivalent performance as in a relational 
(scalar) data model.
+## Variant Metadata
 
-For example, `select variant_get(variant_col, ‘$.field1.inner_field2’, 
‘string’) from tbl` only needs to access `inner_field2`, and the file scan 
could avoid fetching the rest of the Variant value if this field was shredded 
into a separate column in the Parquet schema.
-Similarly, for the query `select * from tbl where variant_get(variant_col, 
‘$.id’, ‘integer’) = 123`, the scan could first decode the shredded `id` 
column, and only fetch/decode the full Variant value for rows that pass the 
filter.
+Variant metadata is stored in the top-level Variant group in a binary 
`metadata` column regardless of whether the Variant value is shredded.
 
-# Parquet Example
+All `value` columns within the Variant must use the same `metadata`.
+All field names of a Variant, whether shredded or not, must be present in the 
metadata.
 
-Consider the following Parquet schema together with how Variant values might 
be mapped to it.
-Notice that we represent each shredded field in `object` as a group of two 
fields, `typed_value` and `variant_value`.
-We extract all homogenous data items of a certain path into `typed_value`, and 
set aside incompatible data items in `variant_value`.
-Intuitively, incompatibilities within the same path may occur because we store 
the shredding schema per Parquet file, and each file can contain several row 
groups.
-Selecting a type for each field that is acceptable for all rows would be 
impractical because it would require buffering the contents of an entire file 
before writing.
+## Value Shredding
 
-Typically, the expectation is that `variant_value` exists at every level as an 
option, along with one of `object`, `array` or `typed_value`.
-If the actual Variant value contains a type that does not match the provided 
schema, it is stored in `variant_value`.
-An `variant_value` may also be populated if an object can be partially 
represented: any fields that are present in the schema must be written to those 
fields, and any missing fields are written to `variant_value`.
-
-The 

  1   2   3   >