Re: Extending Schema Element

2025-11-26 Thread Micah Kornfield
>
> Sure, it's conceptually nicer to have it
> within the schema, but do you see any concrete disadvantage besides
> conceptual clarity of just putting it into the existing key value metadata
> section?


Conceptual clarity actually seems like it is pretty nice, are there
technical down-sides to adding this even though work-arounds exist? Another
use-case that has an open issue against it is adding a "description" field
for each column.  I think there are three options here:
1.  Define a well-known Key-Value in the file metadata for this purpose.
2.  Add a specialized field for it.
3.  Add key-value to schema elements.



On Sat, Nov 15, 2025 at 3:55 AM Andrew Lamb  wrote:

> In addition to putting additional data directly in the thrift metadata
> (either as key=value pairs or thrift fields), another approach is to store
> the information "inline" in the file's body and store only an offset to the
> information in the key=value metadata (this is the approach explained in
> this blog[1] for indexes, but it can be used to store any arbitrary bytes)
>
> Andrew
>
> [1]:
> https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
>
> On Fri, Nov 14, 2025 at 3:07 PM Andrew Bell 
> wrote:
>
> > On Sun, Nov 2, 2025 at 1:00 PM Jan Finis  wrote:
> >
> > > Note that you can already put such metadata into the footer by just
> > putting
> > > it into the regular key-value metadata. Put a JSON array as value there
> > > with the same number of entries as the schema, then you have an
> implicit
> > > 1-to-1 mapping per column. We already use this to store per-column
> > metadata
> > > and haven't encountered any problems with it so far.
> > >
> >
> > Of course you can put anything you want into a single metadata slot.
> > The hope is to have something that's sensible and semantically clear. An
> > advantage of the Thrift encoding is that adding structure entries doesn't
> > impact existing readers as they ignore values that they don't recognize.
> >
> > I think this is a free lunch proposal -- there is benefit and no harm.
> >
> > Here is another possibility: how about allowing extension of the Parquet
> > Thrift IDL in general by permitting all negative values in defined
> Structs
> > to be owned by users? There could be some registry if desired, but
> > something like this would allow users to add whatever data they like to
> the
> > existing metadata layout without impacting those using the standard IDL.
> > Although the Thrift IDL doc doesn't specify size for a Struct identifier,
> > the generated .tcc code uses a signed 16 bit value. This should allow for
> > plenty of additions to the accepted spec and user additions as well.
> Again,
> > there would be no impact to existing readers or writers.
> >
> > --
> > Andrew Bell
> > [email protected]
> >
>


Re: Extending Schema Element

2025-11-15 Thread Andrew Lamb
In addition to putting additional data directly in the thrift metadata
(either as key=value pairs or thrift fields), another approach is to store
the information "inline" in the file's body and store only an offset to the
information in the key=value metadata (this is the approach explained in
this blog[1] for indexes, but it can be used to store any arbitrary bytes)

Andrew

[1]:
https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/

On Fri, Nov 14, 2025 at 3:07 PM Andrew Bell 
wrote:

> On Sun, Nov 2, 2025 at 1:00 PM Jan Finis  wrote:
>
> > Note that you can already put such metadata into the footer by just
> putting
> > it into the regular key-value metadata. Put a JSON array as value there
> > with the same number of entries as the schema, then you have an implicit
> > 1-to-1 mapping per column. We already use this to store per-column
> metadata
> > and haven't encountered any problems with it so far.
> >
>
> Of course you can put anything you want into a single metadata slot.
> The hope is to have something that's sensible and semantically clear. An
> advantage of the Thrift encoding is that adding structure entries doesn't
> impact existing readers as they ignore values that they don't recognize.
>
> I think this is a free lunch proposal -- there is benefit and no harm.
>
> Here is another possibility: how about allowing extension of the Parquet
> Thrift IDL in general by permitting all negative values in defined Structs
> to be owned by users? There could be some registry if desired, but
> something like this would allow users to add whatever data they like to the
> existing metadata layout without impacting those using the standard IDL.
> Although the Thrift IDL doc doesn't specify size for a Struct identifier,
> the generated .tcc code uses a signed 16 bit value. This should allow for
> plenty of additions to the accepted spec and user additions as well. Again,
> there would be no impact to existing readers or writers.
>
> --
> Andrew Bell
> [email protected]
>


Re: Extending Schema Element

2025-11-14 Thread Andrew Bell
On Sun, Nov 2, 2025 at 1:00 PM Jan Finis  wrote:

> Note that you can already put such metadata into the footer by just putting
> it into the regular key-value metadata. Put a JSON array as value there
> with the same number of entries as the schema, then you have an implicit
> 1-to-1 mapping per column. We already use this to store per-column metadata
> and haven't encountered any problems with it so far.
>

Of course you can put anything you want into a single metadata slot.
The hope is to have something that's sensible and semantically clear. An
advantage of the Thrift encoding is that adding structure entries doesn't
impact existing readers as they ignore values that they don't recognize.

I think this is a free lunch proposal -- there is benefit and no harm.

Here is another possibility: how about allowing extension of the Parquet
Thrift IDL in general by permitting all negative values in defined Structs
to be owned by users? There could be some registry if desired, but
something like this would allow users to add whatever data they like to the
existing metadata layout without impacting those using the standard IDL.
Although the Thrift IDL doc doesn't specify size for a Struct identifier,
the generated .tcc code uses a signed 16 bit value. This should allow for
plenty of additions to the accepted spec and user additions as well. Again,
there would be no impact to existing readers or writers.

-- 
Andrew Bell
[email protected]


Re: Extending Schema Element

2025-11-02 Thread Jan Finis
Note that you can already put such metadata into the footer by just putting
it into the regular key-value metadata. Put a JSON array as value there
with the same number of entries as the schema, then you have an implicit
1-to-1 mapping per column. We already use this to store per-column metadata
and haven't encountered any problems with it so far.

IMHO, this is actually enough. I wouldn't see what practical advantage an
extra key-value section within the schema element would give you (unless
we're going back to the use case with super wide schemas and you want to
only decode parts of the schema). Sure, it's conceptually nicer to have it
within the schema, but do you see any concrete disadvantage besides
conceptual clarity of just putting it into the existing key value metadata
section?

Cheers,
Jan

Am So., 2. Nov. 2025 um 04:18 Uhr schrieb Dewey Dunnington <
[email protected]>:

> Point cloud data in Parquet...cool!
>
> This sounds similar to the concept of Field metadata in Arrow, where we
> also have extension types but people do use ad-hoc metadata to convey
> non-extension type information like you are describing. If you are using an
> Arrow implementation to write/read the Parquet file, the embedded Arrow
> schema may already be able to roundtrip that information.
>
> If this is specific to a domain (e.g., LiDAR), you could also invent a
> top-level key/value metadata standard (this is what GeoParquet did before
> Parquet GEOMETRY/GEOGRAPHY). A wider variety of Parquet
> implementations/versions would be able to access this information than if
> the Parquet thrift were updated.
>
> Many of the techniques used to encode LiDAR data/make rows more compact can
> also be achieved by expanding things like bitpacked fields into multiple
> columns or resolving things like scaled integers into an existing Parquet
> type. Parquet's encodings and compression may be able to accomplish
> something similar to how this is frequently stored in point cloud native
> formats (although would make it harder to roundtrip).
>
> Apologies if I'm missing context here!
>
> Cheers,
>
> -dewey
>
> On Fri, Oct 31, 2025 at 2:04 PM Andrew Bell 
> wrote:
>
> > On Fri, Oct 31, 2025 at 1:14 PM Micah Kornfield 
> > wrote:
> >
> > > Hi Andrew,
> > > If this is to support new type (point cloud data), is there a reason to
> > > choose a key value member to the schema over something like the
> extension
> > > type proposal [1]
> > >
> >
> > In some ways it's no different -- you're providing some data to ride
> along
> > with a column.  The extension type has the advantage of providing an
> > indirection which *might* be useful for the case when you have many
> columns
> > of the same type, though this seems a pretty specific use case and adds
> > additional complexity. However, extension types provide no hint of
> meaning
> > to be found in the "serialization" field (JSON is suggested, which could
> > provide keys, but would also require an additional parsing step).
> >
> > Allowing the addition of data to the existing SchemaElement is trivially
> > simple and more flexible. Users could add whatever data they like to
> > annotate their schema element without introducing anything to the type
> > system. For example, one could add a description to an integer element
> > without creating an "Integer with Description" type or provide language
> > information about a string without creating a type "String in French".
> >
> > The extension type proposal suggests that readers will be modified to
> > support the extension types.  Adding metadata directly to the
> SchemaElement
> > simply allows code *outside* of a Parquet reader to use the information
> for
> > its own purpose -- a reader only needs to provide an API to access the
> > metadata to be useful.
> >
> > Some examples from point cloud data:
> >
> > - Integers to which a scale and offset are applied to create a nominal
> > value (the current integer-based scale/offset are insufficient).
> > - Units for many types.
> > - GPS times are stored in several ways -- having metadata which may or
> may
> > not include an offset allows for proper interpretation.
> > - Descriptions of bit fields packed into integers.
> > - Indication that "return" numbers are synthetically generated. (A laser
> > pulse can create multiple points, each known as a "return").
> >
> > There's certainly nothing that precludes doing both extension types and
> > adding metadata support for SchemaElements.
> >
> > --
> > Andrew Bell
> > [email protected]
> >
>


Re: Extending Schema Element

2025-11-01 Thread Dewey Dunnington
Point cloud data in Parquet...cool!

This sounds similar to the concept of Field metadata in Arrow, where we
also have extension types but people do use ad-hoc metadata to convey
non-extension type information like you are describing. If you are using an
Arrow implementation to write/read the Parquet file, the embedded Arrow
schema may already be able to roundtrip that information.

If this is specific to a domain (e.g., LiDAR), you could also invent a
top-level key/value metadata standard (this is what GeoParquet did before
Parquet GEOMETRY/GEOGRAPHY). A wider variety of Parquet
implementations/versions would be able to access this information than if
the Parquet thrift were updated.

Many of the techniques used to encode LiDAR data/make rows more compact can
also be achieved by expanding things like bitpacked fields into multiple
columns or resolving things like scaled integers into an existing Parquet
type. Parquet's encodings and compression may be able to accomplish
something similar to how this is frequently stored in point cloud native
formats (although would make it harder to roundtrip).

Apologies if I'm missing context here!

Cheers,

-dewey

On Fri, Oct 31, 2025 at 2:04 PM Andrew Bell 
wrote:

> On Fri, Oct 31, 2025 at 1:14 PM Micah Kornfield 
> wrote:
>
> > Hi Andrew,
> > If this is to support new type (point cloud data), is there a reason to
> > choose a key value member to the schema over something like the extension
> > type proposal [1]
> >
>
> In some ways it's no different -- you're providing some data to ride along
> with a column.  The extension type has the advantage of providing an
> indirection which *might* be useful for the case when you have many columns
> of the same type, though this seems a pretty specific use case and adds
> additional complexity. However, extension types provide no hint of meaning
> to be found in the "serialization" field (JSON is suggested, which could
> provide keys, but would also require an additional parsing step).
>
> Allowing the addition of data to the existing SchemaElement is trivially
> simple and more flexible. Users could add whatever data they like to
> annotate their schema element without introducing anything to the type
> system. For example, one could add a description to an integer element
> without creating an "Integer with Description" type or provide language
> information about a string without creating a type "String in French".
>
> The extension type proposal suggests that readers will be modified to
> support the extension types.  Adding metadata directly to the SchemaElement
> simply allows code *outside* of a Parquet reader to use the information for
> its own purpose -- a reader only needs to provide an API to access the
> metadata to be useful.
>
> Some examples from point cloud data:
>
> - Integers to which a scale and offset are applied to create a nominal
> value (the current integer-based scale/offset are insufficient).
> - Units for many types.
> - GPS times are stored in several ways -- having metadata which may or may
> not include an offset allows for proper interpretation.
> - Descriptions of bit fields packed into integers.
> - Indication that "return" numbers are synthetically generated. (A laser
> pulse can create multiple points, each known as a "return").
>
> There's certainly nothing that precludes doing both extension types and
> adding metadata support for SchemaElements.
>
> --
> Andrew Bell
> [email protected]
>


Re: Extending Schema Element

2025-10-31 Thread Andrew Bell
On Fri, Oct 31, 2025 at 1:14 PM Micah Kornfield 
wrote:

> Hi Andrew,
> If this is to support new type (point cloud data), is there a reason to
> choose a key value member to the schema over something like the extension
> type proposal [1]
>

In some ways it's no different -- you're providing some data to ride along
with a column.  The extension type has the advantage of providing an
indirection which *might* be useful for the case when you have many columns
of the same type, though this seems a pretty specific use case and adds
additional complexity. However, extension types provide no hint of meaning
to be found in the "serialization" field (JSON is suggested, which could
provide keys, but would also require an additional parsing step).

Allowing the addition of data to the existing SchemaElement is trivially
simple and more flexible. Users could add whatever data they like to
annotate their schema element without introducing anything to the type
system. For example, one could add a description to an integer element
without creating an "Integer with Description" type or provide language
information about a string without creating a type "String in French".

The extension type proposal suggests that readers will be modified to
support the extension types.  Adding metadata directly to the SchemaElement
simply allows code *outside* of a Parquet reader to use the information for
its own purpose -- a reader only needs to provide an API to access the
metadata to be useful.

Some examples from point cloud data:

- Integers to which a scale and offset are applied to create a nominal
value (the current integer-based scale/offset are insufficient).
- Units for many types.
- GPS times are stored in several ways -- having metadata which may or may
not include an offset allows for proper interpretation.
- Descriptions of bit fields packed into integers.
- Indication that "return" numbers are synthetically generated. (A laser
pulse can create multiple points, each known as a "return").

There's certainly nothing that precludes doing both extension types and
adding metadata support for SchemaElements.

-- 
Andrew Bell
[email protected]


Re: Extending Schema Element

2025-10-31 Thread Micah Kornfield
Hi Andrew,
If this is to support new type (point cloud data), is there a reason to
choose a key value member to the schema over something like the extension
type proposal [1]

Thanks,
Micah

[1] https://github.com/apache/parquet-format/pull/451

On Thu, Oct 30, 2025 at 1:37 PM Andrew Bell 
wrote:

> Hi,
>
> I asked earlier about supporting a scaled integer type to support point
> cloud data. I think a non-intrusive way to handle this would be to add an
> *optional* KeyValue member for metadata to the SchemaElement struct. This
> would take care of my needs and perhaps needs of others (someone wanted a
> "description" field). AFAICT this would require no changes on the part of
> any reader or writer -- only the file "parquet.thrift" file would need to
> be changed. This would provide readers and writers the *option* to access
> SchemaElement metadata, but there would be no requirement to do so.
>
> I'm interested in feedback on this proposal.
>
> Thanks,
>
> --
> Andrew Bell
> [email protected]
>