Re: [DISCUSS] Deprecate file_path field in column chunk

2026-01-30 Thread Micah Kornfield
The PR has several approvals, if anybody still has major concerns, please
voice them.  I'll plan on merging on Monday.

Thanks,
Micah

On Fri, Dec 19, 2025 at 10:32 AM Micah Kornfield 
wrote:

> I also posted a PR updating the implementation status page to reflect that
> it's not really file_path which is supported but _metadata files (
> https://github.com/apache/parquet-site/pull/145).  I believe only
> hyparquet might have support for actually reading external columns.
>
> Thanks,
> Micah
>
> On Wed, Dec 10, 2025 at 11:49 PM Micah Kornfield 
> wrote:
>
>> Based on a conversation in the sync today, we thought not explicitly
>> deprecating the field but providing guidance and documenting that new uses
>> for the field should go through a feature addition process is probably a
>> good path forward.
>>
>> I put up https://github.com/apache/parquet-format/pull/542 for a
>> straw-man to capture this.
>>
>> On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem  wrote:
>>
>>> IMO Iceberg needs to be aware of Parquet files referencing others so that
>>> it can prune older snapshots correctly and not delete parquet files
>>> referenced by others when deleting old snapshots. Depending if this is a
>>> cross table or within table file to file reference could make it more or
>>> less complicated.
>>>
>>> I could imagine starting with a simple implementation for the write path:
>>> "Create table foo using parquet options (reference_original_columns =
>>> true)
>>> as Select content, extract(content) as metadata from bar"
>>> That would be constrained to simple plans that have the scan and the
>>> output
>>> in the same step (map only) so that rows are in the same order per file.
>>> Alternatively "alter table foo add column metadata OPTIONS
>>> (column_familly
>>> = 'bar')" with a subsequent "update table set metadata=extract(content)"
>>> to
>>> create those files.
>>>
>>> (just some random thoughts, I'm sure others have spent more time thinking
>>> about this)
>>>
>>> This doesn't seem that different from the mechanism creating a deletion
>>> vector in Iceberg.
>>>
>>> It could also be seen as a view in iceberg joining on _row_id.
>>>
>>> This can be a topic in the meeting tomorrow.
>>>
>>> On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks  wrote:
>>>
>>> > Thanks for the context Kenny.  That example is very similar to some of
>>> the
>>> > cases that come up in the multi-modal scenarios.
>>> >
>>> > I agree that we're in a little bit of a difficult situation due to
>>> lack of
>>> > existing support, which also leads to Micah's concern that it's a
>>> point of
>>> > confusion for implementers.
>>> >
>>> > I would be in favor of adding some additional context to the
>>> description
>>> > because there are some basic things implementers should do (e.g.
>>> validate
>>> > that the file path is either not set or set to the current file being
>>> read
>>> > if they don't support disaggregated column data).  While older clients
>>> will
>>> > likely break if they encounter files written this way, there's almost
>>> no
>>> > risk that it would result in silent failures or corruption as I suspect
>>> > most implementations will read the ranges from the referencing file
>>> and not
>>> > be able to interpret it.
>>> >
>>> > Adding a read path is relatively straightforward (at least in the java
>>> > implementation for both stream and vectored IO reads), but the write
>>> path
>>> > is where things get more complicated.
>>> >
>>> > I think we want to discuss some of these use cases in more detail and
>>> see
>>> > if they are practical and reasonable.  Some cases may make more sense
>>> at a
>>> > higher-level (like table metadata) while others may make sense to
>>> handle at
>>> > the file level (like asymmetric column sizes).
>>> >
>>> > -Dan
>>> >
>>> >
>>> >
>>> > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel 
>>> wrote:
>>> >
>>> > > Since I was the one who brought up file_path at the sync a couple
>>> weeks
>>> > > ago, I'll share my thoughts:
>>> > >
>>> > > I am interested in the file_path field for column chunks because it
>>> would
>>> > > allow for some extremely efficient data engineering in specific cases
>>> > > like *adding
>>> > > a column to existing data*.
>>> > >
>>> > > My use case is LLM data. LLM data is often huge piles of text in
>>> parquet
>>> > > format (see: all of huggingface, or any llm request/response logs).
>>> If I
>>> > > have a 400mb source.parquet file, how can I annotate each row with an
>>> > added
>>> > > "score" column efficiently? I would prefer to not have to copy all
>>> 400mb
>>> > of
>>> > > data just to add a "score" column. It would be slick if I could make
>>> a
>>> > new
>>> > > annotated.parquet file that points to source.parquet for the source
>>> > > columns, and then only includes the new "score" column in the
>>> > > annotated.parquet file. The source.parquet would remain 400mb, the
>>> > > annotated parquet could be ~10kb and incorporate the source data by
>>> > > reference.
>>> > >
>

Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-19 Thread Micah Kornfield
I also posted a PR updating the implementation status page to reflect that
it's not really file_path which is supported but _metadata files (
https://github.com/apache/parquet-site/pull/145).  I believe only
hyparquet might have support for actually reading external columns.

Thanks,
Micah

On Wed, Dec 10, 2025 at 11:49 PM Micah Kornfield 
wrote:

> Based on a conversation in the sync today, we thought not explicitly
> deprecating the field but providing guidance and documenting that new uses
> for the field should go through a feature addition process is probably a
> good path forward.
>
> I put up https://github.com/apache/parquet-format/pull/542 for a
> straw-man to capture this.
>
> On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem  wrote:
>
>> IMO Iceberg needs to be aware of Parquet files referencing others so that
>> it can prune older snapshots correctly and not delete parquet files
>> referenced by others when deleting old snapshots. Depending if this is a
>> cross table or within table file to file reference could make it more or
>> less complicated.
>>
>> I could imagine starting with a simple implementation for the write path:
>> "Create table foo using parquet options (reference_original_columns =
>> true)
>> as Select content, extract(content) as metadata from bar"
>> That would be constrained to simple plans that have the scan and the
>> output
>> in the same step (map only) so that rows are in the same order per file.
>> Alternatively "alter table foo add column metadata OPTIONS (column_familly
>> = 'bar')" with a subsequent "update table set metadata=extract(content)"
>> to
>> create those files.
>>
>> (just some random thoughts, I'm sure others have spent more time thinking
>> about this)
>>
>> This doesn't seem that different from the mechanism creating a deletion
>> vector in Iceberg.
>>
>> It could also be seen as a view in iceberg joining on _row_id.
>>
>> This can be a topic in the meeting tomorrow.
>>
>> On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks  wrote:
>>
>> > Thanks for the context Kenny.  That example is very similar to some of
>> the
>> > cases that come up in the multi-modal scenarios.
>> >
>> > I agree that we're in a little bit of a difficult situation due to lack
>> of
>> > existing support, which also leads to Micah's concern that it's a point
>> of
>> > confusion for implementers.
>> >
>> > I would be in favor of adding some additional context to the description
>> > because there are some basic things implementers should do (e.g.
>> validate
>> > that the file path is either not set or set to the current file being
>> read
>> > if they don't support disaggregated column data).  While older clients
>> will
>> > likely break if they encounter files written this way, there's almost no
>> > risk that it would result in silent failures or corruption as I suspect
>> > most implementations will read the ranges from the referencing file and
>> not
>> > be able to interpret it.
>> >
>> > Adding a read path is relatively straightforward (at least in the java
>> > implementation for both stream and vectored IO reads), but the write
>> path
>> > is where things get more complicated.
>> >
>> > I think we want to discuss some of these use cases in more detail and
>> see
>> > if they are practical and reasonable.  Some cases may make more sense
>> at a
>> > higher-level (like table metadata) while others may make sense to
>> handle at
>> > the file level (like asymmetric column sizes).
>> >
>> > -Dan
>> >
>> >
>> >
>> > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel 
>> wrote:
>> >
>> > > Since I was the one who brought up file_path at the sync a couple
>> weeks
>> > > ago, I'll share my thoughts:
>> > >
>> > > I am interested in the file_path field for column chunks because it
>> would
>> > > allow for some extremely efficient data engineering in specific cases
>> > > like *adding
>> > > a column to existing data*.
>> > >
>> > > My use case is LLM data. LLM data is often huge piles of text in
>> parquet
>> > > format (see: all of huggingface, or any llm request/response logs).
>> If I
>> > > have a 400mb source.parquet file, how can I annotate each row with an
>> > added
>> > > "score" column efficiently? I would prefer to not have to copy all
>> 400mb
>> > of
>> > > data just to add a "score" column. It would be slick if I could make a
>> > new
>> > > annotated.parquet file that points to source.parquet for the source
>> > > columns, and then only includes the new "score" column in the
>> > > annotated.parquet file. The source.parquet would remain 400mb, the
>> > > annotated parquet could be ~10kb and incorporate the source data by
>> > > reference.
>> > >
>> > > As the implementor of hyparquet I have conflicting opinions on this
>> > > feature. On the one hand, it's a cool capability, already built into
>> > > parquet. On the other hand... none of the parquet implementations
>> support
>> > > it. Hyparquet has a branch for reading/writing file_path that I used
>> for
>> > > testing. It do

Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-10 Thread Micah Kornfield
Based on a conversation in the sync today, we thought not explicitly
deprecating the field but providing guidance and documenting that new uses
for the field should go through a feature addition process is probably a
good path forward.

I put up https://github.com/apache/parquet-format/pull/542 for a straw-man
to capture this.

On Tue, Dec 9, 2025 at 6:33 PM Julien Le Dem  wrote:

> IMO Iceberg needs to be aware of Parquet files referencing others so that
> it can prune older snapshots correctly and not delete parquet files
> referenced by others when deleting old snapshots. Depending if this is a
> cross table or within table file to file reference could make it more or
> less complicated.
>
> I could imagine starting with a simple implementation for the write path:
> "Create table foo using parquet options (reference_original_columns = true)
> as Select content, extract(content) as metadata from bar"
> That would be constrained to simple plans that have the scan and the output
> in the same step (map only) so that rows are in the same order per file.
> Alternatively "alter table foo add column metadata OPTIONS (column_familly
> = 'bar')" with a subsequent "update table set metadata=extract(content)" to
> create those files.
>
> (just some random thoughts, I'm sure others have spent more time thinking
> about this)
>
> This doesn't seem that different from the mechanism creating a deletion
> vector in Iceberg.
>
> It could also be seen as a view in iceberg joining on _row_id.
>
> This can be a topic in the meeting tomorrow.
>
> On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks  wrote:
>
> > Thanks for the context Kenny.  That example is very similar to some of
> the
> > cases that come up in the multi-modal scenarios.
> >
> > I agree that we're in a little bit of a difficult situation due to lack
> of
> > existing support, which also leads to Micah's concern that it's a point
> of
> > confusion for implementers.
> >
> > I would be in favor of adding some additional context to the description
> > because there are some basic things implementers should do (e.g. validate
> > that the file path is either not set or set to the current file being
> read
> > if they don't support disaggregated column data).  While older clients
> will
> > likely break if they encounter files written this way, there's almost no
> > risk that it would result in silent failures or corruption as I suspect
> > most implementations will read the ranges from the referencing file and
> not
> > be able to interpret it.
> >
> > Adding a read path is relatively straightforward (at least in the java
> > implementation for both stream and vectored IO reads), but the write path
> > is where things get more complicated.
> >
> > I think we want to discuss some of these use cases in more detail and see
> > if they are practical and reasonable.  Some cases may make more sense at
> a
> > higher-level (like table metadata) while others may make sense to handle
> at
> > the file level (like asymmetric column sizes).
> >
> > -Dan
> >
> >
> >
> > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel  wrote:
> >
> > > Since I was the one who brought up file_path at the sync a couple weeks
> > > ago, I'll share my thoughts:
> > >
> > > I am interested in the file_path field for column chunks because it
> would
> > > allow for some extremely efficient data engineering in specific cases
> > > like *adding
> > > a column to existing data*.
> > >
> > > My use case is LLM data. LLM data is often huge piles of text in
> parquet
> > > format (see: all of huggingface, or any llm request/response logs). If
> I
> > > have a 400mb source.parquet file, how can I annotate each row with an
> > added
> > > "score" column efficiently? I would prefer to not have to copy all
> 400mb
> > of
> > > data just to add a "score" column. It would be slick if I could make a
> > new
> > > annotated.parquet file that points to source.parquet for the source
> > > columns, and then only includes the new "score" column in the
> > > annotated.parquet file. The source.parquet would remain 400mb, the
> > > annotated parquet could be ~10kb and incorporate the source data by
> > > reference.
> > >
> > > As the implementor of hyparquet I have conflicting opinions on this
> > > feature. On the one hand, it's a cool capability, already built into
> > > parquet. On the other hand... none of the parquet implementations
> support
> > > it. Hyparquet has a branch for reading/writing file_path that I used
> for
> > > testing. It does work. But I don't want to ship it unless theres at
> least
> > > ONE other implementation that supports it (there isn't).
> > >
> > > I agree that this would be better implemented at the table format level
> > > (eg- iceberg). BUT... *iceberg does not support my adding column use
> > case*!
> > > The problem is that, despite parquet being a column-oriented format,
> > > iceberg has no support to efficiently zip a new column with existing
> > data.
> > > The only option for "add colu

Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-09 Thread Julien Le Dem
IMO Iceberg needs to be aware of Parquet files referencing others so that
it can prune older snapshots correctly and not delete parquet files
referenced by others when deleting old snapshots. Depending if this is a
cross table or within table file to file reference could make it more or
less complicated.

I could imagine starting with a simple implementation for the write path:
"Create table foo using parquet options (reference_original_columns = true)
as Select content, extract(content) as metadata from bar"
That would be constrained to simple plans that have the scan and the output
in the same step (map only) so that rows are in the same order per file.
Alternatively "alter table foo add column metadata OPTIONS (column_familly
= 'bar')" with a subsequent "update table set metadata=extract(content)" to
create those files.

(just some random thoughts, I'm sure others have spent more time thinking
about this)

This doesn't seem that different from the mechanism creating a deletion
vector in Iceberg.

It could also be seen as a view in iceberg joining on _row_id.

This can be a topic in the meeting tomorrow.

On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks  wrote:

> Thanks for the context Kenny.  That example is very similar to some of the
> cases that come up in the multi-modal scenarios.
>
> I agree that we're in a little bit of a difficult situation due to lack of
> existing support, which also leads to Micah's concern that it's a point of
> confusion for implementers.
>
> I would be in favor of adding some additional context to the description
> because there are some basic things implementers should do (e.g. validate
> that the file path is either not set or set to the current file being read
> if they don't support disaggregated column data).  While older clients will
> likely break if they encounter files written this way, there's almost no
> risk that it would result in silent failures or corruption as I suspect
> most implementations will read the ranges from the referencing file and not
> be able to interpret it.
>
> Adding a read path is relatively straightforward (at least in the java
> implementation for both stream and vectored IO reads), but the write path
> is where things get more complicated.
>
> I think we want to discuss some of these use cases in more detail and see
> if they are practical and reasonable.  Some cases may make more sense at a
> higher-level (like table metadata) while others may make sense to handle at
> the file level (like asymmetric column sizes).
>
> -Dan
>
>
>
> On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel  wrote:
>
> > Since I was the one who brought up file_path at the sync a couple weeks
> > ago, I'll share my thoughts:
> >
> > I am interested in the file_path field for column chunks because it would
> > allow for some extremely efficient data engineering in specific cases
> > like *adding
> > a column to existing data*.
> >
> > My use case is LLM data. LLM data is often huge piles of text in parquet
> > format (see: all of huggingface, or any llm request/response logs). If I
> > have a 400mb source.parquet file, how can I annotate each row with an
> added
> > "score" column efficiently? I would prefer to not have to copy all 400mb
> of
> > data just to add a "score" column. It would be slick if I could make a
> new
> > annotated.parquet file that points to source.parquet for the source
> > columns, and then only includes the new "score" column in the
> > annotated.parquet file. The source.parquet would remain 400mb, the
> > annotated parquet could be ~10kb and incorporate the source data by
> > reference.
> >
> > As the implementor of hyparquet I have conflicting opinions on this
> > feature. On the one hand, it's a cool capability, already built into
> > parquet. On the other hand... none of the parquet implementations support
> > it. Hyparquet has a branch for reading/writing file_path that I used for
> > testing. It does work. But I don't want to ship it unless theres at least
> > ONE other implementation that supports it (there isn't).
> >
> > I agree that this would be better implemented at the table format level
> > (eg- iceberg). BUT... *iceberg does not support my adding column use
> case*!
> > The problem is that, despite parquet being a column-oriented format,
> > iceberg has no support to efficiently zip a new column with existing
> data.
> > The only option for "add column" in iceberg would be to *add a column
> with
> > default values and then re-write every row* (including the heavy text
> > data). So iceberg fails to solve my problem at all.
> >
> > Anyway, I'm fine with deprecating, or not. But I did want to at least
> make
> > the case that it could serve a purpose that I don't see any other good
> way
> > of solving at the moment.
> >
> > Kenny
> >
> >
> >
> > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield 
> > wrote:
> >
> > > Hi Dan,
> > >
> > > > However, there are ongoing discussions around multi-modal cases where
> > > > either separating larg

Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-08 Thread Daniel Weeks
Thanks for the context Kenny.  That example is very similar to some of the
cases that come up in the multi-modal scenarios.

I agree that we're in a little bit of a difficult situation due to lack of
existing support, which also leads to Micah's concern that it's a point of
confusion for implementers.

I would be in favor of adding some additional context to the description
because there are some basic things implementers should do (e.g. validate
that the file path is either not set or set to the current file being read
if they don't support disaggregated column data).  While older clients will
likely break if they encounter files written this way, there's almost no
risk that it would result in silent failures or corruption as I suspect
most implementations will read the ranges from the referencing file and not
be able to interpret it.

Adding a read path is relatively straightforward (at least in the java
implementation for both stream and vectored IO reads), but the write path
is where things get more complicated.

I think we want to discuss some of these use cases in more detail and see
if they are practical and reasonable.  Some cases may make more sense at a
higher-level (like table metadata) while others may make sense to handle at
the file level (like asymmetric column sizes).

-Dan



On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel  wrote:

> Since I was the one who brought up file_path at the sync a couple weeks
> ago, I'll share my thoughts:
>
> I am interested in the file_path field for column chunks because it would
> allow for some extremely efficient data engineering in specific cases
> like *adding
> a column to existing data*.
>
> My use case is LLM data. LLM data is often huge piles of text in parquet
> format (see: all of huggingface, or any llm request/response logs). If I
> have a 400mb source.parquet file, how can I annotate each row with an added
> "score" column efficiently? I would prefer to not have to copy all 400mb of
> data just to add a "score" column. It would be slick if I could make a new
> annotated.parquet file that points to source.parquet for the source
> columns, and then only includes the new "score" column in the
> annotated.parquet file. The source.parquet would remain 400mb, the
> annotated parquet could be ~10kb and incorporate the source data by
> reference.
>
> As the implementor of hyparquet I have conflicting opinions on this
> feature. On the one hand, it's a cool capability, already built into
> parquet. On the other hand... none of the parquet implementations support
> it. Hyparquet has a branch for reading/writing file_path that I used for
> testing. It does work. But I don't want to ship it unless theres at least
> ONE other implementation that supports it (there isn't).
>
> I agree that this would be better implemented at the table format level
> (eg- iceberg). BUT... *iceberg does not support my adding column use case*!
> The problem is that, despite parquet being a column-oriented format,
> iceberg has no support to efficiently zip a new column with existing data.
> The only option for "add column" in iceberg would be to *add a column with
> default values and then re-write every row* (including the heavy text
> data). So iceberg fails to solve my problem at all.
>
> Anyway, I'm fine with deprecating, or not. But I did want to at least make
> the case that it could serve a purpose that I don't see any other good way
> of solving at the moment.
>
> Kenny
>
>
>
> On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield 
> wrote:
>
> > Hi Dan,
> >
> > > However, there are ongoing discussions around multi-modal cases where
> > > either separating large columns (e.g. inline blobs) or appending column
> > > data without rewriting existing data may leverage this.
> >
> >
> > Do you have any design docs or mailing list discussions you can point to?
> >
> > I don't feel like leaving this for now while we explore those use cases
> > > would cause any additional confusion/complexity.
> >
> >
> > Agreed, it isn't urgent to clean this up. But having a more concrete
> > timeline would be helpful, this does seem to be a semi-regular source of
> > confusion for folks, so it would be nice to clean up the loose end.
> >
> > Thanks,
> > Micah
> >
> > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks  wrote:
> >
> > > I'd actually prefer that we don't deprecate this field (at least not
> > > immediately).
> > >
> > > Recognizing that we've discussed separating column data into multiple
> > files
> > > for over a decade without any concrete implementations, there are
> > emerging
> > > use cases that may benefit from investing in this feature.
> > >
> > > Many of the use cases in the past have been misaligned (e.g. separating
> > > column data for security/encryption) and better alternatives addressed
> > > those scenarios.
> > >
> > > However, there are ongoing discussions around multi-modal cases where
> > > either separating large columns (e.g. inline blobs) or appending column
> > > data 

Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-07 Thread Kenny Daniel
Since I was the one who brought up file_path at the sync a couple weeks
ago, I'll share my thoughts:

I am interested in the file_path field for column chunks because it would
allow for some extremely efficient data engineering in specific cases
like *adding
a column to existing data*.

My use case is LLM data. LLM data is often huge piles of text in parquet
format (see: all of huggingface, or any llm request/response logs). If I
have a 400mb source.parquet file, how can I annotate each row with an added
"score" column efficiently? I would prefer to not have to copy all 400mb of
data just to add a "score" column. It would be slick if I could make a new
annotated.parquet file that points to source.parquet for the source
columns, and then only includes the new "score" column in the
annotated.parquet file. The source.parquet would remain 400mb, the
annotated parquet could be ~10kb and incorporate the source data by
reference.

As the implementor of hyparquet I have conflicting opinions on this
feature. On the one hand, it's a cool capability, already built into
parquet. On the other hand... none of the parquet implementations support
it. Hyparquet has a branch for reading/writing file_path that I used for
testing. It does work. But I don't want to ship it unless theres at least
ONE other implementation that supports it (there isn't).

I agree that this would be better implemented at the table format level
(eg- iceberg). BUT... *iceberg does not support my adding column use case*!
The problem is that, despite parquet being a column-oriented format,
iceberg has no support to efficiently zip a new column with existing data.
The only option for "add column" in iceberg would be to *add a column with
default values and then re-write every row* (including the heavy text
data). So iceberg fails to solve my problem at all.

Anyway, I'm fine with deprecating, or not. But I did want to at least make
the case that it could serve a purpose that I don't see any other good way
of solving at the moment.

Kenny



On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield 
wrote:

> Hi Dan,
>
> > However, there are ongoing discussions around multi-modal cases where
> > either separating large columns (e.g. inline blobs) or appending column
> > data without rewriting existing data may leverage this.
>
>
> Do you have any design docs or mailing list discussions you can point to?
>
> I don't feel like leaving this for now while we explore those use cases
> > would cause any additional confusion/complexity.
>
>
> Agreed, it isn't urgent to clean this up. But having a more concrete
> timeline would be helpful, this does seem to be a semi-regular source of
> confusion for folks, so it would be nice to clean up the loose end.
>
> Thanks,
> Micah
>
> On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks  wrote:
>
> > I'd actually prefer that we don't deprecate this field (at least not
> > immediately).
> >
> > Recognizing that we've discussed separating column data into multiple
> files
> > for over a decade without any concrete implementations, there are
> emerging
> > use cases that may benefit from investing in this feature.
> >
> > Many of the use cases in the past have been misaligned (e.g. separating
> > column data for security/encryption) and better alternatives addressed
> > those scenarios.
> >
> > However, there are ongoing discussions around multi-modal cases where
> > either separating large columns (e.g. inline blobs) or appending column
> > data without rewriting existing data may leverage this.
> >
> > I don't feel like leaving this for now while we explore those use cases
> > would cause any additional confusion/complexity.
> >
> > -Dan
> >
> > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield 
> > wrote:
> >
> > > > What does "deprecated" entail here? Do we plan to remove this field
> > > from the format? Otherwise, is it just documentation?
> > >
> > > I was imagining just documentation, since we don't want to break the
> > > "_metadata file" use case.
> > >
> > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > What does "deprecated" entail here? Do we plan to remove this field
> > > > from the format? Otherwise, is it just documentation?
> > > >
> > > >
> > > >
> > > > On Mon, 1 Dec 2025 12:09:18 -0800
> > > > Micah Kornfield 
> > > > wrote:
> > > > > This has come up a few times in the sync and other forums.  I
> wanted
> > to
> > > > > start the conversation about deprecating file_path
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> > > > >
> > > > > [1] in the parquet footer.
> > > > >
> > > > > Outside of the "_metadata" file index use-case I don't think this
> is
> > > used
> > > > > or implemented in any reader (effectively a poor man's table
> format).
> > > > >
> > > > > With the rise of file formats, it seems like a reasonable design
> > choice
> > > > to
> > > > > push complexity of referen

Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-05 Thread Micah Kornfield
Hi Dan,

> However, there are ongoing discussions around multi-modal cases where
> either separating large columns (e.g. inline blobs) or appending column
> data without rewriting existing data may leverage this.


Do you have any design docs or mailing list discussions you can point to?

I don't feel like leaving this for now while we explore those use cases
> would cause any additional confusion/complexity.


Agreed, it isn't urgent to clean this up. But having a more concrete
timeline would be helpful, this does seem to be a semi-regular source of
confusion for folks, so it would be nice to clean up the loose end.

Thanks,
Micah

On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks  wrote:

> I'd actually prefer that we don't deprecate this field (at least not
> immediately).
>
> Recognizing that we've discussed separating column data into multiple files
> for over a decade without any concrete implementations, there are emerging
> use cases that may benefit from investing in this feature.
>
> Many of the use cases in the past have been misaligned (e.g. separating
> column data for security/encryption) and better alternatives addressed
> those scenarios.
>
> However, there are ongoing discussions around multi-modal cases where
> either separating large columns (e.g. inline blobs) or appending column
> data without rewriting existing data may leverage this.
>
> I don't feel like leaving this for now while we explore those use cases
> would cause any additional confusion/complexity.
>
> -Dan
>
> On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield 
> wrote:
>
> > > What does "deprecated" entail here? Do we plan to remove this field
> > from the format? Otherwise, is it just documentation?
> >
> > I was imagining just documentation, since we don't want to break the
> > "_metadata file" use case.
> >
> > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou 
> wrote:
> >
> > >
> > > What does "deprecated" entail here? Do we plan to remove this field
> > > from the format? Otherwise, is it just documentation?
> > >
> > >
> > >
> > > On Mon, 1 Dec 2025 12:09:18 -0800
> > > Micah Kornfield 
> > > wrote:
> > > > This has come up a few times in the sync and other forums.  I wanted
> to
> > > > start the conversation about deprecating file_path
> > > > <
> > >
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> > > >
> > > > [1] in the parquet footer.
> > > >
> > > > Outside of the "_metadata" file index use-case I don't think this is
> > used
> > > > or implemented in any reader (effectively a poor man's table format).
> > > >
> > > > With the rise of file formats, it seems like a reasonable design
> choice
> > > to
> > > > push complexity of referencing columns across files to the table
> level
> > > and
> > > > keep parquet focused on single file storage (encodings, indexing,
> etc).
> > > >
> > > > Implementing this at a file level also can be challenging in the
> > context
> > > of
> > > > knowing all credentials one might need to read from different objects
> > on
> > > > object storage?
> > > >
> > > > Thoughts/Objections?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> > > >
> > >
> > >
> > >
> > >
> >
>


Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-05 Thread Daniel Weeks
I'd actually prefer that we don't deprecate this field (at least not
immediately).

Recognizing that we've discussed separating column data into multiple files
for over a decade without any concrete implementations, there are emerging
use cases that may benefit from investing in this feature.

Many of the use cases in the past have been misaligned (e.g. separating
column data for security/encryption) and better alternatives addressed
those scenarios.

However, there are ongoing discussions around multi-modal cases where
either separating large columns (e.g. inline blobs) or appending column
data without rewriting existing data may leverage this.

I don't feel like leaving this for now while we explore those use cases
would cause any additional confusion/complexity.

-Dan

On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield 
wrote:

> > What does "deprecated" entail here? Do we plan to remove this field
> from the format? Otherwise, is it just documentation?
>
> I was imagining just documentation, since we don't want to break the
> "_metadata file" use case.
>
> On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou  wrote:
>
> >
> > What does "deprecated" entail here? Do we plan to remove this field
> > from the format? Otherwise, is it just documentation?
> >
> >
> >
> > On Mon, 1 Dec 2025 12:09:18 -0800
> > Micah Kornfield 
> > wrote:
> > > This has come up a few times in the sync and other forums.  I wanted to
> > > start the conversation about deprecating file_path
> > > <
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> > >
> > > [1] in the parquet footer.
> > >
> > > Outside of the "_metadata" file index use-case I don't think this is
> used
> > > or implemented in any reader (effectively a poor man's table format).
> > >
> > > With the rise of file formats, it seems like a reasonable design choice
> > to
> > > push complexity of referencing columns across files to the table level
> > and
> > > keep parquet focused on single file storage (encodings, indexing, etc).
> > >
> > > Implementing this at a file level also can be challenging in the
> context
> > of
> > > knowing all credentials one might need to read from different objects
> on
> > > object storage?
> > >
> > > Thoughts/Objections?
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > [1]
> > >
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> > >
> >
> >
> >
> >
>


Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-04 Thread Micah Kornfield
> What does "deprecated" entail here? Do we plan to remove this field
from the format? Otherwise, is it just documentation?

I was imagining just documentation, since we don't want to break the
"_metadata file" use case.

On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou  wrote:

>
> What does "deprecated" entail here? Do we plan to remove this field
> from the format? Otherwise, is it just documentation?
>
>
>
> On Mon, 1 Dec 2025 12:09:18 -0800
> Micah Kornfield 
> wrote:
> > This has come up a few times in the sync and other forums.  I wanted to
> > start the conversation about deprecating file_path
> > <
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> >
> > [1] in the parquet footer.
> >
> > Outside of the "_metadata" file index use-case I don't think this is used
> > or implemented in any reader (effectively a poor man's table format).
> >
> > With the rise of file formats, it seems like a reasonable design choice
> to
> > push complexity of referencing columns across files to the table level
> and
> > keep parquet focused on single file storage (encodings, indexing, etc).
> >
> > Implementing this at a file level also can be challenging in the context
> of
> > knowing all credentials one might need to read from different objects on
> > object storage?
> >
> > Thoughts/Objections?
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> >
>
>
>
>


Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-04 Thread Antoine Pitrou


What does "deprecated" entail here? Do we plan to remove this field
from the format? Otherwise, is it just documentation?



On Mon, 1 Dec 2025 12:09:18 -0800
Micah Kornfield 
wrote:
> This has come up a few times in the sync and other forums.  I wanted to
> start the conversation about deprecating file_path
> 
> [1] in the parquet footer.
> 
> Outside of the "_metadata" file index use-case I don't think this is used
> or implemented in any reader (effectively a poor man's table format).
> 
> With the rise of file formats, it seems like a reasonable design choice to
> push complexity of referencing columns across files to the table level and
> keep parquet focused on single file storage (encodings, indexing, etc).
> 
> Implementing this at a file level also can be challenging in the context of
> knowing all credentials one might need to read from different objects on
> object storage?
> 
> Thoughts/Objections?
> 
> Thanks,
> Micah
> 
> 
> [1]
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> 





Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-03 Thread Julien Le Dem
That sounds good to me.
The field will still have to remain for the _metadata file but we can make
it explicit that it is never populated in the file footer.

On Tue, Dec 2, 2025 at 3:36 AM Andrew Lamb  wrote:

> I for one think it is a good idea, thank you for bringing it up Micah
>
> I think the best rationale for depreciation, as you mention, is to bring
> the spec into alignment with actual implementation practice. The more
> unused features exist in Parquet, the harder it is to implement new readers
> or determine compatibility.
>
> Andrew
>
> On Mon, Dec 1, 2025 at 3:09 PM Micah Kornfield 
> wrote:
>
> > This has come up a few times in the sync and other forums.  I wanted to
> > start the conversation about deprecating file_path
> > <
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> > >
> > [1] in the parquet footer.
> >
> > Outside of the "_metadata" file index use-case I don't think this is used
> > or implemented in any reader (effectively a poor man's table format).
> >
> > With the rise of file formats, it seems like a reasonable design choice
> to
> > push complexity of referencing columns across files to the table level
> and
> > keep parquet focused on single file storage (encodings, indexing, etc).
> >
> > Implementing this at a file level also can be challenging in the context
> of
> > knowing all credentials one might need to read from different objects on
> > object storage?
> >
> > Thoughts/Objections?
> >
> > Thanks,
> > Micah
> >
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> >
>


Re: [DISCUSS] Deprecate file_path field in column chunk

2025-12-02 Thread Andrew Lamb
I for one think it is a good idea, thank you for bringing it up Micah

I think the best rationale for depreciation, as you mention, is to bring
the spec into alignment with actual implementation practice. The more
unused features exist in Parquet, the harder it is to implement new readers
or determine compatibility.

Andrew

On Mon, Dec 1, 2025 at 3:09 PM Micah Kornfield 
wrote:

> This has come up a few times in the sync and other forums.  I wanted to
> start the conversation about deprecating file_path
> <
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
> >
> [1] in the parquet footer.
>
> Outside of the "_metadata" file index use-case I don't think this is used
> or implemented in any reader (effectively a poor man's table format).
>
> With the rise of file formats, it seems like a reasonable design choice to
> push complexity of referencing columns across files to the table level and
> keep parquet focused on single file storage (encodings, indexing, etc).
>
> Implementing this at a file level also can be challenging in the context of
> knowing all credentials one might need to read from different objects on
> object storage?
>
> Thoughts/Objections?
>
> Thanks,
> Micah
>
>
> [1]
>
> https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962
>