RE: Re: [Python] Filtering _metadata by file path

Volker Lorrmann Wed, 14 Feb 2024 12:42:04 -0800

Hi,

Are there any updates regarding the Parquet FileMetaData python bindings?
So, is it possible to remove/add row groups from/to a _metadata file? Or is it 
possible to create a FileMetaData object from a list of RowGroupMetaData 
objects?


Why would this be useful for us:
We are using a _metadata file for all of our parquet datasets (using 
pyarrow.dataset.parquet_dataset(path/to/_metadata,…) stored in s3. 
When we apply some filters on that dataset, read performance is improved and 
the number of read operations (and therefore our costs) is decreaded a lot, 
compared to a dataset without that _metadata file 
(pyarrow.dataset.dataset(‚path/to/dataset‘,…)
Unfortunately, currently we have to recreate the _metadata file for our 
datasets (stored in s3), whenever a file is deleted or updated, which happens 
quite often.

Thanks and regards,
Volker


On 2021/02/08 14:11:13 Josh Mayer wrote:
> Hi Joris,
> 
> The subset method on row groups would work fine for me. I'd be happy to
> help expose this in Python if needed.
> 
> In regards to the dataset partitioning, that route would also work (and is
> separately useful), assuming I can attach manual partitioning information
> to a dataset created from a metadata file. I would like to pass something
> like the partitions argument to ds.FileSystemDataset.from_paths (
> https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset),
> for each row group (or file path) in the metadata_file, e.g.
> 
> dataset = ds.parquet_dataset(metadata_file, partitions=[ds.field("foo") ==
> 1, ds.field("foo") == 2, ...])
> 
> Thanks for the help,
> 
> Josh
> 
> On Mon, Feb 8, 2021 at 6:56 AM Joris Van den Bossche <
> [email protected]> wrote:
> 
> > Hi Josh,
> >
> > As far as I know, the Python bindings for Parquet FileMetaData (and
> > constituent parts) don't expose any methods to construct those objects
> > (apart from reading it from a file). For example, creating a FileMetaData
> > object from a list of RowGroupMetaData is not possible.
> >
> > So I don't think what you describe is currently possible (apart from
> > reading the metadata from the files you want again and appending them, as
> > done in the docs you linked to).
> >
> > Note that if you use pyarrow to read the dataset using the metadata file,
> > filtering on the file path can be equivalent to filtering on one of the
> > partition columns (depending on what subset you wanted to take). And
> > letting the dataset API doing this filtering can be quite efficient (it
> > will filter the file paths on read), so it might not necessarily be needed
> > to do this in advance.
> >
> > In the C++ layer, there is a "FileMetaData::Subset" method added recently
> > (for purposes of the datasets API) which can create a new FileMetaData
> > object with a subset of the row groups based on row group index (position
> > in the vector of row groups). But this is a) not exposed in Python (but
> > could be) and b) doesn't directly allow filtering on file path.
> >
> > Joris
> >
> > On Sat, 6 Feb 2021 at 16:58, Josh Mayer <[email protected]> wrote:
> >
> >> After writing a _metadata file as done here
> >> https://arrow.apache.org/docs/python/parquet.html?highlight=write_metadata#writing-metadata-and-common-medata-files,
> >> I'm wondering if it is possible to read that _metadata file (e.g. using
> >> pyarrow.parquet.read_metadata), filter out some paths, and write it back to
> >> disk. I can see that file path info is available, e.g.
> >>
> >> meta = pq.read_metadata(...)
> >> meta.row_group(0).column(0).file_path
> >>
> >> But I cannot figure out how to filter or create a FileMetaData object
> >> (since that is what the metadata_collector param of
> >>  pyarrow.parquet.write_metadata expects) from either a set of
> >> RowGroupMetaData or ColumnChunkMetaData objects. Is this possible? I'm
> >> trying to avoid needing to reread the FileMetaData from each file in the
> >> dataset directly.
> >>
> >
>  

Volker Lorrmann

Am alten Sportplaz 6
97261 Güntersleben

+49 178 612 67 91
[email protected]
[email protected]

RE: Re: [Python] Filtering _metadata by file path

Reply via email to