Hi, Are there any updates regarding the Parquet FileMetaData python bindings? So, is it possible to remove/add row groups from/to a _metadata file? Or is it possible to create a FileMetaData object from a list of RowGroupMetaData objects?
Why would this be useful for us: We are using a _metadata file for all of our parquet datasets (using pyarrow.dataset.parquet_dataset(path/to/_metadata,…) stored in s3. When we apply some filters on that dataset, read performance is improved and the number of read operations (and therefore our costs) is decreaded a lot, compared to a dataset without that _metadata file (pyarrow.dataset.dataset(‚path/to/dataset‘,…) Unfortunately, currently we have to recreate the _metadata file for our datasets (stored in s3), whenever a file is deleted or updated, which happens quite often. Thanks and regards, Volker On 2021/02/08 14:11:13 Josh Mayer wrote: > Hi Joris, > > The subset method on row groups would work fine for me. I'd be happy to > help expose this in Python if needed. > > In regards to the dataset partitioning, that route would also work (and is > separately useful), assuming I can attach manual partitioning information > to a dataset created from a metadata file. I would like to pass something > like the partitions argument to ds.FileSystemDataset.from_paths ( > https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset), > for each row group (or file path) in the metadata_file, e.g. > > dataset = ds.parquet_dataset(metadata_file, partitions=[ds.field("foo") == > 1, ds.field("foo") == 2, ...]) > > Thanks for the help, > > Josh > > On Mon, Feb 8, 2021 at 6:56 AM Joris Van den Bossche < > [email protected]> wrote: > > > Hi Josh, > > > > As far as I know, the Python bindings for Parquet FileMetaData (and > > constituent parts) don't expose any methods to construct those objects > > (apart from reading it from a file). For example, creating a FileMetaData > > object from a list of RowGroupMetaData is not possible. > > > > So I don't think what you describe is currently possible (apart from > > reading the metadata from the files you want again and appending them, as > > done in the docs you linked to). > > > > Note that if you use pyarrow to read the dataset using the metadata file, > > filtering on the file path can be equivalent to filtering on one of the > > partition columns (depending on what subset you wanted to take). And > > letting the dataset API doing this filtering can be quite efficient (it > > will filter the file paths on read), so it might not necessarily be needed > > to do this in advance. > > > > In the C++ layer, there is a "FileMetaData::Subset" method added recently > > (for purposes of the datasets API) which can create a new FileMetaData > > object with a subset of the row groups based on row group index (position > > in the vector of row groups). But this is a) not exposed in Python (but > > could be) and b) doesn't directly allow filtering on file path. > > > > Joris > > > > On Sat, 6 Feb 2021 at 16:58, Josh Mayer <[email protected]> wrote: > > > >> After writing a _metadata file as done here > >> https://arrow.apache.org/docs/python/parquet.html?highlight=write_metadata#writing-metadata-and-common-medata-files, > >> I'm wondering if it is possible to read that _metadata file (e.g. using > >> pyarrow.parquet.read_metadata), filter out some paths, and write it back to > >> disk. I can see that file path info is available, e.g. > >> > >> meta = pq.read_metadata(...) > >> meta.row_group(0).column(0).file_path > >> > >> But I cannot figure out how to filter or create a FileMetaData object > >> (since that is what the metadata_collector param of > >> pyarrow.parquet.write_metadata expects) from either a set of > >> RowGroupMetaData or ColumnChunkMetaData objects. Is this possible? I'm > >> trying to avoid needing to reread the FileMetaData from each file in the > >> dataset directly. > >> > > > Volker Lorrmann Am alten Sportplaz 6 97261 Güntersleben +49 178 612 67 91 [email protected] [email protected]
