rjzamora commented on pull request #7523: URL: https://github.com/apache/arrow/pull/7523#issuecomment-648269136
Thanks for working on this @jorisvandenbossche ! This does seem like the functionality needed by Dask. To test my understanding (and for the sake of discussion), I am imagining something (roughly) like the following in Dask to collect row-group statistics (note that I am using pyarrow-0.17.1 from conda, so the `get_row_group_fragments` call would be replaced): ```python from collections import defaultdict import json import pandas as pd import pyarrow.dataset as ds from string import ascii_lowercase as letters path = "simple.pdf" df0 = pd.DataFrame( {"x": range(26), "myindex": list(letters)} ).set_index("myindex") df0.to_parquet(path, engine="pyarrow", row_group_size=10) ds = pds.dataset(path) # Need index_cols to be specified by user or encoded in # the "pandas" metadata. Otherwise, we will not bother # to infer an index column (and wont need statistics). index_cols = json.loads( ds.schema.metadata[b"pandas"].decode("utf8") )["index_columns"] filter = None # Some user-defined filter # Collect path and statistics for each row-group metadata = defaultdict(list) for file_frag in ds.get_fragments(filter=filter): for rg_frag in file_frag.get_row_group_fragments(): for rg in rg_frag.row_groups: stats = ds.get_min_max_statistics(rg.statistics) metadata[rg_frag.path].append((<rg-index>, stats)) ``` In this case, the resulting `metadata` object would be something like: ``` defaultdict(list, {'simple.pdf': [(0, <stats-for-rg-0>), (1, <stats-for-rg-1>), (2, <stats-for-rg-2>)]}) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org