The attachments didn’t come through properly.. I’ve got additional questions.
1. What filesystem are these files stored on? I’ve seen issues using S3 if HEAD operations aren’t prioritized. I’m assuming that without HEAD operations you can’t effectively scan a parquet file’s footer and reading the entire file isn’t efficient. available (eventual consistency for HEAD operations) Behaves the same as the “read-after-new-write” consistency level, but only provides eventual consistency for HEAD operations. Offers higher availability for HEAD operations than “read-after-new-write” if Storage Nodes are unavailable. Differs from AWS S3 consistency guarantees for HEAD operations only. 2. Are you using pyarrow.parquet.ParquetDataset or pyarrow.dataset? https://arrow.apache.org/docs/python/parquet.html Note The ParquetDataset is being reimplemented based on the new generic Dataset API (see the Tabular Datasets<https://arrow.apache.org/docs/python/dataset.html#dataset> docs for an overview). This is not yet the default, but can already be enabled by passing the use_legacy_dataset=False keyword to ParquetDataset or read_table(): pq.ParquetDataset('dataset_name/', use_legacy_dataset=False) Enabling this gives the following new features: * Filtering on all columns (using row group statistics) instead of only on the partition keys. * More fine-grained partitioning: support for a directory partitioning scheme in addition to the Hive-like partitioning (e.g. “/2019/11/15/” instead of “/year=2019/month=11/day=15/”), and the ability to specify a schema for the partition keys. * General performance improvement and bug fixes. It also has the following changes in behaviour: * The partition keys need to be explicitly included in the columns keyword when you want to include them in the result while reading a subset of the columns This new implementation is already enabled in read_table, and in the future, this will be turned on by default for ParquetDataset. The new implementation does not yet cover all existing ParquetDataset features (e.g. specifying the metadata, or the pieces property API). Feedback is very welcome. From: Tomaz Maia Suller <[email protected]> Sent: Wednesday, August 3, 2022 2:54 PM To: [email protected] Subject: Issue filtering partitioned Parquet files on partition keys using PyArrow External Email: Use caution with links and attachments Hi, I'm trying to load a dataset I created consisting of Parquet files partitioned on two columns, but reading from a single partition takes over 10 minutes on the first try and still over 15 seconds on any subsequent one while specifying the path to the partition directly takes 50 milliseconds. Am I doing something wrong? The data is arranged in the following way: $ tree -d . ├── state_code=11 │ ├── city_code=1005 │ ├── city_code=106 │ ├── city_code=1104 │ ├── city_code=114 │ ├── city_code=1203 │ ├── city_code=122 │ ├── city_code=130 │ ├── city_code=1302 ... There are 27 state codes and 5560 city codes, so 27 "first level" partitions and 5560 "second level" partitions in total. Each partition often contains only a few kBs worth of Parquet files and nome is greates than ~5MB. These files were written using PySpark and I have full control of how they're generated, in case you think there's a better way to arrange them. I chose this partitioning since I wish to analyse one city at a time; I have also experimented with having only a single level partitioning with 5560 partitions, but didn't see any increase in performance. I'm using Pandas to read the files, and have tried using PyArrow directly as well. Regardless, I've profiled the reading of a single partition using cProfile, and the results clearly show PyArrow is taking the longest to run. I've attached the results of two runs I did using IPython: one right after rebooting my computer, which took well over 500 seconds; and one executed right after that, which took about 15 seconds, with the following command: >>> command = "pd.read_parquet('.', engine='pyarrow', filters=[('state_code', >>> '==', 31), ('city_code', '==', 6200)])" >>> cProfile.run(command, '/path/to/stats') It was much better the second time around but still terrible compared to specifying the path manually, which took around 50 milliseconds according to %timeit. I have absolutely no idea why the filesystem scan is taking so long. I have seen this issue https://issues.apache.org/jira/browse/ARROW-11781<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/ARROW-11781__;!!KSjYCgUGsB4!a08YdpWy7XPfHdgrfg8lMkMTia4epHAyid9h4ZrOEkINTFDbNphftopUuNpDDeL2-ZUSQttF_62sxBUHeVVk_g$> related to the same problem, but it mentions there should be no performance issue as of July 2021, whereas I'm having problems right now. I think I'll stick to specifying the partitions to read "by hand", but I was really really curious on whether I messed something up, or if (Py)Arrow really is inefficient in a task which seems so trivial at first sight. Thanks, Tomaz. P.S.: It's my first time sending an email to a mailing list, so I hope sending attachments is okay, and sorry if it isn't. Importante: As informações deste e-mail são confidenciais. O uso não autorizado é proibido por lei. Por favor, considere o ambiente antes de imprimir. Important: The information on this e-mail is confidential. Non-authorized use is prohibited by law. Please Consider the Environment Before Printing. This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2022 BlackRock, Inc. All rights reserved.
