RE: Issue filtering partitioned Parquet files on partition keys using PyArrow

Tomaz Maia Suller Thu, 04 Aug 2022 09:54:33 -0700

Weston, I'm interested in following up.

________________________________
De: Weston Pace <[email protected]>
Enviado: quinta-feira, 4 de agosto de 2022 12:15
Para: [email protected] <[email protected]>
Assunto: Re: Issue filtering partitioned Parquet files on partition keys using 
PyArrow

Você não costuma receber emails de [email protected]. Saiba por que isso é 
importante<https://aka.ms/LearnAboutSenderIdentification>
[EXTERNAL EMAIL]

There is a lot of room for improvement here.  In the datasets API the call that 
you have described (read_parquet) is broken into two steps:

 * dataset discovery

During dataset discovery we don't use any partition filter.  The goal is to 
create the "total dataset" of all the files.  So in your case this means 
listing out all 150,120 directories.  For every file we discover we capture a 
partition expression for this file.  This is probably where the bulk of time is 
being spent (listing the directories).

 * dataset read

During the dataset read we apply the partition filter.  So we are going to 
iterate through all ~150k files and compare the filter expression with the 
previously captured partition expression, eliminating files that don't match.  
In this phase we don't have any idea of the original directory structure.  So 
instead of performing 27 top-level comparisons + 5560 second-level comparisons 
we end up having to calculate all 150k comparisons.

Both of these steps are considerably longer than they need to be.  If I were to 
guess I would guess that a majority of the time is spent in that first step but 
I don't think the time spent in that second step is negligible.

One fairly straightforward solution would be to allow the partition filter to 
be used during dataset discovery.  This would yield a much smaller dataset so 
step 2 would be much faster but it could also allow the discovery process to 
skip entire directories.  If anyone is interested in working on a fix for this 
I'd be happy to point them at the files that will need to be changed and go 
into a more detailed discussion of potential solutions.

On Thu, Aug 4, 2022 at 5:53 AM Tomaz Maia Suller 
<[email protected]<mailto:[email protected]>> wrote:
Hi David,

I wonder if the problem with the attachments has to do with the files not 
having extensions... I'm trying to send them with .prof this time.

Anyway:

  1.  I'm writing to a local filesystem; I've mounted a NFTS partition which is 
on a HDD. Since the dataset is only ~1.5 GB, I'll try to move it to the SSD I 
have available and see if I get lower access times.
  2.  I'm using trying to use ParquetDataset; though I'm using it directly most 
of the time, i.e. I'm using Pandas which then itself uses (if I understood it 
correctly) ParquetDataset.

I've tried accessing with both the legacy and new versions of the API, 
according to that use_legacy_dataset parameter. The legacy API is significantly 
faster, with access time of about 1 second, though still ridiculously slow 
compared to accessing the path straight away.

If the attachments still don't work for some reason, I'll write up what I ran:

>>> pq_command_new = "pq.ParquetDataset('.', filters=[('state_code', '==', 31), 
>>> ('city_code', '==', 6200)], use_legacy_dataset=False)"
>>> pq_command_old = "pq.ParquetDataset('.', filters=[('state_code', '==', 31), 
>>> ('city_code', '==', 6200)], use_legacy_dataset=True)"
>>> pq_baseline = "pq.ParquetDataset('./state_code=31/city_code=6200')"

>>> cProfile.run(pq_command_new, '/tmp/pq_legacy_false.prof')
This took about 17 seconds.

>>> cProfile.run(pq_command_old, '/tmp/pq_legacy_true.prof')
This took about 1 second.

>>> cProfile.run(pq_baseline, '/tmp/pq_legacy_true.prof')
This took 0.0075 second.

These runs were all after the first run after booting up the computer, which 
took over 500 seconds as I've said.

I'm starting to think I should send this to the development mailing list rather 
than the user one, since the obvious solution is specifying the paths directly 
rather than trying to use the API.
________________________________
De: Lee, David <[email protected]<mailto:[email protected]>>
Enviado: quarta-feira, 3 de agosto de 2022 19:49
Para: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Assunto: RE: Issue filtering partitioned Parquet files on partition keys using 
PyArrow

Você não costuma receber emails de 
[email protected]<mailto:[email protected]>. Saiba por que isso é 
importante<https://aka.ms/LearnAboutSenderIdentification>
[EXTERNAL EMAIL]

The attachments didn’t come through properly..

I’ve got additional questions.

  1.  What filesystem are these files stored on? I’ve seen issues using S3 if 
HEAD operations aren’t prioritized. I’m assuming that without HEAD operations 
you can’t effectively scan a parquet file’s footer and reading the entire file 
isn’t efficient.

available (eventual consistency for HEAD operations) Behaves the same as the 
“read-after-new-write” consistency level, but only provides eventual 
consistency for HEAD operations. Offers higher availability for HEAD operations 
than “read-after-new-write” if Storage Nodes are unavailable. Differs from AWS 
S3 consistency guarantees for HEAD operations only.

2.    Are you using pyarrow.parquet.ParquetDataset or pyarrow.dataset?

https://arrow.apache.org/docs/python/parquet.html

Note

The ParquetDataset is being reimplemented based on the new generic Dataset API 
(see the Tabular 
Datasets<https://arrow.apache.org/docs/python/dataset.html#dataset> docs for an 
overview). This is not yet the default, but can already be enabled by passing 
the use_legacy_dataset=False keyword to ParquetDataset or read_table():

pq.ParquetDataset('dataset_name/', use_legacy_dataset=False)

Enabling this gives the following new features:

  *   Filtering on all columns (using row group statistics) instead of only on 
the partition keys.
  *   More fine-grained partitioning: support for a directory partitioning 
scheme in addition to the Hive-like partitioning (e.g. “/2019/11/15/” instead 
of “/year=2019/month=11/day=15/”), and the ability to specify a schema for the 
partition keys.
  *   General performance improvement and bug fixes.

It also has the following changes in behaviour:

  *   The partition keys need to be explicitly included in the columns keyword 
when you want to include them in the result while reading a subset of the 
columns

This new implementation is already enabled in read_table, and in the future, 
this will be turned on by default for ParquetDataset. The new implementation 
does not yet cover all existing ParquetDataset features (e.g. specifying the 
metadata, or the pieces property API). Feedback is very welcome.

From: Tomaz Maia Suller <[email protected]<mailto:[email protected]>>
Sent: Wednesday, August 3, 2022 2:54 PM
To: [email protected]<mailto:[email protected]>
Subject: Issue filtering partitioned Parquet files on partition keys using 
PyArrow

External Email: Use caution with links and attachments

Hi,

I'm trying to load a dataset I created consisting of Parquet files partitioned 
on two columns, but reading from a single partition takes over 10 minutes on 
the first try and still over 15 seconds on any subsequent one while specifying 
the path to the partition directly takes 50 milliseconds. Am I doing something 
wrong?

The data is arranged in the following way:

$ tree -d

.

├── state_code=11

│   ├── city_code=1005

│   ├── city_code=106

│   ├── city_code=1104

│   ├── city_code=114

│   ├── city_code=1203

│   ├── city_code=122

│   ├── city_code=130

│   ├── city_code=1302

...

There are 27 state codes and 5560 city codes, so 27 "first level" partitions 
and 5560 "second level" partitions in total. Each partition often contains only 
a few kBs worth of Parquet files and nome is greates than ~5MB. These files 
were written using PySpark and I have full control of how they're generated, in 
case you think there's a better way to arrange them. I chose this partitioning 
since I wish to analyse one city at a time; I have also experimented with 
having only a single level partitioning with 5560 partitions, but didn't see 
any increase in performance.

I'm using Pandas to read the files, and have tried using PyArrow directly as 
well. Regardless, I've profiled the reading of a single partition using 
cProfile, and the results clearly show PyArrow is taking the longest to run. 
I've attached the results of two runs I did using IPython: one right after 
rebooting my computer, which took well over 500 seconds; and one executed right 
after that, which took about 15 seconds, with the following command:

>>> command = "pd.read_parquet('.', engine='pyarrow', filters=[('state_code', 
>>> '==', 31), ('city_code', '==', 6200)])"

>>> cProfile.run(command, '/path/to/stats')

It was much better the second time around but still terrible compared to 
specifying the path manually, which took around 50 milliseconds according to 
%timeit.

I have absolutely no idea why the filesystem scan is taking so long. I have 
seen this issue 
https://issues.apache.org/jira/browse/ARROW-11781<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/ARROW-11781__;!!KSjYCgUGsB4!a08YdpWy7XPfHdgrfg8lMkMTia4epHAyid9h4ZrOEkINTFDbNphftopUuNpDDeL2-ZUSQttF_62sxBUHeVVk_g$>
 related to the same problem, but it mentions there should be no performance 
issue as of July 2021, whereas I'm having problems right now.

I think I'll stick to specifying the partitions to read "by hand", but I was 
really really curious on whether I messed something up, or if (Py)Arrow really 
is inefficient in a task which seems so trivial at first sight.

Thanks,

Tomaz.

P.S.: It's my first time sending an email to a mailing list, so I hope sending 
attachments is okay, and sorry if it isn't.

Importante: As informações deste e-mail são confidenciais. O uso não autorizado 
é proibido por lei. Por favor, considere o ambiente antes de imprimir.

Important: The information on this e-mail is confidential. Non-authorized use 
is prohibited by law. Please Consider the Environment Before Printing.

This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

RE: Issue filtering partitioned Parquet files on partition keys using PyArrow

Reply via email to