Eric Kisslinger created ARROW-7059: -------------------------------------- Summary: Reading parquet file with many columns is still slow for 0.15.1 Key: ARROW-7059 URL: https://issues.apache.org/jira/browse/ARROW-7059 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: Linux OS with RHEL 7.7 distribution
blkcqas037:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Reporter: Eric Kisslinger Reading Parquet files with large number of columns still seems to be very slow in 0.15.1 compared to 0.14.1. I using the same test used in https://issues.apache.org/jira/browse/ARROW-6876 except I set {{use_threads=False}} to make for an apples-to-apples comparison with respect to # of CPUs. {{import numpy as np}} {{import pyarrow as pa}} {{import pyarrow.parquet as pq}} {{table = pa.table(\{'c' + str(i): np.random.randn(10) for i in range(10000)})}} {{pq.write_table(table, "test_wide.parquet")}} {{res = pq.read_table("test_wide.parquet")}} {{print(pa.__version__)}} use_threads=False {{%time res = pq.read_table("test_wide.parquet", use_threads=False)}} *In 0.14.1 with use_threads=False:* {{0.14.1}} {{CPU times: user 515 ms, sys: 9.3 ms, total: 524 ms}} {{Wall time: 525 ms}} ** *In 0.15.1 with* *use_threads=False**:* {{0.15.1}} {{CPU times: user 9.89 s, sys: 37.8 ms, total: 9.93 s}} {{Wall time: 9.93 s}} {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)