Each file has 1-2MB with 1 row group each, around 2000 rows per file and 61 columns - a total of 7697842 rows. Is this performance expected for this dataset or is there any suggestion to optimize?
Thanks! On Sat, Jul 1, 2023 at 10:44 PM Weston Pace <[email protected]> wrote: > What size are the row groups in your parquet files? How many columns and > rows in the files? > > On Sat, Jul 1, 2023, 6:08 PM Paulo Motta <[email protected]> wrote: > >> Hi, >> >> I'm trying to read 4096 parquet files with a total size of 6GB using this >> cookbook: >> https://arrow.apache.org/cookbook/java/dataset.html#query-parquet-file >> >> I'm using 100 threads, each thread processing one file at a time on a 72 >> core machine with 32GB heap. The files are pre-loaded in memory. >> >> However it's taking about 10 minutes to process these 4096 files with a >> total size of only 6GB and the process seems to be cpu-bound. >> >> Is this expected read performance for parquet files or am I >> doing something wrong? Any help or tips would be appreciated. >> >> Thanks, >> >> Paulo >> >
