Each file has 1-2MB with 1 row group each, around 2000 rows per file and 61
columns - a total of 7697842 rows. Is this performance expected for this
dataset or is there any suggestion to optimize?

Thanks!

On Sat, Jul 1, 2023 at 10:44 PM Weston Pace <[email protected]> wrote:

> What size are the row groups in your parquet files?  How many columns and
> rows in the files?
>
> On Sat, Jul 1, 2023, 6:08 PM Paulo Motta <[email protected]> wrote:
>
>> Hi,
>>
>> I'm trying to read 4096 parquet files with a total size of 6GB using this
>> cookbook:
>> https://arrow.apache.org/cookbook/java/dataset.html#query-parquet-file
>>
>> I'm using 100 threads, each thread processing one file at a time on a 72
>> core machine with 32GB heap. The files are pre-loaded in memory.
>>
>> However it's taking about 10 minutes to process these 4096 files with a
>> total size of only 6GB and the process seems to be cpu-bound.
>>
>> Is this expected read performance for parquet files or am I
>> doing something wrong? Any help or tips would be appreciated.
>>
>> Thanks,
>>
>> Paulo
>>
>

Reply via email to