Hi Team, I am trying to create a PyArrow table from Parquet data files (1K files ~= 4.2B rows with 9 columns but am facing the challenges. I am seeking some help and guidance to resolve it.
So far, I tried using Arrow dataset with filters and generator approach within Arrow flight. I noticed that even with use_threads = True, the arrow API does not use all the core available in the system. I think one way to load all the data in parallel, is to split the parquet files and run them in multiple servers but it is going to be manual. I really appreciate any help you can provide to handle the large datasets. Thank you, Muru
