> So far, I tried using Arrow dataset with filters and generator approach > within Arrow flight. I noticed that even with use_threads = True, the arrow > API does not use all the core available in the system.
It is possible you are I/O bound. What device are you reading from? Do you know if the data is cached in the kernel page cache already or not? If you are reading parquet from an HDD for example I wouldn't expect you to be able to utilize more than a few cores. > Sure, I will try your suggestion. Please, can you share or point me to mmap > reference sample? There is a little bit of documentation (and a sample) that was recently added here (but won't be on the site until 6.0.0 releases): https://github.com/apache/arrow/blob/master/docs/source/python/ipc.rst#efficiently-writing-and-reading-arrow-data > The use case is to build an in-memory datastore Can you give a rough idea of what kind of performance you are getting and what kind of performance you would expect or need? For an in-memory datastore I wouldn't expect parquet I/O to be on the critical path. On Wed, Aug 4, 2021 at 10:28 AM Murugan Muthusamy <[email protected]> wrote: > > Thanks Chris. The use case is to build an in-memory datastore. After the data > gets loaded, the clients will query and get the results in sub-seconds via an > api. Mostly, select queries and high level aggregations. Yes, not the entire > dataset, the last 90days of data but ideally it would be nice to have the > entire dataset. > > Sure, I will try your suggestion. Please, can you share or point me to mmap > reference sample? > > On Wed, Aug 4, 2021 at 10:12 AM Chris Nuernberger <[email protected]> > wrote: >> >> Murugan, >> >> Could you talk a bit more about what you intend to do with the dataset once >> loaded? >> >> A large dataset is often best represented by a sequence of smaller datasets >> which sounds like how it is currently stored if I hear you correctly. If >> you are doing some large aggregation or something then you can feed the >> datasets one by one into your aggregation without needing to load all of >> them simultaneously. >> >> Are you trying to do some random access pathway across the entire dataset? >> >> One option is to convert each existing parquet file into an arrow table and >> then mmap the resulting tables all at once if you need to simulate having >> the entire system 'in memory'. >> >> On Wed, Aug 4, 2021 at 9:55 AM Murugan Muthusamy <[email protected]> wrote: >>> >>> Hi Team, >>> >>> I am trying to create a PyArrow table from Parquet data files (1K files ~= >>> 4.2B rows with 9 columns but am facing the challenges. I am seeking some >>> help and guidance to resolve it. >>> >>> So far, I tried using Arrow dataset with filters and generator approach >>> within Arrow flight. I noticed that even with use_threads = True, the arrow >>> API does not use all the core available in the system. >>> >>> I think one way to load all the data in parallel, is to split the parquet >>> files and run them in multiple servers but it is going to be manual. >>> >>> I really appreciate any help you can provide to handle the large datasets. >>> >>> Thank you, >>> Muru
