Re: Handle large dataset

Weston Pace Wed, 04 Aug 2021 19:38:24 -0700

> So far, I tried using Arrow dataset with filters and generator approach 
> within Arrow flight. I noticed that even with use_threads = True, the arrow 
> API does not use all the core available in the system.


It is possible you are I/O bound.  What device are you reading from?
Do you know if the data is cached in the kernel page cache already or
not?  If you are reading parquet from an HDD for example I wouldn't
expect you to be able to utilize more than a few cores.

> Sure, I will try your suggestion. Please, can you share or point me to mmap 
> reference sample?

There is a little bit of documentation (and a sample) that was
recently added here (but won't be on the site until 6.0.0 releases):
https://github.com/apache/arrow/blob/master/docs/source/python/ipc.rst#efficiently-writing-and-reading-arrow-data

> The use case is to build an in-memory datastore

Can you give a rough idea of what kind of performance you are getting
and what kind of performance you would expect or need?  For an
in-memory datastore I wouldn't expect parquet I/O to be on the
critical path.

On Wed, Aug 4, 2021 at 10:28 AM Murugan Muthusamy <[email protected]> wrote:
>
> Thanks Chris. The use case is to build an in-memory datastore. After the data 
> gets loaded, the clients will query and get the results in sub-seconds via an 
> api. Mostly, select queries and high level aggregations. Yes, not the entire 
> dataset, the last 90days of data but ideally it would be nice to have the 
> entire dataset.
>
> Sure, I will try your suggestion. Please, can you share or point me to mmap 
> reference sample?
>
> On Wed, Aug 4, 2021 at 10:12 AM Chris Nuernberger <[email protected]> 
> wrote:
>>
>> Murugan,
>>
>> Could you talk a bit more about what you intend to do with the dataset once 
>> loaded?
>>
>> A large dataset is often best represented by a sequence of smaller datasets 
>> which sounds like how it is currently stored if I hear you correctly.  If 
>> you are doing some large aggregation or something then you can feed the 
>> datasets one by one into your aggregation without needing to load all of 
>> them simultaneously.
>>
>> Are you trying to do some random access pathway across the entire dataset?
>>
>> One option is to convert each existing parquet file into an arrow table and 
>> then mmap the resulting tables all at once if you need to simulate having 
>> the entire system 'in memory'.
>>
>> On Wed, Aug 4, 2021 at 9:55 AM Murugan Muthusamy <[email protected]> wrote:
>>>
>>> Hi Team,
>>>
>>> I am trying to create a PyArrow table from Parquet data files (1K files ~= 
>>> 4.2B rows with 9 columns but am facing the challenges. I am seeking some 
>>> help and guidance to resolve it.
>>>
>>> So far, I tried using Arrow dataset with filters and generator approach 
>>> within Arrow flight. I noticed that even with use_threads = True, the arrow 
>>> API does not use all the core available in the system.
>>>
>>> I think one way to load all the data in parallel, is to split the parquet 
>>> files and run them in multiple servers but it is going to be manual.
>>>
>>> I really appreciate any help you can provide to handle the large datasets.
>>>
>>> Thank you,
>>> Muru

Re: Handle large dataset

Reply via email to