Thanks Chris. The use case is to build an in-memory datastore. After the
data gets loaded, the clients will query and get the results in sub-seconds
via an api. Mostly, select queries and high level aggregations. Yes, not
the entire dataset, the last 90days of data but ideally it would be nice to
have the entire dataset.

Sure, I will try your suggestion. Please, can you share or point me to mmap
reference sample?

On Wed, Aug 4, 2021 at 10:12 AM Chris Nuernberger <[email protected]>
wrote:

> Murugan,
>
> Could you talk a bit more about what you intend to do with the dataset
> once loaded?
>
> A large dataset is often best represented by a sequence of smaller
> datasets which sounds like how it is currently stored if I hear you
> correctly.  If you are doing some large aggregation or something then you
> can feed the datasets one by one into your aggregation without needing to
> load all of them simultaneously.
>
> Are you trying to do some random access pathway across the entire
> dataset?
>
> One option is to convert each existing parquet file into an arrow table
> and then mmap the resulting tables all at once if you need to simulate
> having the entire system 'in memory'.
>
> On Wed, Aug 4, 2021 at 9:55 AM Murugan Muthusamy <[email protected]>
> wrote:
>
>> Hi Team,
>>
>> I am trying to create a PyArrow table from Parquet data files (1K files
>> ~= 4.2B rows with 9 columns but am facing the challenges. I am seeking some
>> help and guidance to resolve it.
>>
>> So far, I tried using Arrow dataset with filters and generator approach
>> within Arrow flight. I noticed that even with use_threads = True, the arrow
>> API does not use all the core available in the system.
>>
>> I think one way to load all the data in parallel, is to split the parquet
>> files and run them in multiple servers but it is going to be manual.
>>
>> I really appreciate any help you can provide to handle the large datasets.
>>
>> Thank you,
>> Muru
>>
>

Reply via email to