Re: [RUST] Reading parquet

Andrew Lamb Sun, 24 Jan 2021 05:38:24 -0800

> Also, how is an object that implements Array <dyn Array> downcasted to
other types of Arrays. I'm doing it now using as_any and then down ref to
the type I want. But I have to write the type in the code and I want to
find a way for it to be done automatically.


I think this is the standard way in Rust -- because Rust is statically
typed, in order to do anything with the implementations a cast to a
concrete type is typically needed.

Something to look at might be the various compute kernels in
https://github.com/apache/arrow/tree/master/rust/arrow/src/compute/kernels
that do operate on `ArrayRef`s -- either for operations you could use
directly or at least inspiration / examples of how to manipulate the
various array types.

> By the way, would it make sense to create a struct Table similar to the
one in pyarrow to collect several Record Batches?
I think this could make sense, though most of the operations I see in
pyarrow.Table
<https://arrow.apache.org/docs/python/generated/pyarrow.Table.html> (e.g.
filter and select) are already supported in the Dataframe api. Is there any
operation in particular that you would like to use?

Andrew

On Sun, Jan 24, 2021 at 7:41 AM Fernando Herrera <
[email protected]> wrote:

> Thanks Andrew,
>
> I did read the examples that you mentioned and I don't think they will
> help me with what I want to do. I need to create two hash maps from the
> parquet file to do further comparisons on those maps. In both cases I need
> to create a set of unique ngrams from strings stored in the parquet file.
>
> By the way, would it make sense to create a struct Table similar to the
> one in pyarrow to collect several Record Batches?
>
> Also, how is an object that implements Array <dyn Array> downcasted to
> other types of Arrays. I'm doing it now using as_any and then down ref to
> the type I want. But I have to write the type in the code and I want to
> find a way for it to be done automatically.
>
> Thanks,
> Fernando
>
> On Sun, 24 Jan 2021, 12:01 Andrew Lamb, <[email protected]> wrote:
>
>> Hi Fernando,
>>
>> Keeping the data in memory as `RecordBatch`es sounds like the way to go
>> if you want it all to be in memory.
>>
>> Another way to work in Rust with data from parquet files is to use the
>> `DataFusion` library; Depending on your needs it might save you some time
>> building up your analytics (e.g. it has aggregations, filtering and sorting
>> built it).
>>
>> Here are some examples of how to use DataFusion with a parquet file (with
>> the dataframe and the SQL api):
>>
>> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/dataframe.rs
>>
>> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/parquet_sql.rs
>>
>> If you already have RecordBatches you can register an in memory table as
>> well.
>>
>> Hope that helps,
>> Andrew
>>
>>
>> On Sat, Jan 23, 2021 at 7:33 AM Fernando Herrera <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> A quick question regarding reading a parquet file. What is the best way
>>> to read a parquet file and keep it in memory to do data analysis?
>>>
>>> What I'm doing now is using the record reader from the
>>> ParquetFileArrowReader and then I read all the record batches from the
>>> file. I keep the batches in memory in a vector of record batches. This way
>>> I have access to them to do some aggregations I need from the file.
>>>
>>> Is there another way to do this?
>>>
>>> Thanks,
>>> Fernando
>>>
>>

Re: [RUST] Reading parquet

Reply via email to