Re: [RUST] Reading parquet

Fernando Herrera Sun, 24 Jan 2021 07:08:29 -0800

> Something to look at might be the various compute kernels in
https://github.com/apache/arrow/tree/master/rust/arrow/src/compute/kernels that
do operate on `ArrayRef`s -- either for operations you could use directly
or at least inspiration / examples of how to manipulate the various array
types.


Actually, my issue boils down to accessing the methods from the downcasted
class. For example, if we read a column from a record batch using the
column() method we get an Arc<dyn Array> but I know that that column is a
StringArray so I would like to use the methods available in StringArray.
The only way I can do that now is by using the as_any() method. I will have
a look at the kernels to see if there is another way to do it.

> I think this could make sense, though most of the operations I see in
pyarrow.Table
<https://arrow.apache.org/docs/python/generated/pyarrow.Table.html> (e.g.
filter and select) are already supported in the Dataframe api. Is there any
operation in particular that you would like to use?

One operation that I would like to use is an iterator over the data in one
column or probably over a row. I can see that I could use collect() but
that would only return a Vec<RecordBatch> and that means that I would have
to loop over the available record batches. For example, now I am trying to
create two hash maps. With the first one I'm looping over all the names
that have an ID and I need to create ngrams from the names and group them
based on an ID. And I need another map where the key is the ngram and the
value is a hashset with the IDs that share those ngrams. That's why I think
a simpler struct like Table with an iterator would help a lot and it would
be native to arrow, like the one in pyarrow. What do you think?

Fernando

On Sun, Jan 24, 2021 at 1:38 PM Andrew Lamb <[email protected]> wrote:

> > Also, how is an object that implements Array <dyn Array> downcasted to
> other types of Arrays. I'm doing it now using as_any and then down ref to
> the type I want. But I have to write the type in the code and I want to
> find a way for it to be done automatically.
>
> I think this is the standard way in Rust -- because Rust is statically
> typed, in order to do anything with the implementations a cast to a
> concrete type is typically needed.
>
> Something to look at might be the various compute kernels in
> https://github.com/apache/arrow/tree/master/rust/arrow/src/compute/kernels
> that do operate on `ArrayRef`s -- either for operations you could use
> directly or at least inspiration / examples of how to manipulate the
> various array types.
>
> > By the way, would it make sense to create a struct Table similar to the
> one in pyarrow to collect several Record Batches?
> I think this could make sense, though most of the operations I see in
> pyarrow.Table
> <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html> (e.g.
> filter and select) are already supported in the Dataframe api. Is there any
> operation in particular that you would like to use?
>
> Andrew
>
> On Sun, Jan 24, 2021 at 7:41 AM Fernando Herrera <
> [email protected]> wrote:
>
>> Thanks Andrew,
>>
>> I did read the examples that you mentioned and I don't think they will
>> help me with what I want to do. I need to create two hash maps from the
>> parquet file to do further comparisons on those maps. In both cases I need
>> to create a set of unique ngrams from strings stored in the parquet file.
>>
>> By the way, would it make sense to create a struct Table similar to the
>> one in pyarrow to collect several Record Batches?
>>
>> Also, how is an object that implements Array <dyn Array> downcasted to
>> other types of Arrays. I'm doing it now using as_any and then down ref to
>> the type I want. But I have to write the type in the code and I want to
>> find a way for it to be done automatically.
>>
>> Thanks,
>> Fernando
>>
>> On Sun, 24 Jan 2021, 12:01 Andrew Lamb, <[email protected]> wrote:
>>
>>> Hi Fernando,
>>>
>>> Keeping the data in memory as `RecordBatch`es sounds like the way to go
>>> if you want it all to be in memory.
>>>
>>> Another way to work in Rust with data from parquet files is to use the
>>> `DataFusion` library; Depending on your needs it might save you some time
>>> building up your analytics (e.g. it has aggregations, filtering and sorting
>>> built it).
>>>
>>> Here are some examples of how to use DataFusion with a parquet file
>>> (with the dataframe and the SQL api):
>>>
>>> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/dataframe.rs
>>>
>>> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/parquet_sql.rs
>>>
>>> If you already have RecordBatches you can register an in memory table as
>>> well.
>>>
>>> Hope that helps,
>>> Andrew
>>>
>>>
>>> On Sat, Jan 23, 2021 at 7:33 AM Fernando Herrera <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> A quick question regarding reading a parquet file. What is the best way
>>>> to read a parquet file and keep it in memory to do data analysis?
>>>>
>>>> What I'm doing now is using the record reader from the
>>>> ParquetFileArrowReader and then I read all the record batches from the
>>>> file. I keep the batches in memory in a vector of record batches. This way
>>>> I have access to them to do some aggregations I need from the file.
>>>>
>>>> Is there another way to do this?
>>>>
>>>> Thanks,
>>>> Fernando
>>>>
>>>

Re: [RUST] Reading parquet

Reply via email to