> Something to look at might be the various compute kernels in https://github.com/apache/arrow/tree/master/rust/arrow/src/compute/kernels that do operate on `ArrayRef`s -- either for operations you could use directly or at least inspiration / examples of how to manipulate the various array types.
Actually, my issue boils down to accessing the methods from the downcasted class. For example, if we read a column from a record batch using the column() method we get an Arc<dyn Array> but I know that that column is a StringArray so I would like to use the methods available in StringArray. The only way I can do that now is by using the as_any() method. I will have a look at the kernels to see if there is another way to do it. > I think this could make sense, though most of the operations I see in pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html> (e.g. filter and select) are already supported in the Dataframe api. Is there any operation in particular that you would like to use? One operation that I would like to use is an iterator over the data in one column or probably over a row. I can see that I could use collect() but that would only return a Vec<RecordBatch> and that means that I would have to loop over the available record batches. For example, now I am trying to create two hash maps. With the first one I'm looping over all the names that have an ID and I need to create ngrams from the names and group them based on an ID. And I need another map where the key is the ngram and the value is a hashset with the IDs that share those ngrams. That's why I think a simpler struct like Table with an iterator would help a lot and it would be native to arrow, like the one in pyarrow. What do you think? Fernando On Sun, Jan 24, 2021 at 1:38 PM Andrew Lamb <[email protected]> wrote: > > Also, how is an object that implements Array <dyn Array> downcasted to > other types of Arrays. I'm doing it now using as_any and then down ref to > the type I want. But I have to write the type in the code and I want to > find a way for it to be done automatically. > > I think this is the standard way in Rust -- because Rust is statically > typed, in order to do anything with the implementations a cast to a > concrete type is typically needed. > > Something to look at might be the various compute kernels in > https://github.com/apache/arrow/tree/master/rust/arrow/src/compute/kernels > that do operate on `ArrayRef`s -- either for operations you could use > directly or at least inspiration / examples of how to manipulate the > various array types. > > > By the way, would it make sense to create a struct Table similar to the > one in pyarrow to collect several Record Batches? > I think this could make sense, though most of the operations I see in > pyarrow.Table > <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html> (e.g. > filter and select) are already supported in the Dataframe api. Is there any > operation in particular that you would like to use? > > Andrew > > On Sun, Jan 24, 2021 at 7:41 AM Fernando Herrera < > [email protected]> wrote: > >> Thanks Andrew, >> >> I did read the examples that you mentioned and I don't think they will >> help me with what I want to do. I need to create two hash maps from the >> parquet file to do further comparisons on those maps. In both cases I need >> to create a set of unique ngrams from strings stored in the parquet file. >> >> By the way, would it make sense to create a struct Table similar to the >> one in pyarrow to collect several Record Batches? >> >> Also, how is an object that implements Array <dyn Array> downcasted to >> other types of Arrays. I'm doing it now using as_any and then down ref to >> the type I want. But I have to write the type in the code and I want to >> find a way for it to be done automatically. >> >> Thanks, >> Fernando >> >> On Sun, 24 Jan 2021, 12:01 Andrew Lamb, <[email protected]> wrote: >> >>> Hi Fernando, >>> >>> Keeping the data in memory as `RecordBatch`es sounds like the way to go >>> if you want it all to be in memory. >>> >>> Another way to work in Rust with data from parquet files is to use the >>> `DataFusion` library; Depending on your needs it might save you some time >>> building up your analytics (e.g. it has aggregations, filtering and sorting >>> built it). >>> >>> Here are some examples of how to use DataFusion with a parquet file >>> (with the dataframe and the SQL api): >>> >>> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/dataframe.rs >>> >>> https://github.com/apache/arrow/blob/master/rust/datafusion/examples/parquet_sql.rs >>> >>> If you already have RecordBatches you can register an in memory table as >>> well. >>> >>> Hope that helps, >>> Andrew >>> >>> >>> On Sat, Jan 23, 2021 at 7:33 AM Fernando Herrera < >>> [email protected]> wrote: >>> >>>> Hi all, >>>> >>>> A quick question regarding reading a parquet file. What is the best way >>>> to read a parquet file and keep it in memory to do data analysis? >>>> >>>> What I'm doing now is using the record reader from the >>>> ParquetFileArrowReader and then I read all the record batches from the >>>> file. I keep the batches in memory in a vector of record batches. This way >>>> I have access to them to do some aggregations I need from the file. >>>> >>>> Is there another way to do this? >>>> >>>> Thanks, >>>> Fernando >>>> >>>
