How to best get data from pyarrow.Table column of MapType to a 2D numpy array?

Amine Boubezari Tue, 24 Nov 2020 19:41:17 -0800

Hello, I have question regarding best practices with Apache Arrow. I have a 
very large dataset (10's of millions of rows) stored on a partitioned parquet 
dataset on disk. I load this dataset into memory into a pyarrow.Table, and drop 
all columns except one, which is of type MapType mapping integers to floats. 
This column represents sparse feature vector data to be used in an ML context. 
Call the number of rows "num_rows". My job is to transform this column to a 2D 
numpy array of shape ("num_rows" x "num_cols") where both rows and cols are 
known before hand. If one of my pyarrow.Table rows looks like [(1, 3.4), (2, 
4.4), (4, 5.4), (6, 6.4)] and "num_cols" = 10, then that row in the numpy array 
would look like [0, 3.4, 4.4, 0, 5.4, 0, 6.4, 0, 0, 0, 0], where unmapped 
values are just 0. My 2D numpy array would just be the collection of rows from 
the pyarrow.Table transformed in such a way. What is the best, most efficient 
way to accomplish this, considering I have 10's of millions of rows? Assume I 
have enough memory to fit the entire dataset.


Note that I can use table.to_pandas() to get a pandas DF, and then map 
functions on the pandas series, if that would help in the solution. So far I 
have been stumped, however. df.to_numpy() has not been helpful here.

How to best get data from pyarrow.Table column of MapType to a 2D numpy array?

Reply via email to