Hi Amine, I don't think there is anything in the core arrow library that helps with this at the moment. The most efficient way for doing something like this would probably be Customer C/C++ code to do the conversion, but I'm not an expert in numpy.
-Micah On Tue, Nov 24, 2020 at 7:41 PM Amine Boubezari <[email protected]> wrote: > Hello, I have question regarding best practices with Apache Arrow. I have > a very large dataset (10's of millions of rows) stored on a partitioned > parquet dataset on disk. I load this dataset into memory into a > pyarrow.Table, and drop all columns except one, which is of type MapType > mapping integers to floats. This column represents sparse feature vector > data to be used in an ML context. Call the number of rows "num_rows". My > job is to transform this column to a 2D numpy array of shape ("num_rows" x > "num_cols") where both rows and cols are known before hand. If one of my > pyarrow.Table rows looks like [(1, 3.4), (2, 4.4), (4, 5.4), (6, 6.4)] and > "num_cols" = 10, then that row in the numpy array would look like [0, 3.4, > 4.4, 0, 5.4, 0, 6.4, 0, 0, 0, 0], where unmapped values are just 0. My 2D > numpy array would just be the collection of rows from the pyarrow.Table > transformed in such a way. What is the best, most efficient way to > accomplish this, considering I have 10's of millions of rows? Assume I have > enough memory to fit the entire dataset. > Note that I can use table.to_pandas() to get a pandas DF, and then map > functions on the pandas series, if that would help in the solution. So far > I have been stumped, however. df.to_numpy() has not been helpful here. >
