Paul Balanca created ARROW-11006: ------------------------------------ Summary: [Python] Array to_numpy slow compared to Numpy.view Key: ARROW-11006 URL: https://issues.apache.org/jira/browse/ARROW-11006 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Paul Balanca Assignee: Paul Balanca
The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance: {code:java} N = 1000000 np_arr = np.arange(N) pa_arr = pa.array(np_arr) %timeit l = [np_arr.view() for _ in range(N)] 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)] 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times). I would believe that part of this overhead is probably due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)