On 12/20/21 9:01 AM, Tao Wang wrote:
Hi,
I looked through Arrow's docs about its formats and APIs.
But I am still somewhat confused about typical usecases of Arrow.
As in my understanding, the goal of Arrow is to eliminate the (de)serialization
costs among different data analytic systems, since it has the common format.
But, it still needs some data conversion between Arrow format and language
native format, right? For example, you have to convert Arrow columnar-based
format to C++ row-based format. Or is there any usecase to directly conduct
data analysis on Arrow's format?
Conversion may be required, but the hope is that for many data analytics
applications, if the data can be described by the arrow format, then
conversion is not needed, and data processing can occur efficiently.
Please see examples[1] and cookbook[2] for analytics demonstrations.
Best,
Tao
Hi Tao,
The documentation is still being updated. For an end user, Python
documentation [1][2] and Ballista[3] documentation are probably of most
interest. The original motivation for Arrow was to develop more
efficient data frames that allow for interoperability[4].
Regards,
Benson
[1] https://arrow.apache.org/docs/python/index.html
[2] https://arrow.apache.org/cookbook/py/
[3] https://arrow.apache.org/blog/2021/04/12/ballista-donation/
[4] https://wesmckinney.com/blog/apache-arrow-pandas-internals/