Thanks for the references and explanation! Best, Tao
On Mon, Dec 20, 2021 at 2:26 PM Sasha Krassovsky <[email protected]> wrote: > Hi Tao, > You’re right that there is a cost to convert Arrow’s columnar format to a > row format, but generally data analytics are done on columnar arrays for > performance reasons. Any modern data analytics engine will use some form of > “structure of arrays” (i.e. one array per column, rather than one struct > per row). > > One example of this could be for grouped aggregation, where you group by a > key column (let’s call it A) and sum another column (call it B). This can > be broken up into two steps: group ID mapping and summation. The first > stage touches only column A, as you find which row indices have the same > values and assign them to groups. The summation step then touches only > column B. As you can see, each part of the analysis touches only one column > at a time. Thus if we were to use a row-oriented format we would be loading > data into cache that we weren’t actually using, effectively reducing our > cache size by half! By using a columnar format, we minimize cache misses by > loading more useful data at a time. > > I hope this example helps illustrate why a columnar format is better for > analytical workloads, which is what Arrow is targeting. > > Sasha > > > 19 дек. 2021 г., в 22:02, Tao Wang <[email protected]> написал(а): > > > > Hi, > > > > I looked through Arrow's docs about its formats and APIs. > > > > But I am still somewhat confused about typical usecases of Arrow. > > > > As in my understanding, the goal of Arrow is to eliminate the > (de)serialization costs among different data analytic systems, since it has > the common format. > > > > But, it still needs some data conversion between Arrow format and > language native format, right? For example, you have to convert Arrow > columnar-based format to C++ row-based format. Or is there any usecase to > directly conduct data analysis on Arrow's format? > > > > Best, > > Tao >
