Thanks for the references and explanation!

Best,
Tao

On Mon, Dec 20, 2021 at 2:26 PM Sasha Krassovsky <[email protected]>
wrote:

> Hi Tao,
> You’re right that there is a cost to convert Arrow’s columnar format to a
> row format, but generally data analytics are done on columnar arrays for
> performance reasons. Any modern data analytics engine will use some form of
> “structure of arrays” (i.e. one array per column, rather than one struct
> per row).
>
> One example of this could be for grouped aggregation, where you group by a
> key column (let’s call it A) and sum another column (call it B). This can
> be broken up into two steps: group ID mapping and summation. The first
> stage touches only column A, as you find which row indices have the same
> values and assign them to groups. The summation step then touches only
> column B. As you can see, each part of the analysis touches only one column
> at a time. Thus if we were to use a row-oriented format we would be loading
> data into cache that we weren’t actually using, effectively reducing our
> cache size by half! By using a columnar format, we minimize cache misses by
> loading more useful data at a time.
>
> I hope this example helps illustrate why a columnar format is better for
> analytical workloads, which is what Arrow is targeting.
>
> Sasha
>
> > 19 дек. 2021 г., в 22:02, Tao Wang <[email protected]> написал(а):
> >
> > Hi,
> >
> > I looked through Arrow's docs about its formats and APIs.
> >
> > But I am still somewhat confused about typical usecases of Arrow.
> >
> > As in my understanding, the goal of Arrow is to eliminate the
> (de)serialization costs among different data analytic systems, since it has
> the common format.
> >
> > But, it still needs some data conversion between Arrow format and
> language native format, right? For example, you have to convert Arrow
> columnar-based format to C++ row-based format. Or is there any usecase to
> directly conduct data analysis on Arrow's format?
> >
> > Best,
> > Tao
>

Reply via email to