In a Vectorized querying system, scalars and conditionals should be
avoided at all costs. That's why it's called "vectorized" — it's about
the vectors and not the scalars.
Arrow Arrays (AKA "vectors" in other systems) are the unit of data you
mainly deal with. Data abstraction (in the OOP sense) isn't possible
while also keeping performance — classes like Scalar and DoubleScalar
are not supposed to be instantiated for every scalar in an array when
you're looping. The disadvantage is that your loop now depends on the
type of the array you're dealing with (no data abstraction based on
virtual dispatching).
> Also, is there an efficient way to loop through a slice perhaps by
> incrementing a pointer?
That's the right path. Given a ChunkedArray, this what you can do:
auto &dt = chunked_array->type();
assert(dt->id() == Type::DOUBLE);
for (auto &chunk : chunked_array->chunks()) {
// each chunk is an arrow::Array
ArrayData &data = chunk->data();
util::span<const double> raw_values = data.GetSpan<double>(1); // 1
is the data buffer
// ^ all the scalars of the chunk ara tightly packed here
// 64 bits for every double even if it's logically NULL
}
If data.IsNull(i), the value of raw_values[i] is undefined, depending
on what you're doing with the raw_values, you don't have to care.
Compute functions commonly have two different loops: one that handles
nulls and a faster one (without checks in the loop body) that you can
use when data.GetNullCount()==0.
Another trick is to compute on all the values and carry the same
validity-bitmap to the result. Possible when the operation is based on
each value independently of the others.
Hope this helps. The ultra generic loop on all possible array types is
not possible without many allocations and branches per array element.
--
Felipe
On Mon, Feb 19, 2024 at 9:23 AM Weston Pace <[email protected]> wrote:
>
> There is no advantage to using a Datum here. The Datum class is mainly
> intended for representing something that might be a Scalar or might be an
> Array.
>
> > Also, is there an efficient way to loop through a slice perhaps by
> > incrementing a pointer?
>
> You will want to cast the Array and avoid Scalar instances entirely. For
> example, if you know there are no nulls in your data then you can use methods
> like `DoubleArray::raw_values` which will give you a `double*`. Since it is
> a chunked array you would also have to deal with indexing and iterating the
> chunks.
>
> There are also some iterator utility classes like
> `arrow::stl::ChunkedArrayIterator` which can be easier to use.
>
> On Mon, Feb 19, 2024 at 3:54 AM Blair Azzopardi <[email protected]> wrote:
>>
>> On 2nd thoughts, the 2nd method could also be done in a single line.
>>
>> auto low3 =
>> arrow::Datum(st_s_low.ValueOrDie()).scalar_as<arrow::DoubleScalar>().value;
>>
>> That said, I'm still keen to hear if there's an advantage to using Datum or
>> without; and on my 2nd question regarding efficiently looping through a
>> slice's values.
>>
>> On Mon, 19 Feb 2024 at 09:24, Blair Azzopardi <[email protected]> wrote:
>>>
>>> Hi
>>>
>>> I'm trying to figure out the optimal way for extracting scalar values from
>>> a table; I've found two ways, using a dynamic cast or using Datum and cast.
>>> Is one better than the other? The advantage of the dynamic cast, seems at
>>> least, to be a one liner.
>>>
>>> auto c_val1 = table.GetColumnByName("Val1");
>>> auto st_c_val1 = s_low->GetScalar(0);
>>> if (st_c_val1.ok()) {
>>>
>>> // method 1 - via dyn cast
>>> auto val1 =
>>> std::dynamic_pointer_cast<arrow::DoubleScalar>(st_c_val1.ValueOrDie())->value;
>>>
>>> // method 2 - via Datum & cast
>>> arrow::Datum val(st_c_val1.ValueOrDie());
>>> auto val1 = val.scalar_as<arrow::DoubleScalar>().value;
>>> }
>>>
>>> Also, is there an efficient way to loop through a slice perhaps by
>>> incrementing a pointer? I know a chunked array might mean that the
>>> underlying data isn't stored contiguously so perhaps this is tricky to do.
>>> I imagine the compute functions might do this. Otherwise, it feels each
>>> access to a value in memory requires calls to several functions
>>> (GetScalar/ok/ValueOrDie etc).
>>>
>>> Thanks in advance
>>> Blair