Thanks @Weston and @Felipe. This information has been very helpful and
thank you for the examples too. I completely agree with vectorizing
computations; although, ultimately, these do end up being loops at the
lower levels (unless there's some hardware support, eg SIMD/GPU etc).

@Weston, I managed to iterate over my chunked array as you suggested (found
some useful examples under the test cases) i.e

    std::vector<double> values;
    for (auto elem :
arrow::stl::Iterate<arrow::DoubleType>(*chunked_array)) {
        if (elem.has_value()) {
            values.push_back(*elem);
        }
    }

@Felipe, I had to adjust your snippet somewhat to get it to work (perhaps
the API is in flux). Eventually I did something like this:

    for (auto &chunk : chunked_array->chunks()) {
        auto &data = chunk->data();
        arrow::ArraySpan array_span(*data);
        auto len = array_span.buffers[1].size /
static_cast<int64_t>(sizeof(double));
        auto raw_values = array_span.GetSpan<double>(1, len);
        // able to inspect (double)*(raw_values.data_ + N)
    }

Now I just need to figure out the best way to do this over multiple columns
(row-wise).

Thanks again!


On Tue, 20 Feb 2024 at 19:51, Felipe Oliveira Carvalho <[email protected]>
wrote:

> In a Vectorized querying system, scalars and conditionals should be
> avoided at all costs. That's why it's called "vectorized" — it's about
> the vectors and not the scalars.
>
> Arrow Arrays (AKA "vectors" in other systems) are the unit of data you
> mainly deal with. Data abstraction (in the OOP sense) isn't possible
> while also keeping performance — classes like Scalar and DoubleScalar
> are not supposed to be instantiated for every scalar in an array when
> you're looping. The disadvantage is that your loop now depends on the
> type of the array you're dealing with (no data abstraction based on
> virtual dispatching).
>
> > Also, is there an efficient way to loop through a slice perhaps by
> incrementing a pointer?
>
> That's the right path. Given a ChunkedArray, this what you can do:
>
> auto &dt = chunked_array->type();
> assert(dt->id() == Type::DOUBLE);
> for (auto &chunk : chunked_array->chunks()) {
>    // each chunk is an arrow::Array
>    ArrayData &data = chunk->data();
>    util::span<const double> raw_values = data.GetSpan<double>(1); // 1
> is the data buffer
>    // ^ all the scalars of the chunk ara tightly packed here
>    // 64 bits for every double even if it's logically NULL
> }
>
> If data.IsNull(i), the value of raw_values[i] is undefined, depending
> on what you're doing with the raw_values, you don't have to care.
> Compute functions commonly have two different loops: one that handles
> nulls and a faster one (without checks in the loop body) that you can
> use when data.GetNullCount()==0.
>
> Another trick is to compute on all the values and carry the same
> validity-bitmap to the result. Possible when the operation is based on
> each value independently of the others.
>
> Hope this helps. The ultra generic loop on all possible array types is
> not possible without many allocations and branches per array element.
>
> --
> Felipe
>
>
>
> On Mon, Feb 19, 2024 at 9:23 AM Weston Pace <[email protected]> wrote:
> >
> > There is no advantage to using a Datum here.  The Datum class is mainly
> intended for representing something that might be a Scalar or might be an
> Array.
> >
> > > Also, is there an efficient way to loop through a slice perhaps by
> incrementing a pointer?
> >
> > You will want to cast the Array and avoid Scalar instances entirely.
> For example, if you know there are no nulls in your data then you can use
> methods like `DoubleArray::raw_values` which will give you a `double*`.
> Since it is a chunked array you would also have to deal with indexing and
> iterating the chunks.
> >
> > There are also some iterator utility classes like
> `arrow::stl::ChunkedArrayIterator` which can be easier to use.
> >
> > On Mon, Feb 19, 2024 at 3:54 AM Blair Azzopardi <[email protected]>
> wrote:
> >>
> >> On 2nd thoughts, the 2nd method could also be done in a single line.
> >>
> >> auto low3 =
> arrow::Datum(st_s_low.ValueOrDie()).scalar_as<arrow::DoubleScalar>().value;
> >>
> >> That said, I'm still keen to hear if there's an advantage to using
> Datum or without; and on my 2nd question regarding efficiently looping
> through a slice's values.
> >>
> >> On Mon, 19 Feb 2024 at 09:24, Blair Azzopardi <[email protected]>
> wrote:
> >>>
> >>> Hi
> >>>
> >>> I'm trying to figure out the optimal way for extracting scalar values
> from a table; I've found two ways, using a dynamic cast or using Datum and
> cast. Is one better than the other? The advantage of the dynamic cast,
> seems at least, to be a one liner.
> >>>
> >>> auto c_val1 = table.GetColumnByName("Val1");
> >>> auto st_c_val1 = s_low->GetScalar(0);
> >>> if (st_c_val1.ok()) {
> >>>
> >>>     // method 1 - via dyn cast
> >>>     auto val1 =
> std::dynamic_pointer_cast<arrow::DoubleScalar>(st_c_val1.ValueOrDie())->value;
> >>>
> >>>     // method 2 - via Datum & cast
> >>>     arrow::Datum val(st_c_val1.ValueOrDie());
> >>>     auto val1 = val.scalar_as<arrow::DoubleScalar>().value;
> >>> }
> >>>
> >>> Also, is there an efficient way to loop through a slice perhaps by
> incrementing a pointer? I know a chunked array might mean that the
> underlying data isn't stored contiguously so perhaps this is tricky to do.
> I imagine the compute functions might do this. Otherwise, it feels each
> access to a value in memory requires calls to several functions
> (GetScalar/ok/ValueOrDie etc).
> >>>
> >>> Thanks in advance
> >>> Blair
>

Reply via email to