Re: Documentation of concurrency of the compute API?

Aldrin Thu, 10 Mar 2022 16:41:01 -0800

Oh, but the short answer is that I'm using: Add, Subtract, Divide,
Multiply, Power, and Absolute. Sometimes with both inputs being
ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other
being a scalar.


Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Thu, Mar 10, 2022 at 4:38 PM Aldrin <[email protected]> wrote:

> Hi Niranda!
>
> Sure thing, I've linked to my code. [1] is essentially the function being
> called, and [2] is an example of a wrapper function (more in that file) I
> wrote to reduce boilerplate (to make [1] more readable). But, now that I
> look at [2] again, which I wrote before I really knew much about smart
> pointers, I wonder if some of what I benchmarked is overhead from misusing
> C++ structures?
>
> Thanks!
>
>
> [1]:
> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96
> [2]:
> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <[email protected]>
> wrote:
>
>> Hi Aldrin,
>>
>> It would be helpful to know what sort of compute operators you are using.
>>
>> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote:
>>
>>> I will work on a reproducible example.
>>>
>>> As a sneak peek, what I was seeing was the following (pasted in gmail,
>>> see [1] for markdown version):
>>>
>>> Table ID Columns Rows Rows (slice) Slice count Time (ms)
>>> total; by slice Time (ms)
>>> total; by table
>>> E-GEOD-100618 415 20631 299 69 644.065 410
>>> E-GEOD-76312 2152 27120 48 565 25607.927 2953
>>> E-GEOD-106540 2145 24480 45 544 25193.507 3088
>>>
>>> Where "by slice" is total time, summed from running the function on each
>>> slice and "by table" is the time of just running the function on the table
>>> concatenated from each slice.
>>>
>>> The difference was large (but not *so* large) for ~70 iterations (1.5x);
>>> but for ~550 iterations (and 6x fewer rows, 5x more columns) the difference
>>> became significant (~10x).
>>>
>>> I will follow up here when I have a more reproducible example. I also
>>> started doing this before tensors were available, so I'll try to see how
>>> that changes performance.
>>>
>>>
>>> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a
>>>
>>> Aldrin Montana
>>> Computer Science PhD Student
>>> UC Santa Cruz
>>>
>>>
>>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]>
>>> wrote:
>>>
>>>> As far as I know (and my knowledge here may be dated) the compute
>>>> kernels themselves do not do any concurrency.  There are certainly
>>>> compute kernels that could benefit from concurrency in this manner
>>>> (many kernels naively so) and I think things are setup so that, if we
>>>> decide to tackle this feature, we could do so in a systematic way
>>>> (instead of writing something for each kernel).
>>>>
>>>> I believe that kernels, if given a unique kernel context, should be
>>>> thread safe.
>>>>
>>>> The streaming compute engine, on the other hand, does support
>>>> concurrency.  It is mostly driven by the scanner at the moment (e.g.
>>>> each batch we fetch from the scanner gets a fresh thread task for
>>>> running through the execution plan) but there is some intra-node
>>>> concurrency in the hash join and (I think) the hash aggregate nodes.
>>>> This has been sufficient to saturate cores on the benchmarks we run.
>>>> I know there is ongoing interest in understanding and improving our
>>>> concurrency here.
>>>>
>>>> The scanner supports concurrency.  It will typically fetch multiple
>>>> files at once and, for each file, it will fetch multiple batches at
>>>> once (assuming the file has more than one batch).
>>>>
>>>> > I see a large difference between the total time to apply compute
>>>> functions to a single table (concatenated from many small tables) compared
>>>> to applying compute functions to each sub-table in the composition.
>>>>
>>>> Which one is better?  Can you share a reproducible example?
>>>>
>>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote:
>>>> >
>>>> > Hello!
>>>> >
>>>> > I'm wondering if there's any documentation that describes the
>>>> concurrency/parallelism architecture for the compute API. I'd also be
>>>> interested if there are recommended approaches for seeing performance of
>>>> threads used by Arrow--should I try to check a processor ID and infer
>>>> performance or are there particular tools that the community uses?
>>>> >
>>>> > Specifically, I am wondering if the concurrency is going to be
>>>> different when using a ChunkedArray as an input compared to an Array or for
>>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I see
>>>> a large difference between the total time to apply compute functions to a
>>>> single table (concatenated from many small tables) compared to applying
>>>> compute functions to each sub-table in the composition. I'm trying to
>>>> figure out where that difference may come from and I'm wondering if it's
>>>> related to parallelism within Arrow.
>>>> >
>>>> > I tried using the github issues and JIRA issues (e.g.  [1]) as a way
>>>> to sleuth the info, but I couldn't find anything. The pyarrow API seems to
>>>> have functions I could try and use to figure it out (cpu_count and
>>>> set_cpu_count), but that seems like a vague road.
>>>> >
>>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726
>>>> >
>>>> >
>>>> > Thank you!
>>>> >
>>>> > Aldrin Montana
>>>> > Computer Science PhD Student
>>>> > UC Santa Cruz
>>>>
>>>

Re: Documentation of concurrency of the compute API?

Reply via email to