Re: Documentation of concurrency of the compute API?

Niranda Perera Thu, 10 Mar 2022 17:16:21 -0800

Or are you slicing column-wise?

On Thu, Mar 10, 2022 at 8:14 PM Niranda Perera <[email protected]>
wrote:


> From the looks of it, you are trying to calculate variance, mean, etc over
> rows, isn't it?
>
> I need to clarify a bit on this statement.
> "Where "by slice" is total time, summed from running the function on each
> slice and "by table" is the time of just running the function on the table
> concatenated from each slice."
> So, I assume you are originally using a `vector<shared_ptr<Table>>
> slices`. For the former case, you are passing each slice to
> `MeanAggr::Accumulate`, and for the latter case, you are calling
> arrow::Concatenate(slices) and passing the result as a single table?
>
> On Thu, Mar 10, 2022 at 7:41 PM Aldrin <[email protected]> wrote:
>
>> Oh, but the short answer is that I'm using: Add, Subtract, Divide,
>> Multiply, Power, and Absolute. Sometimes with both inputs being
>> ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other
>> being a scalar.
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Thu, Mar 10, 2022 at 4:38 PM Aldrin <[email protected]> wrote:
>>
>>> Hi Niranda!
>>>
>>> Sure thing, I've linked to my code. [1] is essentially the function
>>> being called, and [2] is an example of a wrapper function (more in that
>>> file) I wrote to reduce boilerplate (to make [1] more readable). But, now
>>> that I look at [2] again, which I wrote before I really knew much about
>>> smart pointers, I wonder if some of what I benchmarked is overhead from
>>> misusing C++ structures?
>>>
>>> Thanks!
>>>
>>>
>>> [1]:
>>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96
>>> [2]:
>>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18
>>>
>>> Aldrin Montana
>>> Computer Science PhD Student
>>> UC Santa Cruz
>>>
>>>
>>> On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <[email protected]>
>>> wrote:
>>>
>>>> Hi Aldrin,
>>>>
>>>> It would be helpful to know what sort of compute operators you are
>>>> using.
>>>>
>>>> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote:
>>>>
>>>>> I will work on a reproducible example.
>>>>>
>>>>> As a sneak peek, what I was seeing was the following (pasted in gmail,
>>>>> see [1] for markdown version):
>>>>>
>>>>> Table ID Columns Rows Rows (slice) Slice count Time (ms)
>>>>> total; by slice Time (ms)
>>>>> total; by table
>>>>> E-GEOD-100618 415 20631 299 69 644.065 410
>>>>> E-GEOD-76312 2152 27120 48 565 25607.927 2953
>>>>> E-GEOD-106540 2145 24480 45 544 25193.507 3088
>>>>>
>>>>> Where "by slice" is total time, summed from running the function on
>>>>> each slice and "by table" is the time of just running the function on the
>>>>> table concatenated from each slice.
>>>>>
>>>>> The difference was large (but not *so* large) for ~70 iterations
>>>>> (1.5x); but for ~550 iterations (and 6x fewer rows, 5x more columns) the
>>>>> difference became significant (~10x).
>>>>>
>>>>> I will follow up here when I have a more reproducible example. I also
>>>>> started doing this before tensors were available, so I'll try to see how
>>>>> that changes performance.
>>>>>
>>>>>
>>>>> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a
>>>>>
>>>>> Aldrin Montana
>>>>> Computer Science PhD Student
>>>>> UC Santa Cruz
>>>>>
>>>>>
>>>>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> As far as I know (and my knowledge here may be dated) the compute
>>>>>> kernels themselves do not do any concurrency.  There are certainly
>>>>>> compute kernels that could benefit from concurrency in this manner
>>>>>> (many kernels naively so) and I think things are setup so that, if we
>>>>>> decide to tackle this feature, we could do so in a systematic way
>>>>>> (instead of writing something for each kernel).
>>>>>>
>>>>>> I believe that kernels, if given a unique kernel context, should be
>>>>>> thread safe.
>>>>>>
>>>>>> The streaming compute engine, on the other hand, does support
>>>>>> concurrency.  It is mostly driven by the scanner at the moment (e.g.
>>>>>> each batch we fetch from the scanner gets a fresh thread task for
>>>>>> running through the execution plan) but there is some intra-node
>>>>>> concurrency in the hash join and (I think) the hash aggregate nodes.
>>>>>> This has been sufficient to saturate cores on the benchmarks we run.
>>>>>> I know there is ongoing interest in understanding and improving our
>>>>>> concurrency here.
>>>>>>
>>>>>> The scanner supports concurrency.  It will typically fetch multiple
>>>>>> files at once and, for each file, it will fetch multiple batches at
>>>>>> once (assuming the file has more than one batch).
>>>>>>
>>>>>> > I see a large difference between the total time to apply compute
>>>>>> functions to a single table (concatenated from many small tables) 
>>>>>> compared
>>>>>> to applying compute functions to each sub-table in the composition.
>>>>>>
>>>>>> Which one is better?  Can you share a reproducible example?
>>>>>>
>>>>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote:
>>>>>> >
>>>>>> > Hello!
>>>>>> >
>>>>>> > I'm wondering if there's any documentation that describes the
>>>>>> concurrency/parallelism architecture for the compute API. I'd also be
>>>>>> interested if there are recommended approaches for seeing performance of
>>>>>> threads used by Arrow--should I try to check a processor ID and infer
>>>>>> performance or are there particular tools that the community uses?
>>>>>> >
>>>>>> > Specifically, I am wondering if the concurrency is going to be
>>>>>> different when using a ChunkedArray as an input compared to an Array or 
>>>>>> for
>>>>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I 
>>>>>> see
>>>>>> a large difference between the total time to apply compute functions to a
>>>>>> single table (concatenated from many small tables) compared to applying
>>>>>> compute functions to each sub-table in the composition. I'm trying to
>>>>>> figure out where that difference may come from and I'm wondering if it's
>>>>>> related to parallelism within Arrow.
>>>>>> >
>>>>>> > I tried using the github issues and JIRA issues (e.g.  [1]) as a
>>>>>> way to sleuth the info, but I couldn't find anything. The pyarrow API 
>>>>>> seems
>>>>>> to have functions I could try and use to figure it out (cpu_count and
>>>>>> set_cpu_count), but that seems like a vague road.
>>>>>> >
>>>>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726
>>>>>> >
>>>>>> >
>>>>>> > Thank you!
>>>>>> >
>>>>>> > Aldrin Montana
>>>>>> > Computer Science PhD Student
>>>>>> > UC Santa Cruz
>>>>>>
>>>>>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Re: Documentation of concurrency of the compute API?

Reply via email to