Re: Documentation of concurrency of the compute API?

Aldrin Thu, 10 Mar 2022 18:47:41 -0800

You're correct with the first clarification. I am not (currently) slicing
column-wise.


And yes, I am calculating variance, mean, etc. so that I can calculate the
t-statistic.

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Thu, Mar 10, 2022 at 5:16 PM Niranda Perera <[email protected]>
wrote:

> Or are you slicing column-wise?
>
> On Thu, Mar 10, 2022 at 8:14 PM Niranda Perera <[email protected]>
> wrote:
>
>> From the looks of it, you are trying to calculate variance, mean, etc
>> over rows, isn't it?
>>
>> I need to clarify a bit on this statement.
>> "Where "by slice" is total time, summed from running the function on each
>> slice and "by table" is the time of just running the function on the table
>> concatenated from each slice."
>> So, I assume you are originally using a `vector<shared_ptr<Table>>
>> slices`. For the former case, you are passing each slice to
>> `MeanAggr::Accumulate`, and for the latter case, you are calling
>> arrow::Concatenate(slices) and passing the result as a single table?
>>
>> On Thu, Mar 10, 2022 at 7:41 PM Aldrin <[email protected]> wrote:
>>
>>> Oh, but the short answer is that I'm using: Add, Subtract, Divide,
>>> Multiply, Power, and Absolute. Sometimes with both inputs being
>>> ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other
>>> being a scalar.
>>>
>>> Aldrin Montana
>>> Computer Science PhD Student
>>> UC Santa Cruz
>>>
>>>
>>> On Thu, Mar 10, 2022 at 4:38 PM Aldrin <[email protected]> wrote:
>>>
>>>> Hi Niranda!
>>>>
>>>> Sure thing, I've linked to my code. [1] is essentially the function
>>>> being called, and [2] is an example of a wrapper function (more in that
>>>> file) I wrote to reduce boilerplate (to make [1] more readable). But, now
>>>> that I look at [2] again, which I wrote before I really knew much about
>>>> smart pointers, I wonder if some of what I benchmarked is overhead from
>>>> misusing C++ structures?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> [1]:
>>>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96
>>>> [2]:
>>>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18
>>>>
>>>> Aldrin Montana
>>>> Computer Science PhD Student
>>>> UC Santa Cruz
>>>>
>>>>
>>>> On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Aldrin,
>>>>>
>>>>> It would be helpful to know what sort of compute operators you are
>>>>> using.
>>>>>
>>>>> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote:
>>>>>
>>>>>> I will work on a reproducible example.
>>>>>>
>>>>>> As a sneak peek, what I was seeing was the following (pasted in
>>>>>> gmail, see [1] for markdown version):
>>>>>>
>>>>>> Table ID Columns Rows Rows (slice) Slice count Time (ms)
>>>>>> total; by slice Time (ms)
>>>>>> total; by table
>>>>>> E-GEOD-100618 415 20631 299 69 644.065 410
>>>>>> E-GEOD-76312 2152 27120 48 565 25607.927 2953
>>>>>> E-GEOD-106540 2145 24480 45 544 25193.507 3088
>>>>>>
>>>>>> Where "by slice" is total time, summed from running the function on
>>>>>> each slice and "by table" is the time of just running the function on the
>>>>>> table concatenated from each slice.
>>>>>>
>>>>>> The difference was large (but not *so* large) for ~70 iterations
>>>>>> (1.5x); but for ~550 iterations (and 6x fewer rows, 5x more columns) the
>>>>>> difference became significant (~10x).
>>>>>>
>>>>>> I will follow up here when I have a more reproducible example. I also
>>>>>> started doing this before tensors were available, so I'll try to see how
>>>>>> that changes performance.
>>>>>>
>>>>>>
>>>>>> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a
>>>>>>
>>>>>> Aldrin Montana
>>>>>> Computer Science PhD Student
>>>>>> UC Santa Cruz
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> As far as I know (and my knowledge here may be dated) the compute
>>>>>>> kernels themselves do not do any concurrency.  There are certainly
>>>>>>> compute kernels that could benefit from concurrency in this manner
>>>>>>> (many kernels naively so) and I think things are setup so that, if we
>>>>>>> decide to tackle this feature, we could do so in a systematic way
>>>>>>> (instead of writing something for each kernel).
>>>>>>>
>>>>>>> I believe that kernels, if given a unique kernel context, should be
>>>>>>> thread safe.
>>>>>>>
>>>>>>> The streaming compute engine, on the other hand, does support
>>>>>>> concurrency.  It is mostly driven by the scanner at the moment (e.g.
>>>>>>> each batch we fetch from the scanner gets a fresh thread task for
>>>>>>> running through the execution plan) but there is some intra-node
>>>>>>> concurrency in the hash join and (I think) the hash aggregate nodes.
>>>>>>> This has been sufficient to saturate cores on the benchmarks we run.
>>>>>>> I know there is ongoing interest in understanding and improving our
>>>>>>> concurrency here.
>>>>>>>
>>>>>>> The scanner supports concurrency.  It will typically fetch multiple
>>>>>>> files at once and, for each file, it will fetch multiple batches at
>>>>>>> once (assuming the file has more than one batch).
>>>>>>>
>>>>>>> > I see a large difference between the total time to apply compute
>>>>>>> functions to a single table (concatenated from many small tables) 
>>>>>>> compared
>>>>>>> to applying compute functions to each sub-table in the composition.
>>>>>>>
>>>>>>> Which one is better?  Can you share a reproducible example?
>>>>>>>
>>>>>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote:
>>>>>>> >
>>>>>>> > Hello!
>>>>>>> >
>>>>>>> > I'm wondering if there's any documentation that describes the
>>>>>>> concurrency/parallelism architecture for the compute API. I'd also be
>>>>>>> interested if there are recommended approaches for seeing performance of
>>>>>>> threads used by Arrow--should I try to check a processor ID and infer
>>>>>>> performance or are there particular tools that the community uses?
>>>>>>> >
>>>>>>> > Specifically, I am wondering if the concurrency is going to be
>>>>>>> different when using a ChunkedArray as an input compared to an Array or 
>>>>>>> for
>>>>>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I 
>>>>>>> see
>>>>>>> a large difference between the total time to apply compute functions to 
>>>>>>> a
>>>>>>> single table (concatenated from many small tables) compared to applying
>>>>>>> compute functions to each sub-table in the composition. I'm trying to
>>>>>>> figure out where that difference may come from and I'm wondering if it's
>>>>>>> related to parallelism within Arrow.
>>>>>>> >
>>>>>>> > I tried using the github issues and JIRA issues (e.g.  [1]) as a
>>>>>>> way to sleuth the info, but I couldn't find anything. The pyarrow API 
>>>>>>> seems
>>>>>>> to have functions I could try and use to figure it out (cpu_count and
>>>>>>> set_cpu_count), but that seems like a vague road.
>>>>>>> >
>>>>>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726
>>>>>>> >
>>>>>>> >
>>>>>>> > Thank you!
>>>>>>> >
>>>>>>> > Aldrin Montana
>>>>>>> > Computer Science PhD Student
>>>>>>> > UC Santa Cruz
>>>>>>>
>>>>>>
>>
>> --
>> Niranda Perera
>> https://niranda.dev/
>> @n1r44 <https://twitter.com/N1R44>
>>
>>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>

Re: Documentation of concurrency of the compute API?

Reply via email to