Hi Niranda!

Sure thing, I've linked to my code. [1] is essentially the function being
called, and [2] is an example of a wrapper function (more in that file) I
wrote to reduce boilerplate (to make [1] more readable). But, now that I
look at [2] again, which I wrote before I really knew much about smart
pointers, I wonder if some of what I benchmarked is overhead from misusing
C++ structures?

Thanks!


[1]:
https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96
[2]:
https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <[email protected]>
wrote:

> Hi Aldrin,
>
> It would be helpful to know what sort of compute operators you are using.
>
> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote:
>
>> I will work on a reproducible example.
>>
>> As a sneak peek, what I was seeing was the following (pasted in gmail,
>> see [1] for markdown version):
>>
>> Table ID Columns Rows Rows (slice) Slice count Time (ms)
>> total; by slice Time (ms)
>> total; by table
>> E-GEOD-100618 415 20631 299 69 644.065 410
>> E-GEOD-76312 2152 27120 48 565 25607.927 2953
>> E-GEOD-106540 2145 24480 45 544 25193.507 3088
>>
>> Where "by slice" is total time, summed from running the function on each
>> slice and "by table" is the time of just running the function on the table
>> concatenated from each slice.
>>
>> The difference was large (but not *so* large) for ~70 iterations (1.5x);
>> but for ~550 iterations (and 6x fewer rows, 5x more columns) the difference
>> became significant (~10x).
>>
>> I will follow up here when I have a more reproducible example. I also
>> started doing this before tensors were available, so I'll try to see how
>> that changes performance.
>>
>>
>> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]>
>> wrote:
>>
>>> As far as I know (and my knowledge here may be dated) the compute
>>> kernels themselves do not do any concurrency.  There are certainly
>>> compute kernels that could benefit from concurrency in this manner
>>> (many kernels naively so) and I think things are setup so that, if we
>>> decide to tackle this feature, we could do so in a systematic way
>>> (instead of writing something for each kernel).
>>>
>>> I believe that kernels, if given a unique kernel context, should be
>>> thread safe.
>>>
>>> The streaming compute engine, on the other hand, does support
>>> concurrency.  It is mostly driven by the scanner at the moment (e.g.
>>> each batch we fetch from the scanner gets a fresh thread task for
>>> running through the execution plan) but there is some intra-node
>>> concurrency in the hash join and (I think) the hash aggregate nodes.
>>> This has been sufficient to saturate cores on the benchmarks we run.
>>> I know there is ongoing interest in understanding and improving our
>>> concurrency here.
>>>
>>> The scanner supports concurrency.  It will typically fetch multiple
>>> files at once and, for each file, it will fetch multiple batches at
>>> once (assuming the file has more than one batch).
>>>
>>> > I see a large difference between the total time to apply compute
>>> functions to a single table (concatenated from many small tables) compared
>>> to applying compute functions to each sub-table in the composition.
>>>
>>> Which one is better?  Can you share a reproducible example?
>>>
>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote:
>>> >
>>> > Hello!
>>> >
>>> > I'm wondering if there's any documentation that describes the
>>> concurrency/parallelism architecture for the compute API. I'd also be
>>> interested if there are recommended approaches for seeing performance of
>>> threads used by Arrow--should I try to check a processor ID and infer
>>> performance or are there particular tools that the community uses?
>>> >
>>> > Specifically, I am wondering if the concurrency is going to be
>>> different when using a ChunkedArray as an input compared to an Array or for
>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I see
>>> a large difference between the total time to apply compute functions to a
>>> single table (concatenated from many small tables) compared to applying
>>> compute functions to each sub-table in the composition. I'm trying to
>>> figure out where that difference may come from and I'm wondering if it's
>>> related to parallelism within Arrow.
>>> >
>>> > I tried using the github issues and JIRA issues (e.g.  [1]) as a way
>>> to sleuth the info, but I couldn't find anything. The pyarrow API seems to
>>> have functions I could try and use to figure it out (cpu_count and
>>> set_cpu_count), but that seems like a vague road.
>>> >
>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726
>>> >
>>> >
>>> > Thank you!
>>> >
>>> > Aldrin Montana
>>> > Computer Science PhD Student
>>> > UC Santa Cruz
>>>
>>

Reply via email to