>From the looks of it, you are trying to calculate variance, mean, etc over rows, isn't it?
I need to clarify a bit on this statement. "Where "by slice" is total time, summed from running the function on each slice and "by table" is the time of just running the function on the table concatenated from each slice." So, I assume you are originally using a `vector<shared_ptr<Table>> slices`. For the former case, you are passing each slice to `MeanAggr::Accumulate`, and for the latter case, you are calling arrow::Concatenate(slices) and passing the result as a single table? On Thu, Mar 10, 2022 at 7:41 PM Aldrin <[email protected]> wrote: > Oh, but the short answer is that I'm using: Add, Subtract, Divide, > Multiply, Power, and Absolute. Sometimes with both inputs being > ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other > being a scalar. > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Thu, Mar 10, 2022 at 4:38 PM Aldrin <[email protected]> wrote: > >> Hi Niranda! >> >> Sure thing, I've linked to my code. [1] is essentially the function being >> called, and [2] is an example of a wrapper function (more in that file) I >> wrote to reduce boilerplate (to make [1] more readable). But, now that I >> look at [2] again, which I wrote before I really knew much about smart >> pointers, I wonder if some of what I benchmarked is overhead from misusing >> C++ structures? >> >> Thanks! >> >> >> [1]: >> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96 >> [2]: >> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18 >> >> Aldrin Montana >> Computer Science PhD Student >> UC Santa Cruz >> >> >> On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <[email protected]> >> wrote: >> >>> Hi Aldrin, >>> >>> It would be helpful to know what sort of compute operators you are >>> using. >>> >>> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote: >>> >>>> I will work on a reproducible example. >>>> >>>> As a sneak peek, what I was seeing was the following (pasted in gmail, >>>> see [1] for markdown version): >>>> >>>> Table ID Columns Rows Rows (slice) Slice count Time (ms) >>>> total; by slice Time (ms) >>>> total; by table >>>> E-GEOD-100618 415 20631 299 69 644.065 410 >>>> E-GEOD-76312 2152 27120 48 565 25607.927 2953 >>>> E-GEOD-106540 2145 24480 45 544 25193.507 3088 >>>> >>>> Where "by slice" is total time, summed from running the function on >>>> each slice and "by table" is the time of just running the function on the >>>> table concatenated from each slice. >>>> >>>> The difference was large (but not *so* large) for ~70 iterations >>>> (1.5x); but for ~550 iterations (and 6x fewer rows, 5x more columns) the >>>> difference became significant (~10x). >>>> >>>> I will follow up here when I have a more reproducible example. I also >>>> started doing this before tensors were available, so I'll try to see how >>>> that changes performance. >>>> >>>> >>>> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a >>>> >>>> Aldrin Montana >>>> Computer Science PhD Student >>>> UC Santa Cruz >>>> >>>> >>>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]> >>>> wrote: >>>> >>>>> As far as I know (and my knowledge here may be dated) the compute >>>>> kernels themselves do not do any concurrency. There are certainly >>>>> compute kernels that could benefit from concurrency in this manner >>>>> (many kernels naively so) and I think things are setup so that, if we >>>>> decide to tackle this feature, we could do so in a systematic way >>>>> (instead of writing something for each kernel). >>>>> >>>>> I believe that kernels, if given a unique kernel context, should be >>>>> thread safe. >>>>> >>>>> The streaming compute engine, on the other hand, does support >>>>> concurrency. It is mostly driven by the scanner at the moment (e.g. >>>>> each batch we fetch from the scanner gets a fresh thread task for >>>>> running through the execution plan) but there is some intra-node >>>>> concurrency in the hash join and (I think) the hash aggregate nodes. >>>>> This has been sufficient to saturate cores on the benchmarks we run. >>>>> I know there is ongoing interest in understanding and improving our >>>>> concurrency here. >>>>> >>>>> The scanner supports concurrency. It will typically fetch multiple >>>>> files at once and, for each file, it will fetch multiple batches at >>>>> once (assuming the file has more than one batch). >>>>> >>>>> > I see a large difference between the total time to apply compute >>>>> functions to a single table (concatenated from many small tables) compared >>>>> to applying compute functions to each sub-table in the composition. >>>>> >>>>> Which one is better? Can you share a reproducible example? >>>>> >>>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote: >>>>> > >>>>> > Hello! >>>>> > >>>>> > I'm wondering if there's any documentation that describes the >>>>> concurrency/parallelism architecture for the compute API. I'd also be >>>>> interested if there are recommended approaches for seeing performance of >>>>> threads used by Arrow--should I try to check a processor ID and infer >>>>> performance or are there particular tools that the community uses? >>>>> > >>>>> > Specifically, I am wondering if the concurrency is going to be >>>>> different when using a ChunkedArray as an input compared to an Array or >>>>> for >>>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I >>>>> see >>>>> a large difference between the total time to apply compute functions to a >>>>> single table (concatenated from many small tables) compared to applying >>>>> compute functions to each sub-table in the composition. I'm trying to >>>>> figure out where that difference may come from and I'm wondering if it's >>>>> related to parallelism within Arrow. >>>>> > >>>>> > I tried using the github issues and JIRA issues (e.g. [1]) as a way >>>>> to sleuth the info, but I couldn't find anything. The pyarrow API seems to >>>>> have functions I could try and use to figure it out (cpu_count and >>>>> set_cpu_count), but that seems like a vague road. >>>>> > >>>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726 >>>>> > >>>>> > >>>>> > Thank you! >>>>> > >>>>> > Aldrin Montana >>>>> > Computer Science PhD Student >>>>> > UC Santa Cruz >>>>> >>>> -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>
