Or are you slicing column-wise? On Thu, Mar 10, 2022 at 8:14 PM Niranda Perera <[email protected]> wrote:
> From the looks of it, you are trying to calculate variance, mean, etc over > rows, isn't it? > > I need to clarify a bit on this statement. > "Where "by slice" is total time, summed from running the function on each > slice and "by table" is the time of just running the function on the table > concatenated from each slice." > So, I assume you are originally using a `vector<shared_ptr<Table>> > slices`. For the former case, you are passing each slice to > `MeanAggr::Accumulate`, and for the latter case, you are calling > arrow::Concatenate(slices) and passing the result as a single table? > > On Thu, Mar 10, 2022 at 7:41 PM Aldrin <[email protected]> wrote: > >> Oh, but the short answer is that I'm using: Add, Subtract, Divide, >> Multiply, Power, and Absolute. Sometimes with both inputs being >> ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other >> being a scalar. >> >> Aldrin Montana >> Computer Science PhD Student >> UC Santa Cruz >> >> >> On Thu, Mar 10, 2022 at 4:38 PM Aldrin <[email protected]> wrote: >> >>> Hi Niranda! >>> >>> Sure thing, I've linked to my code. [1] is essentially the function >>> being called, and [2] is an example of a wrapper function (more in that >>> file) I wrote to reduce boilerplate (to make [1] more readable). But, now >>> that I look at [2] again, which I wrote before I really knew much about >>> smart pointers, I wonder if some of what I benchmarked is overhead from >>> misusing C++ structures? >>> >>> Thanks! >>> >>> >>> [1]: >>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96 >>> [2]: >>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18 >>> >>> Aldrin Montana >>> Computer Science PhD Student >>> UC Santa Cruz >>> >>> >>> On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <[email protected]> >>> wrote: >>> >>>> Hi Aldrin, >>>> >>>> It would be helpful to know what sort of compute operators you are >>>> using. >>>> >>>> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote: >>>> >>>>> I will work on a reproducible example. >>>>> >>>>> As a sneak peek, what I was seeing was the following (pasted in gmail, >>>>> see [1] for markdown version): >>>>> >>>>> Table ID Columns Rows Rows (slice) Slice count Time (ms) >>>>> total; by slice Time (ms) >>>>> total; by table >>>>> E-GEOD-100618 415 20631 299 69 644.065 410 >>>>> E-GEOD-76312 2152 27120 48 565 25607.927 2953 >>>>> E-GEOD-106540 2145 24480 45 544 25193.507 3088 >>>>> >>>>> Where "by slice" is total time, summed from running the function on >>>>> each slice and "by table" is the time of just running the function on the >>>>> table concatenated from each slice. >>>>> >>>>> The difference was large (but not *so* large) for ~70 iterations >>>>> (1.5x); but for ~550 iterations (and 6x fewer rows, 5x more columns) the >>>>> difference became significant (~10x). >>>>> >>>>> I will follow up here when I have a more reproducible example. I also >>>>> started doing this before tensors were available, so I'll try to see how >>>>> that changes performance. >>>>> >>>>> >>>>> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a >>>>> >>>>> Aldrin Montana >>>>> Computer Science PhD Student >>>>> UC Santa Cruz >>>>> >>>>> >>>>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]> >>>>> wrote: >>>>> >>>>>> As far as I know (and my knowledge here may be dated) the compute >>>>>> kernels themselves do not do any concurrency. There are certainly >>>>>> compute kernels that could benefit from concurrency in this manner >>>>>> (many kernels naively so) and I think things are setup so that, if we >>>>>> decide to tackle this feature, we could do so in a systematic way >>>>>> (instead of writing something for each kernel). >>>>>> >>>>>> I believe that kernels, if given a unique kernel context, should be >>>>>> thread safe. >>>>>> >>>>>> The streaming compute engine, on the other hand, does support >>>>>> concurrency. It is mostly driven by the scanner at the moment (e.g. >>>>>> each batch we fetch from the scanner gets a fresh thread task for >>>>>> running through the execution plan) but there is some intra-node >>>>>> concurrency in the hash join and (I think) the hash aggregate nodes. >>>>>> This has been sufficient to saturate cores on the benchmarks we run. >>>>>> I know there is ongoing interest in understanding and improving our >>>>>> concurrency here. >>>>>> >>>>>> The scanner supports concurrency. It will typically fetch multiple >>>>>> files at once and, for each file, it will fetch multiple batches at >>>>>> once (assuming the file has more than one batch). >>>>>> >>>>>> > I see a large difference between the total time to apply compute >>>>>> functions to a single table (concatenated from many small tables) >>>>>> compared >>>>>> to applying compute functions to each sub-table in the composition. >>>>>> >>>>>> Which one is better? Can you share a reproducible example? >>>>>> >>>>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote: >>>>>> > >>>>>> > Hello! >>>>>> > >>>>>> > I'm wondering if there's any documentation that describes the >>>>>> concurrency/parallelism architecture for the compute API. I'd also be >>>>>> interested if there are recommended approaches for seeing performance of >>>>>> threads used by Arrow--should I try to check a processor ID and infer >>>>>> performance or are there particular tools that the community uses? >>>>>> > >>>>>> > Specifically, I am wondering if the concurrency is going to be >>>>>> different when using a ChunkedArray as an input compared to an Array or >>>>>> for >>>>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I >>>>>> see >>>>>> a large difference between the total time to apply compute functions to a >>>>>> single table (concatenated from many small tables) compared to applying >>>>>> compute functions to each sub-table in the composition. I'm trying to >>>>>> figure out where that difference may come from and I'm wondering if it's >>>>>> related to parallelism within Arrow. >>>>>> > >>>>>> > I tried using the github issues and JIRA issues (e.g. [1]) as a >>>>>> way to sleuth the info, but I couldn't find anything. The pyarrow API >>>>>> seems >>>>>> to have functions I could try and use to figure it out (cpu_count and >>>>>> set_cpu_count), but that seems like a vague road. >>>>>> > >>>>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726 >>>>>> > >>>>>> > >>>>>> > Thank you! >>>>>> > >>>>>> > Aldrin Montana >>>>>> > Computer Science PhD Student >>>>>> > UC Santa Cruz >>>>>> >>>>> > > -- > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44> > > -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>
