Oh, but the short answer is that I'm using: Add, Subtract, Divide, Multiply, Power, and Absolute. Sometimes with both inputs being ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other being a scalar.
Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Mar 10, 2022 at 4:38 PM Aldrin <[email protected]> wrote: > Hi Niranda! > > Sure thing, I've linked to my code. [1] is essentially the function being > called, and [2] is an example of a wrapper function (more in that file) I > wrote to reduce boilerplate (to make [1] more readable). But, now that I > look at [2] again, which I wrote before I really knew much about smart > pointers, I wonder if some of what I benchmarked is overhead from misusing > C++ structures? > > Thanks! > > > [1]: > https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96 > [2]: > https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18 > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <[email protected]> > wrote: > >> Hi Aldrin, >> >> It would be helpful to know what sort of compute operators you are using. >> >> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote: >> >>> I will work on a reproducible example. >>> >>> As a sneak peek, what I was seeing was the following (pasted in gmail, >>> see [1] for markdown version): >>> >>> Table ID Columns Rows Rows (slice) Slice count Time (ms) >>> total; by slice Time (ms) >>> total; by table >>> E-GEOD-100618 415 20631 299 69 644.065 410 >>> E-GEOD-76312 2152 27120 48 565 25607.927 2953 >>> E-GEOD-106540 2145 24480 45 544 25193.507 3088 >>> >>> Where "by slice" is total time, summed from running the function on each >>> slice and "by table" is the time of just running the function on the table >>> concatenated from each slice. >>> >>> The difference was large (but not *so* large) for ~70 iterations (1.5x); >>> but for ~550 iterations (and 6x fewer rows, 5x more columns) the difference >>> became significant (~10x). >>> >>> I will follow up here when I have a more reproducible example. I also >>> started doing this before tensors were available, so I'll try to see how >>> that changes performance. >>> >>> >>> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a >>> >>> Aldrin Montana >>> Computer Science PhD Student >>> UC Santa Cruz >>> >>> >>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]> >>> wrote: >>> >>>> As far as I know (and my knowledge here may be dated) the compute >>>> kernels themselves do not do any concurrency. There are certainly >>>> compute kernels that could benefit from concurrency in this manner >>>> (many kernels naively so) and I think things are setup so that, if we >>>> decide to tackle this feature, we could do so in a systematic way >>>> (instead of writing something for each kernel). >>>> >>>> I believe that kernels, if given a unique kernel context, should be >>>> thread safe. >>>> >>>> The streaming compute engine, on the other hand, does support >>>> concurrency. It is mostly driven by the scanner at the moment (e.g. >>>> each batch we fetch from the scanner gets a fresh thread task for >>>> running through the execution plan) but there is some intra-node >>>> concurrency in the hash join and (I think) the hash aggregate nodes. >>>> This has been sufficient to saturate cores on the benchmarks we run. >>>> I know there is ongoing interest in understanding and improving our >>>> concurrency here. >>>> >>>> The scanner supports concurrency. It will typically fetch multiple >>>> files at once and, for each file, it will fetch multiple batches at >>>> once (assuming the file has more than one batch). >>>> >>>> > I see a large difference between the total time to apply compute >>>> functions to a single table (concatenated from many small tables) compared >>>> to applying compute functions to each sub-table in the composition. >>>> >>>> Which one is better? Can you share a reproducible example? >>>> >>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote: >>>> > >>>> > Hello! >>>> > >>>> > I'm wondering if there's any documentation that describes the >>>> concurrency/parallelism architecture for the compute API. I'd also be >>>> interested if there are recommended approaches for seeing performance of >>>> threads used by Arrow--should I try to check a processor ID and infer >>>> performance or are there particular tools that the community uses? >>>> > >>>> > Specifically, I am wondering if the concurrency is going to be >>>> different when using a ChunkedArray as an input compared to an Array or for >>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I see >>>> a large difference between the total time to apply compute functions to a >>>> single table (concatenated from many small tables) compared to applying >>>> compute functions to each sub-table in the composition. I'm trying to >>>> figure out where that difference may come from and I'm wondering if it's >>>> related to parallelism within Arrow. >>>> > >>>> > I tried using the github issues and JIRA issues (e.g. [1]) as a way >>>> to sleuth the info, but I couldn't find anything. The pyarrow API seems to >>>> have functions I could try and use to figure it out (cpu_count and >>>> set_cpu_count), but that seems like a vague road. >>>> > >>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726 >>>> > >>>> > >>>> > Thank you! >>>> > >>>> > Aldrin Montana >>>> > Computer Science PhD Student >>>> > UC Santa Cruz >>>> >>>
