Hi Niranda! Sure thing, I've linked to my code. [1] is essentially the function being called, and [2] is an example of a wrapper function (more in that file) I wrote to reduce boilerplate (to make [1] more readable). But, now that I look at [2] again, which I wrote before I really knew much about smart pointers, I wonder if some of what I benchmarked is overhead from misusing C++ structures?
Thanks! [1]: https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96 [2]: https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18 Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <[email protected]> wrote: > Hi Aldrin, > > It would be helpful to know what sort of compute operators you are using. > > On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote: > >> I will work on a reproducible example. >> >> As a sneak peek, what I was seeing was the following (pasted in gmail, >> see [1] for markdown version): >> >> Table ID Columns Rows Rows (slice) Slice count Time (ms) >> total; by slice Time (ms) >> total; by table >> E-GEOD-100618 415 20631 299 69 644.065 410 >> E-GEOD-76312 2152 27120 48 565 25607.927 2953 >> E-GEOD-106540 2145 24480 45 544 25193.507 3088 >> >> Where "by slice" is total time, summed from running the function on each >> slice and "by table" is the time of just running the function on the table >> concatenated from each slice. >> >> The difference was large (but not *so* large) for ~70 iterations (1.5x); >> but for ~550 iterations (and 6x fewer rows, 5x more columns) the difference >> became significant (~10x). >> >> I will follow up here when I have a more reproducible example. I also >> started doing this before tensors were available, so I'll try to see how >> that changes performance. >> >> >> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a >> >> Aldrin Montana >> Computer Science PhD Student >> UC Santa Cruz >> >> >> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]> >> wrote: >> >>> As far as I know (and my knowledge here may be dated) the compute >>> kernels themselves do not do any concurrency. There are certainly >>> compute kernels that could benefit from concurrency in this manner >>> (many kernels naively so) and I think things are setup so that, if we >>> decide to tackle this feature, we could do so in a systematic way >>> (instead of writing something for each kernel). >>> >>> I believe that kernels, if given a unique kernel context, should be >>> thread safe. >>> >>> The streaming compute engine, on the other hand, does support >>> concurrency. It is mostly driven by the scanner at the moment (e.g. >>> each batch we fetch from the scanner gets a fresh thread task for >>> running through the execution plan) but there is some intra-node >>> concurrency in the hash join and (I think) the hash aggregate nodes. >>> This has been sufficient to saturate cores on the benchmarks we run. >>> I know there is ongoing interest in understanding and improving our >>> concurrency here. >>> >>> The scanner supports concurrency. It will typically fetch multiple >>> files at once and, for each file, it will fetch multiple batches at >>> once (assuming the file has more than one batch). >>> >>> > I see a large difference between the total time to apply compute >>> functions to a single table (concatenated from many small tables) compared >>> to applying compute functions to each sub-table in the composition. >>> >>> Which one is better? Can you share a reproducible example? >>> >>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote: >>> > >>> > Hello! >>> > >>> > I'm wondering if there's any documentation that describes the >>> concurrency/parallelism architecture for the compute API. I'd also be >>> interested if there are recommended approaches for seeing performance of >>> threads used by Arrow--should I try to check a processor ID and infer >>> performance or are there particular tools that the community uses? >>> > >>> > Specifically, I am wondering if the concurrency is going to be >>> different when using a ChunkedArray as an input compared to an Array or for >>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I see >>> a large difference between the total time to apply compute functions to a >>> single table (concatenated from many small tables) compared to applying >>> compute functions to each sub-table in the composition. I'm trying to >>> figure out where that difference may come from and I'm wondering if it's >>> related to parallelism within Arrow. >>> > >>> > I tried using the github issues and JIRA issues (e.g. [1]) as a way >>> to sleuth the info, but I couldn't find anything. The pyarrow API seems to >>> have functions I could try and use to figure it out (cpu_count and >>> set_cpu_count), but that seems like a vague road. >>> > >>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726 >>> > >>> > >>> > Thank you! >>> > >>> > Aldrin Montana >>> > Computer Science PhD Student >>> > UC Santa Cruz >>> >>
