I will work on a reproducible example. As a sneak peek, what I was seeing was the following (pasted in gmail, see [1] for markdown version):
Table ID Columns Rows Rows (slice) Slice count Time (ms) total; by slice Time (ms) total; by table E-GEOD-100618 415 20631 299 69 644.065 410 E-GEOD-76312 2152 27120 48 565 25607.927 2953 E-GEOD-106540 2145 24480 45 544 25193.507 3088 Where "by slice" is total time, summed from running the function on each slice and "by table" is the time of just running the function on the table concatenated from each slice. The difference was large (but not *so* large) for ~70 iterations (1.5x); but for ~550 iterations (and 6x fewer rows, 5x more columns) the difference became significant (~10x). I will follow up here when I have a more reproducible example. I also started doing this before tensors were available, so I'll try to see how that changes performance. [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]> wrote: > As far as I know (and my knowledge here may be dated) the compute > kernels themselves do not do any concurrency. There are certainly > compute kernels that could benefit from concurrency in this manner > (many kernels naively so) and I think things are setup so that, if we > decide to tackle this feature, we could do so in a systematic way > (instead of writing something for each kernel). > > I believe that kernels, if given a unique kernel context, should be thread > safe. > > The streaming compute engine, on the other hand, does support > concurrency. It is mostly driven by the scanner at the moment (e.g. > each batch we fetch from the scanner gets a fresh thread task for > running through the execution plan) but there is some intra-node > concurrency in the hash join and (I think) the hash aggregate nodes. > This has been sufficient to saturate cores on the benchmarks we run. > I know there is ongoing interest in understanding and improving our > concurrency here. > > The scanner supports concurrency. It will typically fetch multiple > files at once and, for each file, it will fetch multiple batches at > once (assuming the file has more than one batch). > > > I see a large difference between the total time to apply compute > functions to a single table (concatenated from many small tables) compared > to applying compute functions to each sub-table in the composition. > > Which one is better? Can you share a reproducible example? > > On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote: > > > > Hello! > > > > I'm wondering if there's any documentation that describes the > concurrency/parallelism architecture for the compute API. I'd also be > interested if there are recommended approaches for seeing performance of > threads used by Arrow--should I try to check a processor ID and infer > performance or are there particular tools that the community uses? > > > > Specifically, I am wondering if the concurrency is going to be different > when using a ChunkedArray as an input compared to an Array or for > ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I see > a large difference between the total time to apply compute functions to a > single table (concatenated from many small tables) compared to applying > compute functions to each sub-table in the composition. I'm trying to > figure out where that difference may come from and I'm wondering if it's > related to parallelism within Arrow. > > > > I tried using the github issues and JIRA issues (e.g. [1]) as a way to > sleuth the info, but I couldn't find anything. The pyarrow API seems to > have functions I could try and use to figure it out (cpu_count and > set_cpu_count), but that seems like a vague road. > > > > [1]: https://issues.apache.org/jira/browse/ARROW-12726 > > > > > > Thank you! > > > > Aldrin Montana > > Computer Science PhD Student > > UC Santa Cruz >
