I definitely think that adding runtime SIMD dispatching for arithmetic (and using xsimd to write generic kernels that can be cross-compiled for different SIMD targets, i.e. AVX2, AVX512, NEON) functions is a good idea and hopefully it will be pretty low hanging fruit (~a day or two of work) to make applications faster across the board.
On Thu, Jun 9, 2022 at 1:45 PM Sasha Krassovsky <[email protected]> wrote: > > To add on to Weston’s response, the only SIMD that will ever be generated for > the kernels by compilers at the moment is with SSE4.2, it will not generate > AVX2 as we have not set up the compiler flags to do that. Also the way the > code is written doesn’t seem super easy to vectorize for the compiler, and > needs a bit of massaging. I do plan to tackle this at some point after the > per-kernel overhead work that Weston mentioned is complete. > > Sasha Krassovsky > > > On Jun 9, 2022, at 8:42 AM, Shawn Yang <[email protected]> wrote: > > > > I see, thanks. I'll do more tests and dive into more arrow compute code. > > > > Sent from my iPhone > > > >> On Jun 9, 2022, at 5:30 PM, Weston Pace <[email protected]> wrote: > >> > >> > >>> > >>> Hi, do you guys know which functions support vectorized SIMD in arrow > >>> compute? > >> > >> I don't know that anyone has done a fully systematic analysis of which > >> kernels support and do not support SIMD at the moment. The kernels > >> are still in flux. There is an active effort to reduce overhead[1] > >> which is the top priority as this could possibly have more impact on > >> performance than SIMD when running expressions involving multiple > >> kernels across multiple threads. > >> > >>> I only found very little functions support vectorized SIMD: > >>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2 > >>> > >>> Does scalar operation support vectorized SIMD? > >> > >> A lack of explicit vectorization instructions does not mean a lack of > >> SIMD support. For many kernels we expect modern compilers to be smart > >> enough to automatically implement vectorization as long as the data is > >> provided in a vectorized fashion (e.g. columnar) and the kernel is > >> simple enough. For more complex kernels there are options such as > >> xsimd but this hasn't yet been very thoroughly explored. At the > >> moment I'm not aware of anyone writing explicitly vectorized kernels > >> as this tends to be rather hardware specific and have a small return > >> on investment. Instead, we benchmark regularly and have > >> micro-optimized certain critical sections (e.g. some of the hash > >> stuff). > >> > >>> I tested with numpy and found arrow is ten times slower: > >> > >> That result you posted appears to be 3.5x slower. You might want to > >> double check and ensure that Arrow was compiled with the appropriate > >> architecture (the cmake files are generally good at figuring this out) > >> but I wouldn't be too surprised if this was the case. Some of this > >> might be unavoidable. For example, does numpy support null values (I > >> don't know for sure but I seem to recall it does not)? Some of this > >> might be an inefficiency or overhead problem in Arrow-C++. It is > >> possible that the add kernel is not being vectorized correctly by the > >> compiler but I don't think those numbers alone are enough proof of > >> that. > >> > >> Performance can be quite tricky. It is important for us but Arrow's > >> compute functionality is still relatively new compared to numpy and > >> work on performance is balanced with work on features. > >> > >> [1] https://lists.apache.org/thread/rh10ykcolt0gxydhgv4vxk2m7ktwx5mh > >> > >>> On Wed, Jun 8, 2022 at 11:08 PM Shawn Yang <[email protected]> > >>> wrote: > >>> > >>> Hi, do you guys know which functions support vectorized SIMD in arrow > >>> compute? After a quick look as arrow compute cpp code, I only found very > >>> little functions support vectorized SIMD: > >>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2 > >>> > >>> Does scalar operation support vectorized SIMD? > >>> > >>> I tested with numpy and found arrow is ten times slower: > >>> > >>> def test_multiply(rows=5000000): > >>> a = pa.array(list(range(rows, 0, -1))) > >>> b = pa.array(list(range(rows))) > >>> import pyarrow.compute as pc > >>> > >>> print("arrow multiply took", timeit.timeit( > >>> lambda: pc.multiply(a, b), number=3)) > >>> a = np.array(list(range(rows, 0, -1))) > >>> b = np.array(list(range(rows))) > >>> print("numpy multiply took", timeit.timeit( > >>> lambda: a * b, number=3)) > >>> # arrow multiply took 0.14826057000000015 > >>> # numpy multiply took 0.04047051300000071 > >>> > >>> > >>>> On Wed, May 25, 2022 at 10:09 PM Shawn Yang <[email protected]> > >>>> wrote: > >>>> > >>>> I see, the key for multiple loop is to ensure the data can be hold in l2 > >>>> cache, so that later > >>>> calculation can process this batch without reading from the main memory, > >>>> and we can record the exec stats for every batch , and do better local > >>>> task scheduling based on those stats. Thanks a lot. Morsels is new to > >>>> me, very interesting ideas > >>>> Sent from my iPhone > >>>> > >>>>> On May 25, 2022, at 7:23 AM, Weston Pace <[email protected]> wrote: > >>>>> > >>>>> There are a few levels of loops. Two at the moment and three in the > >>>>> future. Some are fused and some are not. What we have right now is > >>>>> early stages, is not ideal, and there are people investigating and > >>>>> working on improvements. I can speak a little bit about where we want > >>>>> to go. An example may be helpful. > >>>>> > >>>>> For example, given a filter "x < 100 && x > 0" we have something like > >>>>> (this is an approximation of the work done by > >>>>> arrow::compute::ExecuteScalarExpression and not actual code): > >>>>> > >>>>> ``` > >>>>> for batch_of_128k_rows in data: > >>>>> auto lt_one_hundred = less_than(batch_of_128k_rows, 100) > >>>>> auto gt_zero = greater_than(batch_of_128k_rows, 0) > >>>>> auto filter_pred = and(lt_one_hundred, gt_zero) > >>>>> consume(filter(batch_of_128k_rows, filter_pred)) > >>>>> ``` > >>>>> > >>>>> There are two big things we need to fix here. First, > >>>>> `batch_of_128k_rows` is meant to be some percentage of one thread's > >>>>> portion of the L3 cache. This is a good unit of parallelism but it is > >>>>> not ideal for processing because we'd rather use the L2 cache since we > >>>>> are making three passes across `batch_of_128k_rows`. Second, each of > >>>>> those `auto ... =` lines is allocating new memory. This is not ideal > >>>>> because we'd like to avoid excess allocation if possible. > >>>>> > >>>>> To solve the first problem we are moving towards the "morsel/batch" > >>>>> model[1]. This means we have two "batch" sizes. The outer batch > >>>>> (ironically, the morsel) is the largest and is the one used for > >>>>> determining parallelism. The inner batch should be smaller (size > >>>>> based on L2). > >>>>> > >>>>> To solve the second problem a number of solutions have been proposed > >>>>> (thread-local buffer pools, thread-local buffer stacks, etc.) and we > >>>>> will hopefully adopt one at some point. So the above code snippet > >>>>> would hopefully become something like: > >>>>> > >>>>> ``` > >>>>> thread_local auto lt_one_hundred = allocate_array(l2_sized_batch_size, > >>>>> bool) > >>>>> thread_local auto gt_zero = allocate_array(l2_sized_batch_size, bool) > >>>>> thread_local auto filter_pred = allocate_array(l2_sized_batch_size, > >>>>> bool) > >>>>> for batch_of_128k_rows in data: > >>>>> for l2_sized_batch in batch_of_128k_rows: > >>>>> less_than(l2_sized_batch, 100, <_one_hundred) > >>>>> greater_than(l2_sized_batch, 0, >_zero) > >>>>> and(lt_one_hundred, gt_zero, &filter_pred) > >>>>> consume(filter(l2_sized_batch, filter_pred)) > >>>>> ``` > >>>>> > >>>>> There is still a fair amount of work to do before we get here but I > >>>>> hope this gives you some idea of the direction we are headed. > >>>>> > >>>>> [1] https://db.in.tum.de/~leis/papers/morsels.pdf > >>>>> > >>>>>> On Tue, May 24, 2022 at 6:27 AM Shawn Yang <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>> Hi Ion, thank you for your reply which recaps the history of arrow > >>>>>> compute. Those links are very valuable for me to understand arrow > >>>>>> compute internal. I took a quick for those documents and will take a > >>>>>> deeper into those later. I have another question, does arrow compute > >>>>>> supports loop fusion, which execute multiple vectorized operand in one > >>>>>> loop? This is very common in dataframe comouting, our engine can > >>>>>> extract those expressions into a dag/tree. If arrow computer support > >>>>>> loop fusion,the performance would be very promising > >>>>>> > >>>>>> Sent from my iPhone > >>>>>> > >>>>>>>> On May 24, 2022, at 4:49 AM, Ian Cook <[email protected]> wrote: > >>>>>>> > >>>>>>> Hi Shawn, > >>>>>>> > >>>>>>> In March of 2021, when major work on the C++ query execution machinery > >>>>>>> in Arrow was beginning, Wes sent a message [1] to the dev list and > >>>>>>> linked to a doc [2] with some details about the planned design. A few > >>>>>>> months later Neal sent an update [3] about this work. However those > >>>>>>> documents are now somewhat out of date. More recently, Wes shared > >>>>>>> another update [4] and linked to a doc [5] regarding task execution / > >>>>>>> control flow / scheduling. However I think the best source of > >>>>>>> information is the doc you linked to. The query execution work has > >>>>>>> proceeded organically with many contributors, and efforts to document > >>>>>>> the overall design in sufficient detail have not kept pace. > >>>>>>> > >>>>>>> Regarding benchmarks: There has been extensive work done using > >>>>>>> Conbench [6] as part of the Arrow CI infrastructure to benchmark > >>>>>>> commits, for purposes of avoiding / identifying performance > >>>>>>> regressions and measuring efforts to improve performance. However I am > >>>>>>> not aware of any efforts to produce and publicly share benchmarks for > >>>>>>> the purpose of comparing performance vs. other query engines. > >>>>>>> > >>>>>>> There is a proposal [7] to give the name "Acero" to the Arrow C++ > >>>>>>> compute engine, so in the future you will likely see it referred to by > >>>>>>> that name. I think that having a clearer name for this will motivate > >>>>>>> more efforts to write and share more about it. > >>>>>>> > >>>>>>> Ian > >>>>>>> > >>>>>>> [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0 > >>>>>>> [2] > >>>>>>> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/ > >>>>>>> [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1 > >>>>>>> [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc > >>>>>>> [5] > >>>>>>> https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/ > >>>>>>> [6] https://conbench.ursa.dev > >>>>>>> [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1 > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> On Mon, May 23, 2022 at 10:58 AM Shawn Yang > >>>>>>>> <[email protected]> wrote: > >>>>>>>> > >>>>>>>> Hi, I'm considering using arrow compute as an execution kernel for > >>>>>>>> our distributed dataframe framework. I already read the great doc: > >>>>>>>> https://arrow.apache.org/docs/cpp/compute.html, but it is an usage > >>>>>>>> doc. Is there any design doc, inside introduction or benchmarks for > >>>>>>>> arrow compute so I can quickly understand how arrow compute works, > >>>>>>>> what can be done and what should be done by it? Or I should just > >>>>>>>> read the code in > >>>>>>>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute > >>>>>>>> > >>>>>>>> >
