I definitely think that adding runtime SIMD dispatching for arithmetic
(and using xsimd to write generic kernels that can be cross-compiled
for different SIMD targets, i.e. AVX2, AVX512, NEON) functions is a
good idea and hopefully it will be pretty low hanging fruit (~a day or
two of work) to make applications faster across the board.

On Thu, Jun 9, 2022 at 1:45 PM Sasha Krassovsky
<[email protected]> wrote:
>
> To add on to Weston’s response, the only SIMD that will ever be generated for 
> the kernels by compilers at the moment is with SSE4.2, it will not generate 
> AVX2 as we have not set up the compiler flags to do that. Also the way the 
> code is written doesn’t seem super easy to vectorize for the compiler, and 
> needs a bit of massaging. I do plan to tackle this at some point after the 
> per-kernel overhead work that Weston mentioned is complete.
>
> Sasha Krassovsky
>
> > On Jun 9, 2022, at 8:42 AM, Shawn Yang <[email protected]> wrote:
> >
> > I see, thanks. I'll do more tests and dive into more arrow compute code.
> >
> > Sent from my iPhone
> >
> >> On Jun 9, 2022, at 5:30 PM, Weston Pace <[email protected]> wrote:
> >>
> >> 
> >>>
> >>> Hi, do you guys know which functions support vectorized SIMD in arrow 
> >>> compute?
> >>
> >> I don't know that anyone has done a fully systematic analysis of which
> >> kernels support and do not support SIMD at the moment.  The kernels
> >> are still in flux.  There is an active effort to reduce overhead[1]
> >> which is the top priority as this could possibly have more impact on
> >> performance than SIMD when running expressions involving multiple
> >> kernels across multiple threads.
> >>
> >>> I only found very little functions support vectorized SIMD:
> >>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
> >>>
> >>> Does scalar operation support vectorized SIMD?
> >>
> >> A lack of explicit vectorization instructions does not mean a lack of
> >> SIMD support.  For many kernels we expect modern compilers to be smart
> >> enough to automatically implement vectorization as long as the data is
> >> provided in a vectorized fashion (e.g. columnar) and the kernel is
> >> simple enough.  For more complex kernels there are options such as
> >> xsimd but this hasn't yet been very thoroughly explored.  At the
> >> moment I'm not aware of anyone writing explicitly vectorized kernels
> >> as this tends to be rather hardware specific and have a small return
> >> on investment.  Instead, we benchmark regularly and have
> >> micro-optimized certain critical sections (e.g. some of the hash
> >> stuff).
> >>
> >>> I tested with numpy and found arrow is ten times slower:
> >>
> >> That result you posted appears to be 3.5x slower.  You might want to
> >> double check and ensure that Arrow was compiled with the appropriate
> >> architecture (the cmake files are generally good at figuring this out)
> >> but I wouldn't be too surprised if this was the case.  Some of this
> >> might be unavoidable.  For example, does numpy support null values (I
> >> don't know for sure but I seem to recall it does not)?  Some of this
> >> might be an inefficiency or overhead problem in Arrow-C++.  It is
> >> possible that the add kernel is not being vectorized correctly by the
> >> compiler but I don't think those numbers alone are enough proof of
> >> that.
> >>
> >> Performance can be quite tricky.  It is important for us but Arrow's
> >> compute functionality is still relatively new compared to numpy and
> >> work on performance is balanced with work on features.
> >>
> >> [1] https://lists.apache.org/thread/rh10ykcolt0gxydhgv4vxk2m7ktwx5mh
> >>
> >>> On Wed, Jun 8, 2022 at 11:08 PM Shawn Yang <[email protected]> 
> >>> wrote:
> >>>
> >>> Hi, do you guys know which functions support vectorized SIMD in arrow 
> >>> compute? After a quick look as arrow compute cpp code, I only found very 
> >>> little functions support vectorized SIMD:
> >>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
> >>>
> >>> Does scalar operation support vectorized SIMD?
> >>>
> >>> I tested with numpy and found arrow is ten times slower:
> >>>
> >>> def test_multiply(rows=5000000):
> >>>   a = pa.array(list(range(rows, 0, -1)))
> >>>   b = pa.array(list(range(rows)))
> >>>   import pyarrow.compute as pc
> >>>
> >>>   print("arrow multiply took", timeit.timeit(
> >>>       lambda: pc.multiply(a, b), number=3))
> >>>   a = np.array(list(range(rows, 0, -1)))
> >>>   b = np.array(list(range(rows)))
> >>>   print("numpy multiply took", timeit.timeit(
> >>>       lambda: a * b, number=3))
> >>>   # arrow multiply took 0.14826057000000015
> >>>   # numpy multiply took 0.04047051300000071
> >>>
> >>>
> >>>> On Wed, May 25, 2022 at 10:09 PM Shawn Yang <[email protected]> 
> >>>> wrote:
> >>>>
> >>>> I see, the key for multiple loop is to ensure the data can be hold in l2 
> >>>> cache, so that later
> >>>> calculation can process this batch without reading from the main memory, 
> >>>> and we can record the exec stats for every batch , and do better local 
> >>>> task scheduling based on those stats.  Thanks a lot. Morsels is new to 
> >>>> me, very interesting ideas
> >>>> Sent from my iPhone
> >>>>
> >>>>> On May 25, 2022, at 7:23 AM, Weston Pace <[email protected]> wrote:
> >>>>>
> >>>>> There are a few levels of loops.  Two at the moment and three in the
> >>>>> future.  Some are fused and some are not.  What we have right now is
> >>>>> early stages, is not ideal, and there are people investigating and
> >>>>> working on improvements.  I can speak a little bit about where we want
> >>>>> to go.  An example may be helpful.
> >>>>>
> >>>>> For example, given a filter "x < 100 && x > 0" we have something like
> >>>>> (this is an approximation of the work done by
> >>>>> arrow::compute::ExecuteScalarExpression and not actual code):
> >>>>>
> >>>>> ```
> >>>>> for batch_of_128k_rows in data:
> >>>>>  auto lt_one_hundred = less_than(batch_of_128k_rows, 100)
> >>>>>  auto gt_zero = greater_than(batch_of_128k_rows, 0)
> >>>>>  auto filter_pred = and(lt_one_hundred, gt_zero)
> >>>>>  consume(filter(batch_of_128k_rows, filter_pred))
> >>>>> ```
> >>>>>
> >>>>> There are two big things we need to fix here.  First,
> >>>>> `batch_of_128k_rows` is meant to be some percentage of one thread's
> >>>>> portion of the L3 cache.  This is a good unit of parallelism but it is
> >>>>> not ideal for processing because we'd rather use the L2 cache since we
> >>>>> are making three passes across `batch_of_128k_rows`.  Second, each of
> >>>>> those `auto ... =` lines is allocating new memory.  This is not ideal
> >>>>> because we'd like to avoid excess allocation if possible.
> >>>>>
> >>>>> To solve the first problem we are moving towards the "morsel/batch"
> >>>>> model[1].  This means we have two "batch" sizes.  The outer batch
> >>>>> (ironically, the morsel) is the largest and is the one used for
> >>>>> determining parallelism.  The inner batch should be smaller (size
> >>>>> based on L2).
> >>>>>
> >>>>> To solve the second problem a number of solutions have been proposed
> >>>>> (thread-local buffer pools, thread-local buffer stacks, etc.) and we
> >>>>> will hopefully adopt one at some point.  So the above code snippet
> >>>>> would hopefully become something like:
> >>>>>
> >>>>> ```
> >>>>> thread_local auto lt_one_hundred = allocate_array(l2_sized_batch_size, 
> >>>>> bool)
> >>>>> thread_local auto gt_zero = allocate_array(l2_sized_batch_size, bool)
> >>>>> thread_local auto filter_pred = allocate_array(l2_sized_batch_size, 
> >>>>> bool)
> >>>>> for batch_of_128k_rows in data:
> >>>>>  for l2_sized_batch in batch_of_128k_rows:
> >>>>>      less_than(l2_sized_batch, 100, &lt_one_hundred)
> >>>>>      greater_than(l2_sized_batch, 0, &gt_zero)
> >>>>>      and(lt_one_hundred, gt_zero, &filter_pred)
> >>>>>      consume(filter(l2_sized_batch, filter_pred))
> >>>>> ```
> >>>>>
> >>>>> There is still a fair amount of work to do before we get here but I
> >>>>> hope this gives you some idea of the direction we are headed.
> >>>>>
> >>>>> [1] https://db.in.tum.de/~leis/papers/morsels.pdf
> >>>>>
> >>>>>> On Tue, May 24, 2022 at 6:27 AM Shawn Yang <[email protected]> 
> >>>>>> wrote:
> >>>>>>
> >>>>>> Hi Ion, thank you for your reply which recaps  the history of arrow 
> >>>>>> compute. Those links are very valuable for me to understand arrow 
> >>>>>> compute internal. I took a quick for those documents and will take a 
> >>>>>> deeper into those later. I have another question, does arrow compute 
> >>>>>> supports loop fusion, which execute multiple vectorized operand in one 
> >>>>>> loop? This is very common in dataframe comouting, our engine can 
> >>>>>> extract those expressions into a dag/tree. If arrow computer support 
> >>>>>> loop fusion,the performance would be very promising
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>>>> On May 24, 2022, at 4:49 AM, Ian Cook <[email protected]> wrote:
> >>>>>>>
> >>>>>>> Hi Shawn,
> >>>>>>>
> >>>>>>> In March of 2021, when major work on the C++ query execution machinery
> >>>>>>> in Arrow was beginning, Wes sent a message [1] to the dev list and
> >>>>>>> linked to a doc [2] with some details about the planned design. A few
> >>>>>>> months later Neal sent an update [3] about this work. However those
> >>>>>>> documents are now somewhat out of date. More recently, Wes shared
> >>>>>>> another update [4] and linked to a doc [5] regarding task execution /
> >>>>>>> control flow / scheduling. However I think the best source of
> >>>>>>> information is the doc you linked to. The query execution work has
> >>>>>>> proceeded organically with many contributors, and efforts to document
> >>>>>>> the overall design in sufficient detail have not kept pace.
> >>>>>>>
> >>>>>>> Regarding benchmarks: There has been extensive work done using
> >>>>>>> Conbench [6] as part of the Arrow CI infrastructure to benchmark
> >>>>>>> commits, for purposes of avoiding / identifying performance
> >>>>>>> regressions and measuring efforts to improve performance. However I am
> >>>>>>> not aware of any efforts to produce and publicly share benchmarks for
> >>>>>>> the purpose of comparing performance vs. other query engines.
> >>>>>>>
> >>>>>>> There is a proposal [7] to give the name "Acero" to the Arrow C++
> >>>>>>> compute engine, so in the future you will likely see it referred to by
> >>>>>>> that name. I think that having a clearer name for this will motivate
> >>>>>>> more efforts to write and share more about it.
> >>>>>>>
> >>>>>>> Ian
> >>>>>>>
> >>>>>>> [1] https://lists.apache.org/thread/n632pmjnb85o49lyxy45f7sgh4cshoc0
> >>>>>>> [2] 
> >>>>>>> https://docs.google.com/document/d/1AyTdLU-RxA-Gsb9EsYnrQrmqPMOYMfPlWwxRi1Is1tQ/
> >>>>>>> [3] https://lists.apache.org/thread/3pmb592zmonz86nmmbjcw08j5tcrfzm1
> >>>>>>> [4] https://lists.apache.org/thread/ltllzpt1r2ch06mv1ngfgdl7wv2tm8xc
> >>>>>>> [5] 
> >>>>>>> https://docs.google.com/document/d/1216CUQZ7u4acZvC2jX7juqqQCXtdXMellk3lRrgP_WY/
> >>>>>>> [6] https://conbench.ursa.dev
> >>>>>>> [7] https://lists.apache.org/thread/7v7vkc005v9343n49b3shvrdn19wdpj1
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Mon, May 23, 2022 at 10:58 AM Shawn Yang 
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> Hi, I'm considering using arrow compute as an execution kernel for 
> >>>>>>>> our distributed dataframe framework. I already read the great doc: 
> >>>>>>>> https://arrow.apache.org/docs/cpp/compute.html, but it is an usage 
> >>>>>>>> doc. Is there any design doc, inside introduction or benchmarks for 
> >>>>>>>> arrow compute so I can quickly understand how arrow compute works, 
> >>>>>>>> what can be done and what should be done by it? Or I should just 
> >>>>>>>> read the code in 
> >>>>>>>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute
> >>>>>>>>
> >>>>>>>>
>

Reply via email to