Re: C++ version of Arrow slower than Python version

Jayjeet Chakraborty Fri, 04 Mar 2022 08:51:26 -0800

I see. I believe I was already building in release mode as I was not
passing the CMAKE_BUILD_TYPE flag (which means it will build in release by
default). I will crosscheck once more. Thanks again for all the help.


On Wed, Mar 2, 2022 at 9:35 AM Niranda Perera <niranda.per...@gmail.com>
wrote:

> I think you should try release build mode!
>
> On Wed, Mar 2, 2022 at 12:21 PM Jayjeet Chakraborty <
> jayjeetchakrabort...@gmail.com> wrote:
>
>> Thanks for all the help everyone. I was able to follow Niranda's steps
>> and get the same perf in both C++ and Python. But I still don't know which
>> are essential optimizations for compiling Arrow in C++. Can anyone
>> please share some pointers on that ? I think documenting the essential C++
>> optimizations in some way will help people in the future. Thanks again.
>>
>> On Tue, Mar 1, 2022 at 3:04 PM Weston Pace <weston.p...@gmail.com> wrote:
>>
>>> Does setting UseAsync on the C++ end make a difference?  It's possible
>>> we switched the default to async in python in 6.0.0 but not in C++.
>>>
>>> On Tue, Mar 1, 2022, 11:35 Niranda Perera <niranda.per...@gmail.com>
>>> wrote:
>>>
>>>> Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the
>>>> c++ executable.
>>>> LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench
>>>>
>>>> On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <niranda.per...@gmail.com>
>>>> wrote:
>>>>
>>>>> @Jayeet,
>>>>>
>>>>> I ran your example in my desktop, and I don't see any timing issues
>>>>> there. I used conda to install pyarrow==6.0.0
>>>>> I used the following command
>>>>> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include
>>>>> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench
>>>>>
>>>>> And I had to del the objects in the python file, because it was
>>>>> getting killed due to OOM.
>>>>> ```
>>>>> ...
>>>>>     for i in range(10):
>>>>>         s = time.time()
>>>>>         dataset_ = ds.dataset("/home/niranda/flight_dataset",
>>>>> format="parquet")
>>>>>         table = dataset_.to_table(use_threads=False)
>>>>>         e = time.time()
>>>>>         print(e - s)
>>>>>
>>>>>         del table
>>>>>         del dataset_
>>>>>         gc.collect()
>>>>> ```
>>>>>
>>>>> For me c++ takes around ~21s and python ~22s which is expected.
>>>>>
>>>>>
>>>>> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty <
>>>>> jayjeetchakrabort...@gmail.com> wrote:
>>>>>
>>>>>> Hi Sasha,
>>>>>>
>>>>>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I
>>>>>> tried it again (when compiling with PyArrow SO files) and unfortunately, 
>>>>>> it
>>>>>> didn't improve the results.
>>>>>>
>>>>>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky <
>>>>>> krassovskysa...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Jayjeet,
>>>>>>> I noticed that you're not compiling dataset_bench with optimizations
>>>>>>> enabled. I'm not sure how much it will help, but it may be worth adding
>>>>>>> `-O2` to your g++ invocation.
>>>>>>>
>>>>>>> Sasha Krassovsky
>>>>>>>
>>>>>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty <
>>>>>>> jayjeetchakrabort...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Niranda, David,
>>>>>>>>
>>>>>>>> I ran my benchmarks again with the PyArrow .SO libraries which
>>>>>>>> should be optimized. PyArrow version was 6.0.1 installed from pip. 
>>>>>>>> Here are
>>>>>>>> my new results [1]. Numbers didn't quite seem to improve. You can 
>>>>>>>> check my
>>>>>>>> build config in the Makefile [2]. I created a README [3] to make it 
>>>>>>>> easy
>>>>>>>> for you to reproduce on your end. Thanks.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized
>>>>>>>> [2]
>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile
>>>>>>>> [3]
>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md
>>>>>>>>
>>>>>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera <
>>>>>>>> niranda.per...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Jayeet,
>>>>>>>>>
>>>>>>>>> Could you try building your cpp project against the arrow.so in
>>>>>>>>> pyarrow installation? It should be in the lib directory in your python
>>>>>>>>> environment.
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>>
>>>>>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <
>>>>>>>>> jayjeetchakrabort...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for your reply, David.
>>>>>>>>>>
>>>>>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python.
>>>>>>>>>> 2) The dataset was deployed using this [1] script.
>>>>>>>>>> 3) For C++, Arrow was built from source in release mode. You can
>>>>>>>>>> see the CMake config here [2].
>>>>>>>>>>
>>>>>>>>>> I think I need to test once with Arrow C++ installed from
>>>>>>>>>> packages instead of me building it. That might be the issue.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
>>>>>>>>>> [2]
>>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jayjeet
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <lidav...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Jayjeet,
>>>>>>>>>>>
>>>>>>>>>>> That's odd since the Python API is just wrapping the C++ API, so
>>>>>>>>>>> they should be identical if everything is configured the same. (So 
>>>>>>>>>>> is the
>>>>>>>>>>> Java API, incidentally.) That's effectively what the SO question is 
>>>>>>>>>>> saying.
>>>>>>>>>>>
>>>>>>>>>>> What versions of PyArrow and Arrow are you using? Just to check
>>>>>>>>>>> the obvious things, was Arrow compiled with optimizations? And if 
>>>>>>>>>>> we want
>>>>>>>>>>> to replicate this, is it possible to get the dataset?
>>>>>>>>>>>
>>>>>>>>>>> -David
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Arrow community,
>>>>>>>>>>>
>>>>>>>>>>> I was working on a class project for benchmarking Apache Arrow
>>>>>>>>>>> Dataset API in different programming languages. I found out that 
>>>>>>>>>>> for some
>>>>>>>>>>> reason the C++ API example is slower than the Python API example. I 
>>>>>>>>>>> ran my
>>>>>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. 
>>>>>>>>>>> I tried
>>>>>>>>>>> my best to cross verify if all the parameters are similar in the 
>>>>>>>>>>> Python and
>>>>>>>>>>> C++ examples. It would be great to know if someone had similar 
>>>>>>>>>>> observations
>>>>>>>>>>> in the past and if the reason for this is known. I would really 
>>>>>>>>>>> like to
>>>>>>>>>>> know more about this phenomenon. You can find the code and the 
>>>>>>>>>>> results here
>>>>>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the 
>>>>>>>>>>> exact
>>>>>>>>>>> reason. Thanks a lot for your help.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
>>>>>>>>>>>
>>>>>>>>>>> [2]
>>>>>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>>>>> Ph.D. Student
>>>>>>>>>>> Department of Computer Science and Engineering
>>>>>>>>>>> University of California, Santa Cruz
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>>>>>> National Institute Of Technology, Durgapur
>>>>>>>>>>> West Bengal, India
>>>>>>>>>>> M: (+91) 8436500886
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>>>>> National Institute Of Technology, Durgapur
>>>>>>>>>> West Bengal, India
>>>>>>>>>> M: (+91) 8436500886
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Niranda Perera
>>>>>>>>> https://niranda.dev/
>>>>>>>>> @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Jayjeet Chakraborty*
>>>>>>>> B.Tech in Computer Sc. and Engineering
>>>>>>>> National Institute Of Technology, Durgapur
>>>>>>>> West Bengal, India
>>>>>>>> M: (+91) 8436500886
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Jayjeet Chakraborty*
>>>>>> CS PhD student
>>>>>> UC Santa Cruz
>>>>>> California, USA
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Niranda Perera
>>>>> https://niranda.dev/
>>>>> @n1r44 <https://twitter.com/N1R44>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Niranda Perera
>>>> https://niranda.dev/
>>>> @n1r44 <https://twitter.com/N1R44>
>>>>
>>>>
>>
>> --
>> *Jayjeet Chakraborty*
>> CS PhD student
>> UC Santa Cruz
>> California, USA
>>
>>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>

-- 
*Jayjeet Chakraborty*
CS PhD student
UC Santa Cruz
California, USA

Re: C++ version of Arrow slower than Python version

Reply via email to