I see. I believe I was already building in release mode as I was not passing the CMAKE_BUILD_TYPE flag (which means it will build in release by default). I will crosscheck once more. Thanks again for all the help.
On Wed, Mar 2, 2022 at 9:35 AM Niranda Perera <niranda.per...@gmail.com> wrote: > I think you should try release build mode! > > On Wed, Mar 2, 2022 at 12:21 PM Jayjeet Chakraborty < > jayjeetchakrabort...@gmail.com> wrote: > >> Thanks for all the help everyone. I was able to follow Niranda's steps >> and get the same perf in both C++ and Python. But I still don't know which >> are essential optimizations for compiling Arrow in C++. Can anyone >> please share some pointers on that ? I think documenting the essential C++ >> optimizations in some way will help people in the future. Thanks again. >> >> On Tue, Mar 1, 2022 at 3:04 PM Weston Pace <weston.p...@gmail.com> wrote: >> >>> Does setting UseAsync on the C++ end make a difference? It's possible >>> we switched the default to async in python in 6.0.0 but not in C++. >>> >>> On Tue, Mar 1, 2022, 11:35 Niranda Perera <niranda.per...@gmail.com> >>> wrote: >>> >>>> Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the >>>> c++ executable. >>>> LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench >>>> >>>> On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <niranda.per...@gmail.com> >>>> wrote: >>>> >>>>> @Jayeet, >>>>> >>>>> I ran your example in my desktop, and I don't see any timing issues >>>>> there. I used conda to install pyarrow==6.0.0 >>>>> I used the following command >>>>> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include >>>>> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench >>>>> >>>>> And I had to del the objects in the python file, because it was >>>>> getting killed due to OOM. >>>>> ``` >>>>> ... >>>>> for i in range(10): >>>>> s = time.time() >>>>> dataset_ = ds.dataset("/home/niranda/flight_dataset", >>>>> format="parquet") >>>>> table = dataset_.to_table(use_threads=False) >>>>> e = time.time() >>>>> print(e - s) >>>>> >>>>> del table >>>>> del dataset_ >>>>> gc.collect() >>>>> ``` >>>>> >>>>> For me c++ takes around ~21s and python ~22s which is expected. >>>>> >>>>> >>>>> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty < >>>>> jayjeetchakrabort...@gmail.com> wrote: >>>>> >>>>>> Hi Sasha, >>>>>> >>>>>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I >>>>>> tried it again (when compiling with PyArrow SO files) and unfortunately, >>>>>> it >>>>>> didn't improve the results. >>>>>> >>>>>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky < >>>>>> krassovskysa...@gmail.com> wrote: >>>>>> >>>>>>> Hi Jayjeet, >>>>>>> I noticed that you're not compiling dataset_bench with optimizations >>>>>>> enabled. I'm not sure how much it will help, but it may be worth adding >>>>>>> `-O2` to your g++ invocation. >>>>>>> >>>>>>> Sasha Krassovsky >>>>>>> >>>>>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty < >>>>>>> jayjeetchakrabort...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Niranda, David, >>>>>>>> >>>>>>>> I ran my benchmarks again with the PyArrow .SO libraries which >>>>>>>> should be optimized. PyArrow version was 6.0.1 installed from pip. >>>>>>>> Here are >>>>>>>> my new results [1]. Numbers didn't quite seem to improve. You can >>>>>>>> check my >>>>>>>> build config in the Makefile [2]. I created a README [3] to make it >>>>>>>> easy >>>>>>>> for you to reproduce on your end. Thanks. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized >>>>>>>> [2] >>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile >>>>>>>> [3] >>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md >>>>>>>> >>>>>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera < >>>>>>>> niranda.per...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Jayeet, >>>>>>>>> >>>>>>>>> Could you try building your cpp project against the arrow.so in >>>>>>>>> pyarrow installation? It should be in the lib directory in your python >>>>>>>>> environment. >>>>>>>>> >>>>>>>>> Best >>>>>>>>> >>>>>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty < >>>>>>>>> jayjeetchakrabort...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Thanks for your reply, David. >>>>>>>>>> >>>>>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python. >>>>>>>>>> 2) The dataset was deployed using this [1] script. >>>>>>>>>> 3) For C++, Arrow was built from source in release mode. You can >>>>>>>>>> see the CMake config here [2]. >>>>>>>>>> >>>>>>>>>> I think I need to test once with Arrow C++ installed from >>>>>>>>>> packages instead of me building it. That might be the issue. >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh >>>>>>>>>> [2] >>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Jayjeet >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <lidav...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Jayjeet, >>>>>>>>>>> >>>>>>>>>>> That's odd since the Python API is just wrapping the C++ API, so >>>>>>>>>>> they should be identical if everything is configured the same. (So >>>>>>>>>>> is the >>>>>>>>>>> Java API, incidentally.) That's effectively what the SO question is >>>>>>>>>>> saying. >>>>>>>>>>> >>>>>>>>>>> What versions of PyArrow and Arrow are you using? Just to check >>>>>>>>>>> the obvious things, was Arrow compiled with optimizations? And if >>>>>>>>>>> we want >>>>>>>>>>> to replicate this, is it possible to get the dataset? >>>>>>>>>>> >>>>>>>>>>> -David >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Arrow community, >>>>>>>>>>> >>>>>>>>>>> I was working on a class project for benchmarking Apache Arrow >>>>>>>>>>> Dataset API in different programming languages. I found out that >>>>>>>>>>> for some >>>>>>>>>>> reason the C++ API example is slower than the Python API example. I >>>>>>>>>>> ran my >>>>>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. >>>>>>>>>>> I tried >>>>>>>>>>> my best to cross verify if all the parameters are similar in the >>>>>>>>>>> Python and >>>>>>>>>>> C++ examples. It would be great to know if someone had similar >>>>>>>>>>> observations >>>>>>>>>>> in the past and if the reason for this is known. I would really >>>>>>>>>>> like to >>>>>>>>>>> know more about this phenomenon. You can find the code and the >>>>>>>>>>> results here >>>>>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the >>>>>>>>>>> exact >>>>>>>>>>> reason. Thanks a lot for your help. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench >>>>>>>>>>> >>>>>>>>>>> [2] >>>>>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> *Jayjeet Chakraborty* >>>>>>>>>>> Ph.D. Student >>>>>>>>>>> Department of Computer Science and Engineering >>>>>>>>>>> University of California, Santa Cruz >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> *Jayjeet Chakraborty* >>>>>>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>>>>>> National Institute Of Technology, Durgapur >>>>>>>>>>> West Bengal, India >>>>>>>>>>> M: (+91) 8436500886 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *Jayjeet Chakraborty* >>>>>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>>>>> National Institute Of Technology, Durgapur >>>>>>>>>> West Bengal, India >>>>>>>>>> M: (+91) 8436500886 >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Niranda Perera >>>>>>>>> https://niranda.dev/ >>>>>>>>> @n1r44 <https://twitter.com/N1R44> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Jayjeet Chakraborty* >>>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>>> National Institute Of Technology, Durgapur >>>>>>>> West Bengal, India >>>>>>>> M: (+91) 8436500886 >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> *Jayjeet Chakraborty* >>>>>> CS PhD student >>>>>> UC Santa Cruz >>>>>> California, USA >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Niranda Perera >>>>> https://niranda.dev/ >>>>> @n1r44 <https://twitter.com/N1R44> >>>>> >>>>> >>>> >>>> -- >>>> Niranda Perera >>>> https://niranda.dev/ >>>> @n1r44 <https://twitter.com/N1R44> >>>> >>>> >> >> -- >> *Jayjeet Chakraborty* >> CS PhD student >> UC Santa Cruz >> California, USA >> >> > > -- > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44> > > -- *Jayjeet Chakraborty* CS PhD student UC Santa Cruz California, USA