Thanks for all the help everyone. I was able to follow Niranda's steps and get the same perf in both C++ and Python. But I still don't know which are essential optimizations for compiling Arrow in C++. Can anyone please share some pointers on that ? I think documenting the essential C++ optimizations in some way will help people in the future. Thanks again.
On Tue, Mar 1, 2022 at 3:04 PM Weston Pace <[email protected]> wrote: > Does setting UseAsync on the C++ end make a difference? It's possible we > switched the default to async in python in 6.0.0 but not in C++. > > On Tue, Mar 1, 2022, 11:35 Niranda Perera <[email protected]> > wrote: > >> Oh, I forgot to mention, had to fix LD_LIBRARY_PATH when running the c++ >> executable. >> LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH ./dataset_bench >> >> On Tue, Mar 1, 2022 at 4:34 PM Niranda Perera <[email protected]> >> wrote: >> >>> @Jayeet, >>> >>> I ran your example in my desktop, and I don't see any timing issues >>> there. I used conda to install pyarrow==6.0.0 >>> I used the following command >>> g++ -O3 -std=c++11 dataset_bench.cc -I"$CONDA_PREFIX"/include >>> -L"$CONDA_PREFIX"/lib -larrow -larrow_dataset -lparquet -o dataset_bench >>> >>> And I had to del the objects in the python file, because it was getting >>> killed due to OOM. >>> ``` >>> ... >>> for i in range(10): >>> s = time.time() >>> dataset_ = ds.dataset("/home/niranda/flight_dataset", >>> format="parquet") >>> table = dataset_.to_table(use_threads=False) >>> e = time.time() >>> print(e - s) >>> >>> del table >>> del dataset_ >>> gc.collect() >>> ``` >>> >>> For me c++ takes around ~21s and python ~22s which is expected. >>> >>> >>> On Tue, Mar 1, 2022 at 2:19 PM Jayjeet Chakraborty < >>> [email protected]> wrote: >>> >>>> Hi Sasha, >>>> >>>> Thanks a lot for replying. I tried -O2 earlier but it didn't work. I >>>> tried it again (when compiling with PyArrow SO files) and unfortunately, it >>>> didn't improve the results. >>>> >>>> On Tue, Mar 1, 2022 at 11:14 AM Sasha Krassovsky < >>>> [email protected]> wrote: >>>> >>>>> Hi Jayjeet, >>>>> I noticed that you're not compiling dataset_bench with optimizations >>>>> enabled. I'm not sure how much it will help, but it may be worth adding >>>>> `-O2` to your g++ invocation. >>>>> >>>>> Sasha Krassovsky >>>>> >>>>> On Tue, Mar 1, 2022 at 11:11 AM Jayjeet Chakraborty < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Niranda, David, >>>>>> >>>>>> I ran my benchmarks again with the PyArrow .SO libraries which should >>>>>> be optimized. PyArrow version was 6.0.1 installed from pip. Here are my >>>>>> new >>>>>> results [1]. Numbers didn't quite seem to improve. You can check my build >>>>>> config in the Makefile [2]. I created a README [3] to make it easy for >>>>>> you >>>>>> to reproduce on your end. Thanks. >>>>>> >>>>>> [1] >>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench/optimized >>>>>> [2] >>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/Makefile >>>>>> [3] >>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/dataset_bench/README.md >>>>>> >>>>>> On Tue, Mar 1, 2022 at 10:04 AM Niranda Perera < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Jayeet, >>>>>>> >>>>>>> Could you try building your cpp project against the arrow.so in >>>>>>> pyarrow installation? It should be in the lib directory in your python >>>>>>> environment. >>>>>>> >>>>>>> Best >>>>>>> >>>>>>> On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Thanks for your reply, David. >>>>>>>> >>>>>>>> 1) I used PyArrow 6.0.1 for both C++ and Python. >>>>>>>> 2) The dataset was deployed using this [1] script. >>>>>>>> 3) For C++, Arrow was built from source in release mode. You can >>>>>>>> see the CMake config here [2]. >>>>>>>> >>>>>>>> I think I need to test once with Arrow C++ installed from packages >>>>>>>> instead of me building it. That might be the issue. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh >>>>>>>> [2] >>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp >>>>>>>> >>>>>>>> Best, >>>>>>>> Jayjeet >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Jayjeet, >>>>>>>>> >>>>>>>>> That's odd since the Python API is just wrapping the C++ API, so >>>>>>>>> they should be identical if everything is configured the same. (So is >>>>>>>>> the >>>>>>>>> Java API, incidentally.) That's effectively what the SO question is >>>>>>>>> saying. >>>>>>>>> >>>>>>>>> What versions of PyArrow and Arrow are you using? Just to check >>>>>>>>> the obvious things, was Arrow compiled with optimizations? And if we >>>>>>>>> want >>>>>>>>> to replicate this, is it possible to get the dataset? >>>>>>>>> >>>>>>>>> -David >>>>>>>>> >>>>>>>>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote: >>>>>>>>> >>>>>>>>> Hi Arrow community, >>>>>>>>> >>>>>>>>> I was working on a class project for benchmarking Apache Arrow >>>>>>>>> Dataset API in different programming languages. I found out that for >>>>>>>>> some >>>>>>>>> reason the C++ API example is slower than the Python API example. I >>>>>>>>> ran my >>>>>>>>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I >>>>>>>>> tried >>>>>>>>> my best to cross verify if all the parameters are similar in the >>>>>>>>> Python and >>>>>>>>> C++ examples. It would be great to know if someone had similar >>>>>>>>> observations >>>>>>>>> in the past and if the reason for this is known. I would really like >>>>>>>>> to >>>>>>>>> know more about this phenomenon. You can find the code and the >>>>>>>>> results here >>>>>>>>> [1]. I found a similar issue here [2] but I couldn't understand the >>>>>>>>> exact >>>>>>>>> reason. Thanks a lot for your help. >>>>>>>>> >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench >>>>>>>>> >>>>>>>>> [2] >>>>>>>>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python >>>>>>>>> >>>>>>>>> Best Regards, >>>>>>>>> *Jayjeet Chakraborty* >>>>>>>>> Ph.D. Student >>>>>>>>> Department of Computer Science and Engineering >>>>>>>>> University of California, Santa Cruz >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Jayjeet Chakraborty* >>>>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>>>> National Institute Of Technology, Durgapur >>>>>>>>> West Bengal, India >>>>>>>>> M: (+91) 8436500886 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Jayjeet Chakraborty* >>>>>>>> B.Tech in Computer Sc. and Engineering >>>>>>>> National Institute Of Technology, Durgapur >>>>>>>> West Bengal, India >>>>>>>> M: (+91) 8436500886 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Niranda Perera >>>>>>> https://niranda.dev/ >>>>>>> @n1r44 <https://twitter.com/N1R44> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> *Jayjeet Chakraborty* >>>>>> B.Tech in Computer Sc. and Engineering >>>>>> National Institute Of Technology, Durgapur >>>>>> West Bengal, India >>>>>> M: (+91) 8436500886 >>>>>> >>>>> >>>> >>>> -- >>>> *Jayjeet Chakraborty* >>>> CS PhD student >>>> UC Santa Cruz >>>> California, USA >>>> >>>> >>> >>> -- >>> Niranda Perera >>> https://niranda.dev/ >>> @n1r44 <https://twitter.com/N1R44> >>> >>> >> >> -- >> Niranda Perera >> https://niranda.dev/ >> @n1r44 <https://twitter.com/N1R44> >> >> -- *Jayjeet Chakraborty* CS PhD student UC Santa Cruz California, USA
