Re: C++ version of Arrow slower than Python version

Niranda Perera Tue, 01 Mar 2022 10:04:07 -0800

Hi Jayeet,

Could you try building your cpp project against the arrow.so in pyarrow
installation? It should be in the lib directory in your python environment.


Best

On Tue, Mar 1, 2022 at 12:46 PM Jayjeet Chakraborty <
[email protected]> wrote:

> Thanks for your reply, David.
>
> 1) I used PyArrow 6.0.1 for both C++ and Python.
> 2) The dataset was deployed using this [1] script.
> 3) For C++, Arrow was built from source in release mode. You can see the
> CMake config here [2].
>
> I think I need to test once with Arrow C++ installed from packages instead
> of me building it. That might be the issue.
>
> [1]
> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/blob/main/common/deploy_data.sh
> [2]
> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/cpp
>
> Best,
> Jayjeet
>
>
>
>
> On Tue, Mar 1, 2022 at 5:04 AM David Li <[email protected]> wrote:
>
>> Hi Jayjeet,
>>
>> That's odd since the Python API is just wrapping the C++ API, so they
>> should be identical if everything is configured the same. (So is the Java
>> API, incidentally.) That's effectively what the SO question is saying.
>>
>> What versions of PyArrow and Arrow are you using? Just to check the
>> obvious things, was Arrow compiled with optimizations? And if we want to
>> replicate this, is it possible to get the dataset?
>>
>> -David
>>
>> On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
>>
>> Hi Arrow community,
>>
>> I was working on a class project for benchmarking Apache Arrow Dataset
>> API in different programming languages. I found out that for some reason
>> the C++ API example is slower than the Python API example. I ran my
>> benchmarks on a 5 GB dataset consisting of 300 16MB parquet files. I tried
>> my best to cross verify if all the parameters are similar in the Python and
>> C++ examples. It would be great to know if someone had similar observations
>> in the past and if the reason for this is known. I would really like to
>> know more about this phenomenon. You can find the code and the results here
>> [1]. I found a similar issue here [2] but I couldn't understand the exact
>> reason. Thanks a lot for your help.
>>
>>
>> [1]
>> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
>>
>> [2]
>> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
>>
>> Best Regards,
>> *Jayjeet Chakraborty*
>> Ph.D. Student
>> Department of Computer Science and Engineering
>> University of California, Santa Cruz
>>
>> --
>> *Jayjeet Chakraborty*
>> B.Tech in Computer Sc. and Engineering
>> National Institute Of Technology, Durgapur
>> West Bengal, India
>> M: (+91) 8436500886
>>
>>
>>
>
> --
> *Jayjeet Chakraborty*
> B.Tech in Computer Sc. and Engineering
> National Institute Of Technology, Durgapur
> West Bengal, India
> M: (+91) 8436500886
>


-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Re: C++ version of Arrow slower than Python version

Reply via email to