Hi Jayjeet,

That's odd since the Python API is just wrapping the C++ API, so they should be 
identical if everything is configured the same. (So is the Java API, 
incidentally.) That's effectively what the SO question is saying.

What versions of PyArrow and Arrow are you using? Just to check the obvious 
things, was Arrow compiled with optimizations? And if we want to replicate 
this, is it possible to get the dataset?

-David

On Tue, Mar 1, 2022, at 01:52, Jayjeet Chakraborty wrote:
> Hi Arrow community,
> 
> 
> I was working on a class project for benchmarking Apache Arrow Dataset API in 
> different programming languages. I found out that for some reason the C++ API 
> example is slower than the Python API example. I ran my benchmarks on a 5 GB 
> dataset consisting of 300 16MB parquet files. I tried my best to cross verify 
> if all the parameters are similar in the Python and C++ examples. It would be 
> great to know if someone had similar observations in the past and if the 
> reason for this is known. I would really like to know more about this 
> phenomenon. You can find the code and the results here [1]. I found a similar 
> issue here [2] but I couldn't understand the exact reason. Thanks a lot for 
> your help.
> 
> 
> 
> [1] 
> https://github.com/JayjeetAtGithub/arrow-flight-benchmark/tree/main/dataset_bench
> 
> [2] 
> https://stackoverflow.com/questions/67856457/reading-parquet-file-is-slower-in-c-than-in-python
> 
> 
> Best Regards,
> *Jayjeet Chakraborty*
> Ph.D. Student
> Department of Computer Science and Engineering
> University of California, Santa Cruz
> 
> -- 
> *Jayjeet Chakraborty*
> B.Tech in Computer Sc. and Engineering
> National Institute Of Technology, Durgapur
> West Bengal, India
> M: (+91) 8436500886
> 

Reply via email to