Python Avro is super slow. I have built a C module that is about 30 times faster. It does both encoding and decoding. I intend to open source it soon. More testers would be helpful then.

Wai Yip

Bruce Mitchener <mailto:[email protected]>
Friday, January 09, 2015 6:05 AM
Has anyone profiled the Python code or otherwise looked at the performance?

 - Bruce

Sent from my iPhone

On Jan 9, 2015, at 8:56 PM, Han JU <[email protected] <mailto:[email protected]>> wrote:

Han JU <mailto:[email protected]>
Friday, January 09, 2015 5:56 AM
Hi,

Thanks. I've tried this project and its performance approaches java/scala. But it seems that it has only read support. We have indeed lots of use cases where python program need to persist datasets.




--
*JU Han*

Data Engineer @ Botify.com

+33 0619608888
Mika Ristimaki <mailto:[email protected]>
Friday, January 09, 2015 5:39 AM
Hi,

I can’t really comment why Python Avro is slow but you could try fastavro.

https://pypi.python.org/pypi/fastavro

-Mika


Han JU <mailto:[email protected]>
Friday, January 09, 2015 5:32 AM
Hi,

I'm evaluating Avro to replace our csv based datasets and I notice a performance problem in avro python bindings. Basically I've tested on a 1.8GB dataset with 5 columns. With scala (avro java bindings), reads and writes are fast (18s, 44s) but in python, for the same file, it took nearly one hour to write, and 50 miniutes to read ...

My code is based on the avro documentation examples, and the schema is relatively simple. My question:
  - Is this performance difference a known issue?
  - Is there something I miss (say a special configuration or something)?

I've seen a fastavro project and that is much faster in reading, but not write support. This will prevent us from using Avro since we've lot of python based programs that need to persist data.

Thanks!
--
*JU Han*

Data Engineer @ Botify.com

+33 0619608888

Reply via email to