Python Avro is super slow. I have built a C module that is about 30
times faster. It does both encoding and decoding. I intend to open
source it soon. More testers would be helpful then.
Wai Yip
Bruce Mitchener <mailto:[email protected]>
Friday, January 09, 2015 6:05 AM
Has anyone profiled the Python code or otherwise looked at the
performance?
- Bruce
Sent from my iPhone
On Jan 9, 2015, at 8:56 PM, Han JU <[email protected]
<mailto:[email protected]>> wrote:
Han JU <mailto:[email protected]>
Friday, January 09, 2015 5:56 AM
Hi,
Thanks. I've tried this project and its performance approaches
java/scala. But it seems that it has only read support. We have indeed
lots of use cases where python program need to persist datasets.
--
*JU Han*
Data Engineer @ Botify.com
+33 0619608888
Mika Ristimaki <mailto:[email protected]>
Friday, January 09, 2015 5:39 AM
Hi,
I can’t really comment why Python Avro is slow but you could try fastavro.
https://pypi.python.org/pypi/fastavro
-Mika
Han JU <mailto:[email protected]>
Friday, January 09, 2015 5:32 AM
Hi,
I'm evaluating Avro to replace our csv based datasets and I notice a
performance problem in avro python bindings.
Basically I've tested on a 1.8GB dataset with 5 columns. With scala
(avro java bindings), reads and writes are fast (18s, 44s) but in
python, for the same file, it took nearly one hour to write, and 50
miniutes to read ...
My code is based on the avro documentation examples, and the schema is
relatively simple. My question:
- Is this performance difference a known issue?
- Is there something I miss (say a special configuration or something)?
I've seen a fastavro project and that is much faster in reading, but
not write support. This will prevent us from using Avro since we've
lot of python based programs that need to persist data.
Thanks!
--
*JU Han*
Data Engineer @ Botify.com
+33 0619608888