Hi, I'm evaluating Avro to replace our csv based datasets and I notice a performance problem in avro python bindings. Basically I've tested on a 1.8GB dataset with 5 columns. With scala (avro java bindings), reads and writes are fast (18s, 44s) but in python, for the same file, it took nearly one hour to write, and 50 miniutes to read ...
My code is based on the avro documentation examples, and the schema is relatively simple. My question: - Is this performance difference a known issue? - Is there something I miss (say a special configuration or something)? I've seen a fastavro project and that is much faster in reading, but not write support. This will prevent us from using Avro since we've lot of python based programs that need to persist data. Thanks! -- *JU Han* Data Engineer @ Botify.com +33 0619608888
