Hi, We're examining the storage of our data in Snappy-compressed files. Since we want the data's structure to be self contained, we checked it with Avro and with Sequence (both are splittable, which should best utilize our cluster).
We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster on production environment (very strong machines). Compression What we did here (for test simplicity) is create two Hive tables: Avro-based and Sequence-based. Then we enabled Snappy compression and INSERTed the data from the RAW table (consisting of the 12GB file). In terms of compression rate, Avro was better: 72% vs. 57%. In both cases there were 45 mappers, and CPU/Mem were very far from their limit on all machines. Since there was no reduce operator, this created 45 files. Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for sequence files. Decompression What we did here was this Hive query: SELECT COUNT(1) FROM table-name; Here was the real difference: it took Avro about *75% longer* to perform this (3 minutes vs. 0.5 minute). This was very surprising since for our strong machines the I/O would be expected to be the bottleneck, and since Avro files are smaller,we expected them to be faster to decompress. The number of mappers in both cases was similar (14 vs. 17) and again, CPU/Mem didn't seem to be exausted. Since our most critical time is reading, this issue makes it hard for us to be using Avro. Maybe we're doing something wrong - your input would be much appreciated! Thanks, Nir -- View this message in context: http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequence-unexpected-results-tp4027467.html Sent from the Avro - Users mailing list archive at Nabble.com.
