For your avro files, double check that snappy is used (use avro-tools to peek at the metadata in the file, or simply view the head in a text editor, the compression codec used will be in the header).
Snappy is very fast, most likely the time to read is dominated by deserialization. Avro will be slower than a trivial deserializer (but more compact), but being many times slower is not expected. I am not entirely sure how Hive's Avro serDe works -- it is possible there is a performance issue there. If you were able to get a handful of stack traces (kill -3 or jstack) from the mapper tasks (or a profiler output), it would be very insightful. On 5/23/13 12:42 AM, "nir_zamir" <[email protected]> wrote: >Hi, > >We're examining the storage of our data in Snappy-compressed files. Since >we >want the data's structure to be self contained, we checked it with Avro >and >with Sequence (both are splittable, which should best utilize our >cluster). > >We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster >on production environment (very strong machines). > >Compression > >What we did here (for test simplicity) is create two Hive tables: >Avro-based >and Sequence-based. Then we enabled Snappy compression and INSERTed the >data >from the RAW table (consisting of the 12GB file). > >In terms of compression rate, Avro was better: 72% vs. 57%. >In both cases there were 45 mappers, and CPU/Mem were very far from their >limit on all machines. >Since there was no reduce operator, this created 45 files. > >Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for >sequence files. > >Decompression > >What we did here was this Hive query: >SELECT COUNT(1) FROM table-name; > >Here was the real difference: it took Avro about *75% longer* to perform >this (3 minutes vs. 0.5 minute). >This was very surprising since for our strong machines the I/O would be >expected to be the bottleneck, and since Avro files are smaller,we >expected >them to be faster to decompress. >The number of mappers in both cases was similar (14 vs. 17) and again, >CPU/Mem didn't seem to be exausted. >Since our most critical time is reading, this issue makes it hard for us >to >be using Avro. > >Maybe we're doing something wrong - your input would be much appreciated! > >Thanks, >Nir > > > >-- >View this message in context: >http://apache-avro.679487.n3.nabble.com/Compressed-Avro-vs-compressed-Sequ >ence-unexpected-results-tp4027467.html >Sent from the Avro - Users mailing list archive at Nabble.com.
