On Oct 28, 2010, at 11:43 AM, Ken Krugler wrote: > Hi all, > > I'd seen past emails from Scott and Doug about using Avro as the data > format for Hive. > > This was back in April/May, and I'm wondering about current state of > the world.
I use avro data files readable from both Pig and Hive. Currently this is on our custom schemas and not general purpose read/write. A key future benefit is more easy sharing of arbitrary data inputs/outputs between Hive, Pig, Java M/R, and anything else. Note that, the Howl problem is also attacking this problem. It might make sense to contribute an Avro backend and adapter for that project, so that an avro data file can be read in as a table in Howl. That would be a while off (~year?) at best if I am the contributor. > > Specifically, what's the recommended approach (& known issues) with > using Avro files with Hive? > > E.g. Scott mentioned that "Avro files should be better performing and > more compact than sequence files." Has that been proven out? > In general this will be the case. SequenceFiles use Writables. These don't often store data in binary format as compact as avro does, and are often composed together to form more complicated writables. Reading and writing each writable tends to lead to more fine grained access to the stream which is slower. However, one could make a specialized writable for a specific data type or dataset that is very fast and would out-do what a general purpose tool like Avro does. For the most part in general use it is the slightly smaller size of Avro that would more likely be noticed than any performance difference. Compression overhead dwarfs minor performance differences. The goal of serialization performance in my opinion, is to make your choice of compression be the primary factor in performance. There is currently no equivalent of MapFile or a columnar storage format in Avro (yet). > He also discussed a minor issue with maps - "Their maps however can > have any intrinsic type as a key (int, long, string, float, double)." > Represent arbitrary map as an Avro array of (key, value) tuples. This has no restrictions. In some sense, that is all an Avro Map is anyway, a special case Array of (key, value) with a default map data structure in the language API. Some schema metadata might be required to give Hive the hint that it can treat an array as a map. > And a more serious issue with unions, though this wouldn't directly > impact us as we wouldn't be using that feature. I have dealt with unions by exposing all branches of the union to Hive and/or Pig. Branches that are not taken are null. In some cases I expose an extra field for the branch taken. This does not work in all cases, in particular recursive schemas can be troublesome. > > In our situation, we're trying to get the best of both worlds by > leveraging Hive for analytics, and Cascading for workflow, so having > one store in HDFS for both would be a significant win. Same here but we replaced Cascading with Pig. We have exposed data to both, but need the ability to create a table in Hive and read from Pig, and vice-versa. Avro should be a great tool for that, and leave the data open for many more things to access as well. I will be working on completing the Avro PigStorage late this year, and then have a look at Hive early next year. > > Thanks for any input! > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > >
