Re: Avro and Hive

Ken Krugler Tue, 02 Nov 2010 09:34:36 -0700

Hi Scott,

Thanks for the detailed response. It sounds like currently the bestapproach is to create custom schemas for Hive that support readingAvro data files.

Given the current version of Avro, does it work to set the INPUTFORMATas AvroInputFormat when creating a table in Hive?


Thanks,

-- Ken

On Nov 1, 2010, at 9:32am, Scott Carey wrote:

On Oct 28, 2010, at 11:43 AM, Ken Krugler wrote:
Hi all,

I'd seen past emails from Scott and Doug about using Avro as the data
format for Hive.

This was back in April/May, and I'm wondering about current state of
the world.
I use avro data files readable from both Pig and Hive. Currentlythis is on our custom schemas and not general purpose read/write. Akey future benefit is more easy sharing of arbitrary data inputs/outputs between Hive, Pig, Java M/R, and anything else. Note that,the Howl problem is also attacking this problem. It might makesense to contribute an Avro backend and adapter for that project, sothat an avro data file can be read in as a table in Howl. Thatwould be a while off (~year?) at best if I am the contributor.
Specifically, what's the recommended approach (& known issues) with
using Avro files with Hive?

E.g. Scott mentioned that "Avro files should be better performing and
more compact than sequence files." Has that been proven out?
In general this will be the case. SequenceFiles use Writables.These don't often store data in binary format as compact as avrodoes, and are often composed together to form more complicatedwritables. Reading and writing each writable tends to lead to morefine grained access to the stream which is slower.
However, one could make a specialized writable for a specific datatype or dataset that is very fast and would out-do what a generalpurpose tool like Avro does. For the most part in general use it isthe slightly smaller size of Avro that would more likely be noticedthan any performance difference. Compression overhead dwarfs minorperformance differences. The goal of serialization performance inmy opinion, is to make your choice of compression be the primaryfactor in performance.
There is currently no equivalent of MapFile or a columnar storageformat in Avro (yet).
He also discussed a minor issue with maps - "Their maps however can
have any intrinsic type as a key (int, long, string, float, double)."
Represent arbitrary map as an Avro array of (key, value) tuples.This has no restrictions. In some sense, that is all an Avro Map isanyway, a special case Array of (key, value) with a default map datastructure in the language API. Some schema metadata might berequired to give Hive the hint that it can treat an array as a map.
And a more serious issue with unions, though this wouldn't directly
impact us as we wouldn't be using that feature.
I have dealt with unions by exposing all branches of the union toHive and/or Pig. Branches that are not taken are null. In somecases I expose an extra field for the branch taken. This does notwork in all cases, in particular recursive schemas can be troublesome.
In our situation, we're trying to get the best of both worlds by
leveraging Hive for analytics, and Cascading for workflow, so having
one store in HDFS for both would be a significant win.
Same here but we replaced Cascading with Pig. We have exposed datato both, but need the ability to create a table in Hive and readfrom Pig, and vice-versa. Avro should be a great tool for that, andleave the data open for many more things to access as well.
I will be working on completing the Avro PigStorage late this year,and then have a look at Hive early next year.
Thanks for any input!

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Avro and Hive

Reply via email to