Re: Avro and Hive

Scott Carey Mon, 01 Nov 2010 09:31:24 -0700

On Oct 28, 2010, at 11:43 AM, Ken Krugler wrote:

> Hi all,
> 
> I'd seen past emails from Scott and Doug about using Avro as the data  
> format for Hive.
> 
> This was back in April/May, and I'm wondering about current state of  
> the world.


I use avro data files readable from both Pig and Hive.  Currently this is on 
our custom schemas and not general purpose read/write.  A key future benefit is 
more easy sharing of arbitrary data inputs/outputs between Hive, Pig, Java M/R, 
and anything else.   Note that, the Howl problem is also attacking this 
problem.  It might make sense to contribute an Avro backend and adapter for 
that project, so that an avro data file can be read in as a table in Howl.  
That would be a while off (~year?) at best if I am the contributor.
> 
> Specifically, what's the recommended approach (& known issues) with  
> using Avro files with Hive?
> 
> E.g. Scott mentioned that "Avro files should be better performing and  
> more compact than sequence files." Has that been proven out?
> 

In general this will be the case.  SequenceFiles use Writables.  These don't 
often store data in binary format as compact as avro does, and are often 
composed together to form more complicated writables.  Reading and writing each 
writable tends to lead to more fine grained access to the stream which is 
slower. 

However, one could make a specialized writable for a specific data type or 
dataset that is very fast and would out-do what a general purpose tool like 
Avro does.  For the most part in general use it is the slightly smaller size of 
Avro that would more likely be noticed than any performance difference.  
Compression overhead dwarfs minor performance differences.  The goal of 
serialization performance in my opinion, is to make your choice of compression 
be the primary factor in performance.

There is currently no equivalent of MapFile or a columnar storage format in 
Avro (yet). 

> He also discussed a minor issue with maps - "Their maps however can  
> have any intrinsic type as a key (int, long, string, float, double)."
> 

Represent arbitrary map as an Avro array of (key, value) tuples.  This has no 
restrictions.  In some sense, that is all an Avro Map is anyway, a special case 
Array of (key, value) with a default map data structure in the language API.  
Some schema metadata might be required to give Hive the hint that it can treat 
an array as a map.

> And a more serious issue with unions, though this wouldn't directly  
> impact us as we wouldn't be using that feature.

I have dealt with unions by exposing all branches of the union to Hive and/or 
Pig.   Branches that are not taken are null.  In some cases I expose an extra 
field for the branch taken.   This does not work in all cases, in particular 
recursive schemas can be troublesome.

> 
> In our situation, we're trying to get the best of both worlds by  
> leveraging Hive for analytics, and Cascading for workflow, so having  
> one store in HDFS for both would be a significant win.

Same here but we replaced Cascading with Pig.  We have exposed data to both, 
but need the ability to create a table in Hive and read from Pig, and 
vice-versa.  Avro should be a great tool for that, and leave the data open for 
many more things to access as well. 

I will be working on completing the Avro PigStorage late this year, and then 
have a look at Hive early next year. 

> 
> Thanks for any input!
> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 
>

Re: Avro and Hive

Reply via email to