Hi Scott,
Thanks for the detailed response. It sounds like currently the best
approach is to create custom schemas for Hive that support reading
Avro data files.
Given the current version of Avro, does it work to set the INPUTFORMAT
as AvroInputFormat when creating a table in Hive?
Thanks,
-- Ken
On Nov 1, 2010, at 9:32am, Scott Carey wrote:
On Oct 28, 2010, at 11:43 AM, Ken Krugler wrote:
Hi all,
I'd seen past emails from Scott and Doug about using Avro as the data
format for Hive.
This was back in April/May, and I'm wondering about current state of
the world.
I use avro data files readable from both Pig and Hive. Currently
this is on our custom schemas and not general purpose read/write. A
key future benefit is more easy sharing of arbitrary data inputs/
outputs between Hive, Pig, Java M/R, and anything else. Note that,
the Howl problem is also attacking this problem. It might make
sense to contribute an Avro backend and adapter for that project, so
that an avro data file can be read in as a table in Howl. That
would be a while off (~year?) at best if I am the contributor.
Specifically, what's the recommended approach (& known issues) with
using Avro files with Hive?
E.g. Scott mentioned that "Avro files should be better performing and
more compact than sequence files." Has that been proven out?
In general this will be the case. SequenceFiles use Writables.
These don't often store data in binary format as compact as avro
does, and are often composed together to form more complicated
writables. Reading and writing each writable tends to lead to more
fine grained access to the stream which is slower.
However, one could make a specialized writable for a specific data
type or dataset that is very fast and would out-do what a general
purpose tool like Avro does. For the most part in general use it is
the slightly smaller size of Avro that would more likely be noticed
than any performance difference. Compression overhead dwarfs minor
performance differences. The goal of serialization performance in
my opinion, is to make your choice of compression be the primary
factor in performance.
There is currently no equivalent of MapFile or a columnar storage
format in Avro (yet).
He also discussed a minor issue with maps - "Their maps however can
have any intrinsic type as a key (int, long, string, float, double)."
Represent arbitrary map as an Avro array of (key, value) tuples.
This has no restrictions. In some sense, that is all an Avro Map is
anyway, a special case Array of (key, value) with a default map data
structure in the language API. Some schema metadata might be
required to give Hive the hint that it can treat an array as a map.
And a more serious issue with unions, though this wouldn't directly
impact us as we wouldn't be using that feature.
I have dealt with unions by exposing all branches of the union to
Hive and/or Pig. Branches that are not taken are null. In some
cases I expose an extra field for the branch taken. This does not
work in all cases, in particular recursive schemas can be troublesome.
In our situation, we're trying to get the best of both worlds by
leveraging Hive for analytics, and Cascading for workflow, so having
one store in HDFS for both would be a significant win.
Same here but we replaced Cascading with Pig. We have exposed data
to both, but need the ability to create a table in Hive and read
from Pig, and vice-versa. Avro should be a great tool for that, and
leave the data open for many more things to access as well.
I will be working on completing the Avro PigStorage late this year,
and then have a look at Hive early next year.
Thanks for any input!
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g