HDFS Storage Locations and Hive

John Omernik Mon, 03 Mar 2014 07:51:29 -0800

Given the direction HDFS is going with Storage locations as identified in

https://issues.apache.org/jira/browse/HDFS-2832

and

https://issues.apache.org/jira/secure/attachment/12597860/20130813-HeterogeneousStorage.pdf

Is now the right time to toss out some suggestions for the Hive project on
incorporating some of these features?

I don't see it being a hard thing, we could have new partitions created in
a managed table targeted to a certain storage type (say SSD) and then a
very simple built in command that could be run to set the storage location
on partitions older than a certain date to be on slower storage (say HDD).

Basically, the ideal from a user/administrator perspective is managed
tables with partition locations on various storage locations that is
seamless to user running a query.

Say I have 5 partitions
day='2014-03-03'
day='2014-03-02'
day='2014-03-01'
day='2014-02-28'
day='2014-02-27'

March 2 and 3 would be assigned to SSD storage, and Feb 27-March 1 would be
on HDD. at some point in the early morning of March 4. (note March 4
automatically goes into SSD). A command could be run

ALTER TABLE mytable SET part_location =
/user/hive/fast_data/mytable/day='2014-02-27' WHERE day='2014-02-27'

This command would do a few things:

Ensure the location doesn't exist (if it does, it will fail, perhaps this
is controllable in the command, i.e. copy data, don't fail if directory
exists)
Create the new new part_location
Copy the data from old location to new location (verify this command
completes)
Update Metadata in hive to point the managed part location to new location
remove old data/location

This would allow there to be a simple command that does all the work of
moving things, yes it still would be manual (i.e. it's not built into hive
to auto age older partitions) but at the same time, that is something that
should be managed by the admin anyhow. This allows a simple command to all
the work, including moving the data and updating the metadata. Heck, you
could even add a feature here. instead of just copying the files in the old
location, perhaps do a INSERT OVERWRITE type command where if there was
lots of smaller files appended to the original older location, the
"archival" process uses Map Reduce to reorganize the files into larger
files for better compression, storage. Maybe add a "WITH DEFRAG" option to
the Alter statement. WITH DEFAG could be a trigger for a m/r rather than
just file copy. If this M/R fails, obviously the metadata isn't updated and
the old data isn't deleted.

Thoughts?

HDFS Storage Locations and Hive

Reply via email to