Given the direction HDFS is going with Storage locations as identified in https://issues.apache.org/jira/browse/HDFS-2832
and https://issues.apache.org/jira/secure/attachment/12597860/20130813-HeterogeneousStorage.pdf Is now the right time to toss out some suggestions for the Hive project on incorporating some of these features? I don't see it being a hard thing, we could have new partitions created in a managed table targeted to a certain storage type (say SSD) and then a very simple built in command that could be run to set the storage location on partitions older than a certain date to be on slower storage (say HDD). Basically, the ideal from a user/administrator perspective is managed tables with partition locations on various storage locations that is seamless to user running a query. Say I have 5 partitions day='2014-03-03' day='2014-03-02' day='2014-03-01' day='2014-02-28' day='2014-02-27' March 2 and 3 would be assigned to SSD storage, and Feb 27-March 1 would be on HDD. at some point in the early morning of March 4. (note March 4 automatically goes into SSD). A command could be run ALTER TABLE mytable SET part_location = /user/hive/fast_data/mytable/day='2014-02-27' WHERE day='2014-02-27' This command would do a few things: Ensure the location doesn't exist (if it does, it will fail, perhaps this is controllable in the command, i.e. copy data, don't fail if directory exists) Create the new new part_location Copy the data from old location to new location (verify this command completes) Update Metadata in hive to point the managed part location to new location remove old data/location This would allow there to be a simple command that does all the work of moving things, yes it still would be manual (i.e. it's not built into hive to auto age older partitions) but at the same time, that is something that should be managed by the admin anyhow. This allows a simple command to all the work, including moving the data and updating the metadata. Heck, you could even add a feature here. instead of just copying the files in the old location, perhaps do a INSERT OVERWRITE type command where if there was lots of smaller files appended to the original older location, the "archival" process uses Map Reduce to reorganize the files into larger files for better compression, storage. Maybe add a "WITH DEFRAG" option to the Alter statement. WITH DEFAG could be a trigger for a m/r rather than just file copy. If this M/R fails, obviously the metadata isn't updated and the old data isn't deleted. Thoughts?
