On Tue, Jan 30, 2018 at 1:16 PM, Ryan Blue <b...@apache.org> wrote: > Thanks, Owen. > > I agree, Iceberg addresses a lot of the problems that you're hitting here. > It doesn't quite go as far as moving all metadata into the file system. You > can do that in HDFS and implementations that support atomic rename, but not > in S3 (Iceberg has an implementation of the HDFS one strategy). For S3 you > need some way of making commits atomic, for which we are using a metastore > that is far more light-weight. You could also use a ZooKeeper cluster for > write-side locking, or maybe there are other clever ideas out there. > > Even if Iceberg is agnostic to the commit mechanism, it does almost all of > what you're suggesting and does it in a way that's faster than the current > metastore while providing snapshot isolation. > > rb > > On Mon, Jan 29, 2018 at 9:10 AM, Owen O'Malley <owen.omal...@gmail.com> > wrote: > >> You should really look at what the Netflix guys are doing on Iceberg. >> >> https://github.com/Netflix/iceberg >> >> They have put a lot of thought into how to efficiently handle tabular >> data in S3. They put all of the metadata in S3 except for a single link to >> the name of the table's root metadata file. >> >> Other advantages of their design: >> >> - Efficient atomic addition and removal of files in S3. >> - Consistent schema evolution across formats >> - More flexible partitioning and bucketing. >> >> >> .. Owen >> >> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com> >> wrote: >> >>> All, >>> >>> I have been bouncing around the earth for a while and have had the >>> privilege of working at 4-5 places. On arrival each place was in a variety >>> of states in their hadoop journey. >>> >>> One large company that I was at had a ~200 TB hadoop cluster. They >>> actually ran PIG and there ops group REFUSED to support hive, even though >>> they had written thousands of lines of pig macros to deal with selecting >>> from a partition, or a pig script file you would import so you would know >>> what the columns of the data at location /x/y/z is. >>> >>> In another lifetime I have been at a shop that used SCALDING. Again lots >>> of custom effort there with avro and parquet, all to do things that hive >>> would do our of the box. Again the biggest challenge is the thrift service >>> and metastore. >>> >>> In the cloud many people will use a bootstrap script >>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado >>> op-script.html or 'msck repair' >>> >>> The "rise of the cloud" has changed us all the metastore is being a >>> database is a hard paradigm to support. Imagine for example I created data >>> to an s3 bucket with hive, and another group in my company requires read >>> only access to this data for an ephemeral request. Sharing the data is >>> easy, S3 access can be granted, sharing the metastore and thrift services >>> are much more complicated. >>> >>> So lets think out of the box: >>> >>> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca >>> ssandra-together-at-last >>> >>> Datastax was able to build a platform where the filesystem and the >>> metastore were backed into Cassandra. Even though a HBase user would not >>> want that, the novel thing about that approach is that the metastore was >>> not "some extra thing in a database" that you had to deal with. >>> >>> What I am thinking is that for the user of s3, the metastore should be >>> in s3. Probably in hidden files inside the warehouse/table directory(ies). >>> >>> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k >>> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins >>> ights.commsql.doc/doc/biga_msckrep.html" >>> >>> The implementation could be something like this: >>> >>> On startup read hive.warehouse.dir look for "_warehouse" That would help >>> us locate the databases and in the databases we can locate tables, with the >>> tables we can locate partitions. >>> >>> This will of course scale horribly across tables with 90000000 >>> partitions but that would not be our use case. For all the people with >>> "msck repair" in the bootstrap they have a much cleaner way of using hive. >>> >>> The implementations could even be "Stacked" files first metastore >>> lookback second. >>> >>> It would be also wise to have a tool available in the CLI "metastore >>> <table> toJson" making it drop dead simple to export the schema >>> definitions. >>> >>> Thoughts? >>> >>> >>> >> > > > -- > Ryan Blue >
Ryan, Super great work by the way. Some of the mechanisms are things that Hive could do and in some cases already does. For a long time we have had: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java " implementations that support atomic rename," is essentially SymlinkTextInputFormat. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java /** * This input split wraps the FileSplit generated from * TextInputFormat.getSplits(), while setting the original link file path * as job input path. This is needed because MapOperator relies on the * job input path to lookup correct child operators. The target data file * is encapsulated in the wrapped FileSplit. */ We already in some cases intercept FileInputFormat getSplits() https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java Impala it also maintains its own file metadata which if edited outside of impala falls out of sync. IE if I "insert into" a partition from hive impala is unaware and you have to issue "refresh" Things like write side locking using ZK are more of an implementation detail. I agree that it is non trivial, but if there are scores of people running "REFRESH PARTITION". On is a snapshot isolation is a "super tight" version of SymlinkTextOutputFormat. IE I can "atomically" compose a file that describes what files should be in the table, and during planning phase I do not need Hadoop to calculate splits.