Re: Proposal: File based metastore

Edward Capriolo Tue, 30 Jan 2018 11:10:32 -0800

On Tue, Jan 30, 2018 at 1:16 PM, Ryan Blue <b...@apache.org> wrote:

> Thanks, Owen.
>
> I agree, Iceberg addresses a lot of the problems that you're hitting here.
> It doesn't quite go as far as moving all metadata into the file system. You
> can do that in HDFS and implementations that support atomic rename, but not
> in S3 (Iceberg has an implementation of the HDFS one strategy). For S3 you
> need some way of making commits atomic, for which we are using a metastore
> that is far more light-weight. You could also use a ZooKeeper cluster for
> write-side locking, or maybe there are other clever ideas out there.
>
> Even if Iceberg is agnostic to the commit mechanism, it does almost all of
> what you're suggesting and does it in a way that's faster than the current
> metastore while providing snapshot isolation.
>
> rb
>
> On Mon, Jan 29, 2018 at 9:10 AM, Owen O'Malley <owen.omal...@gmail.com>
> wrote:
>
>> You should really look at what the Netflix guys are doing on Iceberg.
>>
>> https://github.com/Netflix/iceberg
>>
>> They have put a lot of thought into how to efficiently handle tabular
>> data in S3. They put all of the metadata in S3 except for a single link to
>> the name of the table's root metadata file.
>>
>> Other advantages of their design:
>>
>>    - Efficient atomic addition and removal of files in S3.
>>    - Consistent schema evolution across formats
>>    - More flexible partitioning and bucketing.
>>
>>
>> .. Owen
>>
>> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>>> All,
>>>
>>> I have been bouncing around the earth for a while and have had the
>>> privilege of working at 4-5 places. On arrival each place was in a variety
>>> of states in their hadoop journey.
>>>
>>> One large company that I was at had a ~200 TB hadoop cluster. They
>>> actually ran PIG and there ops group REFUSED to support hive, even though
>>> they had written thousands of lines of pig macros to deal with selecting
>>> from a partition, or a pig script file you would import so you would know
>>> what the columns of the data at location /x/y/z is.
>>>
>>> In another lifetime I have been at a shop that used SCALDING. Again lots
>>> of custom effort there with avro and parquet, all to do things that hive
>>> would do our of the box. Again the biggest challenge is the thrift service
>>> and metastore.
>>>
>>> In the cloud many people will use a bootstrap script
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado
>>> op-script.html or 'msck repair'
>>>
>>> The "rise of the cloud" has changed us all the metastore is being a
>>> database is a hard paradigm to support. Imagine for example I created data
>>> to an s3 bucket with hive, and another group in my company requires read
>>> only access to this data for an ephemeral request. Sharing the data is
>>> easy, S3 access can be granted, sharing the metastore and thrift services
>>> are much more complicated.
>>>
>>> So lets think out of the box:
>>>
>>> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca
>>> ssandra-together-at-last
>>>
>>> Datastax was able to build a platform where the filesystem and the
>>> metastore were backed into Cassandra. Even though a HBase user would not
>>> want that, the novel thing about that approach is that the metastore was
>>> not "some extra thing in a database" that you had to deal with.
>>>
>>> What I am thinking is that for the user of s3, the metastore should be
>>> in s3. Probably in hidden files inside the warehouse/table directory(ies).
>>>
>>> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k
>>> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins
>>> ights.commsql.doc/doc/biga_msckrep.html"
>>>
>>> The implementation could be something like this:
>>>
>>> On startup read hive.warehouse.dir look for "_warehouse" That would help
>>> us locate the databases and in the databases we can locate tables, with the
>>> tables we can locate partitions.
>>>
>>> This will of course scale horribly across tables with 90000000
>>> partitions but that would not be our use case. For all the people with
>>> "msck repair" in the bootstrap they have a much cleaner way of using hive.
>>>
>>> The implementations could even be "Stacked" files first metastore
>>> lookback second.
>>>
>>> It would be also wise to have a tool available in the CLI "metastore
>>> <table> toJson" making it drop dead simple to export the schema
>>> definitions.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>
>
>
> --
> Ryan Blue
>

Ryan,

Super great work by the way. Some of the mechanisms are things that Hive
could do and in some cases already does. For a long time we have had:

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java

" implementations that support atomic rename," is essentially
SymlinkTextInputFormat.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java

/**
* This input split wraps the FileSplit generated from
* TextInputFormat.getSplits(), while setting the original link file path
* as job input path. This is needed because MapOperator relies on the
* job input path to lookup correct child operators. The target data file
* is encapsulated in the wrapped FileSplit.
*/

We already in some cases intercept FileInputFormat getSplits()

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java

Impala it also maintains its own file metadata which if edited outside of
impala falls out of sync. IE if I "insert into" a partition from hive
impala is unaware and you have to issue "refresh"

Things like write side locking using ZK are more of an implementation
detail. I agree that it is non trivial, but if there are scores of people
running "REFRESH PARTITION".

On is a snapshot isolation is a "super tight" version of
SymlinkTextOutputFormat. IE I can "atomically" compose a file that
describes what files should be in the table, and during planning phase I do
not need Hadoop to calculate splits.

Re: Proposal: File based metastore

Reply via email to