Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Bowen Sat, 13 Oct 2018 06:55:39 -0700

Thank you Xuefu, for bringing up this awesome, detailed proposal! It will 
resolve lots of existing pain for users like me.


In general, I totally agree that improving FlinkSQL's completeness would be a 
much better start point than building 'Hive on Flink', as the Hive community is 
concerned about Flink's SQL incompleteness and lack of proven batch performance 
shown in https://issues.apache.org/jira/browse/HIVE-10712. Improving FlinkSQL 
seems a more natural direction to start with in order to achieve the 
integration.

Xuefu and Timo has laid a quite clear path of what to tackle next. Given that 
there're already some efforts going on, for item 1,2,5,3,4,6 in Xuefu's list, 
shall we:
identify gaps between a) Xuefu's proposal/discussion result in this thread and 
b) all the ongoing work/discussions?
then, create some new top-level JIRA tickets to keep track of and start more 
detailed discussions with?
It's gonna be a great and influential project , and I'd love to participate 
into it to move FlinkSQL's adoption and ecosystem even further.

Thanks,
Bowen


> 在 2018年10月12日，下午3:37，Jörn Franke <jornfra...@gmail.com> 写道：
> 
> Thank you very nice , I fully agree with that. 
> 
>> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:
>> 
>> Hi Jörn,
>> 
>> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact 
>> it is one of the two approaches that I named in the beginning of the thread. 
>> As also pointed out there, this isn't mutually exclusive from work we 
>> proposed inside Flink and they target at different user groups and user 
>> cases. Further, what we proposed to do in Flink should be a good showcase 
>> that demonstrate Flink's capabilities in batch processing and convince Hive 
>> community of the worth of a new engine. As you might know, the idea 
>> encountered some doubt and resistance. Nevertheless, we do have a solid plan 
>> for Hive on Flink, which we will execute once Flink SQL is in a good shape.
>> 
>> I also agree with you that Flink SQL shouldn't be closely coupled with Hive. 
>> While we mentioned Hive in many of the proposed items, most of them are 
>> coupled only in concepts and functionality rather than code or libraries. We 
>> are taking the advantage of the connector framework in Flink. The only thing 
>> that might be exceptional is to support Hive built-in UDFs, which we may not 
>> make it work out of the box to avoid the coupling. We could, for example, 
>> require users bring in Hive library and register themselves. This is subject 
>> to further discussion.
>> 
>> #11 is about Flink runtime enhancement that is meant to make task failures 
>> more tolerable (so that the job don't have to start from the beginning in 
>> case of task failures) and to make task scheduling more resource-efficient. 
>> Flink's current design in those two aspects leans more to stream processing, 
>> which may not be good enough for batch processing. We will provide more 
>> detailed design when we get to them.
>> 
>> Please let me know if you have further thoughts or feedback.
>> 
>> Thanks,
>> Xuefu
>> 
>> 
>> ------------------------------------------------------------------
>> Sender:Jörn Franke <jornfra...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 13:54
>> Recipient:Xuefu <xuef...@alibaba-inc.com>
>> Cc:vino yang <yanghua1...@gmail.com>; Fabian Hueske <fhue...@gmail.com>; dev 
>> <d...@flink.apache.org>; user <user@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>> 
>> Would it maybe make sense to provide Flink as an engine on Hive 
>> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
>> coupled than integrating hive in all possible flink core modules and thus 
>> introducing a very tight dependency to Hive in the core.
>> 1,2,3 could be achieved via a connector based on the Flink Table API.
>> Just as a proposal to start this Endeavour as independent projects (hive 
>> engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
>> distant future if the Hive integration is heavily demanded one could then 
>> integrate it more tightly if needed. 
>> 
>> What is meant by 11?
>> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu <xuef...@alibaba-inc.com>:
>> 
>> Hi Fabian/Vno,
>> 
>> Thank you very much for your encouragement inquiry. Sorry that I didn't see 
>> Fabian's email until I read Vino's response just now. (Somehow Fabian's went 
>> to the spam folder.)
>> 
>> My proposal contains long-term and short-terms goals. Nevertheless, the 
>> effort will focus on the following areas, including Fabian's list:
>> 
>> 1. Hive metastore connectivity - This covers both read/write access, which 
>> means Flink can make full use of Hive's metastore as its catalog (at least 
>> for the batch but can extend for streaming as well).
>> 2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
>> created by Hive can be understood by Flink and the reverse direction is true 
>> also.
>> 3. Data compatibility - Similar to #2, data produced by Hive can be consumed 
>> by Flink and vise versa.
>> 4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
>> own implementation or make Hive's implementation work in Flink. Further, for 
>> user created UDFs in Hive, Flink SQL should provide a mechanism allowing 
>> user to import them into Flink without any code change required.
>> 5. Data types -  Flink SQL should support all data types that are available 
>> in Hive.
>> 6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) 
>> with extension to support Hive's syntax and language features, around DDL, 
>> DML, and SELECT queries.
>> 7.  SQL CLI - this is currently developing in Flink but more effort is 
>> needed.
>> 8. Server - provide a server that's compatible with Hive's HiverServer2 in 
>> thrift APIs, such that HiveServer2 users can reuse their existing client 
>> (such as beeline) but connect to Flink's thrift server instead.
>> 9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
>> application to use to connect to its thrift server
>> 10. Support other user's customizations in Hive, such as Hive Serdes, 
>> storage handlers, etc.
>> 11. Better task failure tolerance and task scheduling at Flink runtime.
>> 
>> As you can see, achieving all those requires significant effort and across 
>> all layers in Flink. However, a short-term goal could  include only core 
>> areas (such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, 
>> #6).
>> 
>> Please share your further thoughts. If we generally agree that this is the 
>> right direction, I could come up with a formal proposal quickly and then we 
>> can follow up with broader discussions.
>> 
>> Thanks,
>> Xuefu
>> 
>> 
>> 
>> ------------------------------------------------------------------
>> Sender:vino yang <yanghua1...@gmail.com>
>> Sent at:2018 Oct 11 (Thu) 09:45
>> Recipient:Fabian Hueske <fhue...@gmail.com>
>> Cc:dev <d...@flink.apache.org>; Xuefu <xuef...@alibaba-inc.com>; user 
>> <user@flink.apache.org>
>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>> 
>> Hi Xuefu,
>> 
>> Appreciate this proposal, and like Fabian, it would look better if you can 
>> give more details of the plan.
>> 
>> Thanks, vino.
>> 
>> Fabian Hueske <fhue...@gmail.com> 于2018年10月10日周三 下午5:27写道：
>> Hi Xuefu,
>> 
>> Welcome to the Flink community and thanks for starting this discussion! 
>> Better Hive integration would be really great!
>> Can you go into details of what you are proposing? I can think of a couple 
>> ways to improve Flink in that regard:
>> 
>> * Support for Hive UDFs
>> * Support for Hive metadata catalog
>> * Support for HiveQL syntax
>> * ???
>> 
>> Best, Fabian
>> 
>> Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu 
>> <xuef...@alibaba-inc.com>:
>> Hi all,
>> 
>> Along with the community's effort, inside Alibaba we have explored Flink's 
>> potential as an execution engine not just for stream processing but also for 
>> batch processing. We are encouraged by our findings and have initiated our 
>> effort to make Flink's SQL capabilities full-fledged. When comparing what's 
>> available in Flink to the offerings from competitive data processing 
>> engines, we identified a major gap in Flink: a well integration with Hive 
>> ecosystem. This is crucial to the success of Flink SQL and batch due to the 
>> well-established data ecosystem around Hive. Therefore, we have done some 
>> initial work along this direction but there are still a lot of effort needed.
>> 
>> We have two strategies in mind. The first one is to make Flink SQL 
>> full-fledged and well-integrated with Hive ecosystem. This is a similar 
>> approach to what Spark SQL adopted. The second strategy is to make Hive 
>> itself work with Flink, similar to the proposal in [1]. Each approach bears 
>> its pros and cons, but they don’t need to be mutually exclusive with each 
>> targeting at different users and use cases. We believe that both will 
>> promote a much greater adoption of Flink beyond stream processing.
>> 
>> We have been focused on the first approach and would like to showcase 
>> Flink's batch and SQL capabilities with Flink SQL. However, we have also 
>> planned to start strategy #2 as the follow-up effort.
>> 
>> I'm completely new to Flink(, with a short bio [2] below), though many of my 
>> colleagues here at Alibaba are long-time contributors. Nevertheless, I'd 
>> like to share our thoughts and invite your early feedback. At the same time, 
>> I am working on a detailed proposal on Flink SQL's integration with Hive 
>> ecosystem, which will be also shared when ready.
>> 
>> While the ideas are simple, each approach will demand significant effort, 
>> more than what we can afford. Thus, the input and contributions from the 
>> communities are greatly welcome and appreciated.
>> 
>> Regards,
>> 
>> 
>> Xuefu
>> 
>> References:
>> 
>> [1] https://issues.apache.org/jira/browse/HIVE-10712
>> [2] Xuefu Zhang is a long-time open source veteran, worked or working on 
>> many projects under Apache Foundation, of which he is also an honored 
>> member. About 10 years ago he worked in the Hadoop team at Yahoo where the 
>> projects just got started. Later he worked at Cloudera, initiating and 
>> leading the development of Hive on Spark project in the communities and 
>> across many organizations. Prior to joining Alibaba, he worked at Uber where 
>> he promoted Hive on Spark to all Uber's SQL on Hadoop workload and 
>> significantly improved Uber's cluster efficiency.
>> 
>>

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Reply via email to