Re: [DISCUSS] Flink's supported APIs and Hive query syntax

2022-03-08 Thread Jingsong Li
Thanks all for your discussions.

I'll share my opinion here:

1. Hive SQL and Hive-like SQL are the absolute mainstay of current
Batch ETL in China. Hive+Spark (HiveSQL-like)+Databricks also occupies
a large market worldwide.

- Unlike OLAP SQL (such as presto, which is ansi-sql rather than hive
sql), Batch ETL is run periodically, which means that a large number
of Batch Pipelines have already been built, and if they need to be
migrated to a new system, it will be extremely costly to migrate the
SQLs.

2. Our current Hive dialect is immature and we need to put more effort
to decouple it from the flink planner.

Best,
Jingsong

On Tue, Mar 8, 2022 at 4:27 PM Zou Dan  wrote:
>
> Hi Martijn,
> Thanks for bringing this up.
> Hive SQL (using in Hive & Spark) plays an important role in batch processing, 
> it has almost become de facto standard in batch processing. In our company, 
> there are hundreds of thousands of spark jobs each day.
> IMO, if we want to promote Flink batch, Hive syntax compatibility is a 
> crucial point of it.
> Thanks to this feature, we have migrated 800+ Spark jobs to Flink smoothly.
>
> So, I quite agree with putting more effort into Hive syntax compatibility.
>
> Best,
> Dan Zou
>
> 2022年3月7日 19:23,Martijn Visser  写道:
>
> query
>
>


Re: [DISCUSS] Flink's supported APIs and Hive query syntax

2022-03-08 Thread Zou Dan
Hi Martijn,
Thanks for bringing this up.
Hive SQL (using in Hive & Spark) plays an important role in batch processing, 
it has almost become de facto standard in batch processing. In our company, 
there are hundreds of thousands of spark jobs each day.
IMO, if we want to promote Flink batch, Hive syntax compatibility is a crucial 
point of it.
Thanks to this feature, we have migrated 800+ Spark jobs to Flink smoothly.

So, I quite agree with putting more effort into Hive syntax compatibility.

Best,
Dan Zou

> 2022年3月7日 19:23,Martijn Visser  写道:
> 
> query



Re: Re: [DISCUSS] Flink's supported APIs and Hive query syntax

2022-03-08 Thread Jark Wu
Hi Martijn,

Thanks for starting this discussion. I think it's great
for the community to to reach a consensus on the roadmap
of Hive query syntax.

I agree that the Hive project is not actively developed nowadays.
However, Hive still occupies the majority of the batch market
and the Hive ecosystem is even more active now. For example,
the Apache Kyuubi[1] is a new project that is a JDBC server
which is compatible with HiveServer2. And the Apache Iceberg
and Apache Hudi are mainly using Hive Metastore as the table catalog.
The Spark SQL is 99% compatible with Hive SQL. We have to admit
that Hive is the open-source de facto standard for batch processing.

As far as I can see, almost all the companies (including ByteDance,
Kuaishou, NetEase, etc..) in China are using Hive SQL for batch
processing, even the underlying is using Spark as the engine.
I don't know how the batch users can migrate to Flink if Flink
doesn't provide the Hive compatibility. IMO, in the short term,
Hive syntax compatibility is the ticket for us to have a seat
in the batch processing. In the long term, we can drop it and
focus on Flink SQL itself both for batch and stream processing.

Regarding the maintenance concern you raised, I think that's a good
point and they are in the plan. The Hive dialect has already been
a plugin and option now, and the implementation is located in
hive-connector module. We still need some work to make the Hive
dialect purely rely on public APIs, and the Hive connector should be
decopule with table planner. At that time, we can move the whole Hive
connector into a separate repository (I guess this is also in the
externalize connectors plan).

What do you think?

Best,
Jark

[1]:
https://kyuubi.apache.org/docs/latest/overview/kyuubi_vs_thriftserver.html
[2]: https://iceberg.apache.org/docs/latest/spark-configuration/
[3]: https://hudi.apache.org/docs/next/syncing_metastore/

On Tue, 8 Mar 2022 at 11:46, Mang Zhang  wrote:

> Hi Martijn,
>
> Thanks for driving this discussion.
>
> +1 on efforts on more hive/spark syntax compatibility.The hive/spark
> syntax is the most popular in batch computing.Within our company, many
> users have the desire to use Flink to realize the integration of streaming
> and batching,and some users have been running in production for months.And
> we have integrated Flink with our internal remote shuffle service, flink
> save user a lot of development and maintenance costs,user feedback is very
> good.Enrich flink's ecology and provide users with more choices, so I think
> pluggable support for hive/spark dialects is very necessary.We need better
> designs for future multi-source fusion.
>
>
>
>
>
>
>
> Best regards,
>
> Mang Zhang
>
>
>
>
>
> At 2022-03-07 20:52:42, "Jing Zhang"  wrote:
> >Hi Martijn,
> >
> >Thanks for driving this discussion.
> >
> >+1 on efforts on more hive syntax compatibility.
> >
> >With the efforts on batch processing in recent versions(1.10~1.15), many
> >users have run batch processing jobs based on Flink.
> >In our team, we are trying to migrate most of the existing online batch
> >jobs from Hive/Spark to Flink. We hope this migration does not require
> >users to modify their sql.
> >Although Hive is not as popular as it used to be, Hive SQL is still alive
> >because many users still use Hive SQL to run spark jobs.
> >Therefore, compatibility with more HIVE syntax is critical to this
> >migration work.
> >
> >Best,
> >Jing Zhang
> >
> >
> >
> >Martijn Visser  于2022年3月7日周一 19:23写道:
> >
> >> Hi everyone,
> >>
> >> Flink currently has 4 APIs with multiple language support which can be
> used
> >> to develop applications:
> >>
> >> * DataStream API, both Java and Scala
> >> * Table API, both Java and Scala
> >> * Flink SQL, both in Flink query syntax and Hive query syntax
> (partially)
> >> * Python API
> >>
> >> Since FLIP-152 [1] the Flink SQL support has been extended to also
> support
> >> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to
> address
> >> more syntax compatibility issues.
> >>
> >> I would like to open a discussion on Flink directly supporting the Hive
> >> query syntax. I have some concerns if having a 100% Hive query syntax is
> >> indeed something that we should aim for in Flink.
> >>
> >> I can understand that having Hive query syntax support in Flink could
> help
> >> users due to interoperability and being able to migrate. However:
> >>
> >> - Adding full Hive query syntax support will mean that we go from 6
> fully
> >> supported API/language combinations to 7. I think we are currently
> already
> >> struggling with maintaining the existing combinations, let another one
> >> more.
> >> - Apache Hive is/appears to be a project that's not that actively
> developed
> >> anymore. The last release was made in January 2021. It's popularity is
> >> rapidly declining in Europe and the United State, also due Hadoop
> becoming
> >> less popular.
> >> - Related to the previous topic, other software like Snowflake,
> >> 

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

2022-03-07 Thread Jing Ge
Hi,

Thanks Martijn for driving this discussion. Your concerns are very
rational.

We should do our best to keep the Flink development on the right track. I
would suggest discussing it in a vision/goal oriented way. Since Flink has
a clear vision of unified batch and stream processing, supporting batch
jobs will be one of the critical core features to help us reach the vision
and let Flink have an even bigger impact in the industry. I fully agree
with you that we should not focus on the Hive query syntax. Instead of it,
we should build a plan/schedule to support batch query syntax for the
vision. If there is any conflict between Hive query syntax and common batch
query syntax, we should stick with the common batch query syntax. For any
Hive specific query syntax, which is not supported as a common case by
other batch process engines, we should think very carefully and implement
it as a dialect extension like you suggested, but only when it is a
critical business requirement and has broad impact on many use cases. Last
but not least, from architecture's perspective, it is good to have the
capability to support arbitrary syntax via dialect/extension/plugin. But it
will also require a lot of effort to make it happen. Trade-off is always
the key. Currently, I have to agree with you again, we should focus more on
the common (batch) cases.


Best regards,
Jing

On Mon, Mar 7, 2022 at 1:53 PM Jing Zhang  wrote:

> Hi Martijn,
>
> Thanks for driving this discussion.
>
> +1 on efforts on more hive syntax compatibility.
>
> With the efforts on batch processing in recent versions(1.10~1.15), many
> users have run batch processing jobs based on Flink.
> In our team, we are trying to migrate most of the existing online batch
> jobs from Hive/Spark to Flink. We hope this migration does not require
> users to modify their sql.
> Although Hive is not as popular as it used to be, Hive SQL is still alive
> because many users still use Hive SQL to run spark jobs.
> Therefore, compatibility with more HIVE syntax is critical to this
> migration work.
>
> Best,
> Jing Zhang
>
>
>
> Martijn Visser  于2022年3月7日周一 19:23写道:
>
>> Hi everyone,
>>
>> Flink currently has 4 APIs with multiple language support which can be
>> used
>> to develop applications:
>>
>> * DataStream API, both Java and Scala
>> * Table API, both Java and Scala
>> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
>> * Python API
>>
>> Since FLIP-152 [1] the Flink SQL support has been extended to also support
>> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
>> more syntax compatibility issues.
>>
>> I would like to open a discussion on Flink directly supporting the Hive
>> query syntax. I have some concerns if having a 100% Hive query syntax is
>> indeed something that we should aim for in Flink.
>>
>> I can understand that having Hive query syntax support in Flink could help
>> users due to interoperability and being able to migrate. However:
>>
>> - Adding full Hive query syntax support will mean that we go from 6 fully
>> supported API/language combinations to 7. I think we are currently already
>> struggling with maintaining the existing combinations, let another one
>> more.
>> - Apache Hive is/appears to be a project that's not that actively
>> developed
>> anymore. The last release was made in January 2021. It's popularity is
>> rapidly declining in Europe and the United State, also due Hadoop becoming
>> less popular.
>> - Related to the previous topic, other software like Snowflake,
>> Trino/Presto, Databricks are becoming more and more popular. If we add
>> full
>> support for the Hive query syntax, then why not add support for Snowflake
>> and the others?
>> - We are supporting Hive versions that are no longer supported by the Hive
>> community with known security vulnerabilities. This makes Flink also
>> vulnerable for those type of vulnerabilities.
>> - The currently Hive implementation is done by using a lot of internals of
>> Flink, making Flink hard to maintain, with lots of tech debt and making
>> things overly complex.
>>
>> From my perspective, I think it would be better to not have Hive query
>> syntax compatibility directly in Flink itself. Of course we should have a
>> proper Hive connector and a proper Hive catalog to make connectivity with
>> Hive (the versions that are still supported by the Hive community) itself
>> possible. Alternatively, if Hive query syntax is so important, it should
>> not rely on internals but be available as a dialect/pluggable option. That
>> could also open up the possibility to add more syntax support for others
>> in
>> the future, but I really think we should just focus on Flink SQL itself.
>> That's already hard enough to maintain and improve on.
>>
>> I'm looking forward to the thoughts of both Developers and Users, so I'm
>> cross-posting to both mailing lists.
>>
>> Best regards,
>>
>> Martijn Visser
>> https://twitter.com/MartijnVisser82
>>
>> [1]
>> 

Re: [DISCUSS] Flink's supported APIs and Hive query syntax

2022-03-07 Thread Jing Zhang
Hi Martijn,

Thanks for driving this discussion.

+1 on efforts on more hive syntax compatibility.

With the efforts on batch processing in recent versions(1.10~1.15), many
users have run batch processing jobs based on Flink.
In our team, we are trying to migrate most of the existing online batch
jobs from Hive/Spark to Flink. We hope this migration does not require
users to modify their sql.
Although Hive is not as popular as it used to be, Hive SQL is still alive
because many users still use Hive SQL to run spark jobs.
Therefore, compatibility with more HIVE syntax is critical to this
migration work.

Best,
Jing Zhang



Martijn Visser  于2022年3月7日周一 19:23写道:

> Hi everyone,
>
> Flink currently has 4 APIs with multiple language support which can be used
> to develop applications:
>
> * DataStream API, both Java and Scala
> * Table API, both Java and Scala
> * Flink SQL, both in Flink query syntax and Hive query syntax (partially)
> * Python API
>
> Since FLIP-152 [1] the Flink SQL support has been extended to also support
> the Hive query syntax. There is now a follow-up FLINK-26360 [2] to address
> more syntax compatibility issues.
>
> I would like to open a discussion on Flink directly supporting the Hive
> query syntax. I have some concerns if having a 100% Hive query syntax is
> indeed something that we should aim for in Flink.
>
> I can understand that having Hive query syntax support in Flink could help
> users due to interoperability and being able to migrate. However:
>
> - Adding full Hive query syntax support will mean that we go from 6 fully
> supported API/language combinations to 7. I think we are currently already
> struggling with maintaining the existing combinations, let another one
> more.
> - Apache Hive is/appears to be a project that's not that actively developed
> anymore. The last release was made in January 2021. It's popularity is
> rapidly declining in Europe and the United State, also due Hadoop becoming
> less popular.
> - Related to the previous topic, other software like Snowflake,
> Trino/Presto, Databricks are becoming more and more popular. If we add full
> support for the Hive query syntax, then why not add support for Snowflake
> and the others?
> - We are supporting Hive versions that are no longer supported by the Hive
> community with known security vulnerabilities. This makes Flink also
> vulnerable for those type of vulnerabilities.
> - The currently Hive implementation is done by using a lot of internals of
> Flink, making Flink hard to maintain, with lots of tech debt and making
> things overly complex.
>
> From my perspective, I think it would be better to not have Hive query
> syntax compatibility directly in Flink itself. Of course we should have a
> proper Hive connector and a proper Hive catalog to make connectivity with
> Hive (the versions that are still supported by the Hive community) itself
> possible. Alternatively, if Hive query syntax is so important, it should
> not rely on internals but be available as a dialect/pluggable option. That
> could also open up the possibility to add more syntax support for others in
> the future, but I really think we should just focus on Flink SQL itself.
> That's already hard enough to maintain and improve on.
>
> I'm looking forward to the thoughts of both Developers and Users, so I'm
> cross-posting to both mailing lists.
>
> Best regards,
>
> Martijn Visser
> https://twitter.com/MartijnVisser82
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=165227316
> [2] https://issues.apache.org/jira/browse/FLINK-21529
>