Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-01 Thread Bowen Li
nd build information
> in
> > the travis
> >  set +x
> >  echo "-"
> >  echo "Looks like travis-ci is not configured for your fork."
> >  echo "Please setup by swich on 'zeppelin' repository at
> > https://travis-ci.org/profile and travis-ci."
> >  echo "And then make sure 'Build branch updates' option is enabled in
> > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings.";
> >  echo ""
> >  echo "To trigger CI after setup, you will need ammend your last
> commit
> > with"
> >  echo "git commit --amend"
> >  echo "git push your-remote HEAD --force"
> >  echo ""
> >  echo "See
> >
> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration
> > ."
> >fi
> >
> >exit $RET_CODE
> > else
> >set +x
> >echo "travis_check.py does not exists"
> >exit 1
> > fi
> >
> > Chesnay Schepler  于2019年6月29日周六 下午3:17写道:
> >
> >> Does this imply that a Jenkins job is active as long as the Travis build
> >> runs?
> >>
> >> On 26/06/2019 21:28, Bowen Li wrote:
> >>> Hi,
> >>>
> >>> @Dawid, I think the "long test running" as I mentioned in the first
> >> email,
> >>> also as you guys said, belongs to "a big effort which is much harder to
> >>> accomplish in a short period of time and may deserve its own separate
> >>> discussion". Thus I didn't include it in what we can do in a
> foreseeable
> >>> short term.
> >>>
> >>> Besides, I don't think that's the ultimate reason for lack of build
> >>> resources. Even if the build is shortened to something like 2h, the
> >>> problems of no build machine works about 6 or more hours in PST daytime
> >>> that I described will still happen, because no machine from ASF INFRA's
> >>> pool is allocated to Flink. As I have paid close attention to the build
> >>> queue in the past few weekdays, it's a pretty clear pattern now.
> >>>
> >>> **The ultimate root cause** for that is - we don't have any
> **dedicated**
> >>> build resources that we can stably rely on. I'm actually ok to wait
> for a
> >>> long time if there are build requests running, it means at least we are
> >>> making progress. But I'm not ok with no build resource. A better place
> I
> >>> think we should aim at in short term is to always have at least a
> central
> >>> pool (can be 3 or 5) of machines dedicated to build Flink at any time,
> or
> >>> maybe use users resources.
> >>>
> >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is
> >>> using a Jenkins job to automatically build on users' travis account and
> >>> link the result back to github PR. I guess the Jenkins job would fetch
> >>> latest upstream master and build the PR against it. Jeff has filed
> >> tickets
> >>> to learn and get access to the Jenkins infra. It'll better to fully
> >>> understand it first before judging this approach.
> >>>
> >>> I also heard good things about CircleCI, and ASF INFRA seems to have a
> >> pool
> >>> of build capacity there too. Can be an alternative to consider.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <
> >> dwysakow...@apache.org>
> >>> wrote:
> >>>
> >>>> Sorry to jump in late, but I think Bowen missed the most important
> point
> >>>> from Chesnay's previous message in the summary. The ultimate reason
> for
> >>>> all the problems is that the tests take close to 2 hours to run
> already.
> >>>> I fully support this claim: "Unless people start caring about test
> times
> >>>> before adding them, this issue cannot be solved"
> >>>>
> >>>> This is also another reason why using user's Travis account won't
> help.
> >>>> Every few weeks we reach the user's time limit for a single profile.
> >>>> This makes the user's builds simply fail, u

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-02 Thread Bowen Li
t; "Our rough metrics shows that Flink used over 5800 hours of build time
> > last month. That is equal to EIGHT servers running 24/7 for the ENTIRE
> > MONTH. EIGHT. nonstop.
> > When we discovered this last night, we discussed it some and are going
> > to tune down Flink to allow only five executors maximum. We cannot
> > allow Flink to consume so much of a Foundation shared resource."
> >
> > So yes, we either
> > a) have to heavily reduce our CI usage or
> > b) fund our own, either maintaining it ourselves or donating to Apache.
> >
> > On 02/07/2019 05:11, Bowen Li wrote:
> >> By looking at the git history of the Jenkins script, its core part
> >> was finished in March 2017 (and only two minor update in 2017/2018),
> >> so it's been running for over two years now and feels like Zepplin
> >> community has been quite happy with it. @Jeff Zhang
> >> <mailto:zjf...@gmail.com> can you share your insights and user
> >> experience with the Jenkins+Travis approach?
> >>
> >> Things like:
> >>
> >> - has the approach completely solved the resource capacity problem
> >> for Zepplin community? is Zepplin community happy with the result?
> >> - is the whole configuration chain stable (e.g. uptime) enough?
> >> - how often do you need to maintain the Jenkins infra? how many
> >> people are usually involved in maintenance and bug-fixes?
> >>
> >> The downside of this approach seems mostly to be on the maintenance
> >> to me - maintain the script and Jenkins infra.
> >>
> >> ** Having Our Own Travis-CI.com Account **
> >>
> >> Another alternative I've been thinking of is to have our own
> >> travis-ci.com <http://travis-ci.com> account with paid dedicated
> >> resources. Note travis-ci.org <http://travis-ci.org> is the free
> >> version and travis-ci.com <http://travis-ci.com> is the commercial
> >> version. We currently use a shared resource pool managed by ASK INFRA
> >> team on travis-ci.org <http://travis-ci.org>, but we have no control
> >> over it - we can't see how it's configured, how much resources are
> >> available, how resources are allocated among Apache projects, etc.
> >> The nice thing about having an account on travis-ci.com
> >> <http://travis-ci.com> are:
> >>
> >> - relatively low cost with much better resource guarantee than what
> >> we currently have [1]: $249/month with 5 dedicated concurrency,
> >> $489/month with 10 concurrency
> >> - low maintenance work compared to using Jenkins
> >> - (potentially) no migration cost according to Travis's doc [2]
> >> (pending verification)
> >> - full control over the build capacity/configuration compared to
> >> using ASF INFRA's pool
> >>
> >> I'd be surprised if we as such a vibrant community cannot find and
> >> fund $249*12=$2988 a year in exchange for a much better developer
> >> experience and much higher productivity.
> >>
> >> [1] https://travis-ci.com/plans
> >> [2]
> >>
> https://docs.travis-ci.com/user/migrate/open-source-repository-migration
> >>
> >> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler  >> <mailto:ches...@apache.org>> wrote:
> >>
> >> So yes, the Jenkins job keeps pulling the state from Travis until it
> >> finishes.
> >>
> >> Note sure I'm comfortable with the idea of using Jenkins workers
> >> just to
> >> idle for a several hours.
> >>
> >> On 29/06/2019 14:56, Jeff Zhang wrote:
> >> > Here's what zeppelin community did, we make a python script to
> >> check the
> >> > build status of pull request.
> >> > Here's script:
> >> > https://github.com/apache/zeppelin/blob/master/travis_check.py
> >> >
> >> > And this is the script we used in Jenkins build job.
> >> >
> >> > if [ -f "travis_check.py" ]; then
> >> >git log -n 1
> >> >STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull
> >> request.*from.*" | sed
> >> > 's/.*GitHub pull request  >> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
> >> >AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
> >> >PR=$(echo $STATUS | awk &#

Re: Source Kafka and Sink Hive managed tables via Flink Job

2019-07-03 Thread Bowen Li
Hi Youssef,

You need to provide more background context:

- Which Hive sink are you using? We are working on the official Hive sink
for community and will be released in 1.9. So did you develop yours in
house?
- What do you mean by 1st, 2nd, 3rd window? You mean the parallel instances
of the same operator, or do you have you have 3 windowing operations
chained?
- What does your Hive table look like? E.g. is it partitioned or
non-partitioned? If partitioned, how many partitions do you have? is it
writing in static partition or dynamic partition mode? what format? how
large?
- What does your sink do - is each parallelism writing to multiple
partitions or a single partition/table? Is it only appending data or
upserting?

On Wed, Jul 3, 2019 at 1:38 AM Youssef Achbany 
wrote:

> Dear all,
>
> I'm working for a big project and one of the challenge is to read Kafka
> topics and copy them via Hive command into Hive managed tables in order to
> enable ACID HIVE properties.
>
> I try it but I have a issue with back pressure:
> - The first window read 20.000 events and wrote them in Hive tables
> - The second, third, ... send only 100 events because the write in Hive
> take more time than the read of a Kafka topic. But writing 100 events or
> 50.000 events takes +/- the same time for Hive.
>
> Someone have already do this source and sink? Could you help on this?
> Or have you some tips?
> It seems that defining a size window on number of event instead time is not
> possible. Is it true?
>
> Thank you for your help
>
> Youssef
>
> --
> ♻ Be green, keep it on the screen
>


Re: Source Kafka and Sink Hive managed tables via Flink Job

2019-07-03 Thread Bowen Li
BTW,  I'm adding user@ mailing list since this is a user question and
should be asked there.

dev@ mailing list is only for discussions of Flink development. Please see
https://flink.apache.org/community.html#mailing-lists

On Wed, Jul 3, 2019 at 12:34 PM Bowen Li  wrote:

> Hi Youssef,
>
> You need to provide more background context:
>
> - Which Hive sink are you using? We are working on the official Hive sink
> for community and will be released in 1.9. So did you develop yours in
> house?
> - What do you mean by 1st, 2nd, 3rd window? You mean the parallel
> instances of the same operator, or do you have you have 3 windowing
> operations chained?
> - What does your Hive table look like? E.g. is it partitioned or
> non-partitioned? If partitioned, how many partitions do you have? is it
> writing in static partition or dynamic partition mode? what format? how
> large?
> - What does your sink do - is each parallelism writing to multiple
> partitions or a single partition/table? Is it only appending data or
> upserting?
>
> On Wed, Jul 3, 2019 at 1:38 AM Youssef Achbany <
> youssef.achb...@euranova.eu> wrote:
>
>> Dear all,
>>
>> I'm working for a big project and one of the challenge is to read Kafka
>> topics and copy them via Hive command into Hive managed tables in order to
>> enable ACID HIVE properties.
>>
>> I try it but I have a issue with back pressure:
>> - The first window read 20.000 events and wrote them in Hive tables
>> - The second, third, ... send only 100 events because the write in Hive
>> take more time than the read of a Kafka topic. But writing 100 events or
>> 50.000 events takes +/- the same time for Hive.
>>
>> Someone have already do this source and sink? Could you help on this?
>> Or have you some tips?
>> It seems that defining a size window on number of event instead time is
>> not
>> possible. Is it true?
>>
>> Thank you for your help
>>
>> Youssef
>>
>> --
>> ♻ Be green, keep it on the screen
>>
>


Re: [DISCUSS] solve unstable build capacity problem on TravisCI

2019-07-03 Thread Bowen Li
Re: > Are they using their own Travis CI pool, or did the switch to an
entirely different CI service?

I reached out to Wes and Krisztián from Apache Arrow PMC. They are
currently moving away from ASF's Travis to their own in-house metal
machines at [1] with custom CI application at [2]. They've seen significant
improvement w.r.t both much higher performance and basically no resource
waiting time, "night-and-day" difference quoting Wes.

Re: > If we can just switch to our own Travis pool, just for our project,
then this might be something we can do fairly quickly?

I believe so, according to [3] and [4]


[1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/>
[2] https://github.com/ursa-labs/ursabot
[3] https://docs.travis-ci.com/user/migrate/open-source-repository-migration
[4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com



On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler  wrote:

> Are they using their own Travis CI pool, or did the switch to an
> entirely different CI service?
>
> If we can just switch to our own Travis pool, just for our project, then
> this might be something we can do fairly quickly?
>
> On 03/07/2019 05:55, Bowen Li wrote:
> > I responded in the INFRA ticket [1] that I believe they are using a wrong
> > metric against Flink and the total build time is a completely different
> > thing than guaranteed build capacity.
> >
> > My response:
> >
> > "As mentioned above, since I started to pay attention to Flink's build
> > queue a few tens of days ago, I'm in Seattle and I saw no build was
> kicking
> > off in PST daytime in weekdays for Flink. Our teammates in China and
> Europe
> > have also reported similar observations. So we need to evaluate how the
> > large total build time came from - if 1) your number and 2) our
> > observations from three locations that cover pretty much a full day, are
> > all true, I **guess** one reason can be that - highly likely the extra
> > build time came from weekends when other Apache projects may be idle and
> > Flink just drains hard its congested queue.
> >
> > Please be aware of that we're not complaining about the lack of resources
> > in general, I'm complaining about the lack of **stable, dedicated**
> > resources. An example for the latter one is, currently even if no build
> is
> > in Flink's queue and I submit a request to be the queue head in PST
> > morning, my build won't even start in 6-8+h. That is an absurd amount of
> > waiting time.
> >
> > That's saying, if ASF INFRA decides to adopt a quota system and grants
> > Flink five DEDICATED servers that runs all the time only for Flink,
> that'll
> > be PERFECT and can totally solve our problem now.
> >
> > Please be aware of that we're not complaining about the lack of resources
> > in general, I'm complaining about the lack of **stable, dedicated**
> > resources. An example for the latter one is, currently even if no build
> is
> > in Flink's queue and I submit a request to be the queue head in PST
> > morning, my build won't even start in 6-8+h. That is an absurd amount of
> > waiting time.
> >
> >
> > That's saying, if ASF INFRA decides to adopt a quota system and grants
> > Flink five DEDICATED servers that runs all the time only for Flink,
> that'll
> > be PERFECT and can totally solve our problem now.
> >
> > I feel what's missing in the ASF INFRA's Travis resource pool is some
> level
> > of build capacity SLAs and certainty"
> >
> >
> > Again, I believe there are differences in nature of these two problems,
> > long build time v.s. lack of dedicated build resource. That's saying,
> > shortening build time may relieve the situation, and may not. I'm sightly
> > negative on disabling IT cases for PRs, due to the downside is that we
> are
> > at risk of any potential bugs in PR that UTs doesn't catch, and may cost
> a
> > lot more to fix and if it slows others down or even block others, but am
> > open to others opinions on it.
> >
> > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to
> > solve our problem since INFRA's pool is fully shared and they have no
> > control and finer insights over resource allocation to a specific Apache
> > project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA
> > Travis pool (they are actually surprised Flink hasn't plan to do so). I
> > know that Spark is on its own build infra. If we all agree that funding
> our
> > own build infra, I'd 

Re: [VOTE] Migrate to sponsored Travis account

2019-07-04 Thread Bowen Li
+1 on approval of the migration to our own Travis account. The foreseeable
benefits of the whole community's productivity and iteration speed would be
significant!

I think using Flinkbot or Travis REST API would be an implementation
details. Once we determine the overall direction, details can be figured
out.

Good news is that, upon my research on how Arrow and Spark integrate their
own in-house CI services with github repo, they are both using bots with
Github API. See a typical PR check for those projects at [1] and [2]. Thus,
we are **not alone** on this path.

Specifically for Apache Arrow, they have 'Ursabot', similar to our
Flinkbot, as I shared the link in the discussion. [3] lays out how Usrabot
works and integrates with Github API to trigger build. I think their
documentations is a bit outdated though - the doc says it cannot report
back build status to github, but from [1] we can see that the build status
are actually reported.

@Chesnay thanks for taking actions on this. Though I don't have access to
settings of Flink's github repo, I will continue to help push this
initiative in whichever way I can. Wes and Krisztián from Arrow are also
very friendly and helpful, and I can connect you to them to learn their
experience.

[1] https://github.com/apache/arrow/pull/4809
[2] https://github.com/apache/spark/pull/25053
[3] https://github.com/ursa-labs/ursabot#driving-ursabot


On Thu, Jul 4, 2019 at 6:42 AM Hequn Cheng  wrote:

> +1.
>
> And thanks a lot to Chesnay for pushing this.
>
> Best, Hequn
>
> On Thu, Jul 4, 2019 at 8:07 PM Chesnay Schepler 
> wrote:
>
>> Note that the Flinkbot approach isn't that trivial either; we can't
>> _just_ trigger builds for a branch in the apache repo, but would first
>> have to clone the branch/pr into a separate repository (that is owned by
>> the github account that the travis account would be tied to).
>>
>> One roadblock after the next showing up...
>>
>> On 04/07/2019 11:59, Chesnay Schepler wrote:
>> > Small update with mostly bad news:
>> >
>> > INFRA doesn't know whether it is possible, and referred my to Travis
>> > support.
>> > They did point out that it could be problematic in regards to
>> > read/write permissions for the repository.
>> >
>> > From my own findings /so far/ with a test repo/organization, it does
>> > not appear possible to configure the Travis account used for a
>> > specific repository.
>> >
>> > So yeah, if we go down this route we may have to pimp the Flinkbot to
>> > trigger builds through the Travis REST API.
>> >
>> > On 04/07/2019 10:46, Chesnay Schepler wrote:
>> >> I've raised a JIRA
>> >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to
>> >> inquire whether it would be possible to switch to a different Travis
>> >> account, and if so what steps would need to be taken.
>> >> We need a proper confirmation from INFRA since we are not in full
>> >> control of the flink repository (for example, we cannot access the
>> >> settings page).
>> >>
>> >> If this is indeed possible, Ververica is willing sponsor a Travis
>> >> account for the Flink project.
>> >> This would provide us with more than enough resources than we need.
>> >>
>> >> Since this makes the project more reliant on resources provided by
>> >> external companies I would like to vote on this.
>> >>
>> >> Please vote on this proposal, as follows:
>> >> [ ] +1, Approve the migration to a Ververica-sponsored Travis
>> >> account, provided that INFRA approves
>> >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis
>> >> account
>> >>
>> >> The vote will be open for at least 24h, and until we have
>> >> confirmation from INFRA. The voting period may be shorter than the
>> >> usual 3 days since our current is effectively not working.
>> >>
>> >> On 04/07/2019 06:51, Bowen Li wrote:
>> >>> Re: > Are they using their own Travis CI pool, or did the switch to
>> >>> an entirely different CI service?
>> >>>
>> >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are
>> >>> currently moving away from ASF's Travis to their own in-house metal
>> >>> machines at [1] with custom CI application at [2]. They've seen
>> >>> significant improvement w.r.t both much higher performance and
>> >>> basically no resource waiting time, "night-and-

Re: [jira] [Created] (FLINK-13139) Various Hive tests fail on Travis

2019-07-08 Thread Bowen Li
Hi Louis,


Thanks for reporting the issue. The problem is due to Hive 2.3.4 not
compatible with Hadoop 2.4, and requires at least 2.7. I'm not sure yet why
the build succeeded most of the time on Travis CI and only fails
occasionally, maybe it's because the build process
has flink-shaded-hadoop-2-uber in multiple hadoop 2 versions (2.4, 2.7,
etc) in its classpath somehow.

It's been fixed in FLINK-13134
.




On Mon, Jul 8, 2019 at 3:15 AM 不常用邮箱  wrote:

> Hello all:
>
> I meet this problem with can’t import follow class in test file.
> org.apache.flink.table.catalog.hive.HiveCatalog;
> org.apache.flink.table.catalog.hive.HiveTestUtils;
> org.apache.flink.table.catalog.hive.HivePartitionConfig;
> org.apache.flink.table.catalog.hive.HiveCatalogConfig;
> org.apache.flink.table.catalog.hive.HiveDatabaseConfig;
>
> And even themselves can’t import each other. Is that some setting wrong
> with my editor?
> I search the problem in google, told that may occur when copy files?
> Is someone can fix this?
>
> Thanks.
> Louis
>
>
>
> --
> Louis
> Email: xu_soft39211...@163.com
>
> > On Jul 8, 2019, at 16:05, Till Rohrmann (JIRA)  wrote:
> >
> > Till Rohrmann created FLINK-13139:
> > -
> >
> > Summary: Various Hive tests fail on Travis
> > Key: FLINK-13139
> > URL: https://issues.apache.org/jira/browse/FLINK-13139
> > Project: Flink
> >  Issue Type: Bug
> >  Components: Connectors / Hive
> >Affects Versions: 1.9.0
> >Reporter: Till Rohrmann
> > Fix For: 1.9.0
> >
> >
> > Various Hive related tests fail on Travis:
> >
> > {code}
> > 06:06:49.654 [ERROR] Errors:
> > 06:06:49.654 [ERROR]   HiveInputFormatTest.createCatalog:66 » Catalog
> Failed to create Hive Metastore...
> > 06:06:49.654 [ERROR]   HiveTableFactoryTest.init:55 » Catalog Failed to
> create Hive Metastore client
> > 06:06:49.654 [ERROR]   HiveTableOutputFormatTest.createCatalog:72 »
> Catalog Failed to create Hive Met...
> > 06:06:49.654 [ERROR]   HiveTableSinkTest.createCatalog:72 » Catalog
> Failed to create Hive Metastore c...
> > 06:06:49.654 [ERROR]   HiveTableSourceTest.createCatalog:67 » Catalog
> Failed to create Hive Metastore...
> > 06:06:49.654 [ERROR]   HiveCatalogGenericMetadataTest.init:49 » Catalog
> Failed to create Hive Metasto...
> > 06:06:49.654 [ERROR]   HiveCatalogHiveMetadataTest.init:55 » Catalog
> Failed to create Hive Metastore ...
> > 06:06:49.654 [ERROR]   HiveGenericUDFTest.testCeil:193->init:387 »
> ExceptionInInitializer
> > 06:06:49.654 [ERROR]   HiveGenericUDFTest.testDecode:160 »
> NoClassDefFound Could not initialize class...
> > 06:06:49.654 [ERROR]
>  HiveSimpleUDFTest.testUDFArray_singleArray:202->init:237 »
> NoClassDefFound Cou...
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFBin:60->init:237 »
> NoClassDefFound Could not initiali...
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFConv:67->init:237 »
> NoClassDefFound Could not initial...
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFJson:85->init:237 »
> NoClassDefFound Could not initial...
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFMinute:126->init:237 »
> ExceptionInInitializer
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFRand:51->init:237 »
> NoClassDefFound Could not initial...
> > 06:06:49.654 [ERROR]
>  HiveSimpleUDFTest.testUDFRegExpExtract:153->init:237 » NoClassDefFound
> Could n...
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFToInteger:188->init:237
> » NoClassDefFound Could not i...
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFUnbase64:166->init:237
> » NoClassDefFound Could not in...
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFUnhex:177->init:237 »
> NoClassDefFound Could not initi...
> > 06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFWeekOfYear:139->init:237
> » NoClassDefFound Could not ...
> > {code}
> >
> > https://api.travis-ci.org/v3/job/555252043/log.txt
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v7.6.3#76005)
>
>


Re: [ANNOUNCE] Rong Rong becomes a Flink committer

2019-07-11 Thread Bowen Li
Congrats, Rong!


On Thu, Jul 11, 2019 at 10:48 AM Oytun Tez  wrote:

> Congratulations Rong!
>
> ---
> Oytun Tez
>
> *M O T A W O R D*
> The World's Fastest Human Translation Platform.
> oy...@motaword.com — www.motaword.com
>
>
> On Thu, Jul 11, 2019 at 1:44 PM Peter Huang 
> wrote:
>
>> Congrats Rong!
>>
>> On Thu, Jul 11, 2019 at 10:40 AM Becket Qin  wrote:
>>
>>> Congrats, Rong!
>>>
>>> On Fri, Jul 12, 2019 at 1:13 AM Xingcan Cui  wrote:
>>>
 Congrats Rong!

 Best,
 Xingcan

 On Jul 11, 2019, at 1:08 PM, Shuyi Chen  wrote:

 Congratulations, Rong!

 On Thu, Jul 11, 2019 at 8:26 AM Yu Li  wrote:

> Congratulations Rong!
>
> Best Regards,
> Yu
>
>
> On Thu, 11 Jul 2019 at 22:54, zhijiang 
> wrote:
>
>> Congratulations Rong!
>>
>> Best,
>> Zhijiang
>>
>> --
>> From:Kurt Young 
>> Send Time:2019年7月11日(星期四) 22:54
>> To:Kostas Kloudas 
>> Cc:Jark Wu ; Fabian Hueske ;
>> dev ; user 
>> Subject:Re: [ANNOUNCE] Rong Rong becomes a Flink committer
>>
>> Congratulations Rong!
>>
>> Best,
>> Kurt
>>
>>
>> On Thu, Jul 11, 2019 at 10:53 PM Kostas Kloudas 
>> wrote:
>> Congratulations Rong!
>>
>> On Thu, Jul 11, 2019 at 4:40 PM Jark Wu  wrote:
>> Congratulations Rong Rong!
>> Welcome on board!
>>
>> On Thu, 11 Jul 2019 at 22:25, Fabian Hueske 
>> wrote:
>> Hi everyone,
>>
>> I'm very happy to announce that Rong Rong accepted the offer of the
>> Flink PMC to become a committer of the Flink project.
>>
>> Rong has been contributing to Flink for many years, mainly working on
>> SQL and Yarn security features. He's also frequently helping out on the
>> user@f.a.o mailing lists.
>>
>> Congratulations Rong!
>>
>> Best, Fabian
>> (on behalf of the Flink PMC)
>>
>>
>>



Re: [DISCUSS] Flink project bylaws

2019-07-11 Thread Bowen Li
On Thu, Jul 11, 2019 at 10:38 AM Becket Qin  wrote:

> Thanks everyone for all the comments and feedback. Please see the replies
> below:
>
> 
> Re: Konstantin
>
> > * In addition to a simple "Code Change" we could also add a row for "Code
> > Change requiring a FLIP" with a reference to the FLIP process page. A
> FLIP
> > will have/does have different rules for approvals, etc.
>
>
> Good point. Just added the entry.
>
> ---
> Re: Konstantin
>
> > * For "Code Change" the draft currently requires "one +1 from a committer
> > who has not authored the patch followed by a Lazy approval (not counting
> > the vote of the contributor), moving to lazy majority if a -1 is
> received".
> > In my understanding this means, that a committer always needs a review
> and
> > +1 from another committer. As far as I know this is currently not always
> > the case (often committer authors, contributor reviews & +1s).
> >
>
>
> I think it is worth thinking about how we can make it easy to follow the
> > bylaws e.g. by having more Flink-specific Jira workflows and ticket
> types +
> > corresponding permissions. While this is certainly "Step 2", I believe,
> we
> > really need to make it as easy & transparent as possible, otherwise they
> > will be unintentionally broken.
>
>
> & Re: Till
>
> > For the case of a committer being the author and getting a +1 from a
> > non-committer: I think a committer should know when to ask another
> > committer for feedback or not. Hence, I would not enforce that we
> strictly
> > need a +1 from a committer if the author is a committer but of course
> > encourage it if capacities exist.
>
>
> I am with Robert and Aljoscha on this.
>
> I completely understand the concern here. TBH, in Kafka occasionally
> trivial patches from committers are still merged without following the
> cross-review requirement, but it is rare. That said, I still think an
> additional committer's review makes sense due to the following reasons:
> 1. The bottom line here is that we need to make sure every patch is
> reviewed with a high quality. This is a little difficult to guarantee if
> the review comes from a contributor for many reasons. In some cases, a
> contributor may not have enough knowledge about the project to make a good
> judgement. Also sometimes the contributors are more eagerly to get a
> particular issue fixed, so they are willing to lower the review bar.
> 2. One byproduct of such cross review among committers, which I personally
> feel useful, is that it helps gradually form consistent design principles
> and code style. This is because the committers will know how the other
> committers are writing code and learn from each other. So they tend to
> reach some tacit understanding on how things should be done in general.
>
> Another way to think about this is to consider the following two scenarios:
> 1. Reviewing a committer's patch takes a lot of iterations. Then the patch
> needs to be reviewed even if it takes time because there are things
> actually needs to be clarified / changed.
> 2. Reviewing a committer's patch is very smooth and quick, so the patch is
> merged soon. Then reviewing such a patch does not take much time.
>
> Letting another committer review the patch from a committer falls either in
> case 1 or case 2. The best option here is to review the patch because
> If it is case 1, the patch actually needs to be reviewed.
> If it is case 2, the review should not take much time anyways.
>
> In the contrast, we will risk encounter case 1 if we skip the cross-review.
>
> 
> Re: Robert
>
> I replied to your comments in the wiki and made the following modifications
> to resolve some of your comments:
> 1. Added a release manager role section.
> 2. changed the name of "lazy consensus" to "consensus" to align with
> general definition of Apache glossary and other projects.
> 3. review board  -> pull request
>
> -
> Re: Chesnay
>
> The emeritus stuff seems like unnecessary noise.
> >
> As Till mentioned, this is to make sure 2/3 majority is still feasible in
> practice.
>
> There's a bunch of subtle changes in the draft compared to existing
> > "conventions"; we should find a way to highlight these and discuss them
> > one by one.
>
> That is a good suggestion. I am not familiar enough with the current Flink
> convention. Will you help on this? I saw you commented on some part in the
> wiki. Are those complete?
>
> --
>  Re: Aljoscha
>
> How different is this from the Kafka bylaws? I’m asking because I quite
> > like them and wouldn’t mind essentially adopting the Kafka bylaws. I
> mean,
> > it’s open source, and we don’t have to try to re-invent the wheel here.
>
> Ha, you got me on this. The first version of the draft was almost identical
> to Kafka. But Robert has already caught a few inconsistent places. So it
> might still worth going through it to make sure we tr

Re: CiBot Update

2019-07-12 Thread Bowen Li
  * only maintains a single comment, updating it for each new build
  * also links in-progress/queued builds, instead of just finished ones.

Want to clarify that the above changes still hold?



On Fri, Jul 12, 2019 at 3:56 PM Chesnay Schepler  wrote:

> Hello all,
>
> on Thursday i pushed an update to the CiBot so that it
>
>   * only maintains a single comment, updating it for each new build
>   * also links in-progress/queued builds, instead of just finished ones.
>
> The update also included a bug that causes the bot to not recognize
> which commits have been verified before, which lead to a sharp increase
> in queue times as it repeatedly scheduled builds for the same commit.
> This issue has been fixed, all redundant builds have been removed from
> the queue and all comments have been updated to point to previously
> completed builds.
>
> I apologize for the inconvenience.
>
>


Re: CiBot Update

2019-07-13 Thread Bowen Li
Thanks Chesnay for the update.

A new issue I found is that our bot doesn't seem to update the final CI
status back to github.

E.g. in [1], the CI Report shows "d1aa3f2 : PENDING Build" at the moment,
but the travis build actually passed successfully 14 hours ago [2].

[1] https://github.com/apache/flink/pull/8920#issuecomment-510405859
[2] https://travis-ci.com/flink-ci/flink/builds/119001147



On Fri, Jul 12, 2019 at 11:00 PM Chesnay Schepler 
wrote:

> Yes.
>
> On 13/07/2019 01:56, Bowen Li wrote:
> >* only maintains a single comment, updating it for each new build
> >* also links in-progress/queued builds, instead of just finished ones.
> >
> > Want to clarify that the above changes still hold?
> >
> >
> >
> > On Fri, Jul 12, 2019 at 3:56 PM Chesnay Schepler 
> wrote:
> >
> >> Hello all,
> >>
> >> on Thursday i pushed an update to the CiBot so that it
> >>
> >>* only maintains a single comment, updating it for each new build
> >>* also links in-progress/queued builds, instead of just finished
> ones.
> >>
> >> The update also included a bug that causes the bot to not recognize
> >> which commits have been verified before, which lead to a sharp increase
> >> in queue times as it repeatedly scheduled builds for the same commit.
> >> This issue has been fixed, all redundant builds have been removed from
> >> the queue and all comments have been updated to point to previously
> >> completed builds.
> >>
> >> I apologize for the inconvenience.
> >>
> >>
>
>


Re: flink-python failed on Travis

2019-07-17 Thread Bowen Li
Hi Dian,

Is there any update on this? It seems have been failing for a day.



On Tue, Jul 16, 2019 at 9:35 PM Dian Fu  wrote:

> Thanks for reporting this issue. I will take a look at it.
>
> > 在 2019年7月17日,上午11:50,Danny Chan  写道:
> >
> > I have the same issue ~~
> >
> > Best,
> > Danny Chan
> > 在 2019年7月17日 +0800 AM11:21,Haibo Sun ,写道:
> >> Hi, folks
> >>
> >>
> >> I noticed that all of the Travis tests reported the following failure.
> Is anyone working on this issue?
> >>
> >>
> >> ___ summary
> 
> >> ERROR: py27: InvocationError for command
> /home/travis/build/flink-ci/flink/flink-python/dev/.conda/bin/python3.7 -m
> virtualenv --no-download --python
> /home/travis/build/flink-ci/flink/flink-python/dev/.conda/envs/2.7/bin/python2.7
> py27 (exited with code 1)
> >> py33: commands succeeded
> >> ERROR: py34: InvocationError for command
> /home/travis/build/flink-ci/flink/flink-python/dev/.conda/bin/python3.7 -m
> virtualenv --no-download --python
> /home/travis/build/flink-ci/flink/flink-python/dev/.conda/envs/3.4/bin/python3.4
> py34 (exited with code 100)
> >> py35: commands succeeded
> >> py36: commands succeeded
> >> py37: commands succeeded
> >> tox checks... [FAILED]
> >> PYTHON exited with EXIT CODE: 1.
> >> Trying to KILL watchdog (12990).
> >>
> >>
> >> Best,
> >> Haibo
>
>


Re: [ANNOUNCE] JIRA permissions changed: Only committers can assign somebody to a ticket

2019-07-18 Thread Bowen Li
shall we announce in user ML too? Users who are used to assign tickets to
themselves should also be aware of this change

On Thu, Jul 18, 2019 at 3:06 AM Robert Metzger  wrote:

> Hi all,
>
> The permissions for the FLINK Jira project have been changed [1], to *only
> allow committers and PMC members to assign somebody to a Jira ticket.*
>
> Anybody with a Jira account can be assigned to a ticket. There is no need
> for "Contributor" permissions.
>
> This has been discussed in this mailing list thread [2]. More information
> on the contribution process is available on the Flink website [3].
> The goal of this change is to ensure that discussions happen in the JIRA
> ticket, before implementation work has started.
> I'm encouraging all committers to monitor the Jira tickets created in
> "their" components, drive discussions to a consensus and then assign
> somebody to the ticket (indicating that this change has been agreed upon
> and that somebody will review and merge it).
>
> Best,
> Robert
>
>
>
> [1] https://issues.apache.org/jira/browse/INFRA-18644
> [2]
>
> https://lists.apache.org/thread.html/b39e01d636cffa74c85b2f7405a25ec63a38d47eb6e0133d22873478@%3Cdev.flink.apache.org%3E
> [3] https://flink.apache.org/contributing/contribute-code.html
>


Re: [ANNOUNCE] Jiangjie (Becket) Qin has been added as a committer to the Flink project

2019-07-18 Thread Bowen Li
Congrats, Jiangjie!

On Thu, Jul 18, 2019 at 11:07 AM Shuyi Chen  wrote:

> Congrats!
>
> On Thu, Jul 18, 2019 at 10:21 AM Thomas Weise  wrote:
>
> > Congrats!
> >
> >
> > On Thu, Jul 18, 2019 at 9:58 AM Richard Deurwaarder 
> > wrote:
> >
> > > Congrats Becket! :)
> > >
> > > Richard
> > >
> > > On Thu, Jul 18, 2019 at 5:52 PM Xuefu Z  wrote:
> > >
> > > > Congratulation, Becket! At least you're able to assign JIRAs now!
> > > >
> > > > On Thu, Jul 18, 2019 at 8:22 AM Rong Rong 
> wrote:
> > > >
> > > > > Congratulations Becket!
> > > > >
> > > > > --
> > > > > Rong
> > > > >
> > > > > On Thu, Jul 18, 2019 at 7:05 AM Xingcan Cui 
> > > wrote:
> > > > >
> > > > > > Congrats Becket!
> > > > > >
> > > > > > Best,
> > > > > > Xingcan
> > > > > >
> > > > > > On Thu, Jul 18, 2019, 07:17 Dian Fu 
> wrote:
> > > > > >
> > > > > > > Congrats Becket!
> > > > > > >
> > > > > > > > 在 2019年7月18日,下午6:42,Danny Chan  写道:
> > > > > > > >
> > > > > > > >> Congratulations!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Danny Chan
> > > > > > > > 在 2019年7月18日 +0800 PM6:29,Haibo Sun ,写道:
> > > > > > > >> Congratulations Becket!Best,
> > > > > > > >> Haibo
> > > > > > > >> 在 2019-07-18 17:51:06,"Hequn Cheng" 
> > 写道:
> > > > > > > >>> Congratulations Becket!
> > > > > > > >>>
> > > > > > > >>> Best, Hequn
> > > > > > > >>>
> > > > > > > >>> On Thu, Jul 18, 2019 at 5:34 PM vino yang <
> > > yanghua1...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > >  Congratulations!
> > > > > > > 
> > > > > > >  Best,
> > > > > > >  Vino
> > > > > > > 
> > > > > > >  Yun Gao  于2019年7月18日周四
> > > 下午5:31写道:
> > > > > > > 
> > > > > > > > Congratulations!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Yun
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > --
> > > > > > > > From:Kostas Kloudas 
> > > > > > > > Send Time:2019 Jul. 18 (Thu.) 17:30
> > > > > > > > To:dev 
> > > > > > > > Subject:Re: [ANNOUNCE] Jiangjie (Becket) Qin has been
> added
> > > as
> > > > a
> > > > > > >  committer
> > > > > > > > to the Flink project
> > > > > > > >
> > > > > > > > Congratulations Becket!
> > > > > > > >
> > > > > > > > Kostas
> > > > > > > >
> > > > > > > > On Thu, Jul 18, 2019 at 11:21 AM Guowei Ma <
> > > > guowei@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Congrats Becket!
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> Guowei
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Terry Wang  于2019年7月18日周四 下午5:17写道:
> > > > > > > >>
> > > > > > > >>> Congratulations Becket!
> > > > > > > >>>
> > > > > > >  在 2019年7月18日,下午5:09,Dawid Wysakowicz <
> > > > dwysakow...@apache.org>
> > > > > > 写道:
> > > > > > > 
> > > > > > >  Congratulations Becket! Good to have you onboard!
> > > > > > > 
> > > > > > >  On 18/07/2019 10:56, Till Rohrmann wrote:
> > > > > > > > Congrats Becket!
> > > > > > > >
> > > > > > > > On Thu, Jul 18, 2019 at 10:52 AM Jeff Zhang <
> > > > > zjf...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Congratulations Becket!
> > > > > > > >>
> > > > > > > >> Xu Forward  于2019年7月18日周四
> > > > 下午4:39写道:
> > > > > > > >>
> > > > > > > >>> Congratulations Becket! Well deserved.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Cheers,
> > > > > > > >>>
> > > > > > > >>> forward
> > > > > > > >>>
> > > > > > > >>> Kurt Young  于2019年7月18日周四
> > 下午4:20写道:
> > > > > > > >>>
> > > > > > >  Congrats Becket!
> > > > > > > 
> > > > > > >  Best,
> > > > > > >  Kurt
> > > > > > > 
> > > > > > > 
> > > > > > >  On Thu, Jul 18, 2019 at 4:12 PM JingsongLee <
> > > > > > > >> lzljs3620...@aliyun.com
> > > > > > >  .invalid>
> > > > > > >  wrote:
> > > > > > > 
> > > > > > > > Congratulations Becket!
> > > > > > > >
> > > > > > > > Best, Jingsong Lee
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > --
> > > > > > > > From:Congxian Qiu 
> > > > > > > > Send Time:2019年7月18日(星期四) 16:09
> > > > > > > > To:dev@flink.apache.org 
> > > > > > > > Subject:Re: [ANNOUNCE] Jiangjie (Becket) Qin has
> > been
> > > > > added
> > > > > > >  as a
> > > > > > >  committer
> > > > > > > > to the Flink project
> > > > > > > >
> > > > > > > > Congratulations Becket! Well deserved.
> > > > > > > >

[ANNOUNCE] Seattle Flink Meetup at Uber on 8/22

2019-08-12 Thread Bowen Li
Hi All !

Join our next Seattle Flink Meetup at Uber Seattle, featuring talks of
[Flink + Kappa+ @ Uber] and [Flink + Pulsar for streaming-first, unified
data processing].

- TALK #1: Moving from Lambda and Kappa Architectures to Kappa+ with Flink
at Uber
- TALK #2: When Apache Pulsar meets Apache Flink

Checkout event details and RSVP at
https://www.meetup.com/seattle-flink/events/263782233/ . See you soon!

Bowen


Re: [DISCUSS] Repository split

2019-08-12 Thread Bowen Li
-1 for rushing into conclusions that we need to split the repo before
saturating our efforts in improving current build/CI mechanism. Besides all
the build system issues mentioned above (no incremental builds, no
flexibility to build only docs or subsets of components), it's hard to keep
configurations (like code style, permissions, etc) consistent between repos.

IMHO, one area we can further achieve build performance is CI bot. From my
experience, a few simple but effective changes we can make are 1) cancel
previous build when submitting a new commit (this seems to have been fixed
10 days ago [1]), 2) cancel previous build when the PR is closed, either
merged or abandoned. And many to come.

Though I like the soft split approach Stephan raised slightly better than
the hard split, I hope that's not the ultimate approach either, **unless
really no better way presents itself**, because it still seems to me that
we are trying to identify dependency graphs **manually** just to make up
for the incapability of build tool. Gradle is surely capable of doing that
as people mentioned and I used that capability before. I researched maven
previously but didn't get much due to lack of good documentations, and thus
I'm not sure if maven is "modern" enough for that task. Hopefully we won't
need to reinvent the wheels the hard way just for the sake of complementing
maven.

[1]
https://github.com/flink-ci/ci-bot/commit/82bb83fd997fac97405fd956d758af100b0f289c



On Mon, Aug 12, 2019 at 7:44 AM Arvid Heise  wrote:

> I split small and medium-sized repositories in several projects for various
> reasons. In general, the more mature a project, the fewer pain after the
> split. If interfaces are somewhat stable, it's naturally easier to work in
> a distributed manner.
>
> However, projects should be split for the right reasons. Robert pointed the
> most important out: growth of somewhat individual communities. Another
> reason would be that we actually want to force better coverage inside the
> modules (for example, adding tests to the core modules when e2e fail).
> Another reason is to actually slow down development: Make sure that a new
> API endpoint is well-crafted before adding the implementation in some
> module. API changes will occur less, when devs have to adopt it throughout
> several modules and feel the pain of users. Sometimes API changes will
> actually become more visible through separate projects.
> One issue that would be addressed that I currently have is reduced
> complexity while onboarding.
>
> In contrast, other issues can be solved without splitting the repository
> and sacrificing development speed: build times can be lowered with
> company-wide build caches (https://gradle.com/ , also for maven, although
> I
> know only the gradle version).
>
> I think that I have not enough experience with the project yet to cast a
> vote. I made good experiences in the past with splitting (although it takes
> time to pay off), but I see many valid points raised.
>
> I do have a strong opinion on reducing build times though and would be
> avail to explore that, but that sounds like a separate discussion to me.
>
> Best,
>
> Arvid
>
> On Mon, Aug 12, 2019 at 4:26 PM Robert Metzger 
> wrote:
>
> > Thanks a lot for starting the discussion Chesnay!
> >
> >
> > I would like to throw in another aspect into the discussion: What if we
> > consider this repo split as a first step towards making connectors,
> machine
> > learning, gelly, table/SQL? independent projects within the ASF, with
> their
> > own mailing lists, committers and JIRA?
> >
> >
> > Of course, we would not establish the new repos as new projects
> > immediately, but after we have found good boundaries between the projects
> > (interfaces, tests, documentation, communities) (6-24 months)
> >
> >
> > Each project (or repo initially) would create separate releases, and
> depend
> > on stable versions.
> >
> > This allows each project to come up with their own release cadence.
> >
> >
> > Also, the projects could establish their own processes. A connectors
> > project would probably have more turnover in terms of new connector
> > contributions, so something like a “connector incubator” would make
> sense?
> > A “young” machine learning project might benefit from a monthly release
> > model initially.
> >
> > I see this as a way of establishing different standards based on the
> > requirements of each project (the concern of double standards has been
> > voiced)
> >
> >
> > With a clearer “separation of concerns”, the connector project would
> report
> > bugs to upstream Flink, they would fix & test it. In the current setup,
> the
> > bug might just be validated through the connector test. A split would
> force
> > upstream Flink to have a proper test in place.
> >
> >
> > To some extend, Flink is already a project that contains different
> > sub-communities, working on the core, table api or machine learning.
> >
> > Maybe Flink’s growth (from a development per

Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-14 Thread Bowen Li
Congratulations Andrey!

On Wed, Aug 14, 2019 at 10:18 PM Rong Rong  wrote:

> Congratulations Andrey!
>
> On Wed, Aug 14, 2019 at 10:14 PM chaojianok  wrote:
>
> > Congratulations Andrey!
> > At 2019-08-14 21:26:37, "Till Rohrmann"  wrote:
> > >Hi everyone,
> > >
> > >I'm very happy to announce that Andrey Zagrebin accepted the offer of
> the
> > >Flink PMC to become a committer of the Flink project.
> > >
> > >Andrey has been an active community member for more than 15 months. He
> has
> > >helped shaping numerous features such as State TTL, FRocksDB release,
> > >Shuffle service abstraction, FLIP-1, result partition management and
> > >various fixes/improvements. He's also frequently helping out on the
> > >user@f.a.o mailing lists.
> > >
> > >Congratulations Andrey!
> > >
> > >Best, Till
> > >(on behalf of the Flink PMC)
> >
>


Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Bowen Li
-1 for RC2.

I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I
think it's a blocker.  The bug means currently if users call
`tEnv.listUserDefinedFunctions()` in Table API or `show functions;` thru
SQL would not be able to see Flink's built-in functions.

I'm preparing a fix right now.

Bowen


On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai 
wrote:

> Thanks for all the test efforts, verifications and votes so far.
>
> So far, things are looking good, but we still require one more PMC binding
> vote for this RC to be the official release, so I would like to extend the
> vote time for 1 more day, until *Aug. 16th 17:00 CET*.
>
> In the meantime, the release notes for 1.9.0 had only just been finalized
> [1], and could use a few more eyes before closing the vote.
> Any help with checking if anything else should be mentioned there regarding
> breaking changes / known shortcomings would be appreciated.
>
> Cheers,
> Gordon
>
> [1] https://github.com/apache/flink/pull/9438
>
> On Thu, Aug 15, 2019 at 3:58 PM Kurt Young  wrote:
>
> > Great, then I have no other comments on legal check.
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler 
> > wrote:
> >
> > > The licensing items aren't a problem; we don't care about Flink modules
> > > in NOTICE files, and we don't have to update the source-release
> > > licensing since we don't have a pre-built version of the WebUI in the
> > > source.
> > >
> > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > After going through the licenses, I found 2 suspicions but not sure
> if
> > > they
> > > > are
> > > > valid or not.
> > > >
> > > > 1. flink-state-processing-api is packaged in to flink-dist jar, but
> not
> > > > included in
> > > > NOTICE-binary file (the one under the root directory) like other
> > modules.
> > > > 2. flink-runtime-web distributed some JavaScript dependencies through
> > > source
> > > > codes, the licenses and NOTICE file were only updated inside the
> module
> > > of
> > > > flink-runtime-web, but not the NOTICE file and licenses directory
> which
> > > > under
> > > > the  root directory.
> > > >
> > > > Another minor issue I just found is:
> > > > FLINK-13558 tries to include table examples to flink-dist, but I
> cannot
> > > > find it in
> > > > the binary distribution of RC2.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young  wrote:
> > > >
> > > >> Hi Gordon & Timo,
> > > >>
> > > >> Thanks for the feedback, and I agree with it. I will document this
> in
> > > the
> > > >> release notes.
> > > >>
> > > >> Best,
> > > >> Kurt
> > > >>
> > > >>
> > > >> On Thu, Aug 15, 2019 at 6:14 PM Tzu-Li (Gordon) Tai <
> > > tzuli...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hi Kurt,
> > > >>>
> > > >>> With the same argument as before, given that it is mentioned in the
> > > >>> release
> > > >>> announcement that it is a preview feature, I would not block this
> > > release
> > > >>> because of it.
> > > >>> Nevertheless, it would be important to mention this explicitly in
> the
> > > >>> release notes [1].
> > > >>>
> > > >>> Regards,
> > > >>> Gordon
> > > >>>
> > > >>> [1] https://github.com/apache/flink/pull/9438
> > > >>>
> > > >>> On Thu, Aug 15, 2019 at 11:29 AM Timo Walther 
> > > wrote:
> > > >>>
> > >  Hi Kurt,
> > > 
> > >  I agree that this is a serious bug. However, I would not block the
> > >  release because of this. As you said, there is a workaround and
> the
> > >  `execute()` works in the most common case of a single execution.
> We
> > > can
> > >  fix this in a minor release shortly after.
> > > 
> > >  What do others think?
> > > 
> > >  Regards,
> > >  Timo
> > > 
> > > 
> > >  Am 15.08.19 um 11:23 schrieb Kurt Young:
> > > > HI,
> > > >
> > > > We just find a serious bug around blink planner:
> > > > https://issues.apache.org/jira/browse/FLINK-13708
> > > > When user reused the table environment instance, and call
> `execute`
> > >  method
> > > > multiple times for
> > > > different sql, the later call will trigger the earlier ones to be
> > > > re-executed.
> > > >
> > > > It's a serious bug but seems we also have a work around, which is
> > > >>> never
> > > > reuse the table environment
> > > > object. I'm not sure if we should treat this one as blocker issue
> > of
> > >  1.9.0.
> > > > What's your opinion?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 2:01 PM Gary Yao 
> > wrote:
> > > >
> > > >> +1 (non-binding)
> > > >>
> > > >> Jepsen test suite passed 10 times consecutively
> > > >>
> > > >> On Wed, Aug 14, 2019 at 5:31 PM Aljoscha Krettek <
> > > >>> aljos...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> +1
> > > >>>
> > > >>> I did some testing on a Google Cloud Dataproc cluster (it gives
> >

Re: [VOTE] Apache Flink Release 1.9.0, release candidate #2

2019-08-15 Thread Bowen Li
Hi Jark,

Thanks for letting me know that it's been like this in previous releases.
Though I don't think that's the right behavior, it can be discussed for
later release. Thus I retract my -1 for RC2.

Bowen


On Thu, Aug 15, 2019 at 7:49 PM Jark Wu  wrote:

> Hi Bowen,
>
> Thanks for reporting this.
> However, I don't think this is an issue. IMO, it is by design.
> The `tEnv.listUserDefinedFunctions()` in Table API and `show functions;` in
> SQL CLI are intended to return only the registered UDFs, not including
> built-in functions.
> This is also the behavior in previous versions.
>
> Best,
> Jark
>
> On Fri, 16 Aug 2019 at 06:52, Bowen Li  wrote:
>
> > -1 for RC2.
> >
> > I found a bug https://issues.apache.org/jira/browse/FLINK-13741, and I
> > think it's a blocker.  The bug means currently if users call
> > `tEnv.listUserDefinedFunctions()` in Table API or `show functions;` thru
> > SQL would not be able to see Flink's built-in functions.
> >
> > I'm preparing a fix right now.
> >
> > Bowen
> >
> >
> > On Thu, Aug 15, 2019 at 8:55 AM Tzu-Li (Gordon) Tai  >
> > wrote:
> >
> > > Thanks for all the test efforts, verifications and votes so far.
> > >
> > > So far, things are looking good, but we still require one more PMC
> > binding
> > > vote for this RC to be the official release, so I would like to extend
> > the
> > > vote time for 1 more day, until *Aug. 16th 17:00 CET*.
> > >
> > > In the meantime, the release notes for 1.9.0 had only just been
> finalized
> > > [1], and could use a few more eyes before closing the vote.
> > > Any help with checking if anything else should be mentioned there
> > regarding
> > > breaking changes / known shortcomings would be appreciated.
> > >
> > > Cheers,
> > > Gordon
> > >
> > > [1] https://github.com/apache/flink/pull/9438
> > >
> > > On Thu, Aug 15, 2019 at 3:58 PM Kurt Young  wrote:
> > >
> > > > Great, then I have no other comments on legal check.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Thu, Aug 15, 2019 at 9:56 PM Chesnay Schepler  >
> > > > wrote:
> > > >
> > > > > The licensing items aren't a problem; we don't care about Flink
> > modules
> > > > > in NOTICE files, and we don't have to update the source-release
> > > > > licensing since we don't have a pre-built version of the WebUI in
> the
> > > > > source.
> > > > >
> > > > > On 15/08/2019 15:22, Kurt Young wrote:
> > > > > > After going through the licenses, I found 2 suspicions but not
> sure
> > > if
> > > > > they
> > > > > > are
> > > > > > valid or not.
> > > > > >
> > > > > > 1. flink-state-processing-api is packaged in to flink-dist jar,
> but
> > > not
> > > > > > included in
> > > > > > NOTICE-binary file (the one under the root directory) like other
> > > > modules.
> > > > > > 2. flink-runtime-web distributed some JavaScript dependencies
> > through
> > > > > source
> > > > > > codes, the licenses and NOTICE file were only updated inside the
> > > module
> > > > > of
> > > > > > flink-runtime-web, but not the NOTICE file and licenses directory
> > > which
> > > > > > under
> > > > > > the  root directory.
> > > > > >
> > > > > > Another minor issue I just found is:
> > > > > > FLINK-13558 tries to include table examples to flink-dist, but I
> > > cannot
> > > > > > find it in
> > > > > > the binary distribution of RC2.
> > > > > >
> > > > > > Best,
> > > > > > Kurt
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 15, 2019 at 6:19 PM Kurt Young 
> > wrote:
> > > > > >
> > > > > >> Hi Gordon & Timo,
> > > > > >>
> > > > > >> Thanks for the feedback, and I agree with it. I will document
> this
> > > in
> > > > > the
> > > > > >> release notes.
> > > > > >>
> > > > > >> Best,
> > >

Re: [DISCUSS] Reducing build times

2019-08-16 Thread Bowen Li
+1 to Till's points on #2 and #5, especially the potential non-disruptive,
gradual migration approach if we decide to go that route.

To add on, I want to point it out that we can actually start with
flink-shaded project [1] which is a perfect candidate for PoC. It's of much
smaller size, totally isolated from and not interfered with flink project
[2], and it actually covers most of our practical feature requirements for
a build tool - all making it an ideal experimental field.

[1] https://github.com/apache/flink-shaded
[2] https://github.com/apache/flink


On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann  wrote:

> For the sake of keeping the discussion focused and not cluttering the
> discussion thread I would suggest to split the detailed reporting for
> reusing JVMs to a separate thread and cross linking it from here.
>
> Cheers,
> Till
>
> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler 
> wrote:
>
> > Update:
> >
> > TL;DR: table-planner is a good candidate for enabling fork reuse right
> > away, while flink-tests has the potential for huge savings, but we have
> > to figure out some issues first.
> >
> >
> > Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >
> > 4/8 profiles failed.
> >
> > No speedup in libraries, python, blink_planner, 7 minutes saved in
> > libraries (table-planner).
> >
> > The kafka and connectors profiles both fail in kafka tests due to
> > producer leaks, and no speed up could be confirmed so far:
> >
> > java.lang.AssertionError: Detected producer leak. Thread name:
> > kafka-producer-network-thread | producer-239
> > at org.junit.Assert.fail(Assert.java:88)
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> > at
> >
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >
> >
> > The tests profile failed due to various errors in migration tests:
> >
> > junit.framework.AssertionFailedError: Did not see the expected
> accumulator
> > results within time limit.
> > at
> >
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >
> > *However*, a normal tests run takes 40 minutes, while this one above
> > failed after 19 minutes and is only missing the migration tests (which
> > currently need 6-7 minutes). So we could save somewhere between 15 to 20
> > minutes here.
> >
> >
> > Finally, the misc profiles fails in YARN:
> >
> > java.lang.AssertionError
> > at org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >
> > No significant speedup could be observed in other modules; for
> > flink-yarn-tests we can maybe get a minute or 2 out of it.
> >
> > On 16/08/2019 10:43, Chesnay Schepler wrote:
> > > There appears to be a general agreement that 1) should be looked into;
> > > I've setup a branch with fork reuse being enabled for all tests; will
> > > report back the results.
> > >
> > > On 15/08/2019 09:38, Chesnay Schepler wrote:
> > >> Hello everyone,
> > >>
> > >> improving our build times is a hot topic at the moment so let's
> > >> discuss the different ways how they could be reduced.
> > >>
> > >>
> > >>Current state:
> > >>
> > >> First up, let's look at some numbers:
> > >>
> > >> 1 full build currently consumes 5h of build time total ("total
> > >> time"), and in the ideal case takes about 1h20m ("run time") to
> > >> complete from start to finish. The run time may fluctuate of course
> > >> depending on the current Travis load. This applies both to builds on
> > >> the Apache and flink-ci Travis.
> > >>
> > >> At the time of writing, the current queue time for PR jobs (reminder:
> > >> running on flink-ci) is about 30 minutes (which basically means that
> > >> we are processing builds at the rate that they come in), however we
> > >> are in an admittedly quiet period right now.
> > >> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > >> everyone was scrambling to get their changes merged in time for the
> > >> feature freeze.
> > >>
> > >> (Note: Recently optimizations where added to ci-bot where pending
> > >> builds are canceled if a new commit was pushed to the PR or the PR
> > >> was closed, which should prove especially useful during the rush
> > >> hours we see before feature-freezes.)
> > >>
> > >>
> > >>Past approaches
> > >>
> > >> Over the years we have done rather few things to improve this
> > >> situation (hence our current predicament).
> > >>
> > >> Beyond the sporadic speedup of some tests, the only notable reduction
> > >> in total build times was the introduction of cron jobs, which
> > >> consolidated the per-commit matrix from 4 configurations (different
> > >> scala/hadoop versions) to 1.
> > >>
> > >> The separation into multiple build profiles was only a work-around
> > >> for the

[DISCUSS] Upgrade kinesis connector to Apache 2.0 License and include it in official release

2019-08-19 Thread Bowen Li
Hi all,

A while back we discussed upgrading flink-connector-kinesis module to
Apache 2.0 license so that its jar can be published as part of official
Flink releases. Given we have a large user base using Flink with
kinesis/dynamodb streams, it'll free users from building and maintaining
the module themselves, and improve user and developer experience. A ticket
was created [1] but has been idle mainly due to new releases of some aws
libs are not available yet then.

As of today I see that all flink-connector-kinesis's aws dependencies have
been updated to Apache 2.0 license and are released. They include:

- aws-java-sdk-kinesis
- aws-java-sdk-sts
- amazon-kinesis-client
- amazon-kinesis-producer (Apache 2.0 from 0.13.1, released 18 days ago) [2]
- dynamodb-streams-kinesis-adapter (Apache 2.0 from 1.5.0, released 7 days
ago) [3]

Therefore, I'd suggest we kick off the initiative and aim for release 1.10
which is roughly 3 months away, leaving us plenty of time to finish.
According to @Dyana 's comment in the ticket [1], seems some large chunks
of changes need to be made into multiple parts than simply upgrading lib
versions, so we can further break the JIRA down into sub-tasks to limit
scope of each change for easier code review.

@Dyana would you still be interested in carrying the responsibility and
forwarding the effort?

Thanks,
Bowen

[1] https://issues.apache.org/jira/browse/FLINK-12847
[2] https://github.com/awslabs/amazon-kinesis-producer/releases
[3] https://github.com/awslabs/dynamodb-streams-kinesis-adapter/releases


Re: [DISCUSS] Upgrade kinesis connector to Apache 2.0 License and include it in official release

2019-08-20 Thread Bowen Li
@Stephan @Becket kinesis connector currently is using KCL 1.9. Mass changes
are needed if we switch to KCL 2.x. I agree with Dyana that, since KCL 1.x
is also updated to Apache 2.0, we can just focus on upgrading to a newer
KCL 1.x minor version for now.

On Tue, Aug 20, 2019 at 7:52 AM Dyana Rose  wrote:

> ok great,
>
> that's done, the PR is rebased and squashed on top of master and is running
> through Travis
>
> https://github.com/apache/flink/pull/9494
>
> Dyana
>
> On Tue, 20 Aug 2019 at 15:32, Tzu-Li (Gordon) Tai 
> wrote:
>
> > Hi Dyana,
> >
> > Regarding your question on the Chinese docs:
> > Since the Chinese counterparts for the Kinesis connector documentation
> > isn't translated yet (see docs/dev/connectors/kinesis.zh.md), for now
> you
> > can simply just sync whatever changes you made to the English doc to the
> > Chinese one as well.
> >
> > Cheers,
> > Gordon
> >
>


Re: [VOTE] Apache Flink 1.9.0, release candidate #3

2019-08-20 Thread Bowen Li
+1 non-binding

- built from source with default profile
- manually ran SQL and Table API tests for Flink's metadata integration
with Hive Metastore in local cluster
- manually ran SQL tests for batch capability with Blink planner and Hive
integration (source/sink/udf) in local cluster
- file formats include: csv, orc, parquet


On Tue, Aug 20, 2019 at 10:23 PM Gary Yao  wrote:

> +1 (non-binding)
>
> Reran Jepsen tests 10 times.
>
> On Wed, Aug 21, 2019 at 5:35 AM vino yang  wrote:
>
> > +1 (non-binding)
> >
> > - checkout source code and build successfully
> > - started a local cluster and ran some example jobs successfully
> > - verified signatures and hashes
> > - checked release notes and post
> >
> > Best,
> > Vino
> >
> > Stephan Ewen  于2019年8月21日周三 上午4:20写道:
> >
> > > +1 (binding)
> > >
> > >  - Downloaded the binary release tarball
> > >  - started a standalone cluster with four nodes
> > >  - ran some examples through the Web UI
> > >  - checked the logs
> > >  - created a project from the Java quickstarts maven archetype
> > >  - ran a multi-stage DataSet job in batch mode
> > >  - killed as TaskManager and verified correct restart behavior,
> including
> > > failover region backtracking
> > >
> > >
> > > I found a few issues, and a common theme here is confusing error
> > reporting
> > > and logging.
> > >
> > > (1) When testing batch failover and killing a TaskManager, the job
> > reports
> > > as the failure cause "org.apache.flink.util.FlinkException: The
> assigned
> > > slot 6d0e469d55a2630871f43ad0f89c786c_0 was removed."
> > > I think that is a pretty bad error message, as a user I don't know
> > what
> > > that means. Some internal book keeping thing?
> > > You need to know a lot about Flink to understand that this means
> > > "TaskManager failure".
> > > https://issues.apache.org/jira/browse/FLINK-13805
> > > I would not block the release on this, but think this should get
> > pretty
> > > urgent attention.
> > >
> > > (2) The Metric Fetcher floods the log with error messages when a
> > > TaskManager is lost.
> > >  There are many exceptions being logged by the Metrics Fetcher due
> to
> > > not reaching the TM any more.
> > >  This pollutes the log and drowns out the original exception and
> the
> > > meaningful logs from the scheduler/execution graph.
> > >  https://issues.apache.org/jira/browse/FLINK-13806
> > >  Again, I would not block the release on this, but think this
> should
> > > get pretty urgent attention.
> > >
> > > (3) If you put "web.submit.enable: false" into the configuration, the
> web
> > > UI will still display the "SubmitJob" page, but errors will
> > > continuously pop up, stating "Unable to load requested file /jars."
> > > https://issues.apache.org/jira/browse/FLINK-13799
> > >
> > > (4) REST endpoint logs ERROR level messages when selecting the
> > > "Checkpoints" tab for batch jobs. That does not seem correct.
> > >  https://issues.apache.org/jira/browse/FLINK-13795
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > >
> > >
> > > On Tue, Aug 20, 2019 at 11:32 AM Tzu-Li (Gordon) Tai <
> > tzuli...@apache.org>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Legal checks:
> > > > - verified signatures and hashes
> > > > - New bundled Javascript dependencies for flink-runtime-web are
> > correctly
> > > > reflected under licenses-binary and NOTICE file.
> > > > - locally built from source (Scala 2.12, without Hadoop)
> > > > - No missing artifacts in staging repo
> > > > - No binaries in source release
> > > >
> > > > Functional checks:
> > > > - Quickstart working (both in IDE + job submission)
> > > > - Simple State Processor API program that performs offline key schema
> > > > migration (RocksDB backend). Generated savepoint is valid to restore
> > > from.
> > > > - All E2E tests pass locally
> > > > - Didn’t notice any issues with the new WebUI
> > > >
> > > > Cheers,
> > > > Gordon
> > > >
> > > > On Tue, Aug 20, 2019 at 3:53 AM Zili Chen 
> > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - build from source: OK(8u212)
> > > > > - check local setup tutorial works as expected
> > > > >
> > > > > Best,
> > > > > tison.
> > > > >
> > > > >
> > > > > Yu Li  于2019年8月20日周二 上午8:24写道:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > - checked release notes: OK
> > > > > > - checked sums and signatures: OK
> > > > > > - repository appears to contain all expected artifacts
> > > > > > - source release
> > > > > >  - contains no binaries: OK
> > > > > >  - contains no 1.9-SNAPSHOT references: OK
> > > > > >  - build from source: OK (8u102)
> > > > > > - binary release
> > > > > >  - no examples appear to be missing
> > > > > >  - started a cluster; WebUI reachable, example ran
> successfully
> > > > > > - checked README.md file and found nothing unexpected
> > > > > >
> > > > > > Best Regards,
> > > > > > Yu
> > > > > >
> > > > > >
> > > > > > On Tue, 20 Aug 2019 at 01:16, Tzu

[DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-08-27 Thread Bowen Li
Hi folks,

I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
It's critically helpful to improve function usability in SQL.

https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing

In short, it:
- adds support for precise function reference with fully/partially
qualified name
- redefines function resolution order for ambiguous function reference
- adds support for Hive's rich built-in functions (support for Hive user
defined functions was already added in 1.9.0)
- clarifies the concept of temporary functions

Would love to hear your thoughts.

Bowen


Re: [DISCUSS] Flink Python User-Defined Function for Table API

2019-08-27 Thread Bowen Li
Hi Jincheng and Dian,

Sorry for being late to the party. I took a glance at the proposal, LGTM in
general, and I left only a couple comments.

Thanks,
Bowen


On Mon, Aug 26, 2019 at 8:05 PM Dian Fu  wrote:

> Hi Jincheng,
>
> Thanks! It works.
>
> Thanks,
> Dian
>
> > 在 2019年8月27日,上午10:55,jincheng sun  写道:
> >
> > Hi Dian, can you check if you have edit access? :)
> >
> >
> > Dian Fu  于2019年8月26日周一 上午10:52写道:
> >
> >> Hi Jincheng,
> >>
> >> Appreciated for the kind tips and offering of help. Definitely need it!
> >> Could you grant me write permission for confluence? My Id: Dian Fu
> >>
> >> Thanks,
> >> Dian
> >>
> >>> 在 2019年8月26日,上午9:53,jincheng sun  写道:
> >>>
> >>> Thanks for your feedback Hequn & Dian.
> >>>
> >>> Dian, I am glad to see that you want help to create the FLIP!
> >>> Everyone will have first time, and I am very willing to help you
> complete
> >>> your first FLIP creation. Here some tips:
> >>>
> >>> - First I'll give your account write permission for confluence.
> >>> - Before create the FLIP, please have look at the FLIP Template [1],
> >> (It's
> >>> better to know more about FLIP by reading [2])
> >>> - Create Flink Python UDFs related JIRAs after completing the VOTE of
> >>> FLIP.(I think you also can bring up the VOTE thread, if you want! )
> >>>
> >>> Any problems you encounter during this period,feel free to tell me that
> >> we
> >>> can solve them together. :)
> >>>
> >>> Best,
> >>> Jincheng
> >>>
> >>>
> >>>
> >>>
> >>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template
> >>> [2]
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>
> >>>
> >>> Hequn Cheng  于2019年8月23日周五 上午11:54写道:
> >>>
>  +1 for starting the vote.
> 
>  Thanks Jincheng a lot for the discussion.
> 
>  Best, Hequn
> 
>  On Fri, Aug 23, 2019 at 10:06 AM Dian Fu 
> wrote:
> 
> > Hi Jincheng,
> >
> > +1 to start the FLIP create and VOTE on this feature. I'm willing to
> >> help
> > on the FLIP create if you don't mind. As I haven't created a FLIP
> >> before,
> > it will be great if you could help on this. :)
> >
> > Regards,
> > Dian
> >
> >> 在 2019年8月22日,下午11:41,jincheng sun  写道:
> >>
> >> Hi all,
> >>
> >> Thanks a lot for your feedback. If there are no more suggestions and
> >> comments, I think it's better to  initiate a vote to create a FLIP
> for
> >> Apache Flink Python UDFs.
> >> What do you think?
> >>
> >> Best, Jincheng
> >>
> >> jincheng sun  于2019年8月15日周四 上午12:54写道:
> >>
> >>> Hi Thomas,
> >>>
> >>> Thanks for your confirmation and the very important reminder about
> > bundle
> >>> processing.
> >>>
> >>> I have had add the description about how to perform bundle
> processing
> > from
> >>> the perspective of checkpoint and watermark. Feel free to leave
> > comments if
> >>> there are anything not describe clearly.
> >>>
> >>> Best,
> >>> Jincheng
> >>>
> >>>
> >>> Dian Fu  于2019年8月14日周三 上午10:08写道:
> >>>
>  Hi Thomas,
> 
>  Thanks a lot the suggestions.
> 
>  Regarding to bundle processing, there is a section "Checkpoint"[1]
> >> in
> > the
>  design doc which talks about how to handle the checkpoint.
>  However, I think you are right that we should talk more about it,
>  such
> > as
>  what's bundle processing, how it affects the checkpoint and
>  watermark,
> > how
>  to handle the checkpoint and watermark, etc.
> 
>  [1]
> 
> >
> 
> >>
> https://docs.google.com/document/d/1WpTyCXAQh8Jr2yWfz7MWCD2-lou05QaQFb810ZvTefY/edit#heading=h.urladt565yo3
>  <
> 
> >
> 
> >>
> https://docs.google.com/document/d/1WpTyCXAQh8Jr2yWfz7MWCD2-lou05QaQFb810ZvTefY/edit#heading=h.urladt565yo3
> >
> 
>  Regards,
>  Dian
> 
> > 在 2019年8月14日,上午1:01,Thomas Weise  写道:
> >
> > Hi Jincheng,
> >
> > Thanks for putting this together. The proposal is very detailed,
>  thorough
> > and for me as a Beam Flink runner contributor easy to understand
> :)
> >
> > One thing that you should probably detail more is the bundle
>  processing. It
> > is critically important for performance that multiple elements
> are
> > processed in a bundle. The default bundle size in the Flink
> runner
>  is
>  1s or
> > 1000 elements, whichever comes first. And for streaming, you can
>  find
>  the
> > logic necessary to align the bundle processing with watermarks
> and
> > checkpointing here:
> >
> 
> >
> 
> >>
> https://github.com/apache/beam/blob/release-2.14.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-08-29 Thread Bowen Li
Thanks everyone for the feedback.

I have updated the document accordingly. Here're the summary of changes:

- clarify the concept of temporary functions, to facilitate deciding
function resolution order
- provide two options to support Hive built-in functions, with the 2nd one
being preferred
- add detailed prototype code for FunctionCatalog#lookupFunction(name)
- move the section of ”rename existing FunctionCatalog APIs in favor of
temporary functions“ out of the scope of the FLIP
- add another reasonable limitation for function resolution, to not
consider resolving overloaded functions - those with the same name but
different params. (It's still valid to have a single function with
overloaded eval() methods)

Please take another look.

Thanks,
Bowen

On Tue, Aug 27, 2019 at 11:49 AM Bowen Li  wrote:

> Hi folks,
>
> I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
> It's critically helpful to improve function usability in SQL.
>
>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>
> In short, it:
> - adds support for precise function reference with fully/partially
> qualified name
> - redefines function resolution order for ambiguous function reference
> - adds support for Hive's rich built-in functions (support for Hive user
> defined functions was already added in 1.9.0)
> - clarifies the concept of temporary functions
>
> Would love to hear your thoughts.
>
> Bowen
>


[ANNOUNCE] Kinesis connector becomes part of Flink releases

2019-08-30 Thread Bowen Li
Hi all,

I'm glad to announce that, as #9494
was merged today,
flink-connector-kinesis is officially of Apache 2.0 license now in master
branch and its artifact will be deployed to Maven central as part of Flink
releases starting from Flink 1.10.0. Users can use the artifact out of
shelf then and no longer have to build and maintain it on their own.

It brings a much better user experience to our large AWS customer base by
making their work simpler, smoother, and more productive!

Thanks everyone who participated in coding and review to drive this
initiative forward.

Cheers,
Bowen


Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-03 Thread Bowen Li
Hi Kurt,

Re: > What I want to propose is we can merge #3 and #4, make them both under
>"catalog" concept, by extending catalog function to make it have ability to
>have built-in catalog functions. Some benefits I can see from this
approach:
>1. We don't have to introduce new concept like external built-in functions.
>Actually I don't see a full story about how to treat a built-in functions,
and it
>seems a little bit disrupt with catalog. As a result, you have to make
some restriction
>like "hive built-in functions can only be used when current catalog is
hive catalog".

Yes, I've unified #3 and #4 but it seems I didn't update some part of the
doc. I've modified those sections, and they are up to date now.

In short, now built-in function of external systems are defined as a
special kind of catalog function in Flink, and handled by Flink as
following:
- An external built-in function must be associated with a catalog for the
purpose of decoupling flink-table and external systems.
- It always resides in front of catalog functions in ambiguous function
reference order, just like in its own external system
- It is a special catalog function that doesn’t have a schema/database
namespace
- It goes thru the same instantiation logic as other user defined catalog
functions in the external system

Please take another look at the doc, and let me know if you have more
questions.


On Tue, Sep 3, 2019 at 7:28 AM Timo Walther  wrote:

> Hi Kurt,
>
> it should not affect the functions and operations we currently have in
> SQL. It just categorizes the available built-in functions. It is kind of
> an orthogonal concept to the catalog API but built-in functions deserve
> this special kind of treatment. CatalogFunction still fits perfectly in
> there because the regular catalog object resolution logic is not
> affected. So tables and functions are resolved in the same way but with
> built-in functions that have priority as in the original design.
>
> Regards,
> Timo
>
>
> On 03.09.19 15:26, Kurt Young wrote:
> > Does this only affect the functions and operations we currently have in
> SQL
> > and
> > have no effect on tables, right? Looks like this is an orthogonal concept
> > with Catalog?
> > If the answer are both yes, then the catalog function will be a weird
> > concept?
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan  wrote:
> >
> >> The way you proposed are basically the same as what Calcite does, I
> think
> >> we are in the same line.
> >>
> >> Best,
> >> Danny Chan
> >> 在 2019年9月3日 +0800 PM7:57,Timo Walther ,写道:
> >>> This sounds exactly as the module approach I mentioned, no?
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>> On 03.09.19 13:42, Danny Chan wrote:
> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
> >> refactoring to make our function usage more user friendly.
> >>>> For the topic of how to organize the builtin operators and operators
> >> of Hive, here is a solution from Apache Calcite, the Calcite way is to
> make
> >> every dialect operators a “Library”, user can specify which libraries
> they
> >> want to use for a sql query. The builtin operators always comes as the
> >> first class objects and the others are used from the order they appears.
> >> Maybe you can take a reference.
> >>>> [1]
> >>
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> >>>> Best,
> >>>> Danny Chan
> >>>> 在 2019年8月28日 +0800 AM2:50,Bowen Li ,写道:
> >>>>> Hi folks,
> >>>>>
> >>>>> I'd like to kick off a discussion on reworking Flink's
> >> FunctionCatalog.
> >>>>> It's critically helpful to improve function usability in SQL.
> >>>>>
> >>>>>
> >>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> >>>>> In short, it:
> >>>>> - adds support for precise function reference with fully/partially
> >>>>> qualified name
> >>>>> - redefines function resolution order for ambiguous function
> >> reference
> >>>>> - adds support for Hive's rich built-in functions (support for Hive
> >> user
> >>>>> defined functions was already added in 1.9.0)
> >>>>> - clarifies the concept of temporary functions
> >>>>>
> >>>>> Would love to hear your thoughts.
> >>>>>
> >>>>> Bowen
> >>>
>
>


Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-03 Thread Bowen Li
Hi Jingsong,

Re> 1.Hive built-in functions is an intermediate solution. So we should
> not introduce interfaces to influence the framework. To make
> Flink itself more powerful, we should implement the functions
> we need to add.

Yes, please see the doc.

Re> 2.Non-flink built-in functions are easy for users to change their
> behavior. If we support some flink built-in functions in the
> future but act differently from non-flink built-in, this will lead to
> changes in user behavior.

There's no such concept as "external built-in functions" any more. Built-in
functions of external systems will be treated as special catalog functions.

Re> Another question is, does this fallback include all
> hive built-in functions? As far as I know, some hive functions
> have some hacky. If possible, can we start with a white list?
> Once we implement some functions to flink built-in, we can
> also update the whitelist.

Yes, that's something we thought of too. I don't think it's super critical
to the scope of this FLIP, thus I'd like to leave it to future efforts as a
nice-to-have feature.


On Tue, Sep 3, 2019 at 1:37 PM Bowen Li  wrote:

> Hi Kurt,
>
> Re: > What I want to propose is we can merge #3 and #4, make them both
> under
> >"catalog" concept, by extending catalog function to make it have ability
> to
> >have built-in catalog functions. Some benefits I can see from this
> approach:
> >1. We don't have to introduce new concept like external built-in
> functions.
> >Actually I don't see a full story about how to treat a built-in
> functions, and it
> >seems a little bit disrupt with catalog. As a result, you have to make
> some restriction
> >like "hive built-in functions can only be used when current catalog is
> hive catalog".
>
> Yes, I've unified #3 and #4 but it seems I didn't update some part of the
> doc. I've modified those sections, and they are up to date now.
>
> In short, now built-in function of external systems are defined as a
> special kind of catalog function in Flink, and handled by Flink as
> following:
> - An external built-in function must be associated with a catalog for the
> purpose of decoupling flink-table and external systems.
> - It always resides in front of catalog functions in ambiguous function
> reference order, just like in its own external system
> - It is a special catalog function that doesn’t have a schema/database
> namespace
> - It goes thru the same instantiation logic as other user defined catalog
> functions in the external system
>
> Please take another look at the doc, and let me know if you have more
> questions.
>
>
> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther  wrote:
>
>> Hi Kurt,
>>
>> it should not affect the functions and operations we currently have in
>> SQL. It just categorizes the available built-in functions. It is kind of
>> an orthogonal concept to the catalog API but built-in functions deserve
>> this special kind of treatment. CatalogFunction still fits perfectly in
>> there because the regular catalog object resolution logic is not
>> affected. So tables and functions are resolved in the same way but with
>> built-in functions that have priority as in the original design.
>>
>> Regards,
>> Timo
>>
>>
>> On 03.09.19 15:26, Kurt Young wrote:
>> > Does this only affect the functions and operations we currently have in
>> SQL
>> > and
>> > have no effect on tables, right? Looks like this is an orthogonal
>> concept
>> > with Catalog?
>> > If the answer are both yes, then the catalog function will be a weird
>> > concept?
>> >
>> > Best,
>> > Kurt
>> >
>> >
>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan  wrote:
>> >
>> >> The way you proposed are basically the same as what Calcite does, I
>> think
>> >> we are in the same line.
>> >>
>> >> Best,
>> >> Danny Chan
>> >> 在 2019年9月3日 +0800 PM7:57,Timo Walther ,写道:
>> >>> This sounds exactly as the module approach I mentioned, no?
>> >>>
>> >>> Regards,
>> >>> Timo
>> >>>
>> >>> On 03.09.19 13:42, Danny Chan wrote:
>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
>> >> refactoring to make our function usage more user friendly.
>> >>>> For the topic of how to organize the builtin operators and operators
>> >> of Hive, here is a solution from Apache Calcite, the Calcite way is to
>> make
>&

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-03 Thread Bowen Li
Hi Timo,

Re> 1) We should not have the restriction "hive built-in functions can only
> be used when current catalog is hive catalog". Switching a catalog
> should only have implications on the cat.db.object resolution but not
> functions. It would be quite convinient for users to use Hive built-ins
> even if they use a Confluent schema registry or just the in-memory
catalog.

There might be a misunderstanding here.

First of all, Hive built-in functions are not part of Flink built-in
functions, they are catalog functions, thus if the current catalog is not a
HiveCatalog but, say, a schema registry catalog, ambiguous functions
reference just shouldn't be resolved to a different catalog.

Second, Hive built-in functions can potentially be referenced across
catalog, but it doesn't have db namespace and we currently just don't have
a SQL syntax for it. It can be enabled when such a SQL syntax is defined,
e.g. "catalog::function", but it's out of scope of this FLIP.

2) I would propose to have separate concepts for catalog and built-in
functions. In particular it would be nice to modularize built-in
functions. Some built-in functions are very crucial (like AS, CAST,
MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
we add more experimental functions in the future or function for some
special application area (Geo functions, ML functions). A data platform
team might not want to make every built-in function available. Or a
function module like ML functions is in a different Maven module.

I think this is orthogonal to this FLIP, especially we don't have the
"external built-in functions" anymore and currently the built-in function
category remains untouched.

But just to share some thoughts on the proposal, I'm not sure about it:
- I don't know if any other databases handle built-in functions like that.
Maybe you can give some examples? IMHO, built-in functions are system info
and should be deterministic, not depending on loaded libraries. Geo
functions should be either built-in already or just libraries functions,
and library functions can be adapted to catalog APIs or of some other
syntax to use
- I don't know if all use cases stand, and many can be achieved by other
approaches too. E.g. experimental functions can be taken good care of by
documentations, annotations, etc
- the proposal basically introduces some concept like a pluggable built-in
function catalog, despite the already existing catalog APIs
- it brings in even more complicated scenarios to the design. E.g. how do
you handle built-in functions in different modules but different names?

In short, I'm not sure if it really stands and it looks like an overkill to
me. I'd rather not go to that route. Related discussion can be on its own
thread.

3) Following the suggestion above, we can have a separate discovery
mechanism for built-in functions. Instead of just going through a static
list like in BuiltInFunctionDefinitions, a platform team should be able
to select function modules like
catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions) or via service discovery;

Same as above. I'll leave it to its own thread.

re > 3) Dawid and I discussed the resulution order again. I agree with Kurt
> that we should unify built-in function (external or internal) under a
> common layer. However, the resolution order should be:
>   1. built-in functions
>   2. temporary functions
>   3. regular catalog resolution logic
> Otherwise a temporary function could cause clashes with Flink's built-in
> functions. If you take a look at other vendors, like SQL Server they
> also do not allow to overwrite built-in functions.

”I agree with Kurt that we should unify built-in function (external or
internal) under a common layer.“ <- I don't think this is what Kurt means.
Kurt and I are in favor of unifying built-in functions of external systems
and catalog functions. Did you type a mistake?

Besides, I'm not sure about the resolution order you proposed. Temporary
functions have a lifespan over a session and are only visible to the
session owner, they are unique to each user, and users create them on
purpose to be the highest priority in order to overwrite system info
(built-in functions in this case).

In your case, why would users name a temporary function the same as a
built-in function then? Since using that name in ambiguous function
reference will always be resolved to built-in functions, creating a
same-named temp function would be meaningless in the end.


On Tue, Sep 3, 2019 at 1:44 PM Bowen Li  wrote:

> Hi Jingsong,
>
> Re> 1.Hive built-in functions is an intermediate solution. So we should
> > not introduce interfaces to influence the framework. To make
> > Flink itself more powerful, we should implement the functions
> > we need to add.
>
> Y

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-03 Thread Bowen Li
Hi all,

Thanks for the feedback. Just a kindly reminder that the [Proposal] section
in the google doc was updated, please take a look first and let me know if
you have more questions.

On Tue, Sep 3, 2019 at 4:57 PM Bowen Li  wrote:

> Hi Timo,
>
> Re> 1) We should not have the restriction "hive built-in functions can
> only
> > be used when current catalog is hive catalog". Switching a catalog
> > should only have implications on the cat.db.object resolution but not
> > functions. It would be quite convinient for users to use Hive built-ins
> > even if they use a Confluent schema registry or just the in-memory
> catalog.
>
> There might be a misunderstanding here.
>
> First of all, Hive built-in functions are not part of Flink built-in
> functions, they are catalog functions, thus if the current catalog is not a
> HiveCatalog but, say, a schema registry catalog, ambiguous functions
> reference just shouldn't be resolved to a different catalog.
>
> Second, Hive built-in functions can potentially be referenced across
> catalog, but it doesn't have db namespace and we currently just don't have
> a SQL syntax for it. It can be enabled when such a SQL syntax is defined,
> e.g. "catalog::function", but it's out of scope of this FLIP.
>
> 2) I would propose to have separate concepts for catalog and built-in
> functions. In particular it would be nice to modularize built-in
> functions. Some built-in functions are very crucial (like AS, CAST,
> MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
> we add more experimental functions in the future or function for some
> special application area (Geo functions, ML functions). A data platform
> team might not want to make every built-in function available. Or a
> function module like ML functions is in a different Maven module.
>
> I think this is orthogonal to this FLIP, especially we don't have the
> "external built-in functions" anymore and currently the built-in function
> category remains untouched.
>
> But just to share some thoughts on the proposal, I'm not sure about it:
> - I don't know if any other databases handle built-in functions like that.
> Maybe you can give some examples? IMHO, built-in functions are system info
> and should be deterministic, not depending on loaded libraries. Geo
> functions should be either built-in already or just libraries functions,
> and library functions can be adapted to catalog APIs or of some other
> syntax to use
> - I don't know if all use cases stand, and many can be achieved by other
> approaches too. E.g. experimental functions can be taken good care of by
> documentations, annotations, etc
> - the proposal basically introduces some concept like a pluggable built-in
> function catalog, despite the already existing catalog APIs
> - it brings in even more complicated scenarios to the design. E.g. how do
> you handle built-in functions in different modules but different names?
>
> In short, I'm not sure if it really stands and it looks like an overkill
> to me. I'd rather not go to that route. Related discussion can be on its
> own thread.
>
> 3) Following the suggestion above, we can have a separate discovery
> mechanism for built-in functions. Instead of just going through a static
> list like in BuiltInFunctionDefinitions, a platform team should be able
> to select function modules like
> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> HiveFunctions) or via service discovery;
>
> Same as above. I'll leave it to its own thread.
>
> re > 3) Dawid and I discussed the resulution order again. I agree with
> Kurt
> > that we should unify built-in function (external or internal) under a
> > common layer. However, the resolution order should be:
> >   1. built-in functions
> >   2. temporary functions
> >   3. regular catalog resolution logic
> > Otherwise a temporary function could cause clashes with Flink's built-in
> > functions. If you take a look at other vendors, like SQL Server they
> > also do not allow to overwrite built-in functions.
>
> ”I agree with Kurt that we should unify built-in function (external or
> internal) under a common layer.“ <- I don't think this is what Kurt means.
> Kurt and I are in favor of unifying built-in functions of external systems
> and catalog functions. Did you type a mistake?
>
> Besides, I'm not sure about the resolution order you proposed. Temporary
> functions have a lifespan over a session and are only visible to the
> session owner, they are unique to each user, and users create them on
> purpose to be the highest priority in order to overwrite system in

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-03 Thread Bowen Li
Hi,

I agree with Xuefu that the main controversial points are mainly the two
places. My thoughts on them:

1) Determinism of referencing Hive built-in functions. We can either remove
Hive built-in functions from ambiguous function resolution and require
users to use special syntax for their qualified names, or add a config flag
to catalog constructor/yaml for turning on and off Hive built-in functions
with the flag set to 'false' by default and proper doc added to help users
make their decisions.

2) Flink temp functions v.s. Flink built-in functions in ambiguous function
resolution order. We believe Flink temp functions should precede Flink
built-in functions, and I have presented my reasons. Just in case if we
cannot reach an agreement, I propose forbid users registering temp
functions in the same name as a built-in function, like MySQL's approach,
for the moment. It won't have any performance concern, since built-in
functions are all in memory and thus cost of a name check will be really
trivial.


On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z  wrote:

> From what I have seen, there are a couple of focal disagreements:
>
> 1. Resolution order: temp function --> flink built-in function --> catalog
> function vs flink built-in function --> temp function -> catalog function.
> 2. "External" built-in functions: how to treat built-in functions in
> external system and how users reference them
>
> For #1, I agree with Bowen that temp function needs to be at the highest
> priority because that's how a user might overwrite a built-in function
> without referencing a persistent, overwriting catalog function with a fully
> qualified name. Putting built-in functions at the highest priority
> eliminates that usage.
>
> For #2, I saw a general agreement on referencing "external" built-in
> functions such as those in Hive needs to be explicit and deterministic even
> though different approaches are proposed. To limit the scope and simply the
> usage, it seems making sense to me to introduce special syntax for user  to
> explicitly reference an external built-in function such as hive1::sqrt or
> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API call
> hive1.getFunction(ObjectPath functionName) where the database name is
> absent for bulit-in functions available in that catalog hive1. I understand
> that Bowen's original proposal was trying to avoid this, but this could
> turn out to be a clean and simple solution.
>
> (Timo's modular approach is great way to "expand" Flink's built-in function
> set, which seems orthogonal and complementary to this, which could be
> tackled in further future work.)
>
> I'd be happy to hear further thoughts on the two points.
>
> Thanks,
> Xuefu
>
> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young  wrote:
>
> > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the
> > same
> > as Bowen's. But after thinking about it, I'm currently lean to Timo's
> > suggestion.
> >
> > The reason is backward compatibility. If we follow Bowen's approach,
> let's
> > say we
> > first find function in Flink's built-in functions, and then hive's
> > built-in. For example, `foo`
> > is not supported by Flink, but hive has such built-in function. So user
> > will have hive's
> > behavior for function `foo`. And in next release, Flink realize this is a
> > very popular function
> > and add it into Flink's built-in functions, but with different behavior
> as
> > hive's. So in next
> > release, the behavior changes.
> >
> > With Timo's approach, IIUC user have to tell the framework explicitly
> what
> > kind of
> > built-in functions he would like to use. He can just tell framework to
> > abandon Flink's built-in
> > functions, and use hive's instead. User can only choose between them, but
> > not use
> > them at the same time. I think this approach is more predictable.
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li  wrote:
> >
> > > Hi all,
> > >
> > > Thanks for the feedback. Just a kindly reminder that the [Proposal]
> > section
> > > in the google doc was updated, please take a look first and let me know
> > if
> > > you have more questions.
> > >
> > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li  wrote:
> > >
> > > > Hi Timo,
> > > >
> > > > Re> 1) We should not have the restriction "hive built-in functions
> can
> > > > only
> > > > > be used when

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-04 Thread Bowen Li
or. (
> http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html)
>
> Because of lack of standard, it's perfectly fine for Flink to define
> whatever it sees appropriate. Thus, your proposal (no overwriting and must
> have DB as holder) is one option. The advantage is simplicity, The
> downside
> is the deviation from Hive, which is popular and de facto standard in big
> data world.
>
> However, I don't think we have to follow Hive. More importantly, we need a
> consensus. I have no objection if your proposal is generally agreed upon.
>
> Thanks,
> Xuefu
>
> On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz 
> 
> wrote:
>
> Hi all,
>
> Just an opinion on the built-in <> temporary functions resolution and
> NAMING issue. I think we should not allow overriding the built-in
> functions, as this may pose serious issues and to be honest is rather
> not feasible and would require major rework. What happens if a user
> wants to override CAST? Calls to that function are generated at
> different layers of the stack that unfortunately does not always go
> through the Catalog API (at least yet). Moreover from what I've checked
> no other systems allow overriding the built-in functions. All the
> systems I've checked so far register temporary functions in a
> database/schema (either special database for temporary functions, or
> just current database). What I would suggest is to always register
> temporary functions with a 3 part identifier. The same way as tables,
> views etc. This effectively means you cannot override built-in
> functions. With such approach it is natural that the temporary functions
> end up a step lower in the resolution order:
>
> 1. built-in functions (1 part, maybe 2? - this is still under discussion)
>
> 2. temporary functions (always 3 part path)
>
> 3. catalog functions (always 3 part path)
>
> Let me know what do you think.
>
> Best,
>
> Dawid
>
> On 04/09/2019 06:13, Bowen Li wrote:
>
> Hi,
>
> I agree with Xuefu that the main controversial points are mainly the
>
> two
>
> places. My thoughts on them:
>
> 1) Determinism of referencing Hive built-in functions. We can either
>
> remove
>
> Hive built-in functions from ambiguous function resolution and require
> users to use special syntax for their qualified names, or add a config
>
> flag
>
> to catalog constructor/yaml for turning on and off Hive built-in
>
> functions
>
> with the flag set to 'false' by default and proper doc added to help
>
> users
>
> make their decisions.
>
> 2) Flink temp functions v.s. Flink built-in functions in ambiguous
>
> function
>
> resolution order. We believe Flink temp functions should precede Flink
> built-in functions, and I have presented my reasons. Just in case if we
> cannot reach an agreement, I propose forbid users registering temp
> functions in the same name as a built-in function, like MySQL's
>
> approach,
>
> for the moment. It won't have any performance concern, since built-in
> functions are all in memory and thus cost of a name check will be
>
> really
>
> trivial.
>
>
> On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z 
>  wrote:
>
>  From what I have seen, there are a couple of focal disagreements:
>
> 1. Resolution order: temp function --> flink built-in function -->
>
> catalog
>
> function vs flink built-in function --> temp function -> catalog
>
> function.
>
> 2. "External" built-in functions: how to treat built-in functions in
> external system and how users reference them
>
> For #1, I agree with Bowen that temp function needs to be at the
>
> highest
>
> priority because that's how a user might overwrite a built-in function
> without referencing a persistent, overwriting catalog function with a
>
> fully
>
> qualified name. Putting built-in functions at the highest priority
> eliminates that usage.
>
> For #2, I saw a general agreement on referencing "external" built-in
> functions such as those in Hive needs to be explicit and deterministic
>
> even
>
> though different approaches are proposed. To limit the scope and
>
> simply
>
> the
>
> usage, it seems making sense to me to introduce special syntax for
>
> user  to
>
> explicitly reference an external built-in function such as hive1::sqrt
>
> or
>
> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API
>
> call
>
> hive1.getFunction(ObjectPath functionName) where the database name is
> absent for bulit-in functions available in that catalog hive1. I
>
> understand
>
> that

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-04 Thread Bowen Li
Maybe Xuefu missed my email. Please let me know what your thoughts are on
the summary, if there's still major controversy, I can take time to
reevaluate that part.


On Wed, Sep 4, 2019 at 2:25 PM Xuefu Z  wrote:

> Thank all for the sharing thoughts. I think we have gathered some useful
> initial feedback from this long discussion with a couple of focal points
> sticking out.
>
>  We will go back to do more research and adapt our proposal. Once it's
> ready, we will ask for a new round of review. If there is any disagreement,
> we will start a new discussion thread on each rather than having a mega
> discussion like this.
>
> Thanks to everyone for participating.
>
> Regards,
> Xuefu
>
>
> On Thu, Sep 5, 2019 at 2:52 AM Bowen Li  wrote:
>
> > Let me try to summarize and conclude the long thread so far:
> >
> > 1. For order of temp function v.s. built-in function:
> >
> > I think Dawid's point that temp function should be of fully qualified
> path
> > is a better reasoning to back the newly proposed order, and i agree we
> > don't need to follow Hive/Spark.
> >
> > However, I'd rather not change fundamentals of temporary functions in
> this
> > FLIP. It belongs to a bigger story of how temporary objects should be
> > redefined and be handled uniformly - currently temporary tables and views
> > (those registered from TableEnv#registerTable()) behave different than
> what
> > Dawid propose for temp functions, and we need a FLIP to just unify their
> > APIs and behaviors.
> >
> > I agree that backward compatibility is not an issue w.r.t Jark's points.
> >
> > ***Seems we do have consensus that it's acceptable to prevent users
> > registering a temp function in the same name as a built-in function. To
> > help us move forward, I'd like to propose setting such a restraint on
> temp
> > functions in this FLIP to simplify the design and avoid disputes.*** It
> > will also leave rooms for improvements in the future.
> >
> >
> > 2. For Hive built-in function:
> >
> > Thanks Timo for providing the Presto and Postgres examples. I feel
> modular
> > built-in functions can be a good fit for the geo and ml example as a
> native
> > Flink extension, but not sure if it fits well with external integrations.
> > Anyway, I think modular built-in functions is a bigger story and can be
> on
> > its own thread too, and our proposal doesn't prevent Flink from doing
> that
> > in the future.
> >
> > ***Seems we have consensus that users should be able to use built-in
> > functions of Hive or other external systems in SQL explicitly and
> > deterministically regardless of Flink built-in functions and the
> potential
> > modular built-in functions, via some new syntax like "mycat::func"? If
> so,
> > I'd like to propose removing Hive built-in functions from ambiguous
> > function resolution order, and empower users with such a syntax. This way
> > we sacrifice a little convenience for certainty***
> >
> >
> > What do you think?
> >
> > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz 
> > wrote:
> >
> > > Hi,
> > >
> > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just
> > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they
> > are
> > > very inconsistent in that manner (spark being way worse on that).
> > >
> > > Hive:
> > >
> > > You cannot overwrite all the built-in functions. I could overwrite most
> > of
> > > the functions I tried e.g. length, e, pi, round, rtrim, but there are
> > > functions I cannot overwrite e.g. CAST, ARRAY I get:
> > >
> > >
> > > *ParseException line 1:29 cannot recognize input near 'array' 'AS'
> *
> > >
> > > What is interesting is that I cannot ovewrite *array*, but I can
> ovewrite
> > > *map* or *struct*. Though hive behaves reasonable well if I manage to
> > > overwrite a function. When I drop the temporary function the native
> > > function is still available.
> > >
> > > Spark:
> > >
> > > Spark's behavior imho is super bad.
> > >
> > > Theoretically I could overwrite all functions. I was able e.g. to
> > > overwrite CAST function. I had to use though CREATE OR REPLACE
> TEMPORARY
> > > FUNCTION syntax. Otherwise I get an exception that a function already
> > > exists. However when I used the CAST function in a query it used the

Re: [DISCUSS] Contribute Pulsar Flink connector back to Flink

2019-09-05 Thread Bowen Li
Hi,

I think having a Pulsar connector in Flink can be a good mutual benefit to
both communities.

Another perspective is that Pulsar connector is the 1st streaming connector
that integrates with Flink's metadata management system and Catalog APIs.
It'll be cool to see how the integration turns out and whether we need to
improve Flink Catalog stack, which are currently in Beta, to cater to
streaming source/sink. Thus I'm in favor of merging Pulsar connector into
Flink 1.10.

I'd suggest to submit smaller sized PRs, e.g. maybe one for basic
source/sink functionalities and another for schema and catalog integration,
just to make them easier to review.

It doesn't seem to hurt to wait for FLIP-27. But I don't think FLIP-27
should be a blocker in cases where it cannot make its way into 1.10 or
doesn't leave reasonable amount of time for committers to review or for
Pulsar connector to fully adapt to new interfaces.

Bowen



On Thu, Sep 5, 2019 at 3:21 AM Becket Qin  wrote:

> Hi Till,
>
> You are right. It all depends on when the new source interface is going to
> be ready. Personally I think it would be there in about a month or so. But
> I could be too optimistic. It would also be good to hear what do Aljoscha
> and Stephan think as they are also involved in FLIP-27.
>
> In general I think we should have Pulsar connector in Flink 1.10,
> preferably with the new source interface. We can also check it in right now
> with old source interface, but I suspect few users will use it before the
> next official release. Therefore, it seems reasonable to wait a little bit
> to see whether we can jump to the new source interface. As long as we make
> sure Flink 1.10 has it, waiting a little bit doesn't seem to hurt much.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Thu, Sep 5, 2019 at 3:59 PM Till Rohrmann  wrote:
>
> > Hi everyone,
> >
> > I'm wondering what the problem would be if we committed the Pulsar
> > connector before the new source interface is ready. If I understood it
> > correctly, then we need to support the old source interface anyway for
> the
> > existing connectors. By checking it in early I could see the benefit that
> > our users could start using the connector earlier. Moreover, it would
> > prevent that the Pulsar integration is being delayed in case that the
> > source interface should be delayed. The only downside I see is the extra
> > review effort and potential fixes which might be irrelevant for the new
> > source interface implementation. I guess it mainly depends on how certain
> > we are when the new source interface will be ready.
> >
> > Cheers,
> > Till
> >
> > On Thu, Sep 5, 2019 at 8:56 AM Becket Qin  wrote:
> >
> > > Hi Sijie and Yijie,
> > >
> > > Thanks for sharing your thoughts.
> > >
> > > Just want to have some update on FLIP-27. Although the FLIP wiki and
> > > discussion thread has been quiet for some time, a few committer /
> > > contributors in Flink community were actually prototyping the entire
> > thing.
> > > We have made some good progress there but want to update the FLIP wiki
> > > after the entire thing is verified to work in case there are some last
> > > minute surprise in the implementation. I don't have an exact ETA yet,
> > but I
> > > guess it is going to be within a month or so.
> > >
> > > I am happy to review the current Flink Pulsar connector and see if it
> > would
> > > fit in FLIP-27. It would be good to avoid the case that we checked in
> the
> > > Pulsar connector with some review efforts and shortly after that the
> new
> > > Source interface is ready.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Thu, Sep 5, 2019 at 8:39 AM Yijie Shen 
> > > wrote:
> > >
> > > > Thanks for all the feedback and suggestions!
> > > >
> > > > As Sijie said, the goal of the connector has always been to provide
> > > > users with the latest features of both systems as soon as possible.
> We
> > > > propose to contribute the connector to Flink and hope to get more
> > > > suggestions and feedback from Flink experts to ensure the high
> quality
> > > > of the connector.
> > > >
> > > > For FLIP-27, we noticed its existence at the beginning of reworking
> > > > the connector implementation based on Flink 1.9; we also wanted to
> > > > build a connector that supports both batch and stream computing based
> > > > on it.
> > > > However, it has been inactive for some time, so we decided to provide
> > > > a connector with most of the new features, such as the new type
> system
> > > > and the new catalog API first. We will pay attention to the progress
> > > > of FLIP-27 continually and incorporate it with the connector as soon
> > > > as possible.
> > > >
> > > > Regarding the test status of the connector, we are following the
> other
> > > > connectors' test in Flink repository and aimed to provide throughout
> > > > tests as we could. We are also happy to hear suggestions and
> > > > supervision from the Flink community to improve the stability and
> >

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-08 Thread Bowen Li
you shadow only function 'func' in database 'db' in current catalog?
>3. This point is still under discussion, but was mentioned a few
>times, that maybe we want to enable syntax cat.func for "external built-in
>functions". How would that affect statement from previous point? Would
>'db.func' shadow "external built-in function" in 'db' catalog or user
>functions as in point 2? Or maybe both?
>4. Lastly in fact to summarize the previous points. Assuming 2/3-part
>paths. Would the function resolution be actually as follows?:
>   1. temporary functions (1-part path)
>   2. built-in functions
>   3. temporary functions (2-part path)
>   4. 2-part catalog functions a.k.a. "external built-in functions"
>   (cat + func) - this is still under discussion, if we want that in the 
> other
>   focal point
>   5. temporary functions (3-part path)
>   6. 3-part catalog functions a.k.a. user functions
>
> I would be really grateful if you could explain me those questions, thanks.
>
> BTW, Thank you all for a healthy discussion.
>
> Best,
>
> Dawid
> On 04/09/2019 23:25, Xuefu Z wrote:
>
> Thank all for the sharing thoughts. I think we have gathered some useful
> initial feedback from this long discussion with a couple of focal points
> sticking out.
>
>  We will go back to do more research and adapt our proposal. Once it's
> ready, we will ask for a new round of review. If there is any disagreement,
> we will start a new discussion thread on each rather than having a mega
> discussion like this.
>
> Thanks to everyone for participating.
>
> Regards,
> Xuefu
>
>
> On Thu, Sep 5, 2019 at 2:52 AM Bowen Li  
>
>
>  wrote:
>
>
> Let me try to summarize and conclude the long thread so far:
>
> 1. For order of temp function v.s. built-in function:
>
> I think Dawid's point that temp function should be of fully qualified path
> is a better reasoning to back the newly proposed order, and i agree we
> don't need to follow Hive/Spark.
>
> However, I'd rather not change fundamentals of temporary functions in this
> FLIP. It belongs to a bigger story of how temporary objects should be
> redefined and be handled uniformly - currently temporary tables and views
> (those registered from TableEnv#registerTable()) behave different than what
> Dawid propose for temp functions, and we need a FLIP to just unify their
> APIs and behaviors.
>
> I agree that backward compatibility is not an issue w.r.t Jark's points.
>
> ***Seems we do have consensus that it's acceptable to prevent users
> registering a temp function in the same name as a built-in function. To
> help us move forward, I'd like to propose setting such a restraint on temp
> functions in this FLIP to simplify the design and avoid disputes.*** It
> will also leave rooms for improvements in the future.
>
>
> 2. For Hive built-in function:
>
> Thanks Timo for providing the Presto and Postgres examples. I feel modular
> built-in functions can be a good fit for the geo and ml example as a native
> Flink extension, but not sure if it fits well with external integrations.
> Anyway, I think modular built-in functions is a bigger story and can be on
> its own thread too, and our proposal doesn't prevent Flink from doing that
> in the future.
>
> ***Seems we have consensus that users should be able to use built-in
> functions of Hive or other external systems in SQL explicitly and
> deterministically regardless of Flink built-in functions and the potential
> modular built-in functions, via some new syntax like "mycat::func"? If so,
> I'd like to propose removing Hive built-in functions from ambiguous
> function resolution order, and empower users with such a syntax. This way
> we sacrifice a little convenience for certainty***
>
>
> What do you think?
>
> On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz  
>
>
> 
> wrote:
>
>
> Hi,
>
> Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just
> performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they
>
> are
>
> very inconsistent in that manner (spark being way worse on that).
>
> Hive:
>
> You cannot overwrite all the built-in functions. I could overwrite most
>
> of
>
> the functions I tried e.g. length, e, pi, round, rtrim, but there are
> functions I cannot overwrite e.g. CAST, ARRAY I get:
>
>
> *ParseException line 1:29 cannot recognize input near 'array' 'AS' *
>
> What is interesting is that I cannot ovewrite *array*,

[DISCUSS] modular built-in functions

2019-09-09 Thread Bowen Li
Hi all,

During the discussion of how to support Hive built-in functions in Flink in
FLIP-57 [1], an idea of "modular built-in functions" was brought up with
examples of "Extension" in Postgres [2] and "Plugin" in Presto [3]. Thus
I'd like to kick off a discussion to see if we should adopt such an
approach.

I try to summarize basics of the idea:
- functions from modules (e.g. Geo, ML) can be loaded into Flink as
built-in functions
- modules can be configured with order, discovered using SPI or set via
code like "catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions)"
- built-in functions from external systems, like Hive, can be packaged
into such a module

I took time and researched Presto Plugin and Postgres Extension, and here
are some of my findings.

Presto:
- "Presto's Catalog associated with a connector, and a catalog only
contains schemas and references a data source via a connector." [4] A
Presto catalog doesn't have the concept of catalog functions, thus all
Presto functions don't have namespaces. Neither does Presto have function
DDL [5].
- Plugin are not specific to functions - "Plugins can provide
additional Connectors, Types, Functions, and System Access Control" [6]
- Thus, I feel a Plugin in Presto acts more as a "catalog" which is
similar to catalogs in Flink. Since all Presto functions don't have
namespaces, it probably can be seen as a built-in function module.

Postgres:
- Postgres extension is always installed to a schema, not the entire
cluster. There's a "schema_name" param in extension creation DDL - "The
name of the schema in which to install the extension's objects, given that
the extension allows its contents to be relocated. The named schema must
already exist. If not specified, and the extension's control file does not
specify a schema either, the current default object creation schema is
used." [7]  Thus it also acts as "catalog" for schema, and thus functions
in extension are not built-in functions to Postgres.

Therefore, I feel the examples are not exactly the "built-in function
modules" that were brought up, but feel free to correct me if I'm wrong.

Going back to the idea itself, besides it seems to be a simpler concept and
design in some ways, I have two concerns:
1. The major one is still on name resolution - how to deal with name
collisions?
- Not allowing duplicated name won't work for Hive built-in functions
as many of them are dup named with Flink's, so we must allow modules
containing same named functions to be registered
- One assumption of this approach seems to be, given modules are
specified in order, functions from modules can be overrode according to the
order?
- If so, how can users reference a function that is overrode in the
above case (E.g. I may want to switch KMEANS between modules ML1 and ML2
with different implementations)?
 - If it's supported, it seems we still need some new syntax?
 - If it's not supported, that seems to be a major limitation for
users
2. The minor one is, allowing built-in functions from external system to be
accessed within Flink so widely can bring performance issue to users' jobs
- Unlike the potential native Flink Geo or ML functions, built-in
functions from external systems come with a pretty big performance penalty
in Flink due to data conversions and different invocation mechanism.
Supporting Hive built-in functions is mainly for simplifying migration from
Hive. I'm not sure if it makes sense when a user job has nothing to do with
Hive data but unintentionally ends up using Hive built-in functions without
knowing it's penalized on performance. Though doc can help to some extent,
not all users really read docs in detail.

An alternative is to treat "function modules" as catalog.
- For Flink native function modules like Geo or ML, they can be discovered
and registered automatically at runtime with a predefined catalog name in
itself, like "ml" or "ml1", which should be unique. Their functions are
considered as built-in functions to the catalog, and can be referenced, in
some new syntax like "catalog::func", as "ml:kmeans" and "ml1:kmeans".
- For built-in functions from external systems (e.g. Hive), they have to be
referenced either as "catalog::func" to make sure users are explicitly
expecting those external functions, or as complementary built-in functions
to Flink if a config "enable_hive_built_in_functions" in HiveCatalog is
turned on.

Either approach seems to have its own benefits, and I'm open for discussion
and would like to hear others' opinions and use cases where a specific
solution is required.

Thanks,
Bowen


[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html
[2] https://www.postgresql.org/docs/10/extend-extensions.html
[3] https://prestodb.github.io/docs/current/develop/functions.html
[4]
https://prestodb.github.io/docs/current/overview/concepts.html#data-sources

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-11 Thread Bowen Li
Hi,

Thanks @Fabian @Dawid and everyone else for sharing your thoughts!

First, I'd like to take Hive built-in functions out of this FLIP to keep
our original scope and make it less controversial on a potential modular
approach. I will remove Hive built-in functions from the google doc.

Then the focus of debate is mainly function resolution order and temp
function namespace, which are somewhat related. I roughly summarized this
thread, and currently we are debating on two approaches with preference
from the following people:

Option 1:
Proposal: temp functions will be of 1-part path (function name only),
and can override built-in functions. The ambiguous function resolution
order is thus 1) temp functions 2) built-in functions 3) catalog functions
in the current catalog/database
Votes: Xuefu, Bowen, Fabian, Jark

Option 2:
Proposal: temp functions will be of 3-part path (with catalog,
database, and function name), and temp functions cannot override built-in
functions. The ambiguous function resolution order is thus 1) built-in
functions 2) temp functions (in 3-part path) 3) catalog functions in the
current catalog/database
Votes:  Dawid, Timo


Do you think we need a separate voting thread on the two options in the
community, or are we able to conclude from the above summary?



On Wed, Sep 11, 2019 at 8:09 AM Dawid Wysakowicz 
wrote:

> Hi Fabian,
> Thank you for your response.
> Regarding the temporary function, just wanted to clarify one thing: the
> 3-part identifier does not mean the user always has to provide the catalog
> & database explicitly. The same way user does not have to provide them in
> e.g. when creating permanent table, view etc. It means though functions are
> always stored within a database. The same way as all the permanent objects
> and other temporary objects(tables, views). If not given explicitly the
> current catalog & database would be used, both in the create statement or
> when using the function.
>
> Point taken though your preference would be to support overriding built-in
> functions.
>
> Best,
> Dawid
>
> On Wed, 11 Sep 2019, 21:14 Fabian Hueske,  wrote:
>
> > Hi all,
> >
> > I'd like to add my opinion on this topic as well ;-)
> >
> > In general, I think overriding built-in function with temp functions has
> a
> > couple of benefits but also a few challenges:
> >
> > * Users can reimplement the behavior of a built-in functions of a
> different
> > system, e.g., for backward compatibility after a migration.
> > * I don't think that "accidental" overrides and surprising semantics are
> an
> > issue or dangerous. The user registered the temp function in the same
> > session and should therefore be aware of the changed semantics.
> > * I see that not all built-in functions can be overridden, like the CAST
> > example that Dawid gave. However, I think these should be a small
> fraction
> > and such functions could be blacklisted. Sure, that's not super
> consistent,
> > but should (IMO) not be a big issue in practice.
> > * Temp functions should be easy to use. Requiring a 3-part addressing
> makes
> > them a lot less user friendly, IMO. Users need to think about what
> catalog
> > and db to choose when registering them. Also using a temp function in a
> > query becomes less convenient. Moreover, I agree with Bowen's concerns
> that
> > a 3-part addressing scheme reduces the temporal appearance of the
> function.
> >
> > From the three possible solutions, my preference order is
> > 1) 1-part address with override of built-in
> > 2) 1-part address without override of built-in
> > 3) 3-part address
> >
> > Regarding the issue of external built-in functions, I don't think that
> > Timo's proposal of modules is fully orthogonal to this discussion.
> > A Hive function module could be an alternative to offering Hive functions
> > as part of Hive's catalog.
> > From a user's point of view, I think that modules would be a "cleaner"
> > integration ("Why do I need a Hive catalog if all I want to do is apply a
> > Hive function on a Kafka table?").
> > However, the module approach clearly has the problem of dealing with
> > same-named functions in different modules (e.g., a Hive function and a
> > Flink built-in function).
> > The catalog approach as the benefit that functions can be addressed like
> > hiveCat::func (or a similar path).
> >
> > I'm not sure what's the best solution here.
> >
> > Cheers,
> > Fabian
> >
> >
> > Am Mo., 9. Sept. 2019 um 06:30 Uhr schrieb Bowen Li  >:
> >
> > > Hi,
>

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-13 Thread Bowen Li
Hi Fabian,

Yes, I agree 1-part/no-override is the least favorable thus I didn't
include that as a voting option, and the discussion is mainly between
1-part/override builtin and 3-part/not override builtin.

Re > However, it means that temp functions are differently treated than
other db objects.
IMO, the treatment difference results from the fact that functions are a
bit different from other objects - Flink don't have any other built-in
objects (tables, views) except functions.

Cheers,
Bowen


[DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

2019-09-17 Thread Bowen Li
Hi devs,

We'd like to kick off a conversation on "FLIP-68:  Extend Core Table System
with Modular Plugins" [1].

The modular approach was raised in discussion of how to support Hive
built-in functions in FLIP-57 [2]. As we discussed and looked deeper, we
think it’s a good opportunity to broaden the design and the corresponding
problem it aims to solve. The motivation is to expand Flink’s core table
system and enable users to do customizations by writing pluggable modules.

There are two aspects of the motivation:
1. Enpower users to write code and do customized developement for Flink
table core
2. Enable users to integrate Flink with cores and built-in objects of other
systems, so users can reuse what they are familiar with in other SQL
systems seamlessly as core and built-ins of Flink table

Please take a look, and feedbacks are welcome.

Bowen

[1]
https://docs.google.com/document/d/17CPMpMbPDjvM4selUVEfh_tqUK_oV0TODAUA9dfHakc/edit?usp=sharing
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html


Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-18 Thread Bowen Li
ers should be able to override all catalog objects consistently
> >> according
> >> > to FLIP-64 (Support for Temporary Objects in Table module). If
> functions
> >> > are treated completely different, we need more code and special cases.
> >> From
> >> > an implementation perspective, this topic only affects the lookup
> logic
> >> > which is rather low implementation effort which is why I would like to
> >> > clarify the remaining items. As you said, we have a slight consenus on
> >> > overriding built-in functions; we should also strive for reaching
> >> consensus
> >> > on the remaining topics.
> >> > >>
> >> > >> @Dawid: I like your idea as it ensures registering catalog objects
> >> > consistent and the overriding of built-in functions more explicit.
> >> > >>
> >> > >> Thanks,
> >> > >> Timo
> >> > >>
> >> > >>
> >> > >> On 17.09.19 11:59, kai wang wrote:
> >> > >>> hi, everyone
> >> > >>> I think this flip is very meaningful. it supports functions that
> >> can be
> >> > >>> shared by different catalogs and dbs, reducing the duplication of
> >> > functions.
> >> > >>>
> >> > >>> Our group based on flink's sql parser module implements create
> >> function
> >> > >>> feature, stores the parsed function metadata and schema into
> mysql,
> >> and
> >> > >>> also customizes the catalog, customizes sql-client to support
> custom
> >> > >>> schemas and functions. Loaded, but the function is currently
> global,
> >> > and is
> >> > >>> not subdivided according to catalog and db.
> >> > >>>
> >> > >>> In addition, I very much hope to participate in the development of
> >> this
> >> > >>> flip, I have been paying attention to the community, but found it
> is
> >> > more
> >> > >>> difficult to join.
> >> > >>> thank you.
> >> > >>>
> >> > >>> Xuefu Z  于2019年9月17日周二 上午11:19写道:
> >> > >>>
> >> > >>>> Thanks to Tmo and Dawid for sharing thoughts.
> >> > >>>>
> >> > >>>> It seems to me that there is a general consensus on having temp
> >> > functions
> >> > >>>> that have no namespaces and overwrite built-in functions. (As a
> >> side
> >> > note
> >> > >>>> for comparability, the current user defined functions are all
> >> > temporary and
> >> > >>>> having no namespaces.)
> >> > >>>>
> >> > >>>> Nevertheless, I can also see the merit of having namespaced temp
> >> > functions
> >> > >>>> that can overwrite functions defined in a specific cat/db.
> However,
> >> > this
> >> > >>>> idea appears orthogonal to the former and can be added
> >> incrementally.
> >> > >>>>
> >> > >>>> How about we first implement non-namespaced temp functions now
> and
> >> > leave
> >> > >>>> the door open for namespaced ones for later releases as the
> >> > requirement
> >> > >>>> might become more crystal? This also helps shorten the debate and
> >> > allow us
> >> > >>>> to make some progress along this direction.
> >> > >>>>
> >> > >>>> As to Dawid's idea of having a dedicated cat/db to host the
> >> temporary
> >> > temp
> >> > >>>> functions that don't have namespaces, my only concern is the
> >> special
> >> > >>>> treatment for a cat/db, which makes code less clean, as evident
> in
> >> > treating
> >> > >>>> the built-in catalog currently.
> >> > >>>>
> >> > >>>> Thanks,
> >> > >>>> Xuefiu
> >> > >>>>
> >> > >>>> On Mon, Sep 16, 2019 at 5:07 PM Dawid Wysakowicz <
> >> > >>>> wysakowicz.da...@gmail.com>
> >> > >>>> wrote:
> >> > >>>>
> >> > >>>>> Hi,
&

Re: [VOTE] Improve TableFactory to add Context

2020-02-05 Thread Bowen Li
+1, LGTM

On Tue, Feb 4, 2020 at 11:28 PM Jark Wu  wrote:

> +1 form my side.
> Thanks for driving this.
>
> Btw, could you also attach a JIRA issue with the changes described in it,
> so that users can find the issue through the mailing list in the future.
>
> Best,
> Jark
>
> On Wed, 5 Feb 2020 at 13:38, Kurt Young  wrote:
>
> > +1 from my side.
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Feb 5, 2020 at 10:59 AM Jingsong Li 
> > wrote:
> >
> > > Hi all,
> > >
> > > Interface updated.
> > > Please re-vote.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Tue, Feb 4, 2020 at 1:28 PM Jingsong Li 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like to start the vote for the improve of
> > > > TableFactory, which is discussed and
> > > > reached a consensus in the discussion thread[2].
> > > >
> > > > The vote will be open for at least 72 hours. I'll try to close it
> > > > unless there is an objection or not enough votes.
> > > >
> > > > [1]
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Improve-TableFactory-td36647.html
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>


Re: [DISCUSS] FLIP-92: JDBC catalog and Postgres catalog

2020-02-17 Thread Bowen Li
Hi all,

If there's no more comments, I would like to kick off a vote for this FLIP
[1].

FYI, the flip number is changed to 93 since there was a race condition of
taking 92.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-93%3A+JDBC+catalog+and+Postgres+catalog

On Wed, Jan 22, 2020 at 11:05 AM Bowen Li  wrote:

> Hi Flavio,
>
> First, this is a generic question on how flink-jdbc is set up, not
> specific to jdbc catalog, thus is better to be on its own thread.
>
> But to just quickly answer your question, you need to see where the
> incompatibility is. There may be incompatibility on 1) jdbc drivers and 2)
> the databases. 1) is fairly stable and back-compatible. 2) normally has
> things to do with your queries, not the driver.
>
>
>
> On Tue, Jan 21, 2020 at 3:21 PM Flavio Pompermaier 
> wrote:
>
>> Hi all,
>> I'm happy to see a lot of interest in easing the integration with JDBC
>> data
>> sources. Maybe this could be a rare situation (not in my experience
>> however..) but what if I have to connect to the same type of source (e.g.
>> Mysql) with 2 incompatible version...? How can I load the 2 (or more)
>> connectors jars without causing conflicts?
>>
>> Il Mar 14 Gen 2020, 23:32 Bowen Li  ha scritto:
>>
>> > Hi devs,
>> >
>> > I've updated the wiki according to feedbacks. Please take another look.
>> >
>> > Thanks!
>> >
>> >
>> > On Fri, Jan 10, 2020 at 2:24 PM Bowen Li  wrote:
>> >
>> > > Thanks everyone for the prompt feedback. Please see my response below.
>> > >
>> > > > In Postgress, the TIME/TIMESTAMP WITH TIME ZONE has the
>> > > java.time.Instant semantic, and should be mapped to Flink's
>> > TIME/TIMESTAMP
>> > > WITH LOCAL TIME ZONE
>> > >
>> > > Zhenghua, you are right that pg's 'timestamp with timezone' should be
>> > > translated into flink's 'timestamp with local timezone'. I don't find
>> > 'time
>> > > with (local) timezone' though, so we may not support that type from
>> pg in
>> > > Flink.
>> > >
>> > > > I suggest that the parameters can be completely consistent with the
>> > > JDBCTableSource / JDBCTableSink. If you take a look to JDBC api:
>> > > "DriverManager.getConnection".
>> > > That allow "default db, username, pwd" things optional. They can
>> included
>> > > in URL. Of course JDBC api also allows establishing connections to
>> > > different databases in a db instance. So I think we don't need
>> provide a
>> > > "base_url", we can just provide a real "url". To be consistent with
>> JDBC
>> > > api.
>> > >
>> > > Jingsong, what I'm saying is a builder can be added on demand later if
>> > > there's enough user requesting it, and doesn't need to be a core part
>> of
>> > > the FLIP.
>> > >
>> > > Besides, unfortunately Postgres doesn't allow changing databases via
>> > JDBC.
>> > >
>> > > JDBC provides different connecting options as you mentioned, but I'd
>> like
>> > > to keep our design and API simple and having to handle extra parsing
>> > logic.
>> > > And it doesn't shut the door for what you proposed as a future effort.
>> > >
>> > > > Since the PostgreSQL does not have catalog but schema under
>> database,
>> > > why not mapping the PG-database to Flink catalog and PG-schema to
>> Flink
>> > > database
>> > >
>> > > Danny, because 1) there are frequent use cases where users want to
>> switch
>> > > databases or referencing objects across databases in a pg instance 2)
>> > > schema is an optional namespace layer in pg, it always has a default
>> > value
>> > > ("public") and can be invisible to users if they'd like to as shown in
>> > the
>> > > FLIP 3) as you mentioned it is specific to postgres, and I don't feel
>> > it's
>> > > necessary to map Postgres substantially different than others DBMSs
>> with
>> > > additional complexity
>> > >
>> > > >'base_url' configuration: We are following the configuration format
>> > > guideline [1] which suggest to use dash (-) instead of underline (_).
>> And
>> > > I'm a little confused the

Re: [ANNOUNCE] Jingsong Lee becomes a Flink committer

2020-02-21 Thread Bowen Li
Congrats, Jingsong!

On Fri, Feb 21, 2020 at 7:28 AM Till Rohrmann  wrote:

> Congratulations Jingsong!
>
> Cheers,
> Till
>
> On Fri, Feb 21, 2020 at 4:03 PM Yun Gao  wrote:
>
>>   Congratulations Jingsong!
>>
>>Best,
>>Yun
>>
>> --
>> From:Jingsong Li 
>> Send Time:2020 Feb. 21 (Fri.) 21:42
>> To:Hequn Cheng 
>> Cc:Yang Wang ; Zhijiang <
>> wangzhijiang...@aliyun.com>; Zhenghua Gao ; godfrey he
>> ; dev ; user <
>> u...@flink.apache.org>
>> Subject:Re: [ANNOUNCE] Jingsong Lee becomes a Flink committer
>>
>> Thanks everyone~
>>
>> It's my pleasure to be part of the community. I hope I can make a better
>> contribution in future.
>>
>> Best,
>> Jingsong Lee
>>
>> On Fri, Feb 21, 2020 at 2:48 PM Hequn Cheng  wrote:
>> Congratulations Jingsong! Well deserved.
>>
>> Best,
>> Hequn
>>
>> On Fri, Feb 21, 2020 at 2:42 PM Yang Wang  wrote:
>> Congratulations!Jingsong. Well deserved.
>>
>>
>> Best,
>> Yang
>>
>> Zhijiang  于2020年2月21日周五 下午1:18写道:
>> Congrats Jingsong! Welcome on board!
>>
>> Best,
>> Zhijiang
>>
>> --
>> From:Zhenghua Gao 
>> Send Time:2020 Feb. 21 (Fri.) 12:49
>> To:godfrey he 
>> Cc:dev ; user 
>> Subject:Re: [ANNOUNCE] Jingsong Lee becomes a Flink committer
>>
>> Congrats Jingsong!
>>
>>
>> *Best Regards,*
>> *Zhenghua Gao*
>>
>>
>> On Fri, Feb 21, 2020 at 11:59 AM godfrey he  wrote:
>> Congrats Jingsong! Well deserved.
>>
>> Best,
>> godfrey
>>
>> Jeff Zhang  于2020年2月21日周五 上午11:49写道:
>> Congratulations!Jingsong. You deserve it
>>
>> wenlong.lwl  于2020年2月21日周五 上午11:43写道:
>> Congrats Jingsong!
>>
>> On Fri, 21 Feb 2020 at 11:41, Dian Fu  wrote:
>>
>> > Congrats Jingsong!
>> >
>> > > 在 2020年2月21日,上午11:39,Jark Wu  写道:
>> > >
>> > > Congratulations Jingsong! Well deserved.
>> > >
>> > > Best,
>> > > Jark
>> > >
>> > > On Fri, 21 Feb 2020 at 11:32, zoudan  wrote:
>> > >
>> > >> Congratulations! Jingsong
>> > >>
>> > >>
>> > >> Best,
>> > >> Dan Zou
>> > >>
>> >
>> >
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>>
>>
>> --
>> Best, Jingsong Lee
>>
>>
>>


[VOTE] FLIP-93: JDBC catalog and Postgres catalog

2020-02-27 Thread Bowen Li
Hi all,

I'd like to kick off the vote for FLIP-93 [1] to add JDBC catalog and
Postgres catalog.

The vote will last for at least 72 hours, following the consensus voting
protocol.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-93%3A+JDBC+catalog+and+Postgres+catalog

Discussion thread:
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-92-JDBC-catalog-and-Postgres-catalog-td36505.html


Re: Creating TemporalTable based on Catalog table in SQL Client

2020-03-03 Thread Bowen Li
Hi Gyula,

What line 622 (the link you shared) does is not registering catalogs, but
setting an already registered catalog as the current one. As you can see
from the method and its comment, catalogs are loaded first before any
tables in yaml are registered, so you should be able to achieve what you
described.

Bowen

On Tue, Mar 3, 2020 at 5:16 AM Gyula Fóra  wrote:

> Hi all!
>
> I was testing the TemporalTable functionality in the SQL client while using
> the Hive Catalog and I ran into the following problem.
>
> I have a table created in the Hive catalog and I want to create a temporal
> table over it.
>
> As we cannot create temporal tables in SQL directly I have to define it in
> the environment yaml file. Unfortunately it seems to be impossible to
> reference a table only present in the catalog (not in the yaml) as catalogs
> are loaded only after creating the temporal table (see
>
> https://github.com/apache/flink/blob/master/flink-table/flink-sql-client/src/main/java/org/apache/flink/table/client/gateway/local/ExecutionContext.java#L622
> )
>
> I am wondering if it would make sense to set the catalogs before all else
> or if that would cause some other problems.
>
> What do you think?
> Gyula
>


Re: Creating TemporalTable based on Catalog table in SQL Client

2020-03-04 Thread Bowen Li
you would need to reference the table with fully qualified name with
catalog and database

On Wed, Mar 4, 2020 at 02:17 Gyula Fóra  wrote:

> I guess it will only work now if you specify the catalog name too when
> referencing the table.
>
>
> On Wed, Mar 4, 2020 at 11:15 AM Gyula Fóra  wrote:
>
> > You are right but still if the default catalog is something else and
> > that's the one containing the table then it still wont work currently.
> >
> > Gyula
> >
> > On Wed, Mar 4, 2020 at 5:08 AM Bowen Li  wrote:
> >
> >> Hi Gyula,
> >>
> >> What line 622 (the link you shared) does is not registering catalogs,
> but
> >> setting an already registered catalog as the current one. As you can see
> >> from the method and its comment, catalogs are loaded first before any
> >> tables in yaml are registered, so you should be able to achieve what you
> >> described.
> >>
> >> Bowen
> >>
> >> On Tue, Mar 3, 2020 at 5:16 AM Gyula Fóra  wrote:
> >>
> >> > Hi all!
> >> >
> >> > I was testing the TemporalTable functionality in the SQL client while
> >> using
> >> > the Hive Catalog and I ran into the following problem.
> >> >
> >> > I have a table created in the Hive catalog and I want to create a
> >> temporal
> >> > table over it.
> >> >
> >> > As we cannot create temporal tables in SQL directly I have to define
> it
> >> in
> >> > the environment yaml file. Unfortunately it seems to be impossible to
> >> > reference a table only present in the catalog (not in the yaml) as
> >> catalogs
> >> > are loaded only after creating the temporal table (see
> >> >
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-table/flink-sql-client/src/main/java/org/apache/flink/table/client/gateway/local/ExecutionContext.java#L622
> >> > )
> >> >
> >> > I am wondering if it would make sense to set the catalogs before all
> >> else
> >> > or if that would cause some other problems.
> >> >
> >> > What do you think?
> >> > Gyula
> >> >
> >>
> >
>


Re: [DISCUSS] Introduce flink-connector-hive-xx modules

2020-03-04 Thread Bowen Li
Thanks, Jingsong, for bringing this up. We've received lots of feedbacks in
the past few months that the complexity involved in different Hive versions
has been quite painful for users to start with. So it's great to step
forward and deal with such issue.

Before getting on a decision, can you please explain:

1) why you proposed segregating hive versions into the 5 ranges above?
2) what different Hive features are supported in the 5 ranges?
3) have you tested that whether the proposed corresponding Flink module
will be fully compatible with each Hive version range?

Thanks,
Bowen



On Wed, Mar 4, 2020 at 1:00 AM Jingsong Lee  wrote:

> Hi all,
>
> I'd like to propose introduce flink-connector-hive-xx modules.
>
> We have documented the dependencies detailed information[2]. But still has
> some inconvenient:
> - Too many versions, users need to pick one version from 8 versions.
> - Too many versions, It's not friendly to our developers either, because
> there's a problem/exception, we need to look at eight different versions of
> hive client code, which are often various.
> - Too many jars, for example, users need to download 4+ jars for Hive 1.x
> from various places.
>
> We have discussed in [1] and [2], but unfortunately, we can not achieve an
> agreement.
>
> For improving this, I'd like to introduce few flink-connector-hive-xx
> modules in flink-connectors, module contains all the dependencies related
> to hive. And only support lower hive metastore versions:
> - "flink-connector-hive-1.2" to support hive 1.0.0 - 1.2.2
> - "flink-connector-hive-2.0" to support hive 2.0.0 - 2.0.1
> - "flink-connector-hive-2.2" to support hive 2.1.0 - 2.2.0
> - "flink-connector-hive-2.3" to support hive 2.3.0 - 2.3.6
> - "flink-connector-hive-3.1" to support hive 3.0.0 - 3.1.2
>
> Users can choose one and download to flink/lib. It includes all hive
> things.
>
> I try to use a single module to deploy multiple versions, but I can not
> find a suitable way, because different modules require different versions
> and different dependencies.
>
> What do you think?
>
> [1]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-have-separate-Flink-distributions-with-built-in-Hive-dependencies-td35918.html
> [2]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-109-Improve-Hive-dependencies-out-of-box-experience-td38290.html
>
> Best,
> Jingsong Lee
>


Re: [VOTE] FLIP-93: JDBC catalog and Postgres catalog

2020-03-04 Thread Bowen Li
I'm glad to announce that the voting of FLIP-93 has passed, with 7 +1  (3
binding: Jingsong, Kurt, Jark, 4 non-binding: Benchao, zoudan, Terry,
Leonard) and no -1.

Thanks everyone for participating!

Cheers,
Bowen

On Mon, Mar 2, 2020 at 7:33 AM Leonard Xu  wrote:

> +1 (non-binding).
>
>  Very useful feature especially for ETL, It will make  connecting to
> existed DB systems easier.
>
> Best,
> Leonard
>
> > 在 2020年3月2日,21:58,Jark Wu  写道:
> >
> > +1 from my side.
> >
> > Best,
> > Jark
> >
> > On Mon, 2 Mar 2020 at 21:40, Kurt Young  wrote:
> >
> >> +1
> >>
> >> Best,
> >> Kurt
> >>
> >>
> >> On Mon, Mar 2, 2020 at 5:32 PM Jingsong Lee 
> >> wrote:
> >>
> >>> +1 from my side.
> >>>
> >>> Best,
> >>> Jingsong Lee
> >>>
> >>> On Mon, Mar 2, 2020 at 11:06 AM Terry Wang  wrote:
> >>>
> >>>> +1 (non-binding).
> >>>> With this feature, we can more easily interact traditional database in
> >>>> flink.
> >>>>
> >>>> Best,
> >>>> Terry Wang
> >>>>
> >>>>
> >>>>
> >>>>> 2020年3月1日 18:33,zoudan  写道:
> >>>>>
> >>>>> +1 (non-binding)
> >>>>>
> >>>>> Best,
> >>>>> Dan Zou
> >>>>>
> >>>>>
> >>>>>> 在 2020年2月28日,02:38,Bowen Li  写道:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I'd like to kick off the vote for FLIP-93 [1] to add JDBC catalog
> >> and
> >>>>>> Postgres catalog.
> >>>>>>
> >>>>>> The vote will last for at least 72 hours, following the consensus
> >>> voting
> >>>>>> protocol.
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-93%3A+JDBC+catalog+and+Postgres+catalog
> >>>>>>
> >>>>>> Discussion thread:
> >>>>>>
> >>>>
> >>>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-92-JDBC-catalog-and-Postgres-catalog-td36505.html
> >>>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Best, Jingsong Lee
> >>>
> >>
>
>


Re: [DISCUSS] Introduce flink-connector-hive-xx modules

2020-03-04 Thread Bowen Li
Thanks Jingsong for your explanation! I'm +1 for this initiative.

According to your description, I think it makes sense to incorporate
support of Hive 2.2 to that of 2.0/2.1 and reducing the number of ranges to
4.

A couple minor followup questions:
1) will there be a base module like "flink-connector-hive-base" which holds
all the common logic of these proposed modules and is compiled into the
uber jar of "flink-connector-hive-xxx"?
2) according to my observation, it's more common to set the version in
module name to be the lowest version that this module supports, e.g. for
Hive 1.0.0 - 1.2.2, the module name can be "flink-connector-hive-1.0"
rather than "flink-connector-hive-1.2"


On Wed, Mar 4, 2020 at 10:20 PM Jingsong Li  wrote:

> Thanks Bowen for involving.
>
> > why you proposed segregating hive versions into the 5 ranges above? &
> what different Hive features are supported in the 5 ranges?
>
> For only higher client dependencies version support lower hive metastore
> versions:
> - Hive 1.0.0 - 1.2.2, thrift change is OK, only hive date column stats, we
> can throw exception for the unsupported feature.
> - Hive 2.0 and Hive 2.1, primary key support and alter_partition api
> change.
> - Hive 2.2 no thrift change.
> - Hive 2.3 change many things, lots of thrift change.
> - Hive 3+, not null. unique, timestamp, so many things.
>
> All these things can be found in hive_metastore.thrift.
>
> I think I can try do more effort in implementation to use Hive 2.2 to
> support Hive 2.0. So the range size will be 4.
>
> > have you tested that whether the proposed corresponding Flink module will
> be fully compatible with each Hive version range?
>
> Yes, I have done some tests, not really for "fully", but it is a technical
> judgment.
>
> Best,
> Jingsong Lee
>
> On Thu, Mar 5, 2020 at 1:17 PM Bowen Li  wrote:
>
> > Thanks, Jingsong, for bringing this up. We've received lots of feedbacks
> in
> > the past few months that the complexity involved in different Hive
> versions
> > has been quite painful for users to start with. So it's great to step
> > forward and deal with such issue.
> >
> > Before getting on a decision, can you please explain:
> >
> > 1) why you proposed segregating hive versions into the 5 ranges above?
> > 2) what different Hive features are supported in the 5 ranges?
> > 3) have you tested that whether the proposed corresponding Flink module
> > will be fully compatible with each Hive version range?
> >
> > Thanks,
> > Bowen
> >
> >
> >
> > On Wed, Mar 4, 2020 at 1:00 AM Jingsong Lee 
> > wrote:
> >
> > > Hi all,
> > >
> > > I'd like to propose introduce flink-connector-hive-xx modules.
> > >
> > > We have documented the dependencies detailed information[2]. But still
> > has
> > > some inconvenient:
> > > - Too many versions, users need to pick one version from 8 versions.
> > > - Too many versions, It's not friendly to our developers either,
> because
> > > there's a problem/exception, we need to look at eight different
> versions
> > of
> > > hive client code, which are often various.
> > > - Too many jars, for example, users need to download 4+ jars for Hive
> 1.x
> > > from various places.
> > >
> > > We have discussed in [1] and [2], but unfortunately, we can not achieve
> > an
> > > agreement.
> > >
> > > For improving this, I'd like to introduce few flink-connector-hive-xx
> > > modules in flink-connectors, module contains all the dependencies
> related
> > > to hive. And only support lower hive metastore versions:
> > > - "flink-connector-hive-1.2" to support hive 1.0.0 - 1.2.2
> > > - "flink-connector-hive-2.0" to support hive 2.0.0 - 2.0.1
> > > - "flink-connector-hive-2.2" to support hive 2.1.0 - 2.2.0
> > > - "flink-connector-hive-2.3" to support hive 2.3.0 - 2.3.6
> > > - "flink-connector-hive-3.1" to support hive 3.0.0 - 3.1.2
> > >
> > > Users can choose one and download to flink/lib. It includes all hive
> > > things.
> > >
> > > I try to use a single module to deploy multiple versions, but I can not
> > > find a suitable way, because different modules require different
> versions
> > > and different dependencies.
> > >
> > > What do you think?
> > >
> > > [1]
> > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-have-separate-Flink-distributions-with-built-in-Hive-dependencies-td35918.html
> > > [2]
> > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-109-Improve-Hive-dependencies-out-of-box-experience-td38290.html
> > >
> > > Best,
> > > Jingsong Lee
> > >
> >
>
>
> --
> Best, Jingsong Lee
>


Re: [DISCUSS] Introduce flink-connector-hive-xx modules

2020-03-05 Thread Bowen Li
> I have some hesitation, because the actual version number can better
reflect the actual dependency. For example, if the user also knows the
field hiveVersion[1]. He may enter the wrong hiveVersion because of the
name, or he may have the wrong expectation for the hive built-in functions.

Sorry, I'm not sure if my proposal is understood correctly.

What I'm saying is, in your original proposal, taking an example, suggested
naming the module as "flink-connector-hive-1.2" to support hive 1.0.0 -
1.2.2, a name including the highest Hive version it supports. I'm
suggesting to name it "flink-connector-hive-1.0", a name including the
lowest Hive version it supports.

What do you think?



On Wed, Mar 4, 2020 at 11:14 PM Jingsong Li  wrote:

> Hi Bowen, thanks for your reply.
>
> > will there be a base module like "flink-connector-hive-base" which holds
> all the common logic of these proposed modules
>
> Maybe we don't need, their implementation is only "pom.xml". Different
> versions have different dependencies.
>
> > it's more common to set the version in module name to be the lowest
> version that this module supports
>
> I have some hesitation, because the actual version number can better
> reflect the actual dependency. For example, if the user also knows the
> field hiveVersion[1]. He may enter the wrong hiveVersion because of the
> name, or he may have the wrong expectation for the hive built-in functions.
>
> [1] https://github.com/apache/flink/pull/11304
>
> Best,
> Jingsong Lee
>
> On Thu, Mar 5, 2020 at 2:34 PM Bowen Li  wrote:
>
> > Thanks Jingsong for your explanation! I'm +1 for this initiative.
> >
> > According to your description, I think it makes sense to incorporate
> > support of Hive 2.2 to that of 2.0/2.1 and reducing the number of ranges
> to
> > 4.
> >
> > A couple minor followup questions:
> > 1) will there be a base module like "flink-connector-hive-base" which
> holds
> > all the common logic of these proposed modules and is compiled into the
> > uber jar of "flink-connector-hive-xxx"?
> > 2) according to my observation, it's more common to set the version in
> > module name to be the lowest version that this module supports, e.g. for
> > Hive 1.0.0 - 1.2.2, the module name can be "flink-connector-hive-1.0"
> > rather than "flink-connector-hive-1.2"
> >
> >
> > On Wed, Mar 4, 2020 at 10:20 PM Jingsong Li 
> > wrote:
> >
> > > Thanks Bowen for involving.
> > >
> > > > why you proposed segregating hive versions into the 5 ranges above? &
> > > what different Hive features are supported in the 5 ranges?
> > >
> > > For only higher client dependencies version support lower hive
> metastore
> > > versions:
> > > - Hive 1.0.0 - 1.2.2, thrift change is OK, only hive date column stats,
> > we
> > > can throw exception for the unsupported feature.
> > > - Hive 2.0 and Hive 2.1, primary key support and alter_partition api
> > > change.
> > > - Hive 2.2 no thrift change.
> > > - Hive 2.3 change many things, lots of thrift change.
> > > - Hive 3+, not null. unique, timestamp, so many things.
> > >
> > > All these things can be found in hive_metastore.thrift.
> > >
> > > I think I can try do more effort in implementation to use Hive 2.2 to
> > > support Hive 2.0. So the range size will be 4.
> > >
> > > > have you tested that whether the proposed corresponding Flink module
> > will
> > > be fully compatible with each Hive version range?
> > >
> > > Yes, I have done some tests, not really for "fully", but it is a
> > technical
> > > judgment.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Thu, Mar 5, 2020 at 1:17 PM Bowen Li  wrote:
> > >
> > > > Thanks, Jingsong, for bringing this up. We've received lots of
> > feedbacks
> > > in
> > > > the past few months that the complexity involved in different Hive
> > > versions
> > > > has been quite painful for users to start with. So it's great to step
> > > > forward and deal with such issue.
> > > >
> > > > Before getting on a decision, can you please explain:
> > > >
> > > > 1) why you proposed segregating hive versions into the 5 ranges
> above?
> > > > 2) what different Hive features are supported in the 5 ranges?
> > > > 3) have you tested that whether the proposed 

Re: [DISCUSS] Introduce flink-connector-hive-xx modules

2020-03-05 Thread Bowen Li
Hi Jingsong,

I think I misunderstood you. So your argument is that, to support hive
1.0.0 - 1.2.2, we are actually using Hive 1.2.2 and thus we name the flink
module as "flink-connector-hive-1.2", right? It makes sense to me now.

+1 for this change.

Cheers,
Bowen

On Thu, Mar 5, 2020 at 6:53 PM Jingsong Li  wrote:

> Hi Bowen,
>
> My idea is to directly provide the really dependent version, such as hive
> 1.2.2, our jar name is hive 1.2.2, so that users can directly and clearly
> know the version. As for which metastore is supported, we can guide it in
> the document, otherwise, write 1.0, and the result version is indeed 1.2.2,
> which will make users have wrong expectations.
>
> Another, maybe 2.3.6 can support 2.0-2.2 after some efforts.
>
> Best,
> Jingsong Lee
>
> On Fri, Mar 6, 2020 at 1:00 AM Bowen Li  wrote:
>
> > > I have some hesitation, because the actual version number can better
> > reflect the actual dependency. For example, if the user also knows the
> > field hiveVersion[1]. He may enter the wrong hiveVersion because of the
> > name, or he may have the wrong expectation for the hive built-in
> functions.
> >
> > Sorry, I'm not sure if my proposal is understood correctly.
> >
> > What I'm saying is, in your original proposal, taking an example,
> suggested
> > naming the module as "flink-connector-hive-1.2" to support hive 1.0.0 -
> > 1.2.2, a name including the highest Hive version it supports. I'm
> > suggesting to name it "flink-connector-hive-1.0", a name including the
> > lowest Hive version it supports.
> >
> > What do you think?
> >
> >
> >
> > On Wed, Mar 4, 2020 at 11:14 PM Jingsong Li 
> > wrote:
> >
> > > Hi Bowen, thanks for your reply.
> > >
> > > > will there be a base module like "flink-connector-hive-base" which
> > holds
> > > all the common logic of these proposed modules
> > >
> > > Maybe we don't need, their implementation is only "pom.xml". Different
> > > versions have different dependencies.
> > >
> > > > it's more common to set the version in module name to be the lowest
> > > version that this module supports
> > >
> > > I have some hesitation, because the actual version number can better
> > > reflect the actual dependency. For example, if the user also knows the
> > > field hiveVersion[1]. He may enter the wrong hiveVersion because of the
> > > name, or he may have the wrong expectation for the hive built-in
> > functions.
> > >
> > > [1] https://github.com/apache/flink/pull/11304
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Thu, Mar 5, 2020 at 2:34 PM Bowen Li  wrote:
> > >
> > > > Thanks Jingsong for your explanation! I'm +1 for this initiative.
> > > >
> > > > According to your description, I think it makes sense to incorporate
> > > > support of Hive 2.2 to that of 2.0/2.1 and reducing the number of
> > ranges
> > > to
> > > > 4.
> > > >
> > > > A couple minor followup questions:
> > > > 1) will there be a base module like "flink-connector-hive-base" which
> > > holds
> > > > all the common logic of these proposed modules and is compiled into
> the
> > > > uber jar of "flink-connector-hive-xxx"?
> > > > 2) according to my observation, it's more common to set the version
> in
> > > > module name to be the lowest version that this module supports, e.g.
> > for
> > > > Hive 1.0.0 - 1.2.2, the module name can be "flink-connector-hive-1.0"
> > > > rather than "flink-connector-hive-1.2"
> > > >
> > > >
> > > > On Wed, Mar 4, 2020 at 10:20 PM Jingsong Li 
> > > > wrote:
> > > >
> > > > > Thanks Bowen for involving.
> > > > >
> > > > > > why you proposed segregating hive versions into the 5 ranges
> > above? &
> > > > > what different Hive features are supported in the 5 ranges?
> > > > >
> > > > > For only higher client dependencies version support lower hive
> > > metastore
> > > > > versions:
> > > > > - Hive 1.0.0 - 1.2.2, thrift change is OK, only hive date column
> > stats,
> > > > we
> > > > > can throw exception for the unsupported feature.
> > > > > - Hive 2.0 and Hive 2.1, prima

Re: [DISCUSS]FLIP-113: Support SQL and planner hints

2020-03-10 Thread Bowen Li
Thanks Danny for kicking off the effort

The root cause of too much manual work is Flink DDL has mixed 3 types of
params together and doesn't handle each of them very well. Below are how I
categorize them and corresponding solutions in my mind:

- type 1: Metadata of external data, like external endpoint/url,
username/pwd, schemas, formats.

Such metadata are mostly already accessible in external system as long as
endpoints and credentials are provided. Flink can get it thru catalogs, but
we haven't had many catalogs yet and thus Flink just hasn't been able to
leverage that. So the solution should be building more catalogs. Such
params should be part of a Flink table DDL/definition, and not overridable
in any means.


- type 2: Runtime params, like jdbc connector's fetch size, elasticsearch
connector's bulk flush size.

Such params don't affect query results, but affect how results are produced
(eg. fast or slow, aka performance) - they are essentially execution and
implementation details. They change often in exploration or development
stages, but not quite frequently in well-defined long-running pipelines.
They should always have default values and can be missing in query. They
can be part of a table DDL/definition, but should also be replaceable in a
query - *this is what table "hints" in FLIP-113 should cover*.


- type 3: Semantic params, like kafka connector's start offset.

Such params affect query results - the semantics. They'd better be as
filter conditions in WHERE clause that can be pushed down. They change
almost every time a query starts and have nothing to do with metadata, thus
should not be part of table definition/DDL, nor be persisted in catalogs.
If they will, users should create views to keep such params around (note
this is different from variable substitution).


Take Flink-Kafka as an example. Once we get these params right, here're the
steps users need to do to develop and run a Flink job:
- configure a Flink ConfluentSchemaRegistry with url, username, and password
- run "SELECT * FROM mykafka WHERE offset > 12pm yesterday" (simplified
timestamp) in SQL CLI, Flink automatically retrieves all metadata of
schema, file format, etc and start the job
- users want to make the job read Kafka topic faster, so it goes as "SELECT
* FROM mykafka /* faster_read_key=value*/ WHERE offset > 12pm yesterday"
- done and satisfied, users submit it to production


Regarding "CREATE TABLE t LIKE with (k1=v1, k2=v2), I think it's a
nice-to-have feature, but not a strategically critical, long-term solution,
because
1) It may seem promising at the current stage to solve the
too-much-manual-work problem, but that's only because Flink hasn't
leveraged catalogs well and handled the 3 types of params above properly.
Once we get the params types right, the LIKE syntax won't be that
important, and will be just an easier way to create tables without retyping
long fields like username and pwd.
2) Note that only some rare type of catalog can store k-v property pair, so
table created this way often cannot be persisted. In the foreseeable
future, such catalog will only be HiveCatalog, and not everyone has a Hive
metastore. To be honest, without persistence, recreating tables every time
this way is still a lot of keyboard typing.

Cheers,
Bowen

On Tue, Mar 10, 2020 at 8:07 PM Kurt Young  wrote:

> If a specific connector want to have such parameter and read if out of
> configuration, then that's fine.
> If we are talking about a configuration for all kinds of sources, I would
> be super careful about that.
> It's true it can solve maybe 80% cases, but it will also make the left 20%
> feels weird.
>
> Best,
> Kurt
>
>
> On Wed, Mar 11, 2020 at 11:00 AM Jark Wu  wrote:
>
> > Hi Kurt,
> >
> > #3 Regarding to global offset:
> > I'm not saying to use the global configuration to override connector
> > properties by the planner.
> > But the connector should take this configuration and translate into their
> > client API.
> > AFAIK, almost all the message queues support eariliest and latest and a
> > timestamp value as start point.
> > So we can support 3 options for this configuration: "eariliest", "latest"
> > and a timestamp string value.
> > Of course, this can't solve 100% cases, but I guess can sovle 80% or 90%
> > cases.
> > And the remaining cases can be resolved by LIKE syntax which I guess is
> not
> > very common cases.
> >
> > Best,
> > Jark
> >
> >
> > On Wed, 11 Mar 2020 at 10:33, Kurt Young  wrote:
> >
> > > Good to have such lovely discussions. I also want to share some of my
> > > opinions.
> > >
> > > #1 Regarding to error handling: I also think ignore invalid hints would
> > be
> > > dangerous, maybe
> > > the simplest solution is just throw an exception.
> > >
> > > #2 Regarding to property replacement: I don't think we should
> constraint
> > > ourself to
> > > the meaning of the word "hint", and forbidden it modifying any
> properties
> > > which can effect
> > > query results. IMO `PROPERTIES` is one 

Re: [DISCUSS]FLIP-113: Support SQL and planner hints

2020-03-11 Thread Bowen Li
it is true that our DDL is not standard compliant by using the WITH
> >> clause. Nevertheless, we aim for not diverging too much and the LIKE
> >> clause is an example of that. It will solve things like overwriting
> >> WATERMARKs, add additional/modifying properties and inherit schema.
> >>
> >> Bowen is right that Flink's DDL is mixing 3 types definition together.
> >> We are not the first ones that try to solve this. There is also the SQL
> >> MED standard [1] that tried to tackle this problem. I think it was not
> >> considered when designing the current DDL.
> >>
> >> Currently, I see 3 options for handling Kafka offsets. I will give some
> >> examples and look forward to feedback here:
> >>
> >> *Option 1* Runtime and semantic parms as part of the query
> >>
> >> `SELECT * FROM MyTable('offset'=123)`
> >>
> >> Pros:
> >> - Easy to add
> >> - Parameters are part of the main query
> >> - No complicated hinting syntax
> >>
> >> Cons:
> >> - Not SQL compliant
> >>
> >> *Option 2* Use metadata in query
> >>
> >> `CREATE TABLE MyTable (id INT, offset AS SYSTEM_METADATA('offset'))`
> >>
> >> `SELECT * FROM MyTable WHERE offset > TIMESTAMP '2012-12-12 12:34:22'`
> >>
> >> Pros:
> >> - SQL compliant in the query
> >> - Access of metadata in the DDL which is required anyway
> >> - Regular pushdown rules apply
> >>
> >> Cons:
> >> - Users need to add an additional comlumn in the DDL
> >>
> >> *Option 3*: Use hints for properties
> >>
> >> `
> >> SELECT *
> >> FROM MyTable /*+ PROPERTIES('offset'=123) */
> >> `
> >>
> >> Pros:
> >> - Easy to add
> >>
> >> Cons:
> >> - Parameters are not part of the main query
> >> - Cryptic syntax for new users
> >> - Not standard compliant.
> >>
> >> If we go with this option, I would suggest to make it available in a
> >> separate map and don't mix it with statically defined properties. Such
> >> that the factory can decide which properties have the right to be
> >> overwritten by the hints:
> >> TableSourceFactory.Context.getQueryHints(): ReadableConfig
> >>
> >> Regards,
> >> Timo
> >>
> >> [1] https://en.wikipedia.org/wiki/SQL/MED
> >>
> >> Currently I see 3 options as a
> >>
> >>
> >> On 11.03.20 07:21, Danny Chan wrote:
> >>> Thanks Bowen ~
> >>>
> >>> I agree we should somehow categorize our connector parameters.
> >>>
> >>> For type1, I’m already preparing a solution like the Confluent schema
> registry + Avro schema inference thing, so this may not be a problem in the
> near future.
> >>>
> >>> For type3, I have some questions:
> >>>
> >>>> "SELECT * FROM mykafka WHERE offset > 12pm yesterday”
> >>>
> >>> Where does the offset column come from, a virtual column from the
> table schema, you said that
> >>>
> >>>> They change
> >>> almost every time a query starts and have nothing to do with metadata,
> thus
> >>> should not be part of table definition/DDL
> >>>
> >>> But why you can reference it in the query, I’m confused for that, can
> you elaborate a little ?
> >>>
> >>> Best,
> >>> Danny Chan
> >>> 在 2020年3月11日 +0800 PM12:52,Bowen Li ,写道:
> >>>> Thanks Danny for kicking off the effort
> >>>>
> >>>> The root cause of too much manual work is Flink DDL has mixed 3 types
> of
> >>>> params together and doesn't handle each of them very well. Below are
> how I
> >>>> categorize them and corresponding solutions in my mind:
> >>>>
> >>>> - type 1: Metadata of external data, like external endpoint/url,
> >>>> username/pwd, schemas, formats.
> >>>>
> >>>> Such metadata are mostly already accessible in external system as
> long as
> >>>> endpoints and credentials are provided. Flink can get it thru
> catalogs, but
> >>>> we haven't had many catalogs yet and thus Flink just hasn't been able
> to
> >>>> leverage that. So the solution should be building more catalogs. Such
> >>>> params should be part of a Flink table 

Re: FLIP-117: HBase catalog

2020-03-16 Thread Bowen Li
Hi,

I think core of the jira right now is to investigate if catalogs of
schemaless systems like HBase and Elasticsearch bring practical value to
users. I haven't used these SQL connectors before, and thus don't have much
to say in this case. Can anyone describe how it would work? Maybe @Yu
or @Zheng can chime in.

w.r.t unsupported operation exception, they should be thrown in targeted
getters (e.g. getView(), getFunction()). General listing APIs like
listView(), listFunction() should not throw them but just return empty
results, for the sake of not breaking user SQL experience. To dedup code,
such common implementations can be moved to AbstractCatalog to make APIs
look cleaner. I recall that there was an intention to refactor catalog API
signatures, but haven't kept up with it.

Bowen

On Sun, Mar 15, 2020 at 10:19 PM Jingsong Li  wrote:

> Thanks Flavio for driving. Personally I am +1 for integrating HBase tables.
>
> I start a new topic for discussion. It is related but not the core of this
> FLIP.
> In the FLIP, I can see:
> - Does HBase support the concept of partitions..? I don't think so..
> - Does HBase support functions? I don't think so..
> - Does HBase support statistics? I don't think so..
> - Does HBase support views? I don't think so..
>
> And in JDBC catalog [1]. There are lots of UnsupportedOperationExceptions
> too.
> And maybe for confluent catalog, UnsupportedOperationExceptions come again.
> Lots of UnsupportedOperationExceptions looks unhappy to this catalog api...
> So can we do some refactor to catalog api? I can see a lot of catalogs
> just need provide table information without partitions, functions,
> statistics, views...
>
> CC: @Dawid Wysakowicz  @Bowen Li
> 
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-93%3A+JDBC+catalog+and+Postgres+catalog
>
> Best,
> Jingsong Lee
>
> On Sat, Mar 14, 2020 at 7:36 AM Flavio Pompermaier 
> wrote:
>
>> Hello everybody,
>> I started a new FLIP to discuss about an HBaseCatalog implementation[1]
>> after the opening of the relative issue by Bowen [2].
>> I drafted a very simple version of the FLIP just to discuss about the
>> critical points (in red) in order to decide how to proceed.
>>
>> Best,
>> Flavio
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-117%3A+HBase+catalog
>> [2] https://issues.apache.org/jira/browse/FLINK-16575
>>
>
>
> --
> Best, Jingsong Lee
>


Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

2020-03-20 Thread Bowen Li
+1.

I would suggest to take a step even further and see what users really need
to test/try/play with table API and Flink SQL. Besides this one, here're
some more sources and sinks that I have developed or used previously to
facilitate building Flink table/SQL pipelines.


   1. random input data source
  - should generate random data at a specified rate according to schema
  - purposes
 - test Flink pipeline and data can end up in external storage
 correctly
 - stress test Flink sink as well as tuning up external storage
  2. print data sink
  - should print data in row format in console
  - purposes
 - make it easier to test Flink SQL job e2e in IDE
 - test Flink pipeline and ensure output data format/value is
 correct
  3. no output data sink
  - just swallow output data without doing anything
  - purpose
 - evaluate and tune performance of Flink source and the whole
 pipeline. Users' don't need to worry about sink back pressure

These may be taken into consideration all together as an effort to lower
the threshold of running Flink SQL/table API, and facilitate users' daily
work.

Cheers,
Bowen


On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li  wrote:

> Hi all,
>
> I heard some users complain that table is difficult to test. Now with SQL
> client, users are more and more inclined to use it to test rather than
> program.
> The most common example is Kafka source. If users need to test their SQL
> output and checkpoint, they need to:
>
> - 1.Launch a Kafka standalone, create a Kafka topic .
> - 2.Write a program, mock input records, and produce records to Kafka
> topic.
> - 3.Then test in Flink.
>
> The step 1 and 2 are annoying, although this test is E2E.
>
> Then I found StatefulSequenceSource, it is very good because it has deal
> with checkpoint things, so it is very good to checkpoint mechanism.Usually,
> users are turned on checkpoint in production.
>
> With computed columns, user are easy to create a sequence source DDL same
> to Kafka DDL. Then they can test inside Flink, don't need launch other
> things.
>
> Have you consider this? What do you think?
>
> CC: @Aljoscha Krettek  the author
> of StatefulSequenceSource.
>
> Best,
> Jingsong Lee
>


Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

2020-03-23 Thread Bowen Li
gt; > > 3.blackhole sink
> > > > - very useful for high performance testing of Flink
> > > > - I've also run into users trying UDF to output, not sink, so they
> need
> > > > this sink as well.
> > > >
> > > > DDL:
> > > > CREATE TABLE blackhole_table (
> > > > ...
> > > > ) WITH (
> > > > 'connector.type' = 'blackhole'
> > > > )
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu 
> > wrote:
> > > >
> > > > > Thanks Jingsong for bringing up this discussion. +1 to this
> > proposal. I
> > > > > think Bowen's proposal makes much sense to me.
> > > > >
> > > > > This is also a painful problem for PyFlink users. Currently there
> is
> > no
> > > > > built-in easy-to-use table source/sink and it requires users to
> > write a
> > > > lot
> > > > > of code to trying out PyFlink. This is especially painful for new
> > users
> > > > who
> > > > > are not familiar with PyFlink/Flink. I have also encountered the
> > > tedious
> > > > > process Bowen encountered, e.g. writing random source connector,
> > print
> > > > sink
> > > > > and also blackhole print sink as there are no built-in ones to use.
> > > > >
> > > > > Regards,
> > > > > Dian
> > > > >
> > > > > > 在 2020年3月22日,上午11:24,Jark Wu  写道:
> > > > > >
> > > > > > +1 to Bowen's proposal. I also saw many requirements on such
> > built-in
> > > > > > connectors.
> > > > > >
> > > > > > I will leave some my thoughts here:
> > > > > >
> > > > > >> 1. datagen source (random source)
> > > > > > I think we can merge the functinality of sequence-source into
> > random
> > > > > source
> > > > > > to allow users to custom their data values.
> > > > > > Flink can generate random data according to the field types,
> users
> > > > > > can customize their values to be more domain specific, e.g.
> > > > > > 'field.user'='User_[1-9]{0,1}'
> > > > > > This will be similar to kafka-datagen-connect[1].
> > > > > >
> > > > > >> 2. console sink (print sink)
> > > > > > This will be very useful in production debugging, to easily
> output
> > an
> > > > > > intermediate view or result view to a `.out` file.
> > > > > > So that we can look into the data representation, or check dirty
> > > data.
> > > > > > This should be out-of-box without manually DDL registration.
> > > > > >
> > > > > >> 3. blackhole sink (no output sink)
> > > > > > This is very useful for high performance testing of Flink, to
> > > meansure
> > > > > the
> > > > > > throughput of the whole pipeline without sink.
> > > > > > Presto also provides this as a built-in connector [2].
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > > > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html
> > > > > >
> > > > > >
> > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li 
> > wrote:
> > > > > >
> > > > > >> +1.
> > > > > >>
> > > > > >> I would suggest to take a step even further and see what users
> > > really
> > > > > need
> > > > > >> to test/try/play with table API and Flink SQL. Besides this one,
> > > > here're
> > > > > >> some more sources and sinks that I have developed or used
> > previously
> > > > to
> > > > > >> facilitate building Flink table/SQL pipelines.
> > > > > >>
> > > > > >>
> > > > > >>  

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-19 Thread Bowen Li
> > >> >> >> built-in
> > > >> >> >>>>>> function or global temp function. (In absence of the
> special
> > > >> >> >>> namespace,
> > > >> >> >>>>> the
> > > >> >> >>>>>> resolution order is the same as in #2.)
> > > >> >> >>>>>>
> > > >> >> >>>>>> My personal preference is #1, given the unknown use case
> and
> > > >> >> >>> introduced
> > > >> >> >>>>>> complexity for #2 and #3. However, #2 is an acceptable
> > > >> alternative.
> > > >> >> >>>> Thus,
> > > >> >> >>>>>> my votes are:
> > > >> >> >>>>>>
> > > >> >> >>>>>> +1 for #1
> > > >> >> >>>>>> +0 for #2
> > > >> >> >>>>>> -1 for #3
> > > >> >> >>>>>>
> > > >> >> >>>>>> Everyone, please cast your vote (in above format please!),
> > or
> > > >> let
> > > >> >> >> me
> > > >> >> >>>> know
> > > >> >> >>>>>> if you have more questions or other candidates.
> > > >> >> >>>>>>
> > > >> >> >>>>>> Thanks,
> > > >> >> >>>>>> Xuefu
> > > >> >> >>>>>>
> > > >> >> >>>>>>
> > > >> >> >>>>>>
> > > >> >> >>>>>>
> > > >> >> >>>>>>
> > > >> >> >>>>>>
> > > >> >> >>>>>>
> > > >> >> >>>>>> On Wed, Sep 18, 2019 at 6:42 AM Aljoscha Krettek <
> > > >> >> >>> aljos...@apache.org>
> > > >> >> >>>>>> wrote:
> > > >> >> >>>>>>
> > > >> >> >>>>>>> Hi,
> > > >> >> >>>>>>>
> > > >> >> >>>>>>> I think this discussion and the one for FLIP-64 are very
> > > >> >> >> connected.
> > > >> >> >>>> To
> > > >> >> >>>>>>> resolve the differences, think we have to think about the
> > > basic
> > > >> >> >>>>>> principles
> > > >> >> >>>>>>> and find consensus there. The basic questions I see are:
> > > >> >> >>>>>>>
> > > >> >> >>>&g

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-19 Thread Bowen Li
Another reason I prefer "CREATE TEMPORARY BUILTIN FUNCTION" over "ALTER
BUILTIN FUNCTION xxx TEMPORARILY" is - what if users want to drop the
temporary built-in function in the same session? With the former one, they
can run something like "DROP TEMPORARY BUILTIN FUNCTION"; With the latter
one, I'm not sure how users can "restore" the original builtin function
easily from an "altered" function without introducing further nonstandard
SQL syntax.

Also please pardon me as I realized using net may not be a good idea... I'm
trying to fit this vote into cases listed in Flink Bylaw [1].

>From the following result, the majority seems to be #2 too as it has the
most approval so far and doesn't have strong "-1".

#1:3 (+1), 1 (0), 4(-1)
#2:4(0), 3 (+1), 1(+0.5)
   * Dawid -1/0 depending on keyword
#3:2(+1), 3(-1), 3(0)

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026

On Thu, Sep 19, 2019 at 10:30 AM Bowen Li  wrote:

> Hi,
>
> Thanks everyone for your votes. I summarized the result as following:
>
> #1:3 (+1), 1 (0), 4(-1) - net: -1
> #2:4(0), 2 (+1), 1(+0.5)  - net: +2.5
> Dawid -1/0 depending on keyword
> #3:2(+1), 3(-1), 3(0)   - net: -1
>
> Given the result, I'd like to change my vote for #2 from 0 to +1, to make
> it a stronger case with net +3.5. So the votes so far are:
>
> #1:3 (+1), 1 (0), 4(-1) - net: -1
> #2:4(0), 3 (+1), 1(+0.5)  - net: +3.5
> Dawid -1/0 depending on keyword
> #3:2(+1), 3(-1), 3(0)   - net: -1
>
> What do you think? Do you think we can conclude with this result? Or would
> you like to take it as a formal FLIP vote with 3 days voting period?
>
> BTW, I'd prefer "CREATE TEMPORARY BUILTIN FUNCTION" over "ALTER BUILTIN
> FUNCTION xxx TEMPORARILY" because
> 1. the syntax is more consistent with "CREATE FUNCTION" and "CREATE
> TEMPORARY FUNCTION"
> 2. "ALTER BUILTIN FUNCTION xxx TEMPORARILY" implies it alters a built-in
> function but it actually doesn't, the logic only creates a temp function
> with higher priority than that built-in function in ambiguous resolution
> order; and it would behave inconsistently with "ALTER FUNCTION".
>
>
>
> On Thu, Sep 19, 2019 at 2:58 AM Fabian Hueske  wrote:
>
>> I agree, it's very similar from the implementation point of view and the
>> implications.
>>
>> IMO, the difference is mostly on the mental model for the user.
>> Instead of having a special class of temporary functions that have
>> precedence over builtin functions it suggests to temporarily change
>> built-in functions.
>>
>> Fabian
>>
>> Am Do., 19. Sept. 2019 um 11:52 Uhr schrieb Kurt Young > >:
>>
>> > Hi Fabian,
>> >
>> > I think it's almost the same with #2 with different keyword:
>> >
>> > CREATE TEMPORARY BUILTIN FUNCTION xxx
>> >
>> > Best,
>> > Kurt
>> >
>> >
>> > On Thu, Sep 19, 2019 at 5:50 PM Fabian Hueske 
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > I thought about it a bit more and think that there is some good value
>> in
>> > my
>> > > last proposal.
>> > >
>> > > A lot of complexity comes from the fact that we want to allow
>> overriding
>> > > built-in functions which are differently addressed as other functions
>> > (and
>> > > db objects).
>> > > We could just have "CREATE TEMPORARY FUNCTION" do exactly the same
>> thing
>> > as
>> > > "CREATE FUNCTION" and treat both functions exactly the same except
>> that:
>> > > 1) temp functions disappear at the end of the session
>> > > 2) temp function are resolved before other functions
>> > >
>> > > This would be Dawid's proposal from the beginning of this thread (in
>> case
>> > > you still remember... ;-) )
>> > >
>> > > Temporarily overriding built-in functions would be supported with an
>> > > explicit command like
>> > >
>> > > ALTER BUILTIN FUNCTION xxx TEMPORARILY AS ...
>> > >
>> > > This would also address the concerns about accidentally changing the
>> > > semantics of built-in functions.
>> > > IMO, it can't get much more explicit than the above command.
>> > >
>> > > Sorry for bringing up a new option in the middle of the discussion,
>> but
>> > as
>> > > I said, I think 

Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

2019-09-19 Thread Bowen Li
Thanks everyone for your feedback. I've converted it to a FLIP wiki [1].

Please take another look. If there's no more concerns, I'd like to start a
voting thread for it.

Thanks

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Modular+Plugins




On Tue, Sep 17, 2019 at 11:25 AM Bowen Li  wrote:

> Hi devs,
>
> We'd like to kick off a conversation on "FLIP-68:  Extend Core Table
> System with Modular Plugins" [1].
>
> The modular approach was raised in discussion of how to support Hive
> built-in functions in FLIP-57 [2]. As we discussed and looked deeper, we
> think it’s a good opportunity to broaden the design and the corresponding
> problem it aims to solve. The motivation is to expand Flink’s core table
> system and enable users to do customizations by writing pluggable modules.
>
> There are two aspects of the motivation:
> 1. Enpower users to write code and do customized developement for Flink
> table core
> 2. Enable users to integrate Flink with cores and built-in objects of
> other systems, so users can reuse what they are familiar with in other SQL
> systems seamlessly as core and built-ins of Flink table
>
> Please take a look, and feedbacks are welcome.
>
> Bowen
>
> [1]
> https://docs.google.com/document/d/17CPMpMbPDjvM4selUVEfh_tqUK_oV0TODAUA9dfHakc/edit?usp=sharing
> [2]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html
>


Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-20 Thread Bowen Li
"SYSTEM" sounds good to me. FYI, this FLIP only impacts low level of the
SQL function stack and won't actually involve any DDL, thus I will just
document the decision and we should keep it in mind when it's time to
implement the DDLs.

I'm in the process of updating the FLIP to reflect changes required for
option #2, will send a new version for review soon.



On Fri, Sep 20, 2019 at 4:02 PM Dawid Wysakowicz 
wrote:

> I also like the 'System' keyword. I think we can assume we reached
> consensus on this topic.
>
> On Sat, 21 Sep 2019, 06:37 Xuefu Z,  wrote:
>
> > +1 for using the keyword "SYSTEM". Thanks to Timo for chiming in!
> >
> > --Xuefu
> >
> > On Fri, Sep 20, 2019 at 3:28 PM Timo Walther  wrote:
> >
> > > Hi everyone,
> > >
> > > sorry, for the late replay. I give also +1 for option #2. Thus, I guess
> > > we have a clear winner.
> > >
> > > I would also like to find a better keyword/syntax for this statement.
> > > Esp. the BUILTIN keyword can confuse people, because it could be
> written
> > > as BUILTIN, BUILDIN, BUILT_IN, or BUILD_IN. And we would need to
> > > introduce a new reserved keyword in the parser which affects also
> > > non-DDL queries. How about:
> > >
> > > CREATE TEMPORARY SYSTEM FUNCTION xxx
> > >
> > > The SYSTEM keyword is already a reserved keyword and in FLIP-66 we are
> > > discussing to prefix some of the function with a SYSTEM_ prefix like
> > > SYSTEM_WATERMARK. Also SQL defines syntax like "FOR SYSTEM_TIME AS OF".
> > >
> > > What do you think?
> > >
> > > Thanks,
> > > Timo
> > >
> > >
> > > On 20.09.19 05:45, Bowen Li wrote:
> > > > Another reason I prefer "CREATE TEMPORARY BUILTIN FUNCTION" over
> "ALTER
> > > > BUILTIN FUNCTION xxx TEMPORARILY" is - what if users want to drop the
> > > > temporary built-in function in the same session? With the former one,
> > > they
> > > > can run something like "DROP TEMPORARY BUILTIN FUNCTION"; With the
> > latter
> > > > one, I'm not sure how users can "restore" the original builtin
> function
> > > > easily from an "altered" function without introducing further
> > nonstandard
> > > > SQL syntax.
> > > >
> > > > Also please pardon me as I realized using net may not be a good
> idea...
> > > I'm
> > > > trying to fit this vote into cases listed in Flink Bylaw [1].
> > > >
> > > > >From the following result, the majority seems to be #2 too as it has
> > the
> > > > most approval so far and doesn't have strong "-1".
> > > >
> > > > #1:3 (+1), 1 (0), 4(-1)
> > > > #2:4(0), 3 (+1), 1(+0.5)
> > > > * Dawid -1/0 depending on keyword
> > > > #3:2(+1), 3(-1), 3(0)
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026
> > > >
> > > > On Thu, Sep 19, 2019 at 10:30 AM Bowen Li 
> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> Thanks everyone for your votes. I summarized the result as
> following:
> > > >>
> > > >> #1:3 (+1), 1 (0), 4(-1) - net: -1
> > > >> #2:4(0), 2 (+1), 1(+0.5)  - net: +2.5
> > > >>  Dawid -1/0 depending on keyword
> > > >> #3:2(+1), 3(-1), 3(0)   - net: -1
> > > >>
> > > >> Given the result, I'd like to change my vote for #2 from 0 to +1, to
> > > make
> > > >> it a stronger case with net +3.5. So the votes so far are:
> > > >>
> > > >> #1:3 (+1), 1 (0), 4(-1) - net: -1
> > > >> #2:4(0), 3 (+1), 1(+0.5)  - net: +3.5
> > > >>  Dawid -1/0 depending on keyword
> > > >> #3:2(+1), 3(-1), 3(0)   - net: -1
> > > >>
> > > >> What do you think? Do you think we can conclude with this result? Or
> > > would
> > > >> you like to take it as a formal FLIP vote with 3 days voting period?
> > > >>
> > > >> BTW, I'd prefer "CREATE TEMPORARY BUILTIN FUNCTION" over "ALTER
> > BUILTIN
> > > >> FUNCTION xxx TEMPORARILY" because
> > > >> 1. the syntax is more consistent with "CREATE FUNCTION&quo

[VOTE] FLIP-68: Extend Core Table System with Modular Plugins

2019-09-23 Thread Bowen Li
Hi all,

I'd like to start a vote for FLIP-68 [1], since there's no more concern in
the discussion thread [2]

The vote will be open for minimum 3 days till 5:30pm UTC, Sep 26.

Thanks,
Bowen

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Modular+Plugins
[2] https://www.mail-archive.com/dev@flink.apache.org/msg29894.html


[VOTE] FLIP-57: Rework FunctionCatalog

2019-09-23 Thread Bowen Li
Hi all,

I'd like to start a voting thread for FLIP-57 [1], which we've reached
consensus in [2].

This voting will be open for minimum 3 days till 6:30pm UTC, Sep 26.

Thanks,
Bowen

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog
[2]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html#a32613


Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

2019-09-23 Thread Bowen Li
Thanks all for your input!

I've updated FLIP-57 accordingly. To summarize the changes:

   - introduced new concept of "Temporary system functions", which has no
   namespace and override built-in functions
   - repositioned "temporary functions" to be those with namespaces and
   override catalog functions
   - updated FunctionCatalog APIs
   - redefined the ambiguous function resolution order to be:


   1. temporary system functions
  2. builtin functions
  3. temporary functions, of the current catalog/db
  4. catalog functions, in the current catalog/db

Since we've reached consensus on several most critical pieces of the FLIP,
I've started a separate voting thread on it.

Cheers,
Bowen


Re: [DISCUSS] FLIP-63: Rework table partition support

2019-09-23 Thread Bowen Li
Hi Jingsong,

Thanks for driving this effort!

Besides a few further comments on Catalog APIs that I just left, it LGTM.

Not sure why, but the voting thread in gmail shows in the same thread as
the discussion's. After addressing all the comments, could you start a new,
separate thread to let other people be aware of it?

Thanks,
Bowen

On Mon, Sep 23, 2019 at 1:25 AM JingsongLee 
wrote:

>  Thanks for your discussion on google document.
> Comments addressed and added FileSystem connector chapter, and introduce
> code prototype for file system connector to unify flink file system and
> hive connectors.
>
> Looking forward to your feedbacks. Thank you.
>
> Best,
> Jingsong Lee
>
>
> --
> From:JingsongLee 
> Send Time:2019年9月18日(星期三) 09:45
> To:Kurt Young ; dev 
> Subject:Re: [DISCUSS] FLIP-63: Rework table partition support
>
> Thanks for your reply and google doc comments. It has been discussed
>  for two weeks now. I will start a vote thread.
>
> Best,
> Jingsong Lee
>
>
> --
> From:Kurt Young 
> Send Time:2019年9月16日(星期一) 15:55
> To:dev 
> Cc:JingsongLee 
> Subject:Re: [DISCUSS] FLIP-63: Rework table partition support
>
> +1 to this feature, I left some comments on google doc.
>
> Another comment is I think we should do some reorganize about the content
> when you converting this to a cwiki page. I will have some offline
> discussion
> with you.
>
> Since this feature seems to be a fairly big efforts, so I suggest we can
> settle
> down the design doc ASAP and start vote process.
> Best,
> Kurt
>
>
> On Thu, Sep 12, 2019 at 12:43 PM Biao Liu  wrote:
> Hi Jingsong,
>
>  Thanks for explaining. It looks cool!
>
>  Thanks,
>  Biao /'bɪ.aʊ/
>
>
>
>  On Wed, 11 Sep 2019 at 11:37, JingsongLee  .invalid>
>  wrote:
>
>  > Hi biao, thanks for your feedbacks:
>  >
>  > Actually, the runtime source partition of runtime is similar to split,
>  > which concerns data reading, parallelism and fault tolerance, all the
>  > runtime concepts.
>  > While table partition is only a virtual concept. Users are more likely
> to
>  > choose which partition to read and which partition to write. Users can
>  > manage their partitions.
>  > One is physical implementation correlation, the other is logical concept
>  > correlation.
>  > So I think they are two completely different things.
>  >
>  > About [2], The main problem is that how to write data to a catalog file
>  > system in stream mode, it is a general problem and has little to do with
>  > partition.
>  >
>  > [2]
>  >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-notifyOnMaster-for-notifyCheckpointComplete-td32769.html
>  >
>  > Best,
>  > Jingsong Lee
>  >
>  >
>  > --
>  > From:Biao Liu 
>  > Send Time:2019年9月10日(星期二) 14:57
>  > To:dev ; JingsongLee 
>  > Subject:Re: [DISCUSS] FLIP-63: Rework table partition support
>  >
>  > Hi Jingsong,
>  >
>  > Thank you for bringing this discussion. Since I don't have much
> experience
>  > of Flink table/SQL, I'll ask some questions from runtime or engine
>  > perspective.
>  >
>  > > ... where we describe how to partition support in flink and how to
>  > integrate to hive partition.
>  >
>  > FLIP-27 [1] introduces "partition" concept officially. The changes of
>  > FLIP-27 are not only about source interface but also about the whole
>  > infrastructure.
>  > Have you ever thought how to integrate your proposal with these changes?
>  > Or you just want to support "partition" in table layer, there will be no
>  > requirement of underlying infrastructure?
>  >
>  > I have seen a discussion [2] that seems be a requirement of
> infrastructure
>  > to support your proposal. So I have some concerns there might be some
>  > conflicts between this proposal and FLIP-27.
>  >
>  > 1.
>  >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
>  > 2.
>  >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Support-notifyOnMaster-for-notifyCheckpointComplete-td32769.html
>  >
>  > Thanks,
>  > Biao /'bɪ.aʊ/
>  >
>  >
>  >
>  > On Fri, 6 Sep 2019 at 13:22, JingsongLee  .invalid>
>  > wrote:
>  > Hi everyone, thank you for your comments. Mail name was updated
>  >  and streaming-related concepts were added.
>  >
>  >  We would like to start a discussion thread on "FLIP-63: Rework table
>  >  partition support"(Design doc: [1]), where we describe how to partition
>  >  support in flink and how to integrate to hive partition.
>  >
>  >  This FLIP addresses:
>  > - Introduce whole story about partition support.
>  > - Introduce and discuss DDL of partition support.
>  > - Introduce static and dynamic partition insert.
>  > - Introduce partition pruning
>  > - Introduce dynamic partition implementation
>  > - Introduce FileFormatSink to deal with s

Re: [DISCUSS] FLIP 69 - Flink SQL DDL Enhancement

2019-09-23 Thread Bowen Li
Hi Terry,

Thanks for driving the effort! I left some comments in the doc.

AFAIU, the biggest motivation is to support DDLs in sql parser so that both
Table API and SQL CLI can share the stack, despite that SQL CLI has already
supported some commands itself. However, I don't see details on how SQL CLI
would migrate and depend on sql parser, and how Table API and SQL CLI would
actually share SQL parser. I'm not sure yet how much work that will take,
just want to double check that you didn't include them because they are
very trivial according to your estimate?


On Mon, Sep 16, 2019 at 1:46 AM Terry Wang  wrote:

> Hi everyone,
>
> In flink 1.9, we have introduced some awesome features such as complete
> catalog support[1] and sql ddl support[2]. These features have been a
> critical integration for Flink to be able to manage data and metadata like
> a classic RDBMS and make developers more easy to construct their
> real-time/off-line warehouse or sth similar base on flink.
>
> But there is still a lack of support on how Flink SQL DDL to manage
> metadata and data like classic RDBMS such as `alter table rename` and so on.
>
> So I’d like to kick off a discussion on enhancing Flink Sql Ddls:
>
> https://docs.google.com/document/d/1mhZmx1h2ecfL0x8OzYD1n-nVRn4yE7pwk4jGed4k7kc/edit?usp=sharing
> <
> https://docs.google.com/document/d/1mhZmx1h2ecfL0x8OzYD1n-nVRn4yE7pwk4jGed4k7kc/edit?usp=sharing
> >
>
> In short, it:
> - Add Catalog DDL enhancement support:  show catalogs / describe
> catalog / use catalog
> - Add Database DDL enhancement support:  show databses / create
> database / drop database/ alter database
> - Add Table DDL enhancement support:show tables/ describe
> table / alter table
> - Add Function DDL enhancement support: show functions/ create
> function /drop function
>
> Looking forward to your opinions.
>
> Best,
> Terry Wang
>
>
>
> [1]:https://issues.apache.org/jira/browse/FLINK-11275 <
> https://issues.apache.org/jira/browse/FLINK-11275>
> [2]:https://issues.apache.org/jira/browse/FLINK-1 <
> https://issues.apache.org/jira/browse/FLINK-11275>0232
>  


[COMMITTER] repo locked due to synchronization issues

2019-09-23 Thread Bowen Li
Hi committers,

Recently I've run a repo issue multiple times in different days. When I
tried to push a commit to master, git reports the following error:

```
remote: This repository has been locked due to synchronization issues:
remote:  - /x1/gitbox/broken/flink.txt exists due to a previous error, and
prevents pushes.
remote: This could either be a benign issue, or the repositories could be
out of sync.
remote: Please contact us...@infra.apache.org to have infrastructure
resolve the issue.
remote:
To https://gitbox.apache.org/repos/asf/flink.git
 ! [remote rejected]   master -> master (pre-receive hook declined)
error: failed to push some refs to '
https://gitbox.apache.org/repos/asf/flink.git'
```

This is quite a new issue that didn't come till two or three weeks ago. I
researched online with no luck. I also reported it to ASF INFRA [1] but
their suggested solution doesn't work.

The issue however usually goes away the next morning in PST, so I assume
someone from a different timezone in Asia or Europe fixes it somehow? Has
anyone run into it before? How did you fix it?

Thanks,
Bowen

[1] https://issues.apache.org/jira/projects/INFRA/issues/INFRA-18992


Re: [VOTE] FLIP-63: Rework table partition support

2019-09-24 Thread Bowen Li
+1. Thanks, Jingsong!

Bowen

On Tue, Sep 24, 2019 at 4:38 AM Terry Wang  wrote:

> +1, Overall looks good.
>
> Best,
> Terry Wang
>
>
>
> > 在 2019年9月24日,下午5:02,Kurt Young  写道:
> >
> > +1 from my side. Some implementation details could be revisited
> > again during code reviewing.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Sep 24, 2019 at 3:14 PM Jingsong Li 
> wrote:
> >
> >> Just to clarify:
> >>
> >> FLIP wiki:
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-63%3A+Rework+table+partition+support
> >>
> >>
> >> Discussion thread:
> >>
> >>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html
> >>
> >>
> >> Google Doc:
> >>
> >>
> https://docs.google.com/document/d/15R3vZ1R_pAHcvJkRx_CWleXgl08WL3k_ZpnWSdzP7GY/edit?usp=sharing
> >>
> >> Best,
> >> Jingsong Lee
> >>
> >> On Tue, Sep 24, 2019 at 11:43 AM Jingsong Lee 
> >> wrote:
> >>
> >>> Thank you for your reminder.
> >>> Updated.
> >>>
> >>> Best,
> >>> Jingsong Lee
> >>>
> >>> On Tue, Sep 24, 2019 at 11:36 AM Kurt Young  wrote:
> >>>
>  Looks like the wiki is not aligned with latest google doc, could
>  you update it first?
> 
>  Best,
>  Kurt
> 
> 
>  On Tue, Sep 24, 2019 at 10:19 AM Jingsong Lee <
> lzljs3620...@apache.org>
>  wrote:
> 
> > Hi Flink devs, after another round of discussion.
> >
> > I would like to re-start the voting for FLIP-63
> > Rework table partition support.
> >
> > FLIP wiki:
> > <
> >
> 
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics
> >>
> > <
> >
> 
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-51%3A+Rework+of+the+Expression+Design
> >>
> >
> >
> 
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-63%3A+Rework+table+partition+support
> >
> > Discussion thread:
> > <
> >
> 
> >>
> https://lists.apache.org/thread.html/65078bad6e047578d502e1e5d92026f13fd9648725f5b74ed330@%3Cdev.flink.apache.org%3E
> >>
> > <
> >
> 
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-51-Rework-of-the-Expression-Design-td31653.html
> >>
> >
> >
> 
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html
> >
> > Google Doc:
> > <
> >
> 
> >>
> https://docs.google.com/document/d/1yFDyquMo_-VZ59vyhaMshpPtg7p87b9IYdAtMXv5XmM/edit?usp=sharing
> >>
> >
> >
> 
> >>
> https://docs.google.com/document/d/15R3vZ1R_pAHcvJkRx_CWleXgl08WL3k_ZpnWSdzP7GY/edit?usp=sharing
> >
> > Thanks,
> >
> > Best,
> > Jingsong Lee
> >
> 
> >>>
> >>>
> >>> --
> >>> Best, Jingsong Lee
> >>>
> >>
> >>
> >> --
> >> Best, Jingsong Lee
> >>
>
>


Re: [COMMITTER] repo locked due to synchronization issues

2019-09-24 Thread Bowen Li
Thanks everyone for sharing your practices! The problem seems to be gone as
usual and I am able to push to ASF repo.

Verified by ASF INFRA [1], it's indeed caused by the mixed push problem
that Fabian said. Quote from INFRA, "This issue can sometimes occur when
people commit conflicting branches at the same time on gitbox vs github, so
we recommend that projects stick with one or the other for commits."

Though I'm alright with pushing to Github, can we have a single, standard
way of pushing commits to ASF repo. Right now we don't seem to have such a
standard way according to wiki [2]. The standardization helps to not only
avoid the issue mentioned above, but also eradicate problems where, IIRC,
some committers used to forgot reformat commit messages or squash PR's
commits when merging PRs from Github UI.

That's saying, I wonder if we can get consensus on pushing commits only to
ASF gitbox repo and disable committers' write access to the Github mirror?

[1] https://issues.apache.org/jira/browse/INFRA-18992
[2] https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests

On Tue, Sep 24, 2019 at 4:44 AM Hequn Cheng  wrote:

> I met the same problem. Pushing to the GitHub repo directly works fine and
> it seems will resync the two repos.
>
> Best, Hequn
>
> On Tue, Sep 24, 2019 at 4:59 PM Fabian Hueske  wrote:
>
> > Maybe it's a mix of pushing to the ASF repository and Github mirrors?
> > I'm only pushing to the ASF repositories (although not that frequently
> > anymore...).
> >
> > Cheers, Fabian
> >
> > Am Di., 24. Sept. 2019 um 10:50 Uhr schrieb Till Rohrmann <
> > trohrm...@apache.org>:
> >
> > > Pushing directly to Github also works for me without a problem.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Sep 24, 2019 at 10:28 AM Jark Wu  wrote:
> > >
> > > > Hi Bowen,
> > > >
> > > > I have also encountered this problem. I don't know how to fix this.
> > > > But pushing to GitHub repo always works for me.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Tue, 24 Sep 2019 at 06:05, Bowen Li  wrote:
> > > >
> > > > > Hi committers,
> > > > >
> > > > > Recently I've run a repo issue multiple times in different days.
> > When I
> > > > > tried to push a commit to master, git reports the following error:
> > > > >
> > > > > ```
> > > > > remote: This repository has been locked due to synchronization
> > issues:
> > > > > remote:  - /x1/gitbox/broken/flink.txt exists due to a previous
> > error,
> > > > and
> > > > > prevents pushes.
> > > > > remote: This could either be a benign issue, or the repositories
> > could
> > > be
> > > > > out of sync.
> > > > > remote: Please contact us...@infra.apache.org to have
> infrastructure
> > > > > resolve the issue.
> > > > > remote:
> > > > > To https://gitbox.apache.org/repos/asf/flink.git
> > > > >  ! [remote rejected]   master -> master (pre-receive hook
> > declined)
> > > > > error: failed to push some refs to '
> > > > > https://gitbox.apache.org/repos/asf/flink.git'
> > > > > ```
> > > > >
> > > > > This is quite a new issue that didn't come till two or three weeks
> > > ago. I
> > > > > researched online with no luck. I also reported it to ASF INFRA [1]
> > but
> > > > > their suggested solution doesn't work.
> > > > >
> > > > > The issue however usually goes away the next morning in PST, so I
> > > assume
> > > > > someone from a different timezone in Asia or Europe fixes it
> somehow?
> > > Has
> > > > > anyone run into it before? How did you fix it?
> > > > >
> > > > > Thanks,
> > > > > Bowen
> > > > >
> > > > > [1]
> https://issues.apache.org/jira/projects/INFRA/issues/INFRA-18992
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP 69 - Flink SQL DDL Enhancement

2019-09-24 Thread Bowen Li
BTW, will there be a "CREATE/DROP CATALOG" DDL?

Though it's not SQL standard, I can see it'll be useful and handy for our
end users in many cases.

On Mon, Sep 23, 2019 at 12:28 PM Bowen Li  wrote:

> Hi Terry,
>
> Thanks for driving the effort! I left some comments in the doc.
>
> AFAIU, the biggest motivation is to support DDLs in sql parser so that
> both Table API and SQL CLI can share the stack, despite that SQL CLI has
> already supported some commands itself. However, I don't see details on how
> SQL CLI would migrate and depend on sql parser, and how Table API and SQL
> CLI would actually share SQL parser. I'm not sure yet how much work that
> will take, just want to double check that you didn't include them because
> they are very trivial according to your estimate?
>
>
> On Mon, Sep 16, 2019 at 1:46 AM Terry Wang  wrote:
>
>> Hi everyone,
>>
>> In flink 1.9, we have introduced some awesome features such as complete
>> catalog support[1] and sql ddl support[2]. These features have been a
>> critical integration for Flink to be able to manage data and metadata like
>> a classic RDBMS and make developers more easy to construct their
>> real-time/off-line warehouse or sth similar base on flink.
>>
>> But there is still a lack of support on how Flink SQL DDL to manage
>> metadata and data like classic RDBMS such as `alter table rename` and so on.
>>
>> So I’d like to kick off a discussion on enhancing Flink Sql Ddls:
>>
>> https://docs.google.com/document/d/1mhZmx1h2ecfL0x8OzYD1n-nVRn4yE7pwk4jGed4k7kc/edit?usp=sharing
>> <
>> https://docs.google.com/document/d/1mhZmx1h2ecfL0x8OzYD1n-nVRn4yE7pwk4jGed4k7kc/edit?usp=sharing
>> >
>>
>> In short, it:
>> - Add Catalog DDL enhancement support:  show catalogs / describe
>> catalog / use catalog
>> - Add Database DDL enhancement support:  show databses / create
>> database / drop database/ alter database
>> - Add Table DDL enhancement support:show tables/ describe
>> table / alter table
>> - Add Function DDL enhancement support: show functions/ create
>> function /drop function
>>
>> Looking forward to your opinions.
>>
>> Best,
>> Terry Wang
>>
>>
>>
>> [1]:https://issues.apache.org/jira/browse/FLINK-11275 <
>> https://issues.apache.org/jira/browse/FLINK-11275>
>> [2]:https://issues.apache.org/jira/browse/FLINK-1 <
>> https://issues.apache.org/jira/browse/FLINK-11275>0232
>>  <https://issues.apache.org/jira/browse/FLINK-11275>
>
>


Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-09-24 Thread Bowen Li
Hi Dawid,

Re 1): I agree making it easy for users to run experiments is important.
However, I'm not sure allowing users to register temp functions in
nonexistent catalog/db is the optimal way. It seems a bit hacky, and breaks
the current contract between Flink and users that catalog/db must be valid
in order to operate on.

How about we instead focus on making it convenient to create catalogs?
Users actually can already do it with ease via program or SQL CLI yaml file
for an in-memory catalog which has neither extra dependency nor external
connections. What we can further improve is DDL for catalogs, and I raised
it in discussion of [FLIP 69 - Flink SQL DDL Enhancement] driven by Terry
now.

In that case, if users would like to experiment via SQL, they can easily
create an in memory catalog/database using DDL, then play with temp
functions.

Re 2): For the assumption, IIUIC, function ObjectIdentifier has not been
resolved when stack call reaches FunctionCatalog#lookupFunction(), but only
been parsed?

I agree keeping ObjectIdentifier as-is would be good. I'm ok with the
suggested classes, though making ObjectIdentifier a subclass of
FunctionIdentifier seem a bit counter intuitive.

Another potentially simpler way is:

```
// in class FunctionLookup
class Result {
Optional  getObjectIdentifier() { ... }
String getName() { ... }
...
}
```

WDYT?



On Tue, Sep 24, 2019 at 3:41 PM Dawid Wysakowicz 
wrote:

> Hi,
> I really like the flip and think it clarifies important aspects of the
> system.
>
> I have two, I hope small suggestions, which will not take much time to
> agree on.
>
> 1. Could we follow the MySQL approach in regards to the existence of cat/db
> for temporary functions? That means not to check it, so e.g. it's possible
> to create a temporary function in a database that does not exist. I think
> it's really useful e.g in cases when user wants to perform experiments but
> does not have access to the db yet or temporarily does not have connection
> to a catalog.
> 2. Could we not change the ObjectIdentifier? Could we not loosen the
> requirements for all catalog objects such as tables, views, types just for
> the functions? It's really important later on from e.g the serializability
> perspective. The important aspect of the ObjectIdentifier is that we know
> that it has been resolved. The suggested changes break that assumption.
>
> What do you think about adding an interface FunctionIdentifier {
>
> String getName();
>
> /**
>   Return 3-part identifier. Empty in case of a built-in function.
> */
> Optional getObjectIdentifier()
> }
>
> class ObjectIdentifier implements FunctionIdentifier {
> Optional getObjectIdentifier() {
>  return Optional.of(this);
> }
> }
>
> class SystemFunctionIdentifier implements FunctionIdentifier {...}
>
> WDYT?
>
> On Wed, 25 Sep 2019, 04:50 Xuefu Z,  wrote:
>
> > +1. LGTM
> >
> > On Tue, Sep 24, 2019 at 6:09 AM Terry Wang  wrote:
> >
> > > +1
> > >
> > > Best,
> > > Terry Wang
> > >
> > >
> > >
> > > > 在 2019年9月24日,上午10:42,Kurt Young  写道:
> > > >
> > > > +1
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Tue, Sep 24, 2019 at 2:30 AM Bowen Li 
> wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I'd like to start a voting thread for FLIP-57 [1], which we've
> reached
> > > >> consensus in [2].
> > > >>
> > > >> This voting will be open for minimum 3 days till 6:30pm UTC, Sep 26.
> > > >>
> > > >> Thanks,
> > > >> Bowen
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog
> > > >> [2]
> > > >>
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html#a32613
> > > >>
> > >
> > >
> >
> > --
> > Xuefu Zhang
> >
> > "In Honey We Trust!"
> >
>


Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-09-24 Thread Bowen Li
Sorry, I missed some parts of the solution. The complete alternative is the
following, basically having separate APIs in FunctionLookup for ambiguous
and precise function lookup since planner is able to tell which API to call
with parsed queries, and have a unified result:

```
class FunctionLookup {

Optional lookupAmbiguousFunction(String name);


Optional lookupPreciseFunction(ObjectIdentifier oi);


class Result {
Optional  getObjectIdentifier() { ... }
String getName() { ... }
// ...
}

}
```

Thanks,
Bowen


On Tue, Sep 24, 2019 at 9:42 PM Bowen Li  wrote:

> Hi Dawid,
>
> Re 1): I agree making it easy for users to run experiments is important.
> However, I'm not sure allowing users to register temp functions in
> nonexistent catalog/db is the optimal way. It seems a bit hacky, and breaks
> the current contract between Flink and users that catalog/db must be valid
> in order to operate on.
>
> How about we instead focus on making it convenient to create catalogs?
> Users actually can already do it with ease via program or SQL CLI yaml file
> for an in-memory catalog which has neither extra dependency nor external
> connections. What we can further improve is DDL for catalogs, and I raised
> it in discussion of [FLIP 69 - Flink SQL DDL Enhancement] driven by Terry
> now.
>
> In that case, if users would like to experiment via SQL, they can easily
> create an in memory catalog/database using DDL, then play with temp
> functions.
>
> Re 2): For the assumption, IIUIC, function ObjectIdentifier has not been
> resolved when stack call reaches FunctionCatalog#lookupFunction(), but only
> been parsed?
>
> I agree keeping ObjectIdentifier as-is would be good. I'm ok with the
> suggested classes, though making ObjectIdentifier a subclass of
> FunctionIdentifier seem a bit counter intuitive.
>
> Another potentially simpler way is:
>
> ```
> // in class FunctionLookup
> class Result {
> Optional  getObjectIdentifier() { ... }
> String getName() { ... }
> ...
> }
> ```
>
> WDYT?
>
>
>
> On Tue, Sep 24, 2019 at 3:41 PM Dawid Wysakowicz <
> wysakowicz.da...@gmail.com> wrote:
>
>> Hi,
>> I really like the flip and think it clarifies important aspects of the
>> system.
>>
>> I have two, I hope small suggestions, which will not take much time to
>> agree on.
>>
>> 1. Could we follow the MySQL approach in regards to the existence of
>> cat/db
>> for temporary functions? That means not to check it, so e.g. it's possible
>> to create a temporary function in a database that does not exist. I think
>> it's really useful e.g in cases when user wants to perform experiments but
>> does not have access to the db yet or temporarily does not have connection
>> to a catalog.
>> 2. Could we not change the ObjectIdentifier? Could we not loosen the
>> requirements for all catalog objects such as tables, views, types just for
>> the functions? It's really important later on from e.g the serializability
>> perspective. The important aspect of the ObjectIdentifier is that we know
>> that it has been resolved. The suggested changes break that assumption.
>>
>> What do you think about adding an interface FunctionIdentifier {
>>
>> String getName();
>>
>> /**
>>   Return 3-part identifier. Empty in case of a built-in function.
>> */
>> Optional getObjectIdentifier()
>> }
>>
>> class ObjectIdentifier implements FunctionIdentifier {
>> Optional getObjectIdentifier() {
>>  return Optional.of(this);
>> }
>> }
>>
>> class SystemFunctionIdentifier implements FunctionIdentifier {...}
>>
>> WDYT?
>>
>> On Wed, 25 Sep 2019, 04:50 Xuefu Z,  wrote:
>>
>> > +1. LGTM
>> >
>> > On Tue, Sep 24, 2019 at 6:09 AM Terry Wang  wrote:
>> >
>> > > +1
>> > >
>> > > Best,
>> > > Terry Wang
>> > >
>> > >
>> > >
>> > > > 在 2019年9月24日,上午10:42,Kurt Young  写道:
>> > > >
>> > > > +1
>> > > >
>> > > > Best,
>> > > > Kurt
>> > > >
>> > > >
>> > > > On Tue, Sep 24, 2019 at 2:30 AM Bowen Li 
>> wrote:
>> > > >
>> > > >> Hi all,
>> > > >>
>> > > >> I'd like to start a voting thread for FLIP-57 [1], which we've
>> reached
>> > > >> consensus in [2].
>> > > >>
>> > > >> This voting will be open for minimum 3 days till 6:30pm UTC, Sep
>> 26.
>> > > >>
>> > > >> Thanks,
>> > > >> Bowen
>> > > >>
>> > > >> [1]
>> > > >>
>> > > >>
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog
>> > > >> [2]
>> > > >>
>> > > >>
>> > >
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html#a32613
>> > > >>
>> > >
>> > >
>> >
>> > --
>> > Xuefu Zhang
>> >
>> > "In Honey We Trust!"
>> >
>>
>


Re: [VOTE] FLIP-68: Extend Core Table System with Modular Plugins

2019-09-25 Thread Bowen Li
Hi,

I'd like to withdraw the vote for the moment. From offline feedback I got,
the community is currently running out of bandwidth to review and vote this
FLIP. I'd hold back this effort a little bit,

On Tue, Sep 24, 2019 at 3:26 PM Xuefu Z  wrote:

> +1, LGTM
>
> On Mon, Sep 23, 2019 at 10:26 AM Bowen Li  wrote:
>
> > Hi all,
> >
> > I'd like to start a vote for FLIP-68 [1], since there's no more concern
> in
> > the discussion thread [2]
> >
> > The vote will be open for minimum 3 days till 5:30pm UTC, Sep 26.
> >
> > Thanks,
> > Bowen
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Modular+Plugins
> > [2] https://www.mail-archive.com/dev@flink.apache.org/msg29894.html
> >
>
>
> --
> Xuefu Zhang
>
> "In Honey We Trust!"
>


Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-09-25 Thread Bowen Li
Re 1) As described in the FLIP, a temp function lookup will first make sure
the db exists. If the db doesn't exist, a lazy drop is triggered to remove
that temp function.

I agree Hive doesn't handle it consistently, and we are not copying Hive.

IMHO, allowing registering temp functions in nonexistent catalog/db is
hacky and problematic. For instance, "SHOW FUNCTIONS" would list system
functions and functions in the current catalog/db, since users cannot
designate a nonexistent catalog/db as current ones, how can they list
functions in nonexistent catalog/db? They may end up never knowing what
temp functions they've created unless trying out with queries or we
introducing some more nonstandard SQL statements. The same applies to other
temp objects like temp tables.

Re 2) A standalone FunctionIdentifier sounds good to me

On Wed, Sep 25, 2019 at 4:46 AM Dawid Wysakowicz 
wrote:

> Ad. 1
> I wouldn't say it is hacky.
> Moreover, how do you want ensure that the dB always exists when a temporary
> object is used?( in this particular case function). Do you want to query
> for the database existence whenever e.g a temporary function is used? I
> think important aspect here is that the database can be dropped from
> external system, not just flink or a different flink session.
>
> E.g in case of hive, you cannot create a temporary table in a database that
> does not exist, that's true. But if you create a temporary table in a
> database and drop that database from a different session, you can still
> query the previously created temporary table from the original session. It
> does not sound like a consistent behaviour to me. Why don't we make this
> behaviour of not binding a temporary objects to the lifetime of a database
> explicit as part of the temporary objects contract? In the end they exist
> in different layers. Permanent objects & databases in a catalog (in case of
> hive megastore) whereas temporary objects in flink sessions. That's also
> true for the original hive client. The temporary objects live in the hive
> client whereas databases are created in the metastore.
>
> Ad.2
> I'm open for suggestions here. The one thing I wanted to achieve here is so
> that we do not change the contract of ObjectIdentifier. One important thing
> to remember here is that we need the function identifier to be part of the
> FunctionDefinition object and not only as the result of the function
> lookup. At some point we want to be able to store QueryOperations in the
> catalogs. They can contain function calls within which we need to have the
> identifier.
>
> I agree my initial suggestion is over complicated. How about we have just
> the FunctionIdentifier as top level class without making the
> ObjectIdentifier extend from it? I think it's pretty much the same what you
> suggested. The only difference is that it would be a top level class with a
> more descriptive name.
>
>
> On Wed, 25 Sep 2019, 13:57 Bowen Li,  wrote:
>
> > Sorry, I missed some parts of the solution. The complete alternative is
> the
> > following, basically having separate APIs in FunctionLookup for ambiguous
> > and precise function lookup since planner is able to tell which API to
> call
> > with parsed queries, and have a unified result:
> >
> > ```
> > class FunctionLookup {
> >
> > Optional lookupAmbiguousFunction(String name);
> >
> >
> > Optional lookupPreciseFunction(ObjectIdentifier oi);
> >
> >
> > class Result {
> > Optional  getObjectIdentifier() { ... }
> > String getName() { ... }
> > // ...
> > }
> >
> > }
> > ```
> >
> > Thanks,
> > Bowen
> >
> >
> > On Tue, Sep 24, 2019 at 9:42 PM Bowen Li  wrote:
> >
> > > Hi Dawid,
> > >
> > > Re 1): I agree making it easy for users to run experiments is
> important.
> > > However, I'm not sure allowing users to register temp functions in
> > > nonexistent catalog/db is the optimal way. It seems a bit hacky, and
> > breaks
> > > the current contract between Flink and users that catalog/db must be
> > valid
> > > in order to operate on.
> > >
> > > How about we instead focus on making it convenient to create catalogs?
> > > Users actually can already do it with ease via program or SQL CLI yaml
> > file
> > > for an in-memory catalog which has neither extra dependency nor
> external
> > > connections. What we can further improve is DDL for catalogs, and I
> > raised
> > > it in discussion of [FLIP 69 - Flink SQL DDL Enhancement] driven by
> Terry
> > > now.
>

Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-09-27 Thread Bowen Li
@Dawid, do you have any other concerns? If not, I hope we can close the
voting.


On Thu, Sep 26, 2019 at 8:14 PM Rui Li  wrote:

> I'm not sure how much benefit #1 can bring us. If users just want to try
> out temporary functions, they can create temporary system functions which
> don't require a catalog/DB. IIUC, the main reason why we allow temporary
> catalog function is to let users override permanent catalog functions.
> Therefore a temporary function in a non-existing catalog won't serve that
> purpose. Besides, each session is provided with a default catalog and DB.
> So even if users simply want to create some catalog functions they can
> forget about after the session, wouldn't the default catalog/DB be enough
> for such experiments?
>
> On Thu, Sep 26, 2019 at 4:38 AM Bowen Li  wrote:
>
> > Re 1) As described in the FLIP, a temp function lookup will first make
> sure
> > the db exists. If the db doesn't exist, a lazy drop is triggered to
> remove
> > that temp function.
> >
> > I agree Hive doesn't handle it consistently, and we are not copying Hive.
> >
> > IMHO, allowing registering temp functions in nonexistent catalog/db is
> > hacky and problematic. For instance, "SHOW FUNCTIONS" would list system
> > functions and functions in the current catalog/db, since users cannot
> > designate a nonexistent catalog/db as current ones, how can they list
> > functions in nonexistent catalog/db? They may end up never knowing what
> > temp functions they've created unless trying out with queries or we
> > introducing some more nonstandard SQL statements. The same applies to
> other
> > temp objects like temp tables.
> >
> > Re 2) A standalone FunctionIdentifier sounds good to me
> >
> > On Wed, Sep 25, 2019 at 4:46 AM Dawid Wysakowicz <
> > wysakowicz.da...@gmail.com>
> > wrote:
> >
> > > Ad. 1
> > > I wouldn't say it is hacky.
> > > Moreover, how do you want ensure that the dB always exists when a
> > temporary
> > > object is used?( in this particular case function). Do you want to
> query
> > > for the database existence whenever e.g a temporary function is used? I
> > > think important aspect here is that the database can be dropped from
> > > external system, not just flink or a different flink session.
> > >
> > > E.g in case of hive, you cannot create a temporary table in a database
> > that
> > > does not exist, that's true. But if you create a temporary table in a
> > > database and drop that database from a different session, you can still
> > > query the previously created temporary table from the original session.
> > It
> > > does not sound like a consistent behaviour to me. Why don't we make
> this
> > > behaviour of not binding a temporary objects to the lifetime of a
> > database
> > > explicit as part of the temporary objects contract? In the end they
> exist
> > > in different layers. Permanent objects & databases in a catalog (in
> case
> > of
> > > hive megastore) whereas temporary objects in flink sessions. That's
> also
> > > true for the original hive client. The temporary objects live in the
> hive
> > > client whereas databases are created in the metastore.
> > >
> > > Ad.2
> > > I'm open for suggestions here. The one thing I wanted to achieve here
> is
> > so
> > > that we do not change the contract of ObjectIdentifier. One important
> > thing
> > > to remember here is that we need the function identifier to be part of
> > the
> > > FunctionDefinition object and not only as the result of the function
> > > lookup. At some point we want to be able to store QueryOperations in
> the
> > > catalogs. They can contain function calls within which we need to have
> > the
> > > identifier.
> > >
> > > I agree my initial suggestion is over complicated. How about we have
> just
> > > the FunctionIdentifier as top level class without making the
> > > ObjectIdentifier extend from it? I think it's pretty much the same what
> > you
> > > suggested. The only difference is that it would be a top level class
> > with a
> > > more descriptive name.
> > >
> > >
> > > On Wed, 25 Sep 2019, 13:57 Bowen Li,  wrote:
> > >
> > > > Sorry, I missed some parts of the solution. The complete alternative
> is
> > > the
> > > > following, basically having separate APIs in FunctionLookup for
>

Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-09-30 Thread Bowen Li
Hi,

I think above are some valid points, and we can adopt the suggestions.

To elaborate a bit on the new SQL syntax, it would imply that, unlike "SHOW
FUNCTION" which only return function names, "SHOW ALL [TEMPORARY]
FUNCTIONS" would return functions' fully qualified names with catalog and
db names.



On Mon, Sep 30, 2019 at 6:38 AM Timo Walther  wrote:

> Hi all,
>
> I support Fabian's arguments. In my opinion, temporary objects should
> just be an additional layer on top of the regular catalog/database
> lookup logic. Thus, a temporary table or function has always highest
> precedence and should be stable within the local session. Otherwise it
> could magically disappear while someone else is performing modifications
> in the catalog.
>
> Furthermore, this feature is very useful for prototyping as users can
> simply express that a catalog/database is present even through they
> might not have access to it currently.
>
> Regards,
> Timo
>
>
> On 30.09.19 14:57, Fabian Hueske wrote:
> > Hi all,
> >
> > Sorry for the late reply.
> >
> > I think it might lead to confusing situations if temporary functions (or
> > any temporary db objects for that matter) are bound to the life cycle of
> an
> > (external) db/catalog.
> > Imaging a situation where you create a temp function in a db in an
> external
> > catalog and use it but at some point it does not work anymore because
> some
> > other dropped the database from the external catalog.
> > Shouldn't temporary objects be only controlled by the owner of a session?
> >
> > I agree that creating temp objects in non-existing db/catalogs sounds a
> bit
> > strange, but IMO the opposite (the db/catalog must exist for a temp
> > function to be created/exist) can have significant implications like the
> > one I described.
> > I think it would be quite easy for users to understand that temporary
> > objects are solely owned by them (and their session).
> > The problem of listing temporary objects could be solved by adding a ALL
> > [TEMPORARY] clause:
> >
> > SHOW ALL FUNCTIONS; could show all functions regardless of the
> > catalog/database including temporary functions.
> > SHOW ALL TEMPORARY FUNCTIONS; could show all temporary functions
> regardless
> > of the catalog/database.
> >
> > Best,
> > Fabian
> >
> > Am Sa., 28. Sept. 2019 um 02:21 Uhr schrieb Bowen Li <
> bowenl...@gmail.com>:
> >
> >> @Dawid, do you have any other concerns? If not, I hope we can close the
> >> voting.
> >>
> >>
> >> On Thu, Sep 26, 2019 at 8:14 PM Rui Li  wrote:
> >>
> >>> I'm not sure how much benefit #1 can bring us. If users just want to
> try
> >>> out temporary functions, they can create temporary system functions
> which
> >>> don't require a catalog/DB. IIUC, the main reason why we allow
> temporary
> >>> catalog function is to let users override permanent catalog functions.
> >>> Therefore a temporary function in a non-existing catalog won't serve
> that
> >>> purpose. Besides, each session is provided with a default catalog and
> DB.
> >>> So even if users simply want to create some catalog functions they can
> >>> forget about after the session, wouldn't the default catalog/DB be
> enough
> >>> for such experiments?
> >>>
> >>> On Thu, Sep 26, 2019 at 4:38 AM Bowen Li  wrote:
> >>>
> >>>> Re 1) As described in the FLIP, a temp function lookup will first make
> >>> sure
> >>>> the db exists. If the db doesn't exist, a lazy drop is triggered to
> >>> remove
> >>>> that temp function.
> >>>>
> >>>> I agree Hive doesn't handle it consistently, and we are not copying
> >> Hive.
> >>>> IMHO, allowing registering temp functions in nonexistent catalog/db is
> >>>> hacky and problematic. For instance, "SHOW FUNCTIONS" would list
> system
> >>>> functions and functions in the current catalog/db, since users cannot
> >>>> designate a nonexistent catalog/db as current ones, how can they list
> >>>> functions in nonexistent catalog/db? They may end up never knowing
> what
> >>>> temp functions they've created unless trying out with queries or we
> >>>> introducing some more nonstandard SQL statements. The same applies to
> >>> other
> >>>> temp objects like temp tables.
> >>>>
>

Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-09-30 Thread Bowen Li
Hi all,

I've updated the FLIP wiki with the following changes:

- Lifespan of temp functions are not tied to those of catalogs and
databases. Users can create temp functions even though catalogs/dbs in
their fully qualified names don't even exist.
- some new SQL commands
- "SHOW FUNCTIONS" - list names of temp and non-temp system/built-in
functions, and names of temp and catalog functions in the current catalog
and db
- "SHOW ALL FUNCTIONS" - list names of temp and non-temp system/built
functions, and fully qualified names of temp and catalog functions in all
catalogs and dbs
- "SHOW ALL TEMPORARY FUNCTIONS" - list fully qualified names of temp
functions in all catalog and db
- "SHOW ALL TEMPORARY SYSTEM FUNCTIONS" - list names of all temp system
functions

Let me know if you have any questions.

Seems we have resolved all concerns. If there's no more ones, I'd like to
close the vote by this time tomorrow.

Cheers,
Bowen

On Mon, Sep 30, 2019 at 11:59 AM Bowen Li  wrote:

> Hi,
>
> I think above are some valid points, and we can adopt the suggestions.
>
> To elaborate a bit on the new SQL syntax, it would imply that, unlike
> "SHOW FUNCTION" which only return function names, "SHOW ALL [TEMPORARY]
> FUNCTIONS" would return functions' fully qualified names with catalog and
> db names.
>
>
>
> On Mon, Sep 30, 2019 at 6:38 AM Timo Walther  wrote:
>
>> Hi all,
>>
>> I support Fabian's arguments. In my opinion, temporary objects should
>> just be an additional layer on top of the regular catalog/database
>> lookup logic. Thus, a temporary table or function has always highest
>> precedence and should be stable within the local session. Otherwise it
>> could magically disappear while someone else is performing modifications
>> in the catalog.
>>
>> Furthermore, this feature is very useful for prototyping as users can
>> simply express that a catalog/database is present even through they
>> might not have access to it currently.
>>
>> Regards,
>> Timo
>>
>>
>> On 30.09.19 14:57, Fabian Hueske wrote:
>> > Hi all,
>> >
>> > Sorry for the late reply.
>> >
>> > I think it might lead to confusing situations if temporary functions (or
>> > any temporary db objects for that matter) are bound to the life cycle
>> of an
>> > (external) db/catalog.
>> > Imaging a situation where you create a temp function in a db in an
>> external
>> > catalog and use it but at some point it does not work anymore because
>> some
>> > other dropped the database from the external catalog.
>> > Shouldn't temporary objects be only controlled by the owner of a
>> session?
>> >
>> > I agree that creating temp objects in non-existing db/catalogs sounds a
>> bit
>> > strange, but IMO the opposite (the db/catalog must exist for a temp
>> > function to be created/exist) can have significant implications like the
>> > one I described.
>> > I think it would be quite easy for users to understand that temporary
>> > objects are solely owned by them (and their session).
>> > The problem of listing temporary objects could be solved by adding a ALL
>> > [TEMPORARY] clause:
>> >
>> > SHOW ALL FUNCTIONS; could show all functions regardless of the
>> > catalog/database including temporary functions.
>> > SHOW ALL TEMPORARY FUNCTIONS; could show all temporary functions
>> regardless
>> > of the catalog/database.
>> >
>> > Best,
>> > Fabian
>> >
>> > Am Sa., 28. Sept. 2019 um 02:21 Uhr schrieb Bowen Li <
>> bowenl...@gmail.com>:
>> >
>> >> @Dawid, do you have any other concerns? If not, I hope we can close the
>> >> voting.
>> >>
>> >>
>> >> On Thu, Sep 26, 2019 at 8:14 PM Rui Li  wrote:
>> >>
>> >>> I'm not sure how much benefit #1 can bring us. If users just want to
>> try
>> >>> out temporary functions, they can create temporary system functions
>> which
>> >>> don't require a catalog/DB. IIUC, the main reason why we allow
>> temporary
>> >>> catalog function is to let users override permanent catalog functions.
>> >>> Therefore a temporary function in a non-existing catalog won't serve
>> that
>> >>> purpose. Besides, each session is provided with a default catalog and
>> DB.
>> >>> So even if users simply want to create some catalog functions they can
>> &

Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

2019-09-30 Thread Bowen Li
Hi Timo,

Re 1) I agree. I renamed the title to "Extend Core Table System with
Pluggable Modules" and all internal references

Re 2) First, I'll rename the API to useModules(). The design doesn't forbid
users to call useModules() multi times. Objects in modules are loaded on
demand instead of eagerly, so there won't be inconsistency. Users have to
be fully aware of the consequences of resetting modules as that might cause
that some objects can not be referenced anymore or resolution order of some
objects changes.

Re 3) Yes, we'd leave that to users.

Another approach can be to have a non-optional "Core" module for all
objects that cannot be overrode like "CAST" and "AS" functions, and have an
optional "ExtendedCore" module for other replaceable built-in objects.
"Core" should be positioned at the 1st in module list all the time.

I'm fine with either solution.

Re 4) It may sound like a nice-to-have advanced feature for 1.10, but we
can surely fully discuss it for the sake of feature completeness.

Unlike other configs, the order of modules would matter in Flink, and it
implies the LOAD/UNLOAD commands would not be equal in operation positions.
IIUYC, LOAD MODULE 'x' would be interpreted as appending x to the end of
module list, and UNLOAD MODULE 'x' would be interpreted as removing x from
any position in the list?

I'm thinking of the following list of commands:

SHOW MODULES - list modules in order
LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] - load and append the
module to end of the module list
UNLOAD MODULE 'hive' - remove the module from module list, and other
modules remain the same relative positions
USE MODULES 'x' 'y' 'z' (wondering can parser take "'x' 'y' 'z'"?), or USE
MODULES 'x,y,z' - to reorder module list completely


Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-10-01 Thread Bowen Li
Hi Dawid,

Thanks for bringing the suggestions up. I was prototyping yesterday and
found out those places exactly as what you suggested.

For CallExpression and UnresolvedCallExpression, I've added them to
FLIP-57. We will replace ObjectIdentifier with FunctionIdentifier and mark
that as a breaking change

For FunctionIdentifier, the suggested changes LGTM. Just want to bring up
an issue on naming. It seems to me how we now name functions categories is
a bit unclear and confusing, which is reflected on the suggested APIs - in
FunctionIdentifier you lay out, "builtin function" would include builtin
functions and temporary system functions as we are kind of using "system"
and "built-in" interchangeably, and "catalog function" would include
catalog functions and temporary functions. I currently can see two
approaches to make it clearer to users.

1) Simplify FunctionIdentifier to be the following. As it's internal, we
add comments and explanation to devs on which case the APIs support.
However, I feel this approach would be somehow a bit conflicting to what
you want to achieve for the clarity of APIs

@Internal
class FunctionIdentifier {
  // for built-in function and temporary system function
public FunctionIdentifier of(String name) {  }
  // for temporary function and catalog function
public FunctionIdentifier of(ObjectIdentifier identifier){  }
public Optional getFunctionName() {  }
public Optional getObjectIdentifier() {  }
}

2) We can rename our function categories as following so there'll be mainly
just two categories of functions, "system functions" and "catalog
functions", either of which can have temporary ones

  - built-in functions -> officially rename to "system functions" and note
to users that "system" and "built-in" can be used interchangeably. We
prefer "system" because that's the keyword we decided to use in DDL that
creates its temporary peers ("CREATE TEMPORARY SYSTEM FUNCTION")
  - temporary system functions
  - catalog functions
  - temporary functions  -> rename to "temporary catalog function"

@Internal
class FunctionIdentifier {
  // for temporary/non-temporary system function
public FunctionIdentifier ofSystemFunction(String name) {  }
  // for temporary/non-temporary catalog function
public FunctionIdentifier ofCatalogFunction(ObjectIdentifier
identifier){  }
public Optional getSystemFunctionName() {  }
public Optional getCatalogFunctionIdentifier() {  }
}

WDYT?


On Tue, Oct 1, 2019 at 5:48 AM Fabian Hueske  wrote:

> Thanks for the summary Bowen!
>
> Looks good to me.
>
> Cheers,
> Fabian
>
> Am Mo., 30. Sept. 2019 um 23:24 Uhr schrieb Bowen Li  >:
>
> > Hi all,
> >
> > I've updated the FLIP wiki with the following changes:
> >
> > - Lifespan of temp functions are not tied to those of catalogs and
> > databases. Users can create temp functions even though catalogs/dbs in
> > their fully qualified names don't even exist.
> > - some new SQL commands
> > - "SHOW FUNCTIONS" - list names of temp and non-temp system/built-in
> > functions, and names of temp and catalog functions in the current catalog
> > and db
> > - "SHOW ALL FUNCTIONS" - list names of temp and non-temp system/built
> > functions, and fully qualified names of temp and catalog functions in all
> > catalogs and dbs
> > - "SHOW ALL TEMPORARY FUNCTIONS" - list fully qualified names of temp
> > functions in all catalog and db
> > - "SHOW ALL TEMPORARY SYSTEM FUNCTIONS" - list names of all temp
> system
> > functions
> >
> > Let me know if you have any questions.
> >
> > Seems we have resolved all concerns. If there's no more ones, I'd like to
> > close the vote by this time tomorrow.
> >
> > Cheers,
> > Bowen
> >
> > On Mon, Sep 30, 2019 at 11:59 AM Bowen Li  wrote:
> >
> > > Hi,
> > >
> > > I think above are some valid points, and we can adopt the suggestions.
> > >
> > > To elaborate a bit on the new SQL syntax, it would imply that, unlike
> > > "SHOW FUNCTION" which only return function names, "SHOW ALL [TEMPORARY]
> > > FUNCTIONS" would return functions' fully qualified names with catalog
> and
> > > db names.
> > >
> > >
> > >
> > > On Mon, Sep 30, 2019 at 6:38 AM Timo Walther 
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I support Fabian's arguments. In my opinion, temporary objects should
> > >> just be an additional layer on top of the regular catalog/database

Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

2019-10-01 Thread Bowen Li
Hi Timo, Dawid,

I've added the suggested SQL and related changes to TableEnvironment API
and other classes to the google doc. Also removed "USE MODULE" and its
APIs. Will update FLIP wiki once we have a consensus.

W.r.t. descriptor approach, my gut feeling is similar to Dawid's. Besides,
I feel yaml file would be a better solution to persist serializable state
of an environment as the file itself is in serializable format already.
Though yaml file only serves SQL CLI at this moment, we may be able to
extend its reach to Table API and allow users to load/offload a
TableEnvironment from/to yaml files, as something like "TableEnvironment
tEnv = TableEnvironment.loadFromYaml()" and
"tEnv.offloadToYaml()" to restore and persist state, and try to
make yaml file more expressive.


On Tue, Oct 1, 2019 at 6:47 AM Dawid Wysakowicz 
wrote:

> Hi Timo, Bowen,
>
> Unfortunately I did not have enough time to go through all the
> suggestions in details so I can not comment on the whole FLIP.
>
> I just wanted to give my opinion on the "descriptor approach in
> loadModule" part. I am not sure if we need it here. We might be
> overthinking this a bit. It definitely makes sense for objects like
> TableSource/TableSink etc. as they are logical definitions that nearly
> always have to be persisted in a Catalog. I'm not sure if we really need
> the same for a whole session. If we need a resume session feature, the
> way to go would probably be to keep the session in memory on the server
> side. I fear we will never be able to serialize the whole session
> entirely (temporary objects, objects derived from DataStream etc.)
>
> I think it is ok to use instances for objects like Catalogs or Modules
> and have an overlay on top of that that can create instances from
> properties.
>
> Best,
>
> Dawid
>
> On 01/10/2019 11:28, Timo Walther wrote:
> > Hi Bowen,
> >
> > thanks for your response.
> >
> > Re 2) I also don't have a better approach for this issue. It is
> > similar to changing the general TableConfig between two statements. It
> > would be good to add your explanation to the design document.
> >
> > Re 3) It would be interesting to know about which "core" functions we
> > are actually talking about. Also for the overriding built-in functions
> > that we discussed in the other FLIP. But I'm fine with leaving it to
> > the user for now. How about we just introduce loadModule(),
> > unloadModule() methods instead of useModules()? This would ensure that
> > users don't forget to add the core module when adding an additional
> > module and they need to explicitly call "unloadModule('core')".
> >
> > Re 4) Every table environment feature should also be designed with SQL
> > statements in mind to verify the concept. SQL is also more popular
> > that Java/Scala API or YAML file. I would like to add it to 1.10 for
> > marking the feature as complete.
> >
> > SHOW MODULES -> sounds good to me, we should add a listModules():
> > List method to table environment
> >
> > LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] --> we should add a
> > loadModule() method to table environment
> >
> > UNLOAD MODULE 'hive' --> we should add a unloadModule() method to
> > table environment
> >
> > I would not introduce `USE MODULES 'x' 'y' 'z'` for simplicity and
> > concise API. Users need to load the module anyway with properties.
> > They can also load them "in order" immediately. CREATE TABLE can also
> > not create multiple tables but only one at a time in that order.
> >
> > One thing that came to my mind, shall we use a descriptor approach for
> > loadModule()? The past has shown that passing instances causes
> > problems when persisting objects. That's why we also want to get rid
> > of registerTableSource. I could image that users might want to persist
> > a table environment's state for later use in the future. Even though
> > this is future work, we should already keep such use cases in mind
> > when adding new API methods. What do you think?
> >
> > Regards,
> > Timo
> >
> >
> > On 30.09.19 23:17, Bowen Li wrote:
> >> Hi Timo,
> >>
> >> Re 1) I agree. I renamed the title to "Extend Core Table System with
> >> Pluggable Modules" and all internal references
> >>
> >> Re 2) First, I'll rename the API to useModules(). The design doesn't
> >> forbid
> >> users to call useModules() multi times. Object

Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

2019-10-01 Thread Bowen Li
If something like the yaml file is the way to go and achieve such
motivation, we would cover that with current design.

On Tue, Oct 1, 2019 at 12:05 Bowen Li  wrote:

> Hi Timo, Dawid,
>
> I've added the suggested SQL and related changes to TableEnvironment API
> and other classes to the google doc. Also removed "USE MODULE" and its
> APIs. Will update FLIP wiki once we have a consensus.
>
> W.r.t. descriptor approach, my gut feeling is similar to Dawid's. Besides,
> I feel yaml file would be a better solution to persist serializable state
> of an environment as the file itself is in serializable format already.
> Though yaml file only serves SQL CLI at this moment, we may be able to
> extend its reach to Table API and allow users to load/offload a
> TableEnvironment from/to yaml files, as something like "TableEnvironment
> tEnv = TableEnvironment.loadFromYaml()" and
> "tEnv.offloadToYaml()" to restore and persist state, and try to
> make yaml file more expressive.
>
>
> On Tue, Oct 1, 2019 at 6:47 AM Dawid Wysakowicz 
> wrote:
>
>> Hi Timo, Bowen,
>>
>> Unfortunately I did not have enough time to go through all the
>> suggestions in details so I can not comment on the whole FLIP.
>>
>> I just wanted to give my opinion on the "descriptor approach in
>> loadModule" part. I am not sure if we need it here. We might be
>> overthinking this a bit. It definitely makes sense for objects like
>> TableSource/TableSink etc. as they are logical definitions that nearly
>> always have to be persisted in a Catalog. I'm not sure if we really need
>> the same for a whole session. If we need a resume session feature, the
>> way to go would probably be to keep the session in memory on the server
>> side. I fear we will never be able to serialize the whole session
>> entirely (temporary objects, objects derived from DataStream etc.)
>>
>> I think it is ok to use instances for objects like Catalogs or Modules
>> and have an overlay on top of that that can create instances from
>> properties.
>>
>> Best,
>>
>> Dawid
>>
>> On 01/10/2019 11:28, Timo Walther wrote:
>> > Hi Bowen,
>> >
>> > thanks for your response.
>> >
>> > Re 2) I also don't have a better approach for this issue. It is
>> > similar to changing the general TableConfig between two statements. It
>> > would be good to add your explanation to the design document.
>> >
>> > Re 3) It would be interesting to know about which "core" functions we
>> > are actually talking about. Also for the overriding built-in functions
>> > that we discussed in the other FLIP. But I'm fine with leaving it to
>> > the user for now. How about we just introduce loadModule(),
>> > unloadModule() methods instead of useModules()? This would ensure that
>> > users don't forget to add the core module when adding an additional
>> > module and they need to explicitly call "unloadModule('core')".
>> >
>> > Re 4) Every table environment feature should also be designed with SQL
>> > statements in mind to verify the concept. SQL is also more popular
>> > that Java/Scala API or YAML file. I would like to add it to 1.10 for
>> > marking the feature as complete.
>> >
>> > SHOW MODULES -> sounds good to me, we should add a listModules():
>> > List method to table environment
>> >
>> > LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] --> we should add a
>> > loadModule() method to table environment
>> >
>> > UNLOAD MODULE 'hive' --> we should add a unloadModule() method to
>> > table environment
>> >
>> > I would not introduce `USE MODULES 'x' 'y' 'z'` for simplicity and
>> > concise API. Users need to load the module anyway with properties.
>> > They can also load them "in order" immediately. CREATE TABLE can also
>> > not create multiple tables but only one at a time in that order.
>> >
>> > One thing that came to my mind, shall we use a descriptor approach for
>> > loadModule()? The past has shown that passing instances causes
>> > problems when persisting objects. That's why we also want to get rid
>> > of registerTableSource. I could image that users might want to persist
>> > a table environment's state for later use in the future. Even though
>> > this is future work, we should already keep such use cases in mind
>> > whe

Re: [ANNOUNCE] Progress of Apache Flink 1.10 #1

2019-10-01 Thread Bowen Li
Thanks Yu and Gary for the detailed summary and update!

On Fri, Sep 27, 2019 at 6:54 AM Yu Li  wrote:

> Hi community,
>
> Since we are now more than one month into the Flink 1.10 release cycle, we
> thought it would be adequate to give a progress update. Below we have
> included a list of the ongoing efforts that we are aware of, together with
> a brief summary of their state. As always, the list is not meant to be
> exhaustive. If you are working on something that is not included here, feel
> free to use this thread to share your progress.
>
> Note that because we are still relatively at the beginning of the release
> cycle, most of the progress is limited to FLIPs that are accepted or being
> voted on.
>
> - Improving Flink’s build system & CI
> - Repository Split [1]
> - Discussed on the ML but consensus to split the repository was not
> reached.
> - Reduce Build Time [2]
> - Discussion is ongoing. Currently, using Azure Pipelines and
> Gradle are being evaluated.
>
> - Support Java 11 [3]
> - Implementation is in progress (18/21 subtasks resolved)
>
> - Table API improvements
> - FLIP-54 Evolve ConfigOption and Configuration [4]
> - Under discussion.
> - FLIP-59 Enable Execution Configuration from Configuration Object [5]
> - Under discussion.
> - Full Data Type Support in Planner [6]
> - Implementation in progress.
> - FLIP-66 Support Time Attribute in SQL DDL [7]
> - FLIP voting.
> - FLIP-70 Support Computed Column [8]
> - Under discussion.
> - FLIP-63 Rework Table Partition Support [9]
> - FLIP voting
> - FLIP-51 Rework of Expression Design [10]
> - FLIP accepted, implementation in progress.
> - FLIP-55 Introduction of a TableAPI Java Expression DSL [11]
> - Under discussion.
> - FLIP-64 Support for Temporary Objects in Table Module [12]
> - Under discussion.
> - FLIP-65 New Type Inference for Table API UDFs
> - Under discussion.
>
> - Hive compatibility completion (DDL/UDF) to support full Hive integration
> - FLIP-57 Rework FunctionCatalog [13]
> - FLIP voting
> - FLIP-68 Extend Core Table System with Modular Plugins [14]
> - FLIP voting was initiated [15] but temporarily withdrawn due to
> lack of community bandwidth.
>
> - Finer grained resource management
> - FLIP-49: Unified Memory Configuration for TaskExecutors [16]
> - FLIP accepted. Implementation is in progress.
> - FLIP-53: Fine Grained Operator Resource Management [17]
> - FLIP accepted. Implementation details are under discussion.
> - FLIP-56: Dynamic Slot Allocation [18]
> - FLIP accepted. Implementation not started yet.
>
> - Finish scheduler re-architecture [19]
> - Implementation is in progress.
>
> - FLIP-27: Refactor Source Interface [20]
> -  FLIP accepted. Implementation is in progress.
>
> - Executor/Client refactoring [21]
>- Discussion already reached consensus
>- FLIP is coming. A PoC implementation is also ready.
>
> - FLIP-36 Support Interactive Programming [22]
> - Reviewing FLIP-67, which changes the intermediate result management
> in runtime, which is what FLIP-36 will be built on top of.
>
> - FLIP-58: Flink Python User-Defined Stateless Function for Table [23]
> - Implementation is in progress (3/15 subtask resolved).
> - Python environment and dependency management under discussion
>
> - FLIP-50: Spill-able Heap Keyed State Backend [24]
> - FLIP was accepted. Implementation is in progress.
>
> - RocksDB Backend Memory Control [25]
> - Verified capping memory usage through Write Buffer Manager [26] works
> in production.
> - New RocksDB version TBD, 5.18.3/6.2.2 has performance regression [27]
> compared to the currently used version 5.17.2.
> - FLIP of MemoryManager interface for reserving memory to be opened.
>
> - Unaligned Checkpoints [28]
> - Design under discussion.
> - FLIP document is under development and will be released shortly
>
> - Separate framework and user class loader in per-job mode [29]
> - Pull request is being reviewed.
>
> - Active Kubernetes Integration [30]
> - PoC completed. More details need to be discussed before updating the
> PRs.
>
> - FLIP-39 Flink ML pipeline and ML libs [31]
> - ML pipeline API PRs (FLINK-13339) have been opened and are being
> reviewed.
> - Algorithms are waiting for the new ML pipeline API to be merged.
>
> - Add vertex subtask log url on WebUI [32]
> - This makes it easier for users of the WebUI to access the logs of the
> TaskManager that executes a specific subtask.
> - A pull request is opened and currently being reviewed.
>
> As a reminder, the feature freeze is targeted to be at the end of November.
> This leaves us with approximately another 2 months of development time. We
> will send another announcement later in the release cycle to make the date
> of the feature freeze offi

Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-10-02 Thread Bowen Li
Introducing a new term "path" to APIs like
"getShortPath(Identifier)/getLongPath(Identifier)" would be confusing to
users, thus I feel "getSimpleName/getIdentifier" is fine.

To summarize the discussion result.

   - categorize functions into 2 dimensions - system v.s. catalog, non-temp
   v.s. temp - and that give us 4 combinations
   - definition of FunctionIdentifier

 @PublicEvolving

Class FunctionIdentifier {

String name;

ObjectIdentifier oi;

// for temporary/non-temporary system function
public FunctionIdentifier of(String name) {  }
// for temporary/non-temporary catalog function
public FunctionIdentifier of(ObjectIdentifier identifier) {  }


Optional getIdentifier() {}

Optional getSimpleName() {}

}


I've updated them to FLIP wiki. Please take a final look. I'll close the
voting if there's no other concern raised within 24 hours.

Cheers

On Wed, Oct 2, 2019 at 4:54 AM Dawid Wysakowicz 
wrote:

> Hi,
>
> I very much agree with Xuefu's summary of the two points, especially on
> the "functionIdentifier doesn't need to reflect the categories".
>
> For the factory methods I think methods of should be enough:
>
>   // for temporary/non-temporary system function
> public FunctionIdentifier of(String name) {  }
>   // for temporary/non-temporary catalog function
> public FunctionIdentifier of(ObjectIdentifier identifier){  }
>
> In case of the getters I did not like the method name `getName` in the
> original proposal, as in my opinion it could imply that it can return
> also just the name part of an ObjectIdentifier, which should not be the
> case.
>
> I'm fine with getSimpleName/getIdentifier, but want to throw in few
> other suggestions:
>
> * getShortPath(Identifier)/getLongPath(Identifier),
>
> * getSystemPath(Identifier)/getCatalogPath(Identifier)
>
> +1 to any of the 3 options.
>
> One additional thing the FunctionIdentifier should be a PublicEvolving
> class, as it is part of a PublicEvolving APIs e.g. CallExpression, which
> user might need to access e.g. in a filter pushdown.
>
> I also support the Xuefu's suggestion not to support the "ALL" keyword
> in the "SHOW [TEMPORARY] FUNCTIONS" statement, but as the exact design
> of it  is not part of the FLIP-57, we do not need to agree on that in
> this thread.
>
> Overall I think after updating the FLIP with the outcome of the
> discussion I vote +1 for it.
>
> Best,
>
> Dawid
>
>
> On 02/10/2019 00:21, Xuefu Z wrote:
> > Here are some of my thoughts on the minor debates above:
> >
> > 1. +1 for 4 categories of functions. They are categorized along two
> > dimensions of binary values: X: *temporary* vs non-temporary
> (persistent);
> > Y: *system* vs non-system (so said catalog).
> > 2. In my opinion, class functionIdentifier doesn't really need to reflect
> > the categories of the functions. Instead, we should decouple them to make
> > the API more stable. Thus, my suggestion is:
> >
> > @Internal
> > class FunctionIdentifier {
> >   // for temporary/non-temporary system function
> > public FunctionIdentifier ofSimpleName(String name) {  }
> >   // for temporary/non-temporary catalog function
> > public FunctionIdentifier ofIdentifier(ObjectIdentifier
> > identifier){  }
> > public Optional getSimpleName() {  }
> > public Optional getIdentifier() {  }
> > }
> > 3. DDLs -- I don't think we need "ALL" keyword. The grammar can just be:
> >
> > SHOW [TEMPORARY] [SYSTEM] FUNCTIONS.
> >
> > When either keyword is missing, "ALL" is implied along that dimension. We
> > should always limit the search to the system function catalog and the
> > current catalog/DB. I don't see a need of listing functions across
> > different catalogs and databases. (It can be added later if that arises.)
> >
> > Thanks,
> > Xuefu
> >
> > On Tue, Oct 1, 2019 at 11:12 AM Bowen Li  wrote:
> >
> >> Hi Dawid,
> >>
> >> Thanks for bringing the suggestions up. I was prototyping yesterday and
> >> found out those places exactly as what you suggested.
> >>
> >> For CallExpression and UnresolvedCallExpression, I've added them to
> >> FLIP-57. We will replace ObjectIdentifier with FunctionIdentifier and
> mark
> >> that as a breaking change
> >>
> >> For FunctionIdentifier, the suggested changes LGTM. Just want to bring
> up
> >> an issue on naming. It seems to me how we now name functions categories
> is
> >> a bit unclear 

Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-10-03 Thread Bowen Li
I'm glad to announce that the community has accepted the design of FLIP-57,
and we are moving forward to implementing it.

Thanks everyone!

On Wed, Oct 2, 2019 at 11:01 AM Bowen Li  wrote:

> Introducing a new term "path" to APIs like
> "getShortPath(Identifier)/getLongPath(Identifier)" would be confusing to
> users, thus I feel "getSimpleName/getIdentifier" is fine.
>
> To summarize the discussion result.
>
>- categorize functions into 2 dimensions - system v.s. catalog,
>non-temp v.s. temp - and that give us 4 combinations
>- definition of FunctionIdentifier
>
>  @PublicEvolving
>
> Class FunctionIdentifier {
>
> String name;
>
> ObjectIdentifier oi;
>
> // for temporary/non-temporary system function
> public FunctionIdentifier of(String name) {  }
> // for temporary/non-temporary catalog function
> public FunctionIdentifier of(ObjectIdentifier identifier) {  }
>
>
> Optional getIdentifier() {}
>
> Optional getSimpleName() {}
>
> }
>
>
> I've updated them to FLIP wiki. Please take a final look. I'll close the
> voting if there's no other concern raised within 24 hours.
>
> Cheers
>
> On Wed, Oct 2, 2019 at 4:54 AM Dawid Wysakowicz 
> wrote:
>
>> Hi,
>>
>> I very much agree with Xuefu's summary of the two points, especially on
>> the "functionIdentifier doesn't need to reflect the categories".
>>
>> For the factory methods I think methods of should be enough:
>>
>>   // for temporary/non-temporary system function
>> public FunctionIdentifier of(String name) {  }
>>   // for temporary/non-temporary catalog function
>> public FunctionIdentifier of(ObjectIdentifier identifier){  }
>>
>> In case of the getters I did not like the method name `getName` in the
>> original proposal, as in my opinion it could imply that it can return
>> also just the name part of an ObjectIdentifier, which should not be the
>> case.
>>
>> I'm fine with getSimpleName/getIdentifier, but want to throw in few
>> other suggestions:
>>
>> * getShortPath(Identifier)/getLongPath(Identifier),
>>
>> * getSystemPath(Identifier)/getCatalogPath(Identifier)
>>
>> +1 to any of the 3 options.
>>
>> One additional thing the FunctionIdentifier should be a PublicEvolving
>> class, as it is part of a PublicEvolving APIs e.g. CallExpression, which
>> user might need to access e.g. in a filter pushdown.
>>
>> I also support the Xuefu's suggestion not to support the "ALL" keyword
>> in the "SHOW [TEMPORARY] FUNCTIONS" statement, but as the exact design
>> of it  is not part of the FLIP-57, we do not need to agree on that in
>> this thread.
>>
>> Overall I think after updating the FLIP with the outcome of the
>> discussion I vote +1 for it.
>>
>> Best,
>>
>> Dawid
>>
>>
>> On 02/10/2019 00:21, Xuefu Z wrote:
>> > Here are some of my thoughts on the minor debates above:
>> >
>> > 1. +1 for 4 categories of functions. They are categorized along two
>> > dimensions of binary values: X: *temporary* vs non-temporary
>> (persistent);
>> > Y: *system* vs non-system (so said catalog).
>> > 2. In my opinion, class functionIdentifier doesn't really need to
>> reflect
>> > the categories of the functions. Instead, we should decouple them to
>> make
>> > the API more stable. Thus, my suggestion is:
>> >
>> > @Internal
>> > class FunctionIdentifier {
>> >   // for temporary/non-temporary system function
>> > public FunctionIdentifier ofSimpleName(String name) {  }
>> >   // for temporary/non-temporary catalog function
>> > public FunctionIdentifier ofIdentifier(ObjectIdentifier
>> > identifier){  }
>> > public Optional getSimpleName() {  }
>> > public Optional getIdentifier() {  }
>> > }
>> > 3. DDLs -- I don't think we need "ALL" keyword. The grammar can just be:
>> >
>> > SHOW [TEMPORARY] [SYSTEM] FUNCTIONS.
>> >
>> > When either keyword is missing, "ALL" is implied along that dimension.
>> We
>> > should always limit the search to the system function catalog and the
>> > current catalog/DB. I don't see a need of listing functions across
>> > different catalogs and databases. (It can be added later if that
>> arises.)
>> >
>> > Thanks,
>> > Xuefu
>> >
>> &

Re: [VOTE] FLIP-57: Rework FunctionCatalog

2019-10-06 Thread Bowen Li
Hi Aljoscha, Timo

Thanks for the reminder. I've update the details in FLIP wiki, and will
kick off a voting thread.

On Fri, Oct 4, 2019 at 1:51 PM Timo Walther  wrote:

> Hi,
>
> I agree with Aljoscha. It is not transparent to me which votes are
> binding to the current status of the FLIP.
>
> Some other minor comments from my side:
>
> - We don't need to deprecate methods in FunctionCatalog. This class is
> internal. We can simply change the method signatures.
> - `String name` is missing in the FunctionIdentifier code example; can
> we call FunctionIdentifier.getSimpleName() just
> FunctionIdentifier.getName()?
> - Add the methods that we discussed to the example:  `of(String)`,
> `of(ObjectIdentifier)`
>
> Other than that, I'm happy to give my +1 to this proposal.
>
> Thanks for the productive discussion,
> Timo
>
>
> On 04.10.19 13:29, Aljoscha Krettek wrote:
> > Hi,
> >
> > I see there was quite some discussion and changes on the FLIP after this
> VOTE was started. I would suggest to start a new voting thread on the
> current state of the FLIP (keeping in mind that a FLIP vote needs at least
> three committer/PMC votes).
> >
> > For the future, we should probably keep discussion to the [DISCUSS]
> thread and use the vote thread only for voting.
> >
> > Best,
> > Aljoscha
> >
> >> On 3. Oct 2019, at 21:17, Bowen Li  wrote:
> >>
> >> I'm glad to announce that the community has accepted the design of
> FLIP-57,
> >> and we are moving forward to implementing it.
> >>
> >> Thanks everyone!
> >>
> >> On Wed, Oct 2, 2019 at 11:01 AM Bowen Li  wrote:
> >>
> >>> Introducing a new term "path" to APIs like
> >>> "getShortPath(Identifier)/getLongPath(Identifier)" would be confusing
> to
> >>> users, thus I feel "getSimpleName/getIdentifier" is fine.
> >>>
> >>> To summarize the discussion result.
> >>>
> >>>- categorize functions into 2 dimensions - system v.s. catalog,
> >>>non-temp v.s. temp - and that give us 4 combinations
> >>>- definition of FunctionIdentifier
> >>>
> >>>  @PublicEvolving
> >>>
> >>> Class FunctionIdentifier {
> >>>
> >>> String name;
> >>>
> >>> ObjectIdentifier oi;
> >>>
> >>> // for temporary/non-temporary system function
> >>> public FunctionIdentifier of(String name) {  }
> >>> // for temporary/non-temporary catalog function
> >>> public FunctionIdentifier of(ObjectIdentifier identifier) {  }
> >>>
> >>>
> >>> Optional getIdentifier() {}
> >>>
> >>> Optional getSimpleName() {}
> >>>
> >>> }
> >>>
> >>>
> >>> I've updated them to FLIP wiki. Please take a final look. I'll close
> the
> >>> voting if there's no other concern raised within 24 hours.
> >>>
> >>> Cheers
> >>>
> >>> On Wed, Oct 2, 2019 at 4:54 AM Dawid Wysakowicz <
> dwysakow...@apache.org>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I very much agree with Xuefu's summary of the two points, especially
> on
> >>>> the "functionIdentifier doesn't need to reflect the categories".
> >>>>
> >>>> For the factory methods I think methods of should be enough:
> >>>>
> >>>>   // for temporary/non-temporary system function
> >>>> public FunctionIdentifier of(String name) {  }
> >>>>   // for temporary/non-temporary catalog function
> >>>> public FunctionIdentifier of(ObjectIdentifier identifier){  }
> >>>>
> >>>> In case of the getters I did not like the method name `getName` in the
> >>>> original proposal, as in my opinion it could imply that it can return
> >>>> also just the name part of an ObjectIdentifier, which should not be
> the
> >>>> case.
> >>>>
> >>>> I'm fine with getSimpleName/getIdentifier, but want to throw in few
> >>>> other suggestions:
> >>>>
> >>>> * getShortPath(Identifier)/getLongPath(Identifier),
> >>>>
> >>>> * getSystemPath(Identifier)/getCatalogPath(Identifier)
> >>>>
> >>>> +1 to

[VOTE] FLIP-57: Rework FunctionCatalog, latest updated

2019-10-06 Thread Bowen Li
Hi all,

I'd like to start a new voting thread for FLIP-57 [1] on its latest status
despite [2], and we've reached consensus in [2] and [3].

This voting will be open for minimum 3 days till 6:45am UTC, Oct 10.

Thanks,
Bowen

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog
[2] https://www.mail-archive.com/dev@flink.apache.org/msg30180.html
[3]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html#a32613


Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

2019-10-09 Thread Bowen Li
Thanks everyone for your review.

After discussing with Timo and Dawid offline, as well as incorporating
feedback from Xuefu and Jark on mailing list, I decided to make a few
critical changes to the proposal.

- renamed the keyword "type" to "kind". The community has plan to have
"type" keyword in yaml/descriptor refer to data types exclusively in the
near future. We should cater to that change in our design
- allowed specifying names for modules to simplify and unify module
loading/unloading syntax between programming and SQL. Here're the proposed
changes:
SQL:
 LOAD MODULE "name" WITH ("kind"="xxx" [, (properties)])
 UNLOAD MODULE "name";
Table:
 tEnv.loadModule("name", new Xxx(properties));
 tEnv.unloadModule("name");

I have completely updated the google doc [1]. Please take another look, and
let me know if you have any other questions. Thanks!

[1]
https://docs.google.com/document/d/17CPMpMbPDjvM4selUVEfh_tqUK_oV0TODAUA9dfHakc/edit#


On Tue, Oct 8, 2019 at 6:26 AM Jark Wu  wrote:

> Hi Bowen,
>
> Thanks for the proposal. I have two thoughts:
>
> 1) Regarding to "loadModule", how about
> tableEnv.loadModule("xxx" [, propertiesMap]);
> tableEnv.unloadModule(“xxx”);
>
> This makes the API similar to SQL. IMO, instance of Module is not needed
> and verbose as parameter.
> And this makes it easier to load a simple module without any additional
> properties, e.g. tEnv.loadModule("GEO"), tEnv.unloadModule("GEO")
>
> 2) In current design, the module interface only defines function metadata,
> but no implementations.
> I'm wondering how to call/map the implementations in runtime? Am I missing
> something?
>
> Besides, I left some minor comments in the doc.
>
> Best,
> Jark
>
>
> On Sat, 5 Oct 2019 at 08:42, Xuefu Z  wrote:
>
> > I agree with Timo that the new table APIs need to be consistent. I'd go
> > further that an name (or id) is needed for module definition in YAML
> file.
> > In the current design, name is skipped and type has binary meanings.
> >
> > Thanks,
> > Xuefu
> >
> > On Fri, Oct 4, 2019 at 5:24 AM Timo Walther  wrote:
> >
> > > Hi everyone,
> > >
> > > first, I was also questioning my proposal. But Bowen's proposal of
> > > `tEnv.offloadToYaml()` would not work with the current
> design
> > > because we don't know how to serialize a catalog or module into
> > > properties. Currently, there is no converter from instance to
> > > properties. It is a one way conversion. We can add a `toProperties`
> > > method to both Catalog and Module class in the future to solve this.
> > > Solving the table environment serializability can be future work.
> > >
> > > However, I find the current proposal for the TableEnvironment methods
> is
> > > contradicting:
> > >
> > > tableEnv.loadModule(new Yyy());
> > > tableEnv.unloadModule(“Xxx”);
> > >
> > > The loading is specified programmatically whereas the unloading
> requires
> > > a string that is not specified in the module itself. But is defined in
> > > the factory according to the current design.
> > >
> > > SQL does it more consistently. There, the name `xxx` is used when
> > > loading and unloading the module:
> > >
> > > LOAD MODULE 'xxx' [WITH ('prop'='myProp', ...)]
> > > UNLOAD MODULE 'xxx’
> > >
> > > How about:
> > >
> > > tableEnv.loadModule("xxx", new Yyy());
> > > tableEnv.unloadModule(“xxx”);
> > >
> > > This would be similar to the catalog interfaces. The name is not part
> of
> > > the instance itself.
> > >
> > > What do you think?
> > >
> > > Thanks,
> > > Timo
> > >
> > >
> > >
> > >
> > > On 01.10.19 21:17, Bowen Li wrote:
> > > > If something like the yaml file is the way to go and achieve such
> > > > motivation, we would cover that with current design.
> > > >
> > > > On Tue, Oct 1, 2019 at 12:05 Bowen Li  wrote:
> > > >
> > > >> Hi Timo, Dawid,
> > > >>
> > > >> I've added the suggested SQL and related changes to TableEnvironment
> > API
> > > >> and other classes to the google doc. Also removed "USE MODULE" and
> its
> > > >> APIs. Will update FLIP wiki once we have a consensus.
> > > >>
> > &

Re: [DISCUSS] FLIP-64: Support for Temporary Objects in Table module

2019-10-09 Thread Bowen Li
Hi Dawid,

+1 for proposed changes

On Wed, Oct 9, 2019 at 12:15 PM Dawid Wysakowicz 
wrote:

> Sorry for a very delayed response.
>
> @Kurt Yes, this is the goal to have a function created like new
> Function(...) also be wrapped into CatalogFunction. This would have to
> be though a temporary function as we cannot represent that as a set of
> properties. Similar to the createTemporaryView(DataStream stream).
>
> As for the ConnectTableDescriptor I agree this is very similar to
> CatalogTable. I am not sure though if we should get rid of it. In the
> end I see it as a builder for a CatalogTable, which is a slightly more
> internal API, but we might revisit that some time in the future if we
> find that it makes more sense.
>
> @All I updated the FLIP page with some more details from the outcome of
> the discussions around FLIP-57. Please take a look. I would like to
> start a vote on this FLIP as soon as the vote on FLIP-57 goes through.
>
> Best,
>
> Dawid
>
>
> On 19/09/2019 09:24, Kurt Young wrote:
> > IIUC it's good to see that both serializable (tables description from
> DDL)
> > and unserializable (tables with DataStream underneath) tables are treated
> > unify with CatalogTable.
> >
> > Can I also assume functions that either come from a function class (from
> > DDL)
> > or function objects (newed by user) will also treated unify with
> > CatalogFunction?
> >
> > This will greatly simplify and unify current API level concepts and
> design.
> >
> > And it seems only one thing left, how do we deal with
> > ConnectTableDescriptor?
> > It's actually very similar with serializable CatalogTable, both carry
> some
> > text
> > properties which even are the same. Is there any chance we can further
> unify
> > this to CatalogTable?
> >
> > object
> > Best,
> > Kurt
> >
> >
> > On Thu, Sep 19, 2019 at 3:13 PM Jark Wu  wrote:
> >
> >> Thanks Dawid for the design doc.
> >>
> >> In general, I’m +1 to the FLIP.
> >>
> >>
> >> +1 to the single-string and parse way to express object path.
> >>
> >> +1 to deprecate registerTableSink & registerTableSource.
> >> But I would suggest to provide an easy way to register a custom
> >> source/sink before we drop them (this is another story).
> >> Currently, it’s not easy to implement a custom connector descriptor.
> >>
> >> Best,
> >> Jark
> >>
> >>
> >>> 在 2019年9月19日,11:37,Dawid Wysakowicz  写道:
> >>>
> >>> Hi JingsongLee,
> >>> From my understanding they can. Underneath they will be CatalogTables.
> >> The
> >>> difference is the lifetime of the tables. Plus some of the user facing
> >>> interfaces cannot be persisted e.g. datastream. Therefore we must have
> a
> >>> separate methods for that. In the end the temporary tables are held in
> >>> memory as CatalogTables.
> >>> Best,
> >>> Dawid
> >>>
> >>> On Thu, 19 Sep 2019, 10:08 JingsongLee,  >> .invalid>
> >>> wrote:
> >>>
>  Hi dawid:
>  Can temporary tables achieve the same capabilities as catalog table?
>  like statistics: CatalogTableStatistics, CatalogColumnStatistics,
>  PartitionStatistics
>  like partition support: we have added some catalog equivalent
> interfaces
>  on TableSource/TableSink: getPartitions, getPartitionFieldNames
>  Maybe it's not a good idea to add these interfaces to
>  TableSource/TableSink. What do you think?
> 
>  Best,
>  Jingsong Lee
> 
> 
>  --
>  From:Kurt Young 
>  Send Time:2019年9月18日(星期三) 17:54
>  To:dev 
>  Subject:Re: [DISCUSS] FLIP-64: Support for Temporary Objects in Table
>  module
> 
>  Hi all,
> 
>  Sorry to join this party late. Big +1 to this flip, especially for the
>  dropping
>  "registerTableSink & registerTableSource" part. These are indeed
> legacy
>  and we should try to unify them through CatalogTable after we
> introduce
>  the concept of Catalog.
> 
>  From my understanding, what we can registered should all be metadata,
>  TableSource/TableSink should only be the one who is responsible to do
>  the real work, i.e. reading and writing data according to the schema
> and
>  other information like computed column, partition, .e.g.
> 
>  Best,
>  Kurt
> 
> 
>  On Wed, Sep 18, 2019 at 5:14 PM JingsongLee   .invalid>
>  wrote:
> 
> > After some development and thinking, I have a general understanding.
> > +1 to registering a source/sink does not fit into the SQL world.
> > I am OK to have a deprecated registerTemporarySource/Sink to
> compatible
> > with old ways.
> >
> > Best,
> > Jingsong Lee
> >
> >
> > --
> > From:Timo Walther 
> > Send Time:2019年9月17日(星期二) 08:00
> > To:dev 
> > Subject:Re: [DISCUSS] FLIP-64: Support for Temporary Objects in Table
> > module
> >
> > Hi Dawid,
> >
> > thanks for the design 

Re: [VOTE] FLIP-57: Rework FunctionCatalog, latest updated

2019-10-15 Thread Bowen Li
Hi all,

I hereby announce the FLIP has passed with 6 +1 votes, 4 binding (Dawid,
Timo, Aljoscha, Jark) and 2 non-binding (Xuefu, Jingsong).

Thanks for your review and participation!



On Thu, Oct 10, 2019 at 1:08 AM Jingsong Li  wrote:

> +1
>
> Best,
> Jingsong Lee
>
> On Thu, Oct 10, 2019 at 3:38 PM Jark Wu  wrote:
>
> > +1
> >
> > Thanks,
> > Jark
> >
> > On Wed, 9 Oct 2019 at 01:03, Xuefu Z  wrote:
> >
> > > +1
> > >
> > > On Tue, Oct 8, 2019 at 7:00 AM Aljoscha Krettek 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > > On 8. Oct 2019, at 15:35, Timo Walther  wrote:
> > > > >
> > > > > +1
> > > > >
> > > > > Thanks for driving these efforts,
> > > > > Timo
> > > > >
> > > > > On 07.10.19 10:10, Dawid Wysakowicz wrote:
> > > > >> +1 for the FLIP.
> > > > >>
> > > > >> Best,
> > > > >>
> > > > >> Dawid
> > > > >>
> > > > >> On 07/10/2019 08:45, Bowen Li wrote:
> > > > >>> Hi all,
> > > > >>>
> > > > >>> I'd like to start a new voting thread for FLIP-57 [1] on its
> latest
> > > > status
> > > > >>> despite [2], and we've reached consensus in [2] and [3].
> > > > >>>
> > > > >>> This voting will be open for minimum 3 days till 6:45am UTC, Oct
> > 10.
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Bowen
> > > > >>>
> > > > >>> [1]
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog
> > > > >>> [2]
> > https://www.mail-archive.com/dev@flink.apache.org/msg30180.html
> > > > >>> [3]
> > > > >>>
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html#a32613
> > > > >>>
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Xuefu Zhang
> > >
> > > "In Honey We Trust!"
> > >
> >
>
>
> --
> Best, Jingsong Lee
>


Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

2019-10-15 Thread Bowen Li
 general this problem is unsolved
> >>>> for now, also Kafka tables could clash if you read from two Kafka
> >>>> clusters with different versions.
> >>>>
> >>>> Regards,
> >>>> Timo
> >>>>
> >>>>
> >>>> On 10.10.19 08:01, Jark Wu wrote:
> >>>>> Hi Xuefu,
> >>>>>
> >>>>> If there is only one instance per type, then what's the "name" used
> >> for?
> >>>>> Could we remove it and only keep "type" or "kind" to identify
> modules?
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>> On Thu, 10 Oct 2019 at 11:21, Xuefu Z  wrote:
> >>>>>
> >>>>>> Jark has a good point. However, I think validation logic can put in
> >>>> place
> >>>>>> to restrict one instance per type. Maybe the doc needs to be
> specific
> >> on
> >>>>>> this.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Xuefu
> >>>>>>
> >>>>>> On Wed, Oct 9, 2019 at 7:41 PM Jark Wu  wrote:
> >>>>>>
> >>>>>>> Thanks Bowen for the updating.
> >>>>>>>
> >>>>>>> I have some different opinions on the change.
> >>>>>>> IIUC, in the previous design, the "name" is also the "id" or "type"
> >> to
> >>>>>>> identify which module to load. Which means we can only load one
> >>>> instance
> >>>>>> of
> >>>>>>> a module.
> >>>>>>> In the new design, the "name" is just an alias to the module
> >> instance,
> >>>>>> the
> >>>>>>> "kind" is used to identify modules. Which means we can load
> different
> >>>>>>> instances of a module.
> >>>>>>> However, what's the "name" or alias used for? Do we need to support
> >>>>>> loading
> >>>>>>> different instances of a module? From my point of view, it brings
> >> more
> >>>>>>> complexity and confusion.
> >>>>>>> For example, if we load a "hive121" which uses HiveModule with
> >> version
> >>>>>>> 1.2.1 and load a "hive234" which uses HiveModule with version
> 2.3.4,
> >>>> then
> >>>>>>> how to solve the class conflict problem?
> >>>>>>>
> >>>>>>> IMO, a module can only be load once in a session, so "name" maybe
> >>>>>> useless.
> >>>>>>> So my proposal is similar to the previous one, but only change
> "name"
> >>>> to
> >>>>>>> "kind".
> >>>>>>>
> >>>>>>>   SQL:
> >>>>>>> LOAD MODULE "kind" [WITH (properties)];
> >>>>>>> UNLOAD MODULE "kind";
> >>>>>>>Table:
> >>>>>>> tEnv.loadModule("kind" [, properties]);
> >>>>>>> tEnv.unloadModule("kind");
> >>>>>>>
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jark
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, 9 Oct 2019 at 20:38, Bowen Li  wrote:
> >>>>>>>
> >>>>>>>> Thanks everyone for your review.
> >>>>>>>>
> >>>>>>>> After discussing with Timo and Dawid offline, as well as
> >> incorporating
> >>>>>>>> feedback from Xuefu and Jark on mailing list, I decided to make a
> >> few
> >>>>>>>> critical changes to the proposal.
> >>>>>>>>
> >>>>>>>> - renamed the keyword "type" to "kind". The community has plan to
> >> have
> >>>>>>

[VOTE] FLIP-68: Extend Core Table System with Modular Plugins

2019-10-15 Thread Bowen Li
Hi all,

I'd like to kick off a voting thread for FLIP-68: Extend Core Table System
with Modular Plugins [1], as we have reached consensus in [2].

The voting period will be open for at least 72 hours, ending at 5pm Oct 18,
UTC.

Thanks,
Bowen

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Modular+Plugins
[2] https://www.mail-archive.com/dev@flink.apache.org/msg29894.html


Re: [VOTE] FLIP-64: Support for Temporary Objects in Table module

2019-10-15 Thread Bowen Li
+1

On Tue, Oct 15, 2019 at 5:09 AM Jark Wu  wrote:

> +1 from my side.
>
> Cheers,
> Jark
>
> On Tue, 15 Oct 2019 at 19:11, vino yang  wrote:
>
> > +1
> >
> > Best,
> > Vino
> >
> > Aljoscha Krettek  于2019年10月15日周二 下午4:31写道:
> >
> > > +1
> > >
> > > Best,
> > > Aljoscha
> > >
> > > > On 14. Oct 2019, at 14:55, Kurt Young  wrote:
> > > >
> > > > +1
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Fri, Oct 11, 2019 at 1:39 PM Dawid Wysakowicz <
> > dwysakow...@apache.org
> > > >
> > > > wrote:
> > > >
> > > >> Hi everyone,
> > > >> I would like to start a vote on FLIP-64. The discussion seems to
> have
> > > >> reached an agreement.
> > > >>
> > > >> Please vote for the following design document:
> > > >>
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-64%3A+Support+for+Temporary+Objects+in+Table+module
> > > >>
> > > >>
> > > >> The discussion can be found at:
> > > >>
> > > >>
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-64-Support-for-Temporary-Objects-in-Table-module-td32684.html
> > > >>
> > > >>
> > > >> This voting will be open for at least 72 hours. I'll try to close it
> > on
> > > >> 2019-10-16 14:00 UTC, unless there is an objection or not enough
> > votes.
> > > >>
> > > >> Best,
> > > >>
> > > >> Dawid
> > > >>
> > > >>
> > > >>
> > >
> > >
> >
>


Re: [Discussion] FLIP-79 Flink Function DDL Support

2019-10-15 Thread Bowen Li
Hi Zhenqiu,

Thanks for taking on this effort!

A couple questions:
- Though this FLIP is about function DDL, can we also think about how the
created functions can be mapped to CatalogFunction and see if we need to
modify CatalogFunction interface? Syntax changes need to be backed by the
backend.
- Can we define a clearer, smaller scope targeting for Flink 1.10 among all
the proposed changes? The current overall scope seems to be quite wide, and
it may be unrealistic to get everything in a single release, or even a
couple. However, I believe the most common user story can be something as
simple as "being able to create and persist a java class-based udf and use
it later in queries", which will add great value for most Flink users and
is achievable in 1.10.

Bowen

On Sun, Oct 13, 2019 at 10:46 PM Peter Huang 
wrote:

> Dear Community,
>
> FLIP-79 Flink Function DDL Support
> <
> https://docs.google.com/document/d/16kkHlis80s61ifnIahCj-0IEdy5NJ1z-vGEJd_JuLog/edit#
> >
>
> This proposal aims to support function DDL with the consideration of SQL
> syntax, language compliance, and advanced external UDF lib registration.
> The Flink DDL is initialized and discussed in the design
> <
> https://docs.google.com/document/d/1TTP-GCC8wSsibJaSUyFZ_5NBAHYEB1FVmPpP7RgDGBA/edit#heading=h.wpsqidkaaoil
> >
> [1] by Shuyi Chen and Timo. As the initial discussion mainly focused on the
> table, type and view. FLIP-69 [2] extend it with a more detailed discussion
> of DDL for catalog, database, and function. Original the function DDL was
> under the scope of FLIP-69. After some discussion
>  with the community, we
> found that there are several ongoing efforts, such as FLIP-64 [3], FLIP-65
> [4], and FLIP-78 [5]. As they will directly impact the SQL syntax of
> function DDL, the proposal wants to describe the problem clearly with the
> consideration of existing works and make sure the design aligns with
> efforts of API change of temporary objects and type inference for UDF
> defined by different languages.
>
> The FlLIP outlines the requirements from related works, and propose a SQL
> syntax to meet those requirements. The corresponding implementation is also
> discussed. Please kindly review and give feedback.
>
>
> Best Regards
> Peter Huang
>


Re: [VOTE] FLIP-68: Extend Core Table System with Modular Plugins

2019-10-15 Thread Bowen Li
sorry, please ignore this thread as the FLIP's name should be "Extend Core
Table System with Pluggable Modules"

On Tue, Oct 15, 2019 at 9:59 AM Bowen Li  wrote:

> Hi all,
>
> I'd like to kick off a voting thread for FLIP-68: Extend Core Table System
> with Modular Plugins [1], as we have reached consensus in [2].
>
> The voting period will be open for at least 72 hours, ending at 5pm Oct
> 18, UTC.
>
> Thanks,
> Bowen
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Modular+Plugins
> [2] https://www.mail-archive.com/dev@flink.apache.org/msg29894.html
>
>


[VOTE] FLIP-68: Extend Core Table System with Pluggable Modules

2019-10-15 Thread Bowen Li
Hi all,

I'd like to kick off a voting thread for FLIP-68: Extend Core Table System
with Pluggable Modules [1], as we have reached consensus in [2].

The voting period will be open for at least 72 hours, ending at 7pm Oct 18
UTC.

Thanks,
Bowen

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Pluggable+Modules
[2] https://www.mail-archive.com/dev@flink.apache.org/msg29894.html


Re: [VOTE] Drop Python 2 support for 1.10

2019-10-15 Thread Bowen Li
+1

On Sun, Oct 13, 2019 at 10:54 PM Hequn Cheng  wrote:

> +1
>
> Thanks a lot for driving this, Dian!
>
> On Mon, Oct 14, 2019 at 1:46 PM jincheng sun 
> wrote:
>
> > +1
> >
> > Dian Fu  于2019年10月14日周一 下午1:21写道:
> >
> > > Hi all,
> > >
> > > I would like to start the vote for "Drop Python 2 support for 1.10",
> > which
> > > is discussed and reached a consensus in the discussion thread[1].
> > >
> > > The vote will be open for at least 72 hours. Unless there is an
> > objection,
> > > I will try to close it by Oct 17, 2019 18:00 UTC if we have received
> > > sufficient votes.
> > >
> > > Regards,
> > > Dian
> > >
> > > [1]
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Drop-Python-2-support-for-1-10-td33824.html
> > > <
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Drop-Python-2-support-for-1-10-td33824.html
> > > >
> >
>


Re: [DISCUSS] FLIP-72: Introduce Pulsar Connector

2019-10-16 Thread Bowen Li
Hi Yijie,

Per the discussion, maybe you can move pulsar source to 'future work'
section in the FLIP for now?

Besides, the FLIP seems to be quite rough at the moment, and I'd recommend
to add more details .

A few questions mainly regarding the proposed pulsar catalog.

   - Can you provide some background of pulsar schema registry and how it
   works?
   - The proposed design of pulsar catalog is very vague now, can you share
   some details of how a pulsar catalog would work internally? E.g.
  - which APIs does it support exactly? E.g. I see from your prototype
  that table creation is supported but not alteration.
  - is it going to connect to a pulsar schema registry via a http
  client or a pulsar client, etc
  - will it be able to handle multiple versions of pulsar, or just one?
  How is compatibility handles between different Flink-Pulsar versions?
  - will it support only reading from pulsar schema registry , or both
  read/write? Will it work end-to-end in Flink SQL for users to create and
  manipulate a pulsar table such as "CREATE TABLE t WITH
  PROPERTIES(type=pulsar)" and "DROP TABLE t"?
  - Is a pulsar topic always gonna be a non-partitioned table? How is a
  partitioned topic mapped to a Flink table?
   - How to map Flink's catalog/database namespace to pulsar's multi-tenant
   namespaces? I'm not very familiar with how multi tenancy works in pulsar,
   and some background context/use cases may help here too. E.g.
  - can a pulsar client/consumer/producer be multiple-tenant at the
  same time?
  - how does authentication work in pulsar's multi-tenancy and the
  catalog? asking since I didn't see the proposed pulsar catalog has
  username/password configs
  - the FLIP seems propose mapping a pulsar cluster and
  'tenant/namespace' respectively to Flink's 'catalog' and 'database'. I
  wonder whether it totally makes sense, or should we actually map "tenant"
  to "catalog", and "namespace" to "database"?

Cheers,
Bowen

On Fri, Sep 20, 2019 at 1:16 AM Yijie Shen 
wrote:

> Hi everyone,
>
> Per discussion in the previous thread
> <
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Contribute-Pulsar-Flink-connector-back-to-Flink-tc32538.html
> >,
> I have created FLIP-72 to kick off a more detailed discussion on the Flink
> Pulsar connector:
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-72%3A+Introduce+Pulsar+Connector
>
> In short, the connector has the following features:
>
>-
>
>Pulsar as a streaming source with exactly-once guarantee.
>-
>
>Sink streaming results to Pulsar with at-least-once semantics.
>-
>
>Build upon Flink new Table API Type system (FLIP-37
><
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-37%3A+Rework+of+the+Table+API+Type+System
> >
>), and can automatically (de)serialize messages with the help of Pulsar
>schema.
>-
>
>Integrate with Flink new Catalog API (FLIP-30
><
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
> >),
>which enables the use of Pulsar topics as tables in Table API as well as
>SQL client.
>
>
>
> https://docs.google.com/document/d/1rES79eKhkJxrRfQp1b3u8LB2aPaq-6JaDHDPJIA8kMY/edit#heading=h.28v5v23yeq1u
>
>
> Would love to here your thoughts on this.
>
> Best,
> Yijie
>


Re: [VOTE] FLIP-68: Extend Core Table System with Pluggable Modules

2019-10-17 Thread Bowen Li
Thanks for pointing them out, Dawid. I've went over the overall doc again
and corrected the above typos.

- ModuleManager#listFunctions() returns Set
- ModuleManager holds a LinkedHashMap to keep loaded
modules in order
- ModuleFactory#createModule(Map) and returns Module


On Thu, Oct 17, 2019 at 2:27 AM Dawid Wysakowicz 
wrote:

> Hi all,
>
> Generally I'm fine with the design. Before I cast my +1 I wanted to
> clarify one thing. Is the module name in ModuleFactory#createModule
> necessary? Can't it be just?:
>
> interface ModuleFactory extends TableFactory {
>Module createModule(Map properties);
> }
>
> The name under which the module was registered should not affect the
> implementation of the module as far as I can tell. Could we remove this
> parameter from the method?
>
> I also spotted a few "bugs" in the design, but they do not affect the
> outcome of the design, as they are either just artifacts of refactoring the
> FLIP or affect only the internal implementation:
>
>- there is a typo in the ModuleFactory#createModule return type. It
>should be Module instead of Plugin
>- the return type of ModuleManager:listFunctions() should be
>Set instead of Set>, right?
>- we cannot use list to store the modules in ModuleManager if I am not
>mistaken. We need to store them in a Map to e.g. be able to unload the
>modules by its name.
>
> Best,
>
> Dawid
> On 17/10/2019 04:16, Jark Wu wrote:
>
> +1
>
> Thanks,
> Jark
>
> On Thu, 17 Oct 2019 at 04:44, Peter Huang  
> 
> wrote:
>
>
> +1 Thanks
>
> On Wed, Oct 16, 2019 at 12:48 PM Xuefu Z  
>  wrote:
>
>
> +1 (non-biding)
>
> On Wed, Oct 16, 2019 at 2:26 AM Timo Walther  
>  wrote:
>
>
> +1
>
> Thanks,
> Timo
>
>
> On 15.10.19 20:50, Bowen Li wrote:
>
> Hi all,
>
> I'd like to kick off a voting thread for FLIP-68: Extend Core Table
>
> System
>
> with Pluggable Modules [1], as we have reached consensus in [2].
>
> The voting period will be open for at least 72 hours, ending at 7pm
>
> Oct
>
> 18
>
> UTC.
>
> Thanks,
> Bowen
>
> [1]
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Pluggable+Modules
>
> [2] https://www.mail-archive.com/dev@flink.apache.org/msg29894.html
>
> --
> Xuefu Zhang
>
> "In Honey We Trust!"
>
>
>


Re: [VOTE] FLIP-68: Extend Core Table System with Pluggable Modules

2019-10-18 Thread Bowen Li
Thanks Dawid and everyone.

I'm hereby glad to announce that we have unanimously approved this FLIP
with 5 +1 votes, 3 binding (Timo, Jark, Dawid) and 2 non-binding (Xuefu,
Peter), and no -1.

This FLIP shall move to implementation phase and will target for Flink 1.10.


On Fri, Oct 18, 2019 at 1:29 AM Dawid Wysakowicz 
wrote:

> Thank you Bowen for the update. Great to hear we can have just
>
> ModuleFactory#createModule(Map)
>
> +1 for the FLIP. Nice design BTW ;)
>
> Best,
>
> Dawid
>
>
> On 17/10/2019 18:36, Bowen Li wrote:
> > Thanks for pointing them out, Dawid. I've went over the overall doc again
> > and corrected the above typos.
> >
> > - ModuleManager#listFunctions() returns Set
> > - ModuleManager holds a LinkedHashMap to keep loaded
> > modules in order
> > - ModuleFactory#createModule(Map) and returns Module
> >
> >
> > On Thu, Oct 17, 2019 at 2:27 AM Dawid Wysakowicz  >
> > wrote:
> >
> >> Hi all,
> >>
> >> Generally I'm fine with the design. Before I cast my +1 I wanted to
> >> clarify one thing. Is the module name in ModuleFactory#createModule
> >> necessary? Can't it be just?:
> >>
> >> interface ModuleFactory extends TableFactory {
> >>Module createModule(Map properties);
> >> }
> >>
> >> The name under which the module was registered should not affect the
> >> implementation of the module as far as I can tell. Could we remove this
> >> parameter from the method?
> >>
> >> I also spotted a few "bugs" in the design, but they do not affect the
> >> outcome of the design, as they are either just artifacts of refactoring
> the
> >> FLIP or affect only the internal implementation:
> >>
> >>- there is a typo in the ModuleFactory#createModule return type. It
> >>should be Module instead of Plugin
> >>- the return type of ModuleManager:listFunctions() should be
> >>Set instead of Set>, right?
> >>- we cannot use list to store the modules in ModuleManager if I am
> not
> >>mistaken. We need to store them in a Map to e.g. be able to unload
> the
> >>modules by its name.
> >>
> >> Best,
> >>
> >> Dawid
> >> On 17/10/2019 04:16, Jark Wu wrote:
> >>
> >> +1
> >>
> >> Thanks,
> >> Jark
> >>
> >> On Thu, 17 Oct 2019 at 04:44, Peter Huang 
> 
> >> wrote:
> >>
> >>
> >> +1 Thanks
> >>
> >> On Wed, Oct 16, 2019 at 12:48 PM Xuefu Z  <
> usxu...@gmail.com> wrote:
> >>
> >>
> >> +1 (non-biding)
> >>
> >> On Wed, Oct 16, 2019 at 2:26 AM Timo Walther  <
> twal...@apache.org> wrote:
> >>
> >>
> >> +1
> >>
> >> Thanks,
> >> Timo
> >>
> >>
> >> On 15.10.19 20:50, Bowen Li wrote:
> >>
> >> Hi all,
> >>
> >> I'd like to kick off a voting thread for FLIP-68: Extend Core Table
> >>
> >> System
> >>
> >> with Pluggable Modules [1], as we have reached consensus in [2].
> >>
> >> The voting period will be open for at least 72 hours, ending at 7pm
> >>
> >> Oct
> >>
> >> 18
> >>
> >> UTC.
> >>
> >> Thanks,
> >> Bowen
> >>
> >> [1]
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+System+with+Pluggable+Modules
> >>
> >> [2] https://www.mail-archive.com/dev@flink.apache.org/msg29894.html
> >>
> >> --
> >> Xuefu Zhang
> >>
> >> "In Honey We Trust!"
> >>
> >>
> >>
>
>


Re: [ANNOUNCE] Becket Qin joins the Flink PMC

2019-10-29 Thread Bowen Li
Congrats Becket!

On Tue, Oct 29, 2019 at 06:32 Till Rohrmann  wrote:

> Congrats Becket :-)
>
> On Tue, Oct 29, 2019 at 10:27 AM Yang Wang  wrote:
>
> > Congratulations Becket :)
> >
> > Best,
> > Yang
> >
> > Vijay Bhaskar  于2019年10月29日周二 下午4:31写道:
> >
> > > Congratulations Becket
> > >
> > > Regards
> > > Bhaskar
> > >
> > > On Tue, Oct 29, 2019 at 1:53 PM Danny Chan 
> wrote:
> > >
> > > > Congratulations :)
> > > >
> > > > Best,
> > > > Danny Chan
> > > > 在 2019年10月29日 +0800 PM4:14,dev@flink.apache.org,写道:
> > > > >
> > > > > Congratulations :)
> > > >
> > >
> >
>


  1   2   3   4   5   6   >