Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-17 Thread Jungtaek Lim
2020 at 3:06 PM Jungtaek Lim wrote: > Hi German, > > option 1 isn't about "deleting" the old files, as your input directory may > be accessed by multiple queries. Kafka centralizes the maintenance of input > data hence possible to apply retention without problem. >

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread Jungtaek Lim
compatible? > > How I see it, I think It would be interesting to have a retention period > to delete old files and/or the possibility of indicating an offset > (Timestamp). It would be very "similar" to how we do it with kafka. > > WDYT? > > On Thu, 30 Jul 2020 at 2

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Jungtaek Lim
+1 (non-binding, I guess) Thanks for raising the issue and sorting it out! On Fri, Jul 31, 2020 at 6:47 AM Holden Karau wrote: > Hi Spark Developers, > > After the discussion of the proposal to amend Spark committer guidelines, > it appears folks are generally in agreement on policy clarificati

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread Jungtaek Lim
d to consider is the listing cost. is there > any way we can avoid listing the entire base directory and then filtering > out the new files. if the data is organized as partitions using date, will > it help to list only those partitions where new files were added? > > > On Thu, J

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-29 Thread Jungtaek Lim
bump, is there any interest on this topic? On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim wrote: > (Just to add rationalization, you can refer the original mail thread on > dev@ list to see efforts on addressing problems in file stream source / > sink - > https://lists.apache.org

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-19 Thread Jungtaek Lim
at 6:18 AM Jungtaek Lim wrote: > Hi devs, > > As I have been going through the various issues on metadata log growing, > it's not only the issue of sink, but also the issue of source. > Unlike sink metadata log which entries should be available to the readers, > the sour

[DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-19 Thread Jungtaek Lim
timestamp, which Spark will read from such timestamp and forward order. This doesn't cover all use cases of "latestFirst", but "latestFirst" doesn't seem to be natural with the concept of SS (think about watermark), I'd prefer to support alternatives instead of struggling with "latestFirst". Would like to hear your opinions. Thanks, Jungtaek Lim (HeartSaVioR)

Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Jungtaek Lim
For me merge script worked for python 2.7, but I got some trouble with the encoding issue (probably from contributor's name) so now I use the merge script with virtualenv & python 3.7.7. "python3" would be OK for me as well as it doesn't break virtualenv with python 3. On Sat, Jul 18, 2020 at 6:1

Re: [DISCUSS] -1s and commits

2020-07-16 Thread Jungtaek Lim
On Fri, Jul 17, 2020 at 8:06 AM Holden Karau wrote: > > > On Thu, Jul 16, 2020 at 3:34 PM Jungtaek Lim > wrote: > >> I agree with Wenchen that there are different topics. >> > I agree. I mentioned it in my postscript because I wanted to provide the > context

Re: [DISCUSS] -1s and commits

2020-07-16 Thread Jungtaek Lim
I agree with Wenchen that there are different topics. The policy of veto is obvious, as ASF doc describes it with explicitly saying non-overridable per project. In any way, the approach of resolving the situation should lead to voters withdrawing their vetoes. There's nothing to interpret differen

Re: Welcoming some new Apache Spark committers

2020-07-15 Thread Jungtaek Lim
wrote: >> >>> >>> Congratulations ! >>> >>> Regards, >>> Mridul >>> >>> On Tue, Jul 14, 2020 at 12:37 PM Matei Zaharia >>> wrote: >>> >>>> Hi all, >>>> >>>> The Spark PMC recently voted t

Re: [DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-07-12 Thread Jungtaek Lim
Just submitted the patch: https://github.com/apache/spark/pull/29077 On Tue, Jun 16, 2020 at 3:40 PM Jungtaek Lim wrote: > Bump this again. I filed SPARK-31985 [1] and plan to submit a PR in a > couple of days if there's no voice on the reason we should keep it. > > 1. https://

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-09 Thread Jungtaek Lim
As a side note, I've raised patches for addressing two frequent flaky tests, CliSuite [1] and HiveSessionImplSuite [2]. Hope this helps to mitigate the situation. 1. https://github.com/apache/spark/pull/29036 2. https://github.com/apache/spark/pull/29039 On Thu, Jul 9, 2020 at 11:51 AM Hyukjin Kw

Re: m2 cache issues in Jenkins?

2020-07-06 Thread Jungtaek Lim
at 5:35 AM Jungtaek Lim wrote: > Could this be a flaky or persistent issue? It failed with Scala gendoc but > it didn't fail with the part the PR modified. It ran from worker-05. > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125121/consoleFull > >

Re: m2 cache issues in Jenkins?

2020-07-06 Thread Jungtaek Lim
>>> Shane, can we remove .m2 in worker machine 4? >>> >>> 2020년 7월 3일 (금) 오전 8:18, Jungtaek Lim 님이 >>> 작성: >>> >>>> Looks like Jenkins service itself becomes unstable. It took >>>> considerable time to just open the test report f

Re: m2 cache issues in Jenkins?

2020-07-02 Thread Jungtaek Lim
Looks like Jenkins service itself becomes unstable. It took considerable time to just open the test report for a specific build, and Jenkins doesn't pick the request on rebuild (retest this, please) in Github comment. On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon wrote: > Ah, okay. Actually there

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-01 Thread Jungtaek Lim
help. > >> > >> > >> > >> https://issues.apache.org/jira/browse/SPARK-32136 > >> > >> > >> > >> Thanks, > >> > >> Jason. > >> > >> > >> > >> From: Jungtaek Lim >

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-30 Thread Jungtaek Lim
>> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰 wrote: >> >>> I volunteer to be a release manager of 3.0.1, if nobody is working on >>> this. >>> >>> >>> -- 原始邮件 -- >>> *发件人:* "Gengliang Wang"; &g

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Jungtaek Lim
Does this count only "new features" (probably major), or also count "improvements"? I'm aware of a couple of improvements which should be ideally included in the next release, but if this counts only major new features then don't feel they should be listed. On Tue, Jun 30, 2020 at 1:32 AM Holden K

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-06-26 Thread Jungtaek Lim
coder in Spark 3.0.0 pulls the schema from serializer, which removes the problem. The remaining question is, would we like to fix it in 2.4.x? On Tue, May 26, 2020 at 2:54 PM Jungtaek Lim wrote: > I meant how to interpret Java Beans in Spark are not consistently defined. > > Unlike you&

Re: Handling user-facing metadata issues on file stream source & sink

2020-06-25 Thread Jungtaek Lim
which was throwing OOME. 1. https://github.com/apache/spark/pull/28904 On Sun, Jun 14, 2020 at 4:14 PM Jungtaek Lim wrote: > Bump again - hope to get some traction because these issues are either > long-standing problems or noticeable improvements (each PR has numbers/UI > graph to s

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Jungtaek Lim
+1 on a 3.0.1 soon. Probably it would be nice if some Scala experts can take a look at https://issues.apache.org/jira/browse/SPARK-32051 and include the fix into 3.0.1 if possible. Looks like APIs designed to work with Scala 2.11 & Java bring ambiguity in Scala 2.12 & Java. On Wed, Jun 24, 2020 a

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Jungtaek Lim
Great, thanks all for your efforts on the huge step forward! On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon wrote: > Yay! > > 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성: > >> Great job everyone ! Congratulations :-) >> >> Regards, >> Mridul >> >> On Thu, Jun 18, 2020 at 10:21 AM Reynold

Re: [DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-06-15 Thread Jungtaek Lim
Bump this again. I filed SPARK-31985 [1] and plan to submit a PR in a couple of days if there's no voice on the reason we should keep it. 1. https://issues.apache.org/jira/browse/SPARK-31985 On Thu, May 21, 2020 at 8:54 AM Jungtaek Lim wrote: > Let me share the effect on remo

Re: Handling user-facing metadata issues on file stream source & sink

2020-06-14 Thread Jungtaek Lim
m/apache/spark/pull/28422 2. https://github.com/apache/spark/pull/28363 3. https://github.com/apache/spark/pull/27620 4. https://github.com/apache/spark/pull/27649 5. https://github.com/apache/spark/pull/27694 On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim wrote: > Worth noting that I got s

Re: Revisiting the idea of a Spark 2.5 transitional release

2020-06-12 Thread Jungtaek Lim
ong-standing issue or the feature has been provided for a long time in competitive products. Thanks, Jungtaek Lim (HeartSaVioR) 1. http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Spark-2-5-release-td27963.html#a27979 On Sat, Jun 13, 2020 at 10:13 AM Ryan Blue wrote: > +1 for a

Re: [vote] Apache Spark 3.0 RC3

2020-06-07 Thread Jungtaek Lim
I'm seeing the effort of including the correctness issue SPARK-28067 [1] to 3.0.0 via SPARK-31894 [2]. That doesn't seem to be a regression so technically doesn't block the release, so while it'd be good to weigh its worth (it requires some SS users to discard the state so might bring less frighten

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-25 Thread Jungtaek Lim
are seeing is that it's not checking the property names, just using > ordering, in your reproducer. That seems different? > > On Sun, May 24, 2020 at 3:00 AM Jungtaek Lim > wrote: > > > > OK I just went through the change, and the change breaks bunch of > existi

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-24 Thread Jungtaek Lim
eaking change and the difference would be confusing if we don't explain it enough. Any thoughts? On Mon, May 11, 2020 at 1:36 PM Jungtaek Lim wrote: > First case is not tied to the batch / streaming as Encoders.bean simply > fails when inferring schema. > > Second case is tied to

Re: Handling user-facing metadata issues on file stream source & sink

2020-05-21 Thread Jungtaek Lim
Worth noting that I got similar question around local community as well. These reporters didn't encounter the edge-case, they're encountered the critical issue in the normal running of streaming query. On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim wrote: > (bump to expose the discu

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-21 Thread Jungtaek Lim
Looks like there're new blocker issues newly figured out. * https://issues.apache.org/jira/browse/SPARK-31786 * https://issues.apache.org/jira/browse/SPARK-31761 (not yet marked as blocker but according to JIRA comment it's a regression issue as well as correctness issue IMHO) Let's collect the l

Re: [DISCUSS] "complete" streaming output mode

2020-05-21 Thread Jungtaek Lim
gt; has to be added to maintain the mode. > > I mean, I would want all pipelines that I build to work magically without > me having to put any thought into it, but then I feel most people in this > email list would be out of jobs. These are typical considerations that you >

Re: [DISCUSS] "complete" streaming output mode

2020-05-20 Thread Jungtaek Lim
e to drop complete mode. But before then it's more important to build a consensus that complete mode is only used for few use case (we need to collect these use cases of course) and the cost of maintenance exceeds the benefit. For sure I'm open for disagreement. Thanks, Jungtaek Lim (Hear

Re: [DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-05-20 Thread Jungtaek Lim
about compatibility, etc. while it never be used in production. On Tue, May 19, 2020 at 1:14 PM Jungtaek Lim wrote: > Hi devs, > > during experiment on complete mode I realized we left some incomplete code > parts on supporting aggregation for continuous mode. (shuffle & coalesce) &g

[DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-05-18 Thread Jungtaek Lim
ct anyone is working on this). The functionality is undocumented (as the work was only done partially) and continuous mode is experimental so I don't feel risks to get rid of the part. What do you think? If it makes sense then I'll raise a PR to get rid of the incomplete codes. T

[DISCUSS] "complete" streaming output mode

2020-05-18 Thread Jungtaek Lim
n make a consensus on the viewpoint of complete mode and drop supporting it if we agree with. Would like to hear everyone's opinions. It would be great if someone brings the valid cases where complete mode is being used in production. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://issues.apache

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-18 Thread Jungtaek Lim
Looks like the priority of SPARK-31706 [1] is incorrectly marked - it sounds like a blocker, as SPARK-26785 [2] / SPARK-26956 [3] dropped the feature of "update" on streaming output mode (as a result) and SPARK-31706 restores it. SPARK-31706 is not yet resolved, which may be valid reason to roll a

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Jungtaek Lim
the parameter (even if it > is hidden) > > On Tue, May 12, 2020 at 12:46 PM Ryan Blue wrote: > >> +1 for the approach Jungtaek suggests. That will avoid needing to support >> behavior that is not well understood with minimal changes. >> >> On Tue, May

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Jungtaek Lim
Before I forget, we'd better not forget to change the doc, as create table doc looks to represent current syntax which will be incorrect later. On Tue, May 12, 2020 at 5:32 PM Jungtaek Lim wrote: > It's not only for end users, but also for us. Spark itself uses the config > &

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-12 Thread Jungtaek Lim
t; wrote: > >> I'm all for getting the unified syntax into master. The only issue >> appears to be whether or not to pass the presence of the EXTERNAL keyword >> through to a catalog in v2. Maybe it's time to start a discuss thread for >> that issue so we're

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Jungtaek Lim
Btw another wondering here is, is it good to retain the flag on master as an intermediate step? Wouldn't it be better for us to start "unified create table syntax" from scratch? On Tue, May 12, 2020 at 6:50 AM Jungtaek Lim wrote: > I'm sorry, but I have to agree with Ry

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-11 Thread Jungtaek Lim
. >> >> Unless we plan to NOT support the behavior >> when spark.sql.legacy.createHiveTableByDefault.enabled is disabled, we >> should not ship Spark 3.0 with SPARK-30098. Otherwise, we will have to deal >> with this problem for years to come. >> >> On Mon,

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-10 Thread Jungtaek Lim
only relying on the sequence of the columns while matching row with schema, then it could be affected.) On Mon, May 11, 2020 at 1:24 PM Wenchen Fan wrote: > is it a problem only for streaming or it affects batch queries as well? > > On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim > wr

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-05-10 Thread Jungtaek Lim
or EXTERNAL is > specified. This gives us more time to think about how to do it in 3.1. > > If you have other ideas, please reply to this thread. > > Thanks, > Wenchen > > On Thu, Mar 26, 2020 at 7:28 AM Jungtaek Lim > wrote: > >> Thanks, filed SPARK-31257 >> &

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-08 Thread Jungtaek Lim
, 2020 at 5:50 PM Wenchen Fan wrote: > Can you give some simple examples to demonstrate the problem? I think the > inconsistency would bring problems but don't know how. > > On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim > wrote: > >> (bump to expose the discussion

Re: Handling user-facing metadata issues on file stream source & sink

2020-05-08 Thread Jungtaek Lim
(bump to expose the discussion to more readers) On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim wrote: > Hi devs, > > I'm seeing more and more structured streaming end users encountered the > metadata issues on file stream source and sink. They have been known-issues > an

Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-08 Thread Jungtaek Lim
(bump to expose the discussion to more readers) On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim wrote: > Hi devs, > > There're couple of issues being reported on the user@ mailing list which > results in being affected by inconsistent schema on Encoders.bean. > > 1. Ty

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-05-07 Thread Jungtaek Lim
I don't see any new features/functions for these blockers. For SPARK-31257 (which is filed and marked as a blocker from me), I agree unifying create table syntax shouldn't be a blocker for Spark 3.0.0, as that is a new feature, but even we put the proposal aside, the problem remains the same and I

Handling user-facing metadata issues on file stream source & sink

2020-05-04 Thread Jungtaek Lim
know there're couple of alternatives, but I don't think starter would start from there. End users may just try to find alternatives - not alternative of data source, but alternative of streaming processing framework. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://lists.apache.org/thread.

Inconsistent schema on Encoders.bean (reported issues from user@)

2020-05-04 Thread Jungtaek Lim
want to at least document the ideal form of the bean Spark expects. Would like to hear opinions on this. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://lists.apache.org/thread.html/r8f8e680e02955cdf05b4dd34c60a9868288fd10a03f1b1b8627f3d84%40%3Cuser.spark.apache.org%3E 2. http://mail-archives.apach

Re: InferFiltersFromConstraints logical optimization rule and Optimizer.defaultBatches?

2020-04-14 Thread Jungtaek Lim
Please correct me if I'm missing something. At a glance, your statements look correct if I understand correctly. I guess it might be simply missed, but it sounds as pretty trivial one as only a line can be removed safely which won't affect anything. (filterNot should be retained even we remove the

Re: Automatic PR labeling

2020-04-13 Thread Jungtaek Lim
Nice addition, looks pretty good! On Tue, Apr 14, 2020 at 1:17 AM Xiao Li wrote: > Looks great! > > Thanks for making this happen. This is pretty helpful. > > Xiao > > On Sun, Apr 12, 2020 at 11:52 PM Hyukjin Kwon wrote: > >> Okay, now it started to work. Let's see if it works well! >> >> 2020년

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Jungtaek Lim
, 2020 at 10:01 AM Xiao Li wrote: > >> Only the low-risk or high-value bug fixes, and the documentation changes >> are allowed to merge to branch-3.0. I expect all the committers are >> following the same rules like what we did in the previous releases. >> >> Xiao >

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Jungtaek Lim
hesitate to test the RC1 (see how many people have been tested RC1 in this thread), as they probably need to test the same with RC2. On Thu, Apr 9, 2020 at 5:50 PM Jungtaek Lim wrote: > I went through some manually tests for the new features of Structured > Streaming in Spark 3.0.0. (Please let m

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Jungtaek Lim
I went through some manually tests for the new features of Structured Streaming in Spark 3.0.0. (Please let me know if there're more features we'd like to test manually.) * file source cleanup - both “archive" and “delete" work. Query fails as expected when the input directory is the output direct

Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-02 Thread Jungtaek Lim
On Fri, Apr 3, 2020 at 12:31 AM Sean Owen wrote: > On Wed, Apr 1, 2020 at 10:28 PM Jungtaek Lim > wrote: > > The definition of "latest version" would matter, especially there's a > time we prepare minor+ version release. > > > > For example, lots of p

Re: [DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Jungtaek Lim
;>> an Improvement applies to; it just isn't that useful. We aren't >>>> generally going to back-port improvements anyway. >>>> >>>> Even for bugs, we don't really need to know that a bug in master >>>> affects 2.4.5, 2.4.4, 2.4.3

[DISCUSS] filling affected versions on JIRA issue

2020-04-01 Thread Jungtaek Lim
and in worse case (there's no such UT) we should do E2E manual verification which I would give up. There should have some balance/threshold, and the balance should be the thing the community has a consensus. Would like to hear everyone's voice on this. Thanks, Jungtaek Lim (HeartSaVioR)

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-03-31 Thread Jungtaek Lim
-1 (non-binding) I filed SPARK-31257 as a blocker, and now others start to agree that it's a critical issue which should be dealt before releasing Spark 3.0. Please refer recent comments in https://github.com/apache/spark/pull/28026 It won't delay the release pretty much, as we can either revert

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-30 Thread Jungtaek Lim
set it to "false" and deal with it. WDYT? On Tue, Mar 31, 2020 at 7:48 AM Jungtaek Lim wrote: > I'm not sure I understand the direction of resolution. I'm not saying it's > just a confusion - it's "ambiguous" and "indeterministic". >

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-30 Thread Jungtaek Lim
e: > >> I don't have a dog in this race, but: Would it be OK to ship 3.0 with >> some release notes and/or prominent documentation calling out this issue, >> and then fixing it in 3.0.1? >> >> On Sat, Mar 28, 2020 at 8:45 PM Jungtaek Lim < >> kabhwan.open

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-28 Thread Jungtaek Lim
Sat, Mar 28, 2020 at 11:51 AM, Sean Owen wrote: > >> I'm also curious - there no open blockers for 3.0 but I know a few are >> still floating around open to revert changes. What is the status there? >> From my field of view I'm not aware of other blocking i

Re: Release Manager's official `branch-3.0` Assessment?

2020-03-27 Thread Jungtaek Lim
entrate these things. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Mar 25, 2020 at 1:52 PM Xiao Li wrote: > Let us try to finish the remaining major blockers in the next few days. > For example, https://issues.apache.org/jira/browse/SPARK-31085 > > +1 to cut the RC even if we still have the b

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-25 Thread Jungtaek Lim
bes > this and doesn't appear to be done. > > On Wed, Mar 25, 2020 at 4:03 PM Jungtaek Lim > wrote: > >> UPDATE: Sorry I just missed the PR ( >> https://github.com/apache/spark/pull/28026). I still think it'd be nice >> to avoid recycling the JIRA iss

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-25 Thread Jungtaek Lim
UPDATE: Sorry I just missed the PR ( https://github.com/apache/spark/pull/28026). I still think it'd be nice to avoid recycling the JIRA issue which was resolved before. Shall we have a new JIRA issue with linking to SPARK-30098, and set proper priority? On Thu, Mar 26, 2020 at 7:59 AM Jun

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-25 Thread Jungtaek Lim
Would it be better to prioritize this to make sure the change is included in Spark 3.0? (Maybe filing an issue and set as a blocker) Looks like there's consensus that SPARK-30098 brought ambiguous issue which should be fixed (though the consideration of severity seems to be different), and once we

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-19 Thread Jungtaek Lim
Anything would be OK if the create table DDL provides a "clear way" to expect the table provider "before" they run the query. Great news that it doesn't require major rework - looking forward to the PR. Thanks again to jump in and sort this out. - Jungtaek Lim (HeartSaVi

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
if Hive specific clauses are being used. Yes as I said earlier it may make end users' query to be changed, but better than uncertain. Btw, if the main purpose to add native syntax and change it by default is to discontinue supporting Hive create table rule sooner, simply dropping rule 2 with

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
their query fits and they don't need to spend a lot > of time understanding the subtle difference between these 2 syntaxes. > > On Wed, Mar 18, 2020 at 7:01 PM Jungtaek Lim > wrote: > >> A bit correction: the example I provided for vice versa is not really a >> corr

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
A bit correction: the example I provided for vice versa is not really a correct case for vice versa. It's actually same case (intended to use rule 2 which is not default) but different result. On Wed, Mar 18, 2020 at 7:22 PM Jungtaek Lim wrote: > My concern is that although we simp

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
; fields? >> >> >> On Wed, Mar 18, 2020 at 4:38 PM Wenchen Fan wrote: >> >>> I think the general guideline is to promote Spark's own CREATE TABLE >>> syntax instead of the Hive one. Previously these two rules are mutually >>> exclusive

Re: [DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-18 Thread Jungtaek Lim
just writes CREATE TABLE without USING or ROW > FORMAT or STORED AS, does it matter what table we create? Internally the > parser rules conflict and we pick the native syntax depending on the rule > order. But the user-facing behavior looks fine. > > CREATE EXTERNAL TABLE is a prob

[DISCUSS] Resolve ambiguous parser rule between two "create table"s

2020-03-15 Thread Jungtaek Lim
ry if they intend to create Hive table. (Given we will also provide legacy option I'm feeling this is acceptable.) 2. Define "ROW FORMAT" or "STORED AS" as mandatory one. pros. Less invasive for existing queries. cons. Less intuitive, because they have been optional and now be

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-03-12 Thread Jungtaek Lim
Context.jsonRDD >>> - SQLContext.load >>> - SQLContext.jdbc >>> >>> If you think these APIs should not be added back, let me know and we can >>> discuss the items further. In general, I think we should provide more >>> evidences and discuss

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-03-07 Thread Jungtaek Lim
+1 for Sean as well. Moreover, as I added a voice on previous thread, if we want to be strict with retaining public API, what we really need to do along with this is having similar level or stricter of policy for adding public API. If we don't apply the policy symmetrically, problems would go wors

Re: Breaking API changes in Spark 3.0

2020-02-19 Thread Jungtaek Lim
e balance on this to avoid restricting ourselves too much, but I feel there's no balance now - most things are just going through PRs without discussion. It would be ideal we have time to consider on this. On Thu, Feb 20, 2020 at 8:50 AM Jungtaek Lim wrote: > Apache Spark 2.0 was release

Re: Breaking API changes in Spark 3.0

2020-02-19 Thread Jungtaek Lim
Apache Spark 2.0 was released in July 2016. Assuming the project has been trying the best to follow the semantic versioning, it is "more than three years" to wait for the breaking changes. What the community misses to address necessary breaking changes would be going to be technical debts for anoth

Re: Request to document the direct relationship between other configurations

2020-02-13 Thread Jungtaek Lim
address in this thread. > > Shall we conclude this thread by deciding to document the direct > relationship between configurations preferably in one prevailing style? > > > 2020년 2월 14일 (금) 오전 11:36, Jungtaek Lim 님이 > 작성: > >> Even spark.dynamicAllocation.* doesn't f

Re: Request to document the direct relationship between other configurations

2020-02-13 Thread Jungtaek Lim
7;s do it as our final goal. Otherwise, let's > simplify it to reduce the overhead rather then having a policy for the > mid-term specifically. > > > 2020년 2월 13일 (목) 오후 12:24, Jungtaek Lim 님이 > 작성: > >> I tend to agree that there should be a time to make thing be consi

Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Jungtaek Lim
+1 Thanks for the proposal. Looks very reasonable to me. On Thu, Feb 13, 2020 at 10:53 AM Hyukjin Kwon wrote: > +1. > > 2020년 2월 13일 (목) 오전 9:30, Gengliang Wang 님이 > 작성: > >> +1, this is really helpful. We should make the SQL configurations >> consistent and more readable. >> >> On Wed, Feb 12,

Re: Request to document the direct relationship between other configurations

2020-02-12 Thread Jungtaek Lim
ts on here. Could I >>>>>> ask what you guys think about this in general? >>>>>> >>>>>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon 님이 작성: >>>>>> >>>>>>> To do that, we should explicitly document such st

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Jungtaek Lim
k setting 'spark.eventLog.rolling.maxFileSize' > automatically enables rolling. Then, they realise the log is not rolling > later after the file > size becomes bigger. > > > 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim 님이 > 작성: > >> I'm sorry if I miss somethi

Re: Request to document the direct relationship between other configurations

2020-02-11 Thread Jungtaek Lim
ations should have redundant part of the doc. More redundant if the condition is nested. I agree this is the good step of "be kind" but less pragmatic. I'd be happy to follow the consensus we would make in this thread. Appreciate more voices. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, F

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Jungtaek Lim
Nice work, Dongjoon! Thanks for the huge efforts on sorting out with correctness things as well. On Tue, Feb 11, 2020 at 12:40 PM Wenchen Fan wrote: > Great Job, Dongjoon! > > On Mon, Feb 10, 2020 at 4:18 PM Hyukjin Kwon wrote: > >> Thanks Dongjoon! >> >> 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamam

Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-15 Thread Jungtaek Lim
Once we decided to cancel the RC1, what about including SPARK-29450 ( https://github.com/apache/spark/pull/27209) into RC2? SPARK-29450 was merged into master, and Xiao figured out it fixed a regression, long lasting one (broken at 2.3.0). The link refers the PR for 2.4 branch. Thanks, Jungtaek

Re: Release Apache Spark 2.4.5

2020-01-05 Thread Jungtaek Lim
+1 to have another Spark 2.4 release, as Spark 2.4.4 was released in 4 months old and there's release window for this. On Mon, Jan 6, 2020 at 12:38 PM Hyukjin Kwon wrote: > Yeah, I think it's nice to have another maintenance release given Spark > 3.0 timeline. > > 2020년 1월 6일 (월) 오전 7:58, Dongjo

Re: Patch to produce messages with null body using console producer

2019-12-27 Thread Jungtaek Lim
You seem to hit wrong mailing list - please send to Kafka dev. mailing list. On Fri, Dec 27, 2019 at 8:10 PM jelmer wrote: > Hi folks, > > A while back I opened a pull request ( > https://github.com/apache/kafka/pull/7567 ) that makes it possible to > produce messages with a null body using the

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-24 Thread Jungtaek Lim
to get reviewed and merged later? Happy Holiday! Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro wrote: > Looks nice, happy holiday, all! > > Bests, > Takeshi > > On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun > wrote: > >> +1

Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview2

2019-12-24 Thread Jungtaek Lim
Great work, Yuming! Happy Holidays. On Wed, Dec 25, 2019 at 9:08 AM Dongjoon Hyun wrote: > Indeed! Thank you again, Yuming and all. > > Bests, > Dongjoon. > > > On Tue, Dec 24, 2019 at 13:38 Takeshi Yamamuro > wrote: > >> Great work, Yuming! >> >> Bests, >> Takeshi >> >> On Wed, Dec 25, 2019 at

Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Jungtaek Lim
If I understand correctly, you'll just want to package your implementation with your preference of project manager (maven, sbt, etc.) which registers your dialect implementation into JdbcDialects, and pass the jar and let end users load the jar. That will automatically do everything and they can us

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Jungtaek Lim
ll/26845) On Thu, Dec 12, 2019 at 3:53 AM Nicholas Chammas wrote: > Is this something that would be exposed/relevant to the Python API? Or is > this just for people implementing their own Spark data source? > > On Wed, Dec 11, 2019 at 12:35 AM Jungtaek Lim < > kabhwan.opensou...

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Jungtaek Lim
Nice, thanks for the answer! I'll craft a PR soon. Thanks again. On Thu, Dec 12, 2019 at 3:32 AM Ryan Blue wrote: > Sounds good to me, too. > > On Wed, Dec 11, 2019 at 1:18 AM Jungtaek Lim > wrote: > >> Thanks for the quick response, Wenchen! >> >> I'

Re: [DISCUSS] Add close() on DataWriter interface

2019-12-11 Thread Jungtaek Lim
me > for DataWriter. > > On Wed, Dec 11, 2019 at 1:35 PM Jungtaek Lim > wrote: > >> Hi devs, >> >> I'd like to propose to add close() on DataWriter explicitly, which is the >> place for resource cleanup. >> >> The rationalization of the propo

[DISCUSS] Add close() on DataWriter interface

2019-12-10 Thread Jungtaek Lim
tible changes in Spark 3.0, so I feel it may not matter. Would love to hear your thoughts. Thanks in advance, Jungtaek Lim (HeartSaVioR)

Re: DataSourceWriter V2 Api questions

2019-12-06 Thread Jungtaek Lim
> There are 2 open questions we need to answer: > 1. How to make sure all tasks are launched at the same time to implement > 2PC? barrier execution? > 2. To reach "eventually consistent", we must retry the job until successe. > How shall we guarantee the job retry? > &g

Re: Query regarding stateless aggregations

2019-11-28 Thread Jungtaek Lim
ing on how the input is broken down to multiple batches. By the definition of ground rule, streaming aggregation is required to be stateful. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Nov 28, 2019 at 9:17 PM Chitral Verma wrote: > Hi Devs, > I have a query regarding stateless agg

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
t mitigate the issue heavily... so please treat my idea as > rough idea just for possible optimization.) > > > > But again that's very rough idea, and it won't make sense if the > expected output is not acceptable as representation. > > > > -Jungtaek Lim (HeartSaV

Re: Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
wait... Hmm... Looks like I missed the another point of optimization here which might mitigate the issue heavily... so please treat my idea as rough idea just for possible optimization.) But again that's very rough idea, and it won't make sense if the expected output is not acceptable as representat

Loose the requirement of "median" of the SQL metrics

2019-11-27 Thread Jungtaek Lim
is 100, and replace sorting 100 elements with sorting 10 elements 11 times. The difference would be bigger if the number of tasks is bigger. Just a rough idea so any feedbacks are appreciated. Thanks, Jungtaek Lim (HeartSaVioR)

Re: Does StreamingSymmetricHashJoinExec work with watermark? I don't think so

2019-11-14 Thread Jungtaek Lim
Jacek, would you mind if I ask for the query to reproduce? Not sure I get you without having the example of "not working". Thanks, Jungtaek Lim (HeartSaVioR) On Tue, Nov 12, 2019 at 12:04 AM Jacek Laskowski wrote: > Hi, > > I think watermark does not work for StreamingSy

<    1   2   3   4   5   >