Observed consistent test failure in master (ParquetIOSuite)

2022-06-27 Thread Jungtaek Lim
Hi,

I just observed the test failure in ParquetIOSuite which I can consistently
reproduce with IntelliJ. Haven't had a chance to run a test with maven/sbt.

I filed SPARK-39622 <https://issues.apache.org/jira/browse/SPARK-39622> for
this failure.

It'd be awesome if someone having context looks into this sooner.

Thanks!
Jungtaek Lim (HeartSaVioR)


Re: 回复: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-14 Thread Jungtaek Lim
+1 (non-binding)

Checked signature and checksum. Confirmed SPARK-39412
 is resolved. Built
source tgz with JDK 11.

Thanks Max for driving the efforts of this huge release!

On Tue, Jun 14, 2022 at 2:51 PM huaxin gao  wrote:

> +1 (non-binding)
>
> On Mon, Jun 13, 2022 at 10:47 PM Kent Yao  wrote:
>
>> +1, non-binding
>>
>> Xiao Li  于2022年6月14日周二 13:11写道:
>> >
>> > +1
>> >
>> > Xiao
>> >
>> > beliefer  于2022年6月13日周一 20:04写道:
>> >>
>> >> +1 AFAIK, no blocking issues now.
>> >> Glad to hear to release 3.3.0 !
>> >>
>> >>
>> >> 在 2022-06-14 09:38:35,"Ruifeng Zheng"  写道:
>> >>
>> >> +1 (non-binding)
>> >>
>> >> Maxim, thank you for driving this release!
>> >>
>> >> thanks,
>> >> ruifeng
>> >>
>> >>
>> >>
>> >> -- 原始邮件 --
>> >> 发件人: "Chao Sun" ;
>> >> 发送时间: 2022年6月14日(星期二) 上午8:45
>> >> 收件人: "Cheng Su";
>> >> 抄送: "L. C. Hsieh";"dev";
>> >> 主题: Re: [VOTE] Release Spark 3.3.0 (RC6)
>> >>
>> >> +1 (non-binding)
>> >>
>> >> Thanks,
>> >> Chao
>> >>
>> >> On Mon, Jun 13, 2022 at 5:37 PM Cheng Su 
>> wrote:
>> >>>
>> >>> +1 (non-binding).
>> >>>
>> >>>
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Cheng Su
>> >>>
>> >>>
>> >>>
>> >>> From: L. C. Hsieh 
>> >>> Date: Monday, June 13, 2022 at 5:13 PM
>> >>> To: dev 
>> >>> Subject: Re: [VOTE] Release Spark 3.3.0 (RC6)
>> >>>
>> >>> +1
>> >>>
>> >>> On Mon, Jun 13, 2022 at 5:07 PM Holden Karau 
>> wrote:
>> >>> >
>> >>> > +1
>> >>> >
>> >>> > On Mon, Jun 13, 2022 at 4:51 PM Yuming Wang 
>> wrote:
>> >>> >>
>> >>> >> +1 (non-binding)
>> >>> >>
>> >>> >> On Tue, Jun 14, 2022 at 7:41 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>> >>> >>>
>> >>> >>> +1
>> >>> >>>
>> >>> >>> Thanks,
>> >>> >>> Dongjoon.
>> >>> >>>
>> >>> >>> On Mon, Jun 13, 2022 at 3:54 PM Chris Nauroth <
>> cnaur...@apache.org> wrote:
>> >>> 
>> >>>  +1 (non-binding)
>> >>> 
>> >>>  I repeated all checks I described for RC5:
>> >>> 
>> >>>  https://lists.apache.org/thread/ksoxmozgz7q728mnxl6c2z7ncmo87vls
>> >>> 
>> >>>  Maxim, thank you for your dedication on these release candidates.
>> >>> 
>> >>>  Chris Nauroth
>> >>> 
>> >>> 
>> >>>  On Mon, Jun 13, 2022 at 3:21 PM Mridul Muralidharan <
>> mri...@gmail.com> wrote:
>> >>> >
>> >>> >
>> >>> > +1
>> >>> >
>> >>> > Signatures, digests, etc check out fine.
>> >>> > Checked out tag and build/tested with -Pyarn -Pmesos
>> -Pkubernetes
>> >>> >
>> >>> > The test "SPARK-33084: Add jar support Ivy URI in SQL" in
>> sql.SQLQuerySuite fails; but other than that, rest looks good.
>> >>> >
>> >>> > Regards,
>> >>> > Mridul
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Mon, Jun 13, 2022 at 4:25 PM Tom Graves
>>  wrote:
>> >>> >>
>> >>> >> +1
>> >>> >>
>> >>> >> Tom
>> >>> >>
>> >>> >> On Thursday, June 9, 2022, 11:27:50 PM CDT, Maxim Gekk <
>> maxim.g...@databricks.com.invalid> wrote:
>> >>> >>
>> >>> >>
>> >>> >> Please vote on releasing the following candidate as Apache
>> Spark version 3.3.0.
>> >>> >>
>> >>> >> The vote is open until 11:59pm Pacific time June 14th and
>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >>> >>
>> >>> >> [ ] +1 Release this package as Apache Spark 3.3.0
>> >>> >> [ ] -1 Do not release this package because ...
>> >>> >>
>> >>> >> To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >>> >>
>> >>> >> The tag to be voted on is v3.3.0-rc6 (commit
>> f74867bddfbcdd4d08076db36851e88b15e66556):
>> >>> >> https://github.com/apache/spark/tree/v3.3.0-rc6
>> >>> >>
>> >>> >> The release files, including signatures, digests, etc. can be
>> found at:
>> >>> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-bin/
>> >>> >>
>> >>> >> Signatures used for Spark RCs can be found in this file:
>> >>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>> >>
>> >>> >> The staging repository for this release can be found at:
>> >>> >>
>> https://repository.apache.org/content/repositories/orgapachespark-1407
>> >>> >>
>> >>> >> The documentation corresponding to this release can be found
>> at:
>> >>> >> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc6-docs/
>> >>> >>
>> >>> >> The list of bug fixes going into 3.3.0 can be found at the
>> following URL:
>> >>> >>
>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>> >>> >>
>> >>> >> This release is using the release script of the tag v3.3.0-rc6.
>> >>> >>
>> >>> >>
>> >>> >> FAQ
>> >>> >>
>> >>> >> =
>> >>> >> How can I help test this release?
>> >>> >> =
>> >>> >> If you are a Spark user, you can help us test this release by
>> taking
>> >>> >> an existing Spark workload and running on this release
>> 

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-08 Thread Jungtaek Lim
Apologize for late participation.

I'm sorry, but -1 (non-binding) from me.

Unfortunately I found a major user-facing issue which hurts UX seriously on
Kafka data source usage.

In some cases, Kafka data source can throw IllegalStateException for the
case of failOnDataLoss=true which condition is bound to the state of Kafka
topic (not Spark's issue). With the recent change of Spark,
IllegalStateException is now bound to the "internal error", and Spark gives
incorrect guidance to the end users, telling to end users that Spark has a
bug and they are encouraged to file a JIRA ticket which is simply wrong.

Previously, Kafka data source provided the error message with the context
why it failed, and how to workaround it. I feel this is a serious
regression on UX.

Please look into https://issues.apache.org/jira/browse/SPARK-39412 for more
details.


On Wed, Jun 8, 2022 at 3:40 PM Hyukjin Kwon  wrote:

> Okay. Thankfully the binary release is fine per
> https://github.com/apache/spark/blob/v3.3.0-rc5/dev/create-release/release-build.sh#L268
> .
> The source package (and GitHub tag) has 3.3.0.dev0, and the binary package
> has 3.3.0. Technically this is not a blocker now because PyPI upload will
> be able to be made correctly.
> I lowered the priority to critical. I switch my -1 to 0.
>
> On Wed, 8 Jun 2022 at 15:17, Hyukjin Kwon  wrote:
>
>> Arrrgh  .. I am very sorry that I found this problem late.
>> RC 5 does not have the correct version of PySpark, see
>> https://github.com/apache/spark/blob/v3.3.0-rc5/python/pyspark/version.py#L19
>> I think the release script was broken because the version now has 'str'
>> type, see
>> https://github.com/apache/spark/blob/v3.3.0-rc5/dev/create-release/release-tag.sh#L88
>> I filed a JIRA at https://issues.apache.org/jira/browse/SPARK-39411
>>
>> -1 from me
>>
>>
>>
>> On Wed, 8 Jun 2022 at 13:16, Cheng Pan  wrote:
>>
>>> +1 (non-binding)
>>>
>>> * Verified SPARK-39313 has been address[1]
>>> * Passed integration test w/ Apache Kyuubi (Incubating)[2]
>>>
>>> [1] https://github.com/housepower/spark-clickhouse-connector/pull/123
>>> [2] https://github.com/apache/incubator-kyuubi/pull/2817
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>> On Wed, Jun 8, 2022 at 7:04 AM Chris Nauroth 
>>> wrote:
>>> >
>>> > +1 (non-binding)
>>> >
>>> > * Verified all checksums.
>>> > * Verified all signatures.
>>> > * Built from source, with multiple profiles, to full success, for Java
>>> 11 and Scala 2.13:
>>> > * build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver
>>> -Pkubernetes -Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
>>> > * Tests passed.
>>> > * Ran several examples successfully:
>>> > * bin/spark-submit --class org.apache.spark.examples.SparkPi
>>> examples/jars/spark-examples_2.12-3.3.0.jar
>>> > * bin/spark-submit --class
>>> org.apache.spark.examples.sql.hive.SparkHiveExample
>>> examples/jars/spark-examples_2.12-3.3.0.jar
>>> > * bin/spark-submit
>>> examples/src/main/python/streaming/network_wordcount.py localhost 
>>> > * Tested some of the issues that blocked prior release candidates:
>>> > * bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT
>>> true) t(x) UNION SELECT 1 AS a;'
>>> > * bin/spark-sql -e "select date '2018-11-17' > 1"
>>> > * SPARK-39293 ArrayAggregate fix
>>> >
>>> > Chris Nauroth
>>> >
>>> >
>>> > On Tue, Jun 7, 2022 at 1:30 PM Cheng Su 
>>> wrote:
>>> >>
>>> >> +1 (non-binding). Built and ran some internal test for Spark SQL.
>>> >>
>>> >>
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Cheng Su
>>> >>
>>> >>
>>> >>
>>> >> From: L. C. Hsieh 
>>> >> Date: Tuesday, June 7, 2022 at 1:23 PM
>>> >> To: dev 
>>> >> Subject: Re: [VOTE] Release Spark 3.3.0 (RC5)
>>> >>
>>> >> +1
>>> >>
>>> >> Liang-Chi
>>> >>
>>> >> On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang 
>>> wrote:
>>> >> >
>>> >> > +1 (non-binding)
>>> >> >
>>> >> > Gengliang
>>> >> >
>>> >> > On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves 
>>> wrote:
>>> >> >>
>>> >> >> +1
>>> >> >>
>>> >> >> Tom Graves
>>> >> >>
>>> >> >> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
>>> >> >>  wrote:
>>> >> >> >
>>> >> >> > Please vote on releasing the following candidate as Apache Spark
>>> version 3.3.0.
>>> >> >> >
>>> >> >> > The vote is open until 11:59pm Pacific time June 8th and passes
>>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >> >> >
>>> >> >> > [ ] +1 Release this package as Apache Spark 3.3.0
>>> >> >> > [ ] -1 Do not release this package because ...
>>> >> >> >
>>> >> >> > To learn more about Apache Spark, please see
>>> http://spark.apache.org/
>>> >> >> >
>>> >> >> > The tag to be voted on is v3.3.0-rc5 (commit
>>> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>>> >> >> > https://github.com/apache/spark/tree/v3.3.0-rc5
>>> >> >> >
>>> >> >> > The release files, including signatures, digests, etc. can be
>>> found at:
>>> >> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
>>> >> >> >
>>> >> >> > Signatures used for Spark RCs can 

Re: SIGMOD System Award for Apache Spark

2022-05-12 Thread Jungtaek Lim
Congrats Spark community!

On Fri, May 13, 2022 at 10:40 AM Qian Sun  wrote:

> Congratulations !!!
>
> 2022年5月13日 上午3:44,Matei Zaharia  写道:
>
> Hi all,
>
> We recently found out that Apache Spark received
>  the SIGMOD System Award this
> year, given by SIGMOD (the ACM’s data management research organization) to
> impactful real-world and research systems. This puts Spark in good company
> with some very impressive previous recipients
> . This award is
> really an achievement by the whole community, so I wanted to say congrats
> to everyone who contributes to Spark, whether through code, issue reports,
> docs, or other means.
>
> Matei
>
>
>


Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-23 Thread Jungtaek Lim
If it requires a Kafka broker update, we should not simply bump the version
of Kafka client. Probably we should at least provide separate artifacts.

We should not enforce the upgrade of other component just because we want
to upgrade the dependency. At least it should not happen in minor versions
of Spark. Hopefully that’s not a case.

There’s a belief that Kafka client-broker compatibility is both backwards
and forwards. That is true in many cases (that’s what Kafka excels to), but
there seems to be the case it is not satisfied with specific configuration
and specific setup of Kafka broker. E.g KIP-679.

The less compatible config is going to turn on by default in 3.0 and will
not work correctly with the specific setup of Kafka broker. So that is us
who breaks things for specific setup, and my point is how much
responsibility we should have to guide the end users to avoid the
frustration.

2022년 3월 23일 (수) 오후 9:41, Sean Owen 님이 작성:

> Well, yes, but if it requires a Kafka server-side update, it does, and
> that is out of scope for us to document.
> It is important that we document if and how (if we know) the client update
> will impact existing Kafka installations (does it require a server-side
> update or not?), and document the change itself for sure along with any
> Spark-side migration notes.
>
> On Fri, Mar 18, 2022 at 8:47 PM Jungtaek Lim 
> wrote:
>
>> The thing is, it is “us” who upgrades Kafka client and makes possible
>> divergence between client and broker in end users’ production env.
>>
>> Someone can claim that end users can downgrade the kafka-client artifact
>> when building their app so that the version can be matched, but we don’t
>> test anything against downgrading kafka-client version for kafka connector.
>> That sounds to me we defer our work to end users.
>>
>> It sounds to me “someone” should refer to us, and then it is no longer a
>> matter of “help”. It is a matter of “responsibility”, as you said.
>>
>> 2022년 3월 18일 (금) 오후 10:15, Sean Owen 님이 작성:
>>
>>> I think we can assume that someone upgrading Kafka will be responsible
>>> for thinking through the breaking changes. We can help by listing anything
>>> we know could affect Spark-Kafka usage and calling those out in a release
>>> note, for sure. I don't think we need to get into items that would affect
>>> Kafka usage itself; focus on the connector-related issues.
>>>
>>> On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> CORRECTION: in option 2, we enumerate KIPs which may bring
>>>> incompatibility with older brokers (not all KIPs).
>>>>
>>>> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Hi dev,
>>>>>
>>>>> I would like to initiate the discussion about how to deal with the
>>>>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>>>>> 3.3.
>>>>>
>>>>> We didn't care much about the upgrade of Kafka dependency since our
>>>>> belief on Kafka client has been that the new Kafka client version should
>>>>> have no compatibility issues with older brokers. Based on semantic
>>>>> versioning, upgrading major versions rings an alarm for me.
>>>>>
>>>>> I haven't gone through changes that happened between versions, but
>>>>> found one KIP (KIP-679
>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
>>>>> which may not work with older brokers with specific setup. (It's described
>>>>> in the "Compatibility, Deprecation, and Migration Plan" section of the 
>>>>> KIP).
>>>>>
>>>>> This may not be problematic for the users who upgrade both client and
>>>>> broker altogether, but end users of Spark may be unlikely the case.
>>>>> Computation engines are relatively easier to upgrade. Storage systems
>>>>> aren't. End users would think the components are independent.
>>>>>
>>>>> I looked through the notable changes in the Kafka doc, and it does
>>>>> mention this KIP, but it just says the default config has changed and
>>>>> doesn't mention about the impacts. There is a link to
>>>>> KIP, that said, everyone needs to read through the KIP wiki page for
>>>>> details.
>>>>>
>>>>> Based on the context, what would b

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-23 Thread Jungtaek Lim
Bump to try gathering more voices before taking action. For now, I see two
voices as option 2 & 5 (similar to option 2 but not in the migration node
but in the release note).

On Fri, Mar 18, 2022 at 7:15 PM Jungtaek Lim 
wrote:

> CORRECTION: in option 2, we enumerate KIPs which may bring incompatibility
> with older brokers (not all KIPs).
>
> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim 
> wrote:
>
>> Hi dev,
>>
>> I would like to initiate the discussion about how to deal with the
>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>> 3.3.
>>
>> We didn't care much about the upgrade of Kafka dependency since our
>> belief on Kafka client has been that the new Kafka client version should
>> have no compatibility issues with older brokers. Based on semantic
>> versioning, upgrading major versions rings an alarm for me.
>>
>> I haven't gone through changes that happened between versions, but found
>> one KIP (KIP-679
>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
>> which may not work with older brokers with specific setup. (It's described
>> in the "Compatibility, Deprecation, and Migration Plan" section of the KIP).
>>
>> This may not be problematic for the users who upgrade both client and
>> broker altogether, but end users of Spark may be unlikely the case.
>> Computation engines are relatively easier to upgrade. Storage systems
>> aren't. End users would think the components are independent.
>>
>> I looked through the notable changes in the Kafka doc, and it does
>> mention this KIP, but it just says the default config has changed and
>> doesn't mention about the impacts. There is a link to
>> KIP, that said, everyone needs to read through the KIP wiki page for
>> details.
>>
>> Based on the context, what would be the best way to notice end users for
>> the major version upgrade of Kafka? I can imagine several options
>> including...
>>
>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>> the noticeable changes in the Kafka doc in the migration guide.
>> 2. Do 1 & spend more effort to read through all KIPs and check
>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>> KIPs (or even summarize) in the migration guide.
>> 3. Do 2 & actively override the default configs to be compatible with
>> older versions if the change of the default configs in Kafka 3.0 is
>> backward incompatible. End users should set these configs explicitly to
>> override them back.
>> 4. Do not care. End users can indicate the upgrade in the release note,
>> and we expect end users to actively check the notable changes (& KIPs) from
>> Kafka doc.
>> 5. Options not described above...
>>
>> Please take a look and provide your voice on this.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. Probably this would be applied to all non-bugfix versions of
>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>> for minor versions, though.
>>
>


Re: bazel and external/

2022-03-22 Thread Jungtaek Lim
We might want to keep this open for another several days to see more voices
or support here, since it just went through 4 days including weekend.

But I'm happy with the change (+1 from me), given the new name makes much
more sense than before. For me, this is pure improvement even without the
context of bazel.

On Tue, Mar 22, 2022 at 7:02 PM Alkis Evlogimenos <
alkis.evlogime...@databricks.com> wrote:

> The PR is updated to rename the directory to `connectors`. If there are no
> other objections can we merge it?
>
> On Mon, Mar 21, 2022 at 1:42 PM Alkis Evlogimenos <
> alkis.evlogime...@databricks.com> wrote:
>
>> Unless there are objections, I will update the PR tonight to rename
>> `external` to `connectors`.
>>
>> On Mon, Mar 21, 2022 at 12:36 PM Wenchen Fan  wrote:
>>
>>> How about renaming it to `connectors` if docker is the only exception
>>> and will be moved out?
>>>
>>> On Sat, Mar 19, 2022 at 6:18 PM Alkis Evlogimenos
>>>  wrote:
>>>
>>>> It looks like renaming the directory and moving components can be
>>>> separate steps. If there is consensus that connectors will move out, should
>>>> the directory be named misc for everything else until there is some
>>>> direction for the remaining modules?
>>>>
>>>> On Fri, 18 Mar 2022 at 03:03 Jungtaek Lim 
>>>> wrote:
>>>>
>>>>> Avro reader is technically a connector. We eventually called data
>>>>> source implementation "connector" as well; the package name in the 
>>>>> catalyst
>>>>> represents it.
>>>>>
>>>>> Docker is something I'm not sure fits with the name "external". It
>>>>> probably deserves a top level directory now, since we start to release an
>>>>> official docker image. That does not seem to be an experimental one.
>>>>>
>>>>> Except Docker, all modules in the external directory are "sort of"
>>>>> connectors. Ganglia metric sink is an exception, but it is still a kind of
>>>>> connector for Dropwizard.
>>>>> (It might be interesting to see how many users are still using
>>>>> kinesis-asl and ganglia-lgpl modules. We have had almost no updates for
>>>>> DStream for several years.)
>>>>>
>>>>> If we agree with my proposal for docker, remaining is going to be
>>>>> effectively a rename. I don't have a strong opinion, just wanted to avoid
>>>>> the external directory to become/remain miscellaneous one.
>>>>>
>>>>> On Fri, Mar 18, 2022 at 10:04 AM Sean Owen  wrote:
>>>>>
>>>>>> I sympathize, but might be less change to just rename the dir. There
>>>>>> is more in there like the avro reader; it's kind of miscellaneous. I 
>>>>>> think
>>>>>> we might want fewer rather than more top level dirs.
>>>>>>
>>>>>> On Thu, Mar 17, 2022 at 7:33 PM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> We seem to just focus on how to avoid the conflict with the name
>>>>>>> "external" used in bazel. Since we consider the possibility of renaming,
>>>>>>> why not revisit the modules "external" contains?
>>>>>>>
>>>>>>> Looks like kinds of the modules external directory contains are 1)
>>>>>>> Docker 2) Connectors 3) Sink on Dropwizard metrics (only ganglia here, 
>>>>>>> and
>>>>>>> it seems to be just that Ganglia is LGPL)
>>>>>>>
>>>>>>> Would it make sense if each kind deserves a top directory? We can
>>>>>>> probably give better generalized names, and as a side-effect we will no
>>>>>>> longer have "external".
>>>>>>>
>>>>>>> On Fri, Mar 18, 2022 at 5:45 AM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Thank you for posting this, Alkis.
>>>>>>>>
>>>>>>>> Before the question (1) and (2), I'm curious if the Apache Spark
>>>>>>>> community has other downstreams using Bazel.
>>>>>>>>
>>>>>>>> To All. If there are some Bazel users with Apache Spark code, could
>>>>>>>> 

Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
The thing is, it is “us” who upgrades Kafka client and makes possible
divergence between client and broker in end users’ production env.

Someone can claim that end users can downgrade the kafka-client artifact
when building their app so that the version can be matched, but we don’t
test anything against downgrading kafka-client version for kafka connector.
That sounds to me we defer our work to end users.

It sounds to me “someone” should refer to us, and then it is no longer a
matter of “help”. It is a matter of “responsibility”, as you said.

2022년 3월 18일 (금) 오후 10:15, Sean Owen 님이 작성:

> I think we can assume that someone upgrading Kafka will be responsible for
> thinking through the breaking changes. We can help by listing anything we
> know could affect Spark-Kafka usage and calling those out in a release
> note, for sure. I don't think we need to get into items that would affect
> Kafka usage itself; focus on the connector-related issues.
>
> On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim 
> wrote:
>
>> CORRECTION: in option 2, we enumerate KIPs which may bring
>> incompatibility with older brokers (not all KIPs).
>>
>> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi dev,
>>>
>>> I would like to initiate the discussion about how to deal with the
>>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>>> 3.3.
>>>
>>> We didn't care much about the upgrade of Kafka dependency since our
>>> belief on Kafka client has been that the new Kafka client version should
>>> have no compatibility issues with older brokers. Based on semantic
>>> versioning, upgrading major versions rings an alarm for me.
>>>
>>> I haven't gone through changes that happened between versions, but found
>>> one KIP (KIP-679
>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
>>> which may not work with older brokers with specific setup. (It's described
>>> in the "Compatibility, Deprecation, and Migration Plan" section of the KIP).
>>>
>>> This may not be problematic for the users who upgrade both client and
>>> broker altogether, but end users of Spark may be unlikely the case.
>>> Computation engines are relatively easier to upgrade. Storage systems
>>> aren't. End users would think the components are independent.
>>>
>>> I looked through the notable changes in the Kafka doc, and it does
>>> mention this KIP, but it just says the default config has changed and
>>> doesn't mention about the impacts. There is a link to
>>> KIP, that said, everyone needs to read through the KIP wiki page for
>>> details.
>>>
>>> Based on the context, what would be the best way to notice end users for
>>> the major version upgrade of Kafka? I can imagine several options
>>> including...
>>>
>>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>>> the noticeable changes in the Kafka doc in the migration guide.
>>> 2. Do 1 & spend more effort to read through all KIPs and check
>>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>>> KIPs (or even summarize) in the migration guide.
>>> 3. Do 2 & actively override the default configs to be compatible with
>>> older versions if the change of the default configs in Kafka 3.0 is
>>> backward incompatible. End users should set these configs explicitly to
>>> override them back.
>>> 4. Do not care. End users can indicate the upgrade in the release note,
>>> and we expect end users to actively check the notable changes (& KIPs) from
>>> Kafka doc.
>>> 5. Options not described above...
>>>
>>> Please take a look and provide your voice on this.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> ps. Probably this would be applied to all non-bugfix versions of
>>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>>> for minor versions, though.
>>>
>>


Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
As always, I hope that the direction of the discussion would be focusing on
the topic. Let’s avoid ourselves to be side-tracked. Please consider the
mail thread as full context and feel free to ask me if there is a lack of
information for you to provide a voice.

Thanks for the voice in previous mail btw!

2022년 3월 18일 (금) 오후 9:41, Gabor Somogyi 님이 작성:

> I've just read the related PR and seems like the situation is not so black
> and white as I've presumed purely from tech point of view...
>
> On Fri, 18 Mar 2022, 12:44 Gabor Somogyi, 
> wrote:
>
>> Hi Jungtaek,
>>
>> I've taken a deeper look at the issue and here are my findings.
>> As far as I'm concerned there are basically 2 ways with some minor
>> decorations:
>> * We care
>> * We don't care
>>
>> I'm pretty sure users are clever enough but setting the expectation that
>> all users are tracking Kafka KIPs one-by-one would be a hard requirement.
>> This implies that I would vote on the "We care" point, the only question
>> is how.
>>
>> Unless we have a specific reason for point 3 I wouldn't override default
>> configs. The reason behind is simple.
>> Kafka has it's strategic direction and going against it w/o good reason
>> is rarely a good idea (maybe we have such but that would be said out).
>>
>> I think when Kafka version upgrade happens we engineers are having a look
>> whether all the changes in the new version
>> are backward compatible or not so point 2 makes sense to me. Honestly I'm
>> drinking coffee with some of the Kafka devs
>> so I've never ever read through all the KIPs between releases because
>> they've told what's important to check :)
>>
>> Seems like my Kafka Spark compatibility gist is out-of-date so maybe I
>> need to invest some time to resurrect it:
>> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9
>>
>> Hope my thoughts are helpful!
>>
>> BR,
>> G
>>
>>
>> On Fri, Mar 18, 2022 at 11:15 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> CORRECTION: in option 2, we enumerate KIPs which may bring
>>> incompatibility with older brokers (not all KIPs).
>>>
>>> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi dev,
>>>>
>>>> I would like to initiate the discussion about how to deal with the
>>>> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
>>>> 3.3.
>>>>
>>>> We didn't care much about the upgrade of Kafka dependency since our
>>>> belief on Kafka client has been that the new Kafka client version should
>>>> have no compatibility issues with older brokers. Based on semantic
>>>> versioning, upgrading major versions rings an alarm for me.
>>>>
>>>> I haven't gone through changes that happened between versions, but
>>>> found one KIP (KIP-679
>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
>>>> which may not work with older brokers with specific setup. (It's described
>>>> in the "Compatibility, Deprecation, and Migration Plan" section of the 
>>>> KIP).
>>>>
>>>> This may not be problematic for the users who upgrade both client and
>>>> broker altogether, but end users of Spark may be unlikely the case.
>>>> Computation engines are relatively easier to upgrade. Storage systems
>>>> aren't. End users would think the components are independent.
>>>>
>>>> I looked through the notable changes in the Kafka doc, and it does
>>>> mention this KIP, but it just says the default config has changed and
>>>> doesn't mention about the impacts. There is a link to
>>>> KIP, that said, everyone needs to read through the KIP wiki page for
>>>> details.
>>>>
>>>> Based on the context, what would be the best way to notice end users
>>>> for the major version upgrade of Kafka? I can imagine several options
>>>> including...
>>>>
>>>> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
>>>> the noticeable changes in the Kafka doc in the migration guide.
>>>> 2. Do 1 & spend more effort to read through all KIPs and check
>>>> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
>>>> KIPs (or even summarize) in the migration guide.
>>>> 3. Do 2 & actively override the default configs to be compatible with
>>>> older versions if the change of the default configs in Kafka 3.0 is
>>>> backward incompatible. End users should set these configs explicitly to
>>>> override them back.
>>>> 4. Do not care. End users can indicate the upgrade in the release note,
>>>> and we expect end users to actively check the notable changes (& KIPs) from
>>>> Kafka doc.
>>>> 5. Options not described above...
>>>>
>>>> Please take a look and provide your voice on this.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> ps. Probably this would be applied to all non-bugfix versions of
>>>> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
>>>> for minor versions, though.
>>>>
>>>


Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
CORRECTION: in option 2, we enumerate KIPs which may bring incompatibility
with older brokers (not all KIPs).

On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim 
wrote:

> Hi dev,
>
> I would like to initiate the discussion about how to deal with the
> migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
> 3.3.
>
> We didn't care much about the upgrade of Kafka dependency since our belief
> on Kafka client has been that the new Kafka client version should have no
> compatibility issues with older brokers. Based on semantic versioning,
> upgrading major versions rings an alarm for me.
>
> I haven't gone through changes that happened between versions, but found
> one KIP (KIP-679
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
> which may not work with older brokers with specific setup. (It's described
> in the "Compatibility, Deprecation, and Migration Plan" section of the KIP).
>
> This may not be problematic for the users who upgrade both client and
> broker altogether, but end users of Spark may be unlikely the case.
> Computation engines are relatively easier to upgrade. Storage systems
> aren't. End users would think the components are independent.
>
> I looked through the notable changes in the Kafka doc, and it does mention
> this KIP, but it just says the default config has changed and doesn't
> mention about the impacts. There is a link to KIP, that said, everyone
> needs to read through the KIP wiki page for details.
>
> Based on the context, what would be the best way to notice end users for
> the major version upgrade of Kafka? I can imagine several options
> including...
>
> 1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking
> the noticeable changes in the Kafka doc in the migration guide.
> 2. Do 1 & spend more effort to read through all KIPs and check
> "Compatibility, Deprecation, and Migration Plan" section, and enumerate all
> KIPs (or even summarize) in the migration guide.
> 3. Do 2 & actively override the default configs to be compatible with
> older versions if the change of the default configs in Kafka 3.0 is
> backward incompatible. End users should set these configs explicitly to
> override them back.
> 4. Do not care. End users can indicate the upgrade in the release note,
> and we expect end users to actively check the notable changes (& KIPs) from
> Kafka doc.
> 5. Options not described above...
>
> Please take a look and provide your voice on this.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> ps. Probably this would be applied to all non-bugfix versions of
> dependency upgrades. We may still want to be pragmatic, e.g. pass-through
> for minor versions, though.
>


[DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-18 Thread Jungtaek Lim
Hi dev,

I would like to initiate the discussion about how to deal with the
migration guide on upgrading Kafka to 3.1 (from 2.8.1) in upcoming Spark
3.3.

We didn't care much about the upgrade of Kafka dependency since our belief
on Kafka client has been that the new Kafka client version should have no
compatibility issues with older brokers. Based on semantic versioning,
upgrading major versions rings an alarm for me.

I haven't gone through changes that happened between versions, but found
one KIP (KIP-679
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-679%3A+Producer+will+enable+the+strongest+delivery+guarantee+by+default>)
which may not work with older brokers with specific setup. (It's described
in the "Compatibility, Deprecation, and Migration Plan" section of the KIP).

This may not be problematic for the users who upgrade both client and
broker altogether, but end users of Spark may be unlikely the case.
Computation engines are relatively easier to upgrade. Storage systems
aren't. End users would think the components are independent.

I looked through the notable changes in the Kafka doc, and it does mention
this KIP, but it just says the default config has changed and doesn't
mention about the impacts. There is a link to KIP, that said, everyone
needs to read through the KIP wiki page for details.

Based on the context, what would be the best way to notice end users for
the major version upgrade of Kafka? I can imagine several options
including...

1. Explicitly mention that Spark 3.3 upgrades Kafka to 3.1 with linking the
noticeable changes in the Kafka doc in the migration guide.
2. Do 1 & spend more effort to read through all KIPs and check
"Compatibility, Deprecation, and Migration Plan" section, and enumerate all
KIPs (or even summarize) in the migration guide.
3. Do 2 & actively override the default configs to be compatible with older
versions if the change of the default configs in Kafka 3.0 is backward
incompatible. End users should set these configs explicitly to override
them back.
4. Do not care. End users can indicate the upgrade in the release note, and
we expect end users to actively check the notable changes (& KIPs) from
Kafka doc.
5. Options not described above...

Please take a look and provide your voice on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

ps. Probably this would be applied to all non-bugfix versions of dependency
upgrades. We may still want to be pragmatic, e.g. pass-through for minor
versions, though.


Re: bazel and external/

2022-03-17 Thread Jungtaek Lim
Avro reader is technically a connector. We eventually called data source
implementation "connector" as well; the package name in the catalyst
represents it.

Docker is something I'm not sure fits with the name "external". It probably
deserves a top level directory now, since we start to release an official
docker image. That does not seem to be an experimental one.

Except Docker, all modules in the external directory are "sort of"
connectors. Ganglia metric sink is an exception, but it is still a kind of
connector for Dropwizard.
(It might be interesting to see how many users are still using kinesis-asl
and ganglia-lgpl modules. We have had almost no updates for DStream for
several years.)

If we agree with my proposal for docker, remaining is going to be
effectively a rename. I don't have a strong opinion, just wanted to avoid
the external directory to become/remain miscellaneous one.

On Fri, Mar 18, 2022 at 10:04 AM Sean Owen  wrote:

> I sympathize, but might be less change to just rename the dir. There is
> more in there like the avro reader; it's kind of miscellaneous. I think we
> might want fewer rather than more top level dirs.
>
> On Thu, Mar 17, 2022 at 7:33 PM Jungtaek Lim 
> wrote:
>
>> We seem to just focus on how to avoid the conflict with the name
>> "external" used in bazel. Since we consider the possibility of renaming,
>> why not revisit the modules "external" contains?
>>
>> Looks like kinds of the modules external directory contains are 1) Docker
>> 2) Connectors 3) Sink on Dropwizard metrics (only ganglia here, and it
>> seems to be just that Ganglia is LGPL)
>>
>> Would it make sense if each kind deserves a top directory? We can
>> probably give better generalized names, and as a side-effect we will no
>> longer have "external".
>>
>> On Fri, Mar 18, 2022 at 5:45 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for posting this, Alkis.
>>>
>>> Before the question (1) and (2), I'm curious if the Apache Spark
>>> community has other downstreams using Bazel.
>>>
>>> To All. If there are some Bazel users with Apache Spark code, could you
>>> share your practice? If you are using renaming, what is your renamed
>>> directory name?
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Mar 17, 2022 at 11:56 AM Alkis Evlogimenos
>>>  wrote:
>>>
>>>> AFAIK there is not. `external` has been baked in bazel since the
>>>> beginning and there is no plan from bazel devs to attempt to fix this
>>>> <https://github.com/bazelbuild/bazel/issues/4508#issuecomment-724055371>
>>>> .
>>>>
>>>> On Thu, Mar 17, 2022 at 7:52 PM Sean Owen  wrote:
>>>>
>>>>> Just checking - there is no way to tell bazel to look somewhere else
>>>>> for whatever 'external' means to it?
>>>>> It's a kinda big ugly change but it's not a functional change. If
>>>>> anything it might break some downstream builds that rely on the current
>>>>> structure too. But such is life for developers? I don't have a strong
>>>>> reason we can't.
>>>>>
>>>>> On Thu, Mar 17, 2022 at 1:47 PM Alkis Evlogimenos
>>>>>  wrote:
>>>>>
>>>>>> Hi Spark devs.
>>>>>>
>>>>>> The Apache Spark repo has a top level external/ directory. This is a
>>>>>> reserved name for the bazel build system and it causes all sorts of
>>>>>> problems: some can be worked around and some cannot (for some details on
>>>>>> one that cannot see
>>>>>> https://github.com/hedronvision/bazel-compile-commands-extractor/issues/30
>>>>>> ).
>>>>>>
>>>>>> Some forks of Apache Spark use bazel as a build system. It would be
>>>>>> nice if we can make this change in Apache Spark without resorting to
>>>>>> complex renames/merges whenever changes are pulled from upstream.
>>>>>>
>>>>>> As such I proposed to rename external/ directory to want to rename
>>>>>> the external/ directory to something else [SPARK-38569
>>>>>> <https://issues.apache.org/jira/browse/SPARK-38569>]. I also sent a
>>>>>> tentative [PR-35874 <https://github.com/apache/spark/pull/35874>]
>>>>>> that renames external/ to vendor/.
>>>>>>
>>>>>> My questions to you are:
>>>>>> 1. Are there any objections to renaming external to X?
>>>>>> 2. Is vendor a good new name for external?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>


Re: bazel and external/

2022-03-17 Thread Jungtaek Lim
We seem to just focus on how to avoid the conflict with the name "external"
used in bazel. Since we consider the possibility of renaming, why not
revisit the modules "external" contains?

Looks like kinds of the modules external directory contains are 1) Docker
2) Connectors 3) Sink on Dropwizard metrics (only ganglia here, and it
seems to be just that Ganglia is LGPL)

Would it make sense if each kind deserves a top directory? We can probably
give better generalized names, and as a side-effect we will no longer have
"external".

On Fri, Mar 18, 2022 at 5:45 AM Dongjoon Hyun 
wrote:

> Thank you for posting this, Alkis.
>
> Before the question (1) and (2), I'm curious if the Apache Spark community
> has other downstreams using Bazel.
>
> To All. If there are some Bazel users with Apache Spark code, could you
> share your practice? If you are using renaming, what is your renamed
> directory name?
>
> Dongjoon.
>
>
> On Thu, Mar 17, 2022 at 11:56 AM Alkis Evlogimenos
>  wrote:
>
>> AFAIK there is not. `external` has been baked in bazel since the
>> beginning and there is no plan from bazel devs to attempt to fix this
>> .
>>
>> On Thu, Mar 17, 2022 at 7:52 PM Sean Owen  wrote:
>>
>>> Just checking - there is no way to tell bazel to look somewhere else for
>>> whatever 'external' means to it?
>>> It's a kinda big ugly change but it's not a functional change. If
>>> anything it might break some downstream builds that rely on the current
>>> structure too. But such is life for developers? I don't have a strong
>>> reason we can't.
>>>
>>> On Thu, Mar 17, 2022 at 1:47 PM Alkis Evlogimenos
>>>  wrote:
>>>
 Hi Spark devs.

 The Apache Spark repo has a top level external/ directory. This is a
 reserved name for the bazel build system and it causes all sorts of
 problems: some can be worked around and some cannot (for some details on
 one that cannot see
 https://github.com/hedronvision/bazel-compile-commands-extractor/issues/30
 ).

 Some forks of Apache Spark use bazel as a build system. It would be
 nice if we can make this change in Apache Spark without resorting to
 complex renames/merges whenever changes are pulled from upstream.

 As such I proposed to rename external/ directory to want to rename the
 external/ directory to something else [SPARK-38569
 ]. I also sent a
 tentative [PR-35874 ] that
 renames external/ to vendor/.

 My questions to you are:
 1. Are there any objections to renaming external to X?
 2. Is vendor a good new name for external?

 Cheers,

>>>


Re: Apache Spark 3.3 Release

2022-03-03 Thread Jungtaek Lim
Thanks Maxim for volunteering to drive the release! I support the plan
(March 15th) to perform a release branch cut.

Btw, would we be open for modification of critical/blocker issues after the
release branch cut? I have a blocker JIRA ticket and the PR is open for
reviewing, but need some time to gain traction as well as going through
actual reviews. My guess is yes but to confirm again.

On Fri, Mar 4, 2022 at 4:20 AM Dongjoon Hyun 
wrote:

> Thank you, Max, for volunteering for Apache Spark 3.3 release manager.
>
> Ya, I'm also +1 for the original plan.
>
> Dongjoon
>
> On Thu, Mar 3, 2022 at 10:52 AM Mridul Muralidharan 
> wrote:
>
>>
>> Agree with Sean, code freeze by mid March sounds good.
>>
>> Regards,
>> Mridul
>>
>> On Thu, Mar 3, 2022 at 12:47 PM Sean Owen  wrote:
>>
>>> I think it's fine to pursue the existing plan - code freeze in two weeks
>>> and try to close off key remaining issues. Final release pending on how
>>> those go, and testing, but fine to get the ball rolling.
>>>
>>> On Thu, Mar 3, 2022 at 12:45 PM Maxim Gekk
>>>  wrote:
>>>
 Hello All,

 I would like to bring on the table the theme about the new Spark
 release 3.3. According to the public schedule at
 https://spark.apache.org/versioning-policy.html, we planned to start
 the code freeze and release branch cut on March 15th, 2022. Since this date
 is coming soon, I would like to take your attention on the topic and gather
 objections that you might have.

 Bellow is the list of ongoing and active SPIPs:

 Spark SQL:
 - [SPARK-31357] DataSourceV2: Catalog API for view metadata
 - [SPARK-35801] Row-level operations in Data Source V2
 - [SPARK-37166] Storage Partitioned Join

 Spark Core:
 - [SPARK-20624] Add better handling for node shutdown
 - [SPARK-25299] Use remote storage for persisting shuffle data

 PySpark:
 - [SPARK-26413] RDD Arrow Support in Spark Core and PySpark

 Kubernetes:
 - [SPARK-36057] Support Customized Kubernetes Schedulers

 Probably, we should finish if there are any remaining works for Spark
 3.3, and switch to QA mode, cut a branch and keep everything on track. I
 would like to volunteer to help drive this process.

 Best regards,
 Max Gekk

>>>


Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Jungtaek Lim
If ASF wants to do it, INFRA could probably deal with it for entire
projects, like ASF code of conduct being exposed to the right side of the
all ASF github repos recently.

On Wed, Dec 15, 2021 at 11:49 PM Sean Owen  wrote:

> It might imply that this is a way to fund Spark alone, and it isn't.
> Probably no big deal either way but maybe not worth it. It won't be a
> mystery how to find and fund the ASF for the few orgs that want to, as
> compared to a small project
>
> On Wed, Dec 15, 2021, 8:34 AM Maciej  wrote:
>
>> Hi All,
>>
>> Just wondering ‒ would it make sense to add .github/FUNDING.yml with
>> custom link pointing to one (or both) of these:
>>
>>- https://www.apache.org/foundation/sponsorship.html
>>- https://www.apache.org/foundation/contributing.html
>>
>>
>> --
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>>


Re: [Proposal] Deprecate Trigger.Once and replace with Trigger.AvailableNow

2021-12-12 Thread Jungtaek Lim
Friendly reminder. I'll submit the proposed change if there is no objection
observed this week.

On Wed, Dec 8, 2021 at 4:16 PM Jungtaek Lim 
wrote:

> Hi dev,
>
> I would like to hear voices about deprecating Trigger.Once, and replacing
> it with Trigger.AvailableNow [1] in Structured Streaming.
>
> Rationalization:
>
> The expected behavior of Trigger.Once is like reading all available data
> after the last trigger and processing them. This holds true when the last
> run was gracefully terminated, but there are cases streaming queries to not
> be terminated gracefully. There is a possibility the last run may write the
> offset (WAL) for the new batch before termination, then a new run of
> Trigger.Once only processes the data which was built in the latest
> unfinished batch, and doesn't process new data.
>
> The behavior is not deterministic from the users' point of view, as end
> users wouldn't know whether the last run wrote the offset or not, unless
> they look into the query's checkpoint by themselves.
>
> While Trigger.AvailableNow came to solve the scalability issue on
> Trigger.Once, it also ensures that it tries to process all available data
> at the point of time it is triggered, which consistently works as expected
> behavior of Trigger.Once.
>
> Proposed Plan:
>
> - Deprecate Trigger.Once in Apache Spark 3.3
> - Leave guidance to migrate to Trigger.AvailableNow in migration guide
> - Replace all usages of Trigger.Once with Trigger.AvailableNow, except the
> test cases of Trigger.Once itself
>
> Please review the proposal and share your voice on this.
>
> Thanks!
> Jungtaek Lim
>
> 1. https://issues.apache.org/jira/browse/SPARK-36533
>


[Proposal] Deprecate Trigger.Once and replace with Trigger.AvailableNow

2021-12-07 Thread Jungtaek Lim
Hi dev,

I would like to hear voices about deprecating Trigger.Once, and replacing
it with Trigger.AvailableNow [1] in Structured Streaming.

Rationalization:

The expected behavior of Trigger.Once is like reading all available data
after the last trigger and processing them. This holds true when the last
run was gracefully terminated, but there are cases streaming queries to not
be terminated gracefully. There is a possibility the last run may write the
offset (WAL) for the new batch before termination, then a new run of
Trigger.Once only processes the data which was built in the latest
unfinished batch, and doesn't process new data.

The behavior is not deterministic from the users' point of view, as end
users wouldn't know whether the last run wrote the offset or not, unless
they look into the query's checkpoint by themselves.

While Trigger.AvailableNow came to solve the scalability issue on
Trigger.Once, it also ensures that it tries to process all available data
at the point of time it is triggered, which consistently works as expected
behavior of Trigger.Once.

Proposed Plan:

- Deprecate Trigger.Once in Apache Spark 3.3
- Leave guidance to migrate to Trigger.AvailableNow in migration guide
- Replace all usages of Trigger.Once with Trigger.AvailableNow, except the
test cases of Trigger.Once itself

Please review the proposal and share your voice on this.

Thanks!
Jungtaek Lim

1. https://issues.apache.org/jira/browse/SPARK-36533


Re: Time for Spark 3.2.1?

2021-12-07 Thread Jungtaek Lim
+1 for both releases and the time!

On Wed, Dec 8, 2021 at 3:46 PM Mridul Muralidharan  wrote:

>
> +1 for maintenance release, and also +1 for doing this in Jan !
>
> Thanks,
> Mridul
>
> On Tue, Dec 7, 2021 at 11:41 PM Gengliang Wang  wrote:
>
>> +1 for new maintenance releases for all 3.x branches as well.
>>
>> On Wed, Dec 8, 2021 at 8:19 AM Hyukjin Kwon  wrote:
>>
>>> SGTM!
>>>
>>> On Wed, 8 Dec 2021 at 09:07, huaxin gao  wrote:
>>>
 I prefer to start rolling the release in January if there is no need to
 publish it sooner :)

 On Tue, Dec 7, 2021 at 3:59 PM Hyukjin Kwon 
 wrote:

> Oh BTW, I realised that it's a holiday season soon this month
> including Christmas and new year.
> Shall we maybe start rolling the release around next January? I would
> leave it to @huaxin gao  :-).
>
> On Wed, 8 Dec 2021 at 06:19, Dongjoon Hyun 
> wrote:
>
>> +1 for new releases.
>>
>> Dongjoon.
>>
>> On Mon, Dec 6, 2021 at 8:51 PM Wenchen Fan 
>> wrote:
>>
>>> +1 to make new maintenance releases for all 3.x branches.
>>>
>>> On Tue, Dec 7, 2021 at 8:57 AM Sean Owen  wrote:
>>>
 Always fine by me if someone wants to roll a release.

 It's been ~6 months since the last 3.0.x and 3.1.x releases, too; a
 new release of those wouldn't hurt either, if any of our release 
 managers
 have the time or inclination. 3.0.x is reaching unofficial end-of-life
 around now anyway.


 On Mon, Dec 6, 2021 at 6:55 PM Hyukjin Kwon 
 wrote:

> Hi all,
>
> It's been two months since Spark 3.2.0 release, and we have
> resolved many bug fixes and regressions. What do you guys think about
> rolling Spark 3.2.1 release?
>
> cc @huaxin gao  FYI who I happened to
> overhear that is interested in rolling the maintenance release :-).
>



Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Jungtaek Lim
Thanks for all the hard work you have been doing, Shane!

On Tue, Dec 7, 2021 at 2:17 PM Nick Pentreath 
wrote:

> Wow! end of an era
>
> Thanks so much to you Shane for all you work over 10 (!!) years. And to
> Amplab also!
>
> Farewell Spark Jenkins!
>
> N
>
> On Tue, Dec 7, 2021 at 6:49 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Farewell to Jenkins and its classic weather forecast build status icons:
>>
>> [image: health-80plus.png][image: health-60to79.png][image:
>> health-40to59.png][image: health-20to39.png][image: health-00to19.png]
>>
>> And thank you Shane for all the help over these years.
>>
>> Will you be nuking all the Jenkins-related code in the repo after the
>> 23rd?
>>
>> On Mon, Dec 6, 2021 at 3:02 PM shane knapp ☠  wrote:
>>
>>> hey everyone!
>>>
>>> after a marathon run of nearly a decade, we're finally going to be
>>> shutting down {amp|rise}lab jenkins at the end of this month...
>>>
>>> the earliest snapshot i could find is from 2013 with builds for spark
>>> 0.7:
>>>
>>> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>>>
>>> it's been a hell of a run, and i'm gonna miss randomly tweaking the
>>> build system, but technology has moved on and running a dedicated set of
>>> servers for just one open source project is just too expensive for us here
>>> at uc berkeley.
>>>
>>> if there's interest, i'll fire up a zoom session and all y'alls can
>>> watch me type the final command:
>>>
>>> systemctl stop jenkins
>>>
>>> feeling bittersweet,
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>


Re: Update Spark 3.3 release window?

2021-10-28 Thread Jungtaek Lim
+1 for mid-March 2022.

+1 for EOL 2.x as well. I guess we did it already according to
Dongjoon's quote from the Spark website.

On Fri, Oct 29, 2021 at 3:49 AM Dongjoon Hyun 
wrote:

> +1 for mid March for Spark 3.3.
>
> For 2.4, our document already mentioned its EOL like
>
> " For example, 2.4.0 was released in November 2nd 2018 and had been
> maintained for 31 months until 2.4.8 was released on May 2021. 2.4.8 is the
> last release and no more 2.4.x releases should be expected even for bug
> fixes."
>
> Do we need somthing more explicit?
>
> Anyway, I'm +1 for that too if needed.
>
> Dongjoon
>
> On Thu, Oct 28, 2021 at 8:07 AM Gengliang Wang  wrote:
>
>> +1, Mid-March 2022 sounds good.
>>
>>
>> Gengliang
>>
>> On Thu, Oct 28, 2021 at 10:54 PM Tom Graves 
>> wrote:
>>
>>> +1 for updating, mid march sounds good.  I'm also fine with EOL 2.x.
>>>
>>> Tom
>>>
>>> On Thursday, October 28, 2021, 09:37:00 AM CDT, Mridul Muralidharan <
>>> mri...@gmail.com> wrote:
>>>
>>>
>>>
>>> +1 to EOL 2.x
>>> Mid march sounds like a good placeholder for 3.3.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Wed, Oct 27, 2021 at 10:38 PM Sean Owen  wrote:
>>>
>>> Seems fine to me - as good a placeholder as anything.
>>> Would that be about time to call 2.x end-of-life?
>>>
>>> On Wed, Oct 27, 2021 at 9:36 PM Hyukjin Kwon 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Spark 3.2. is out. Shall we update the release window
>>> https://spark.apache.org/versioning-policy.html?
>>> I am thinking of Mid March 2022 (5 months after the 3.2 release) for
>>> code freeze and onward.
>>>
>>>


Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Jungtaek Lim
Thanks to Gengliang for driving this huge release!

On Wed, Oct 20, 2021 at 1:50 AM Dongjoon Hyun 
wrote:

> Thank you so much, Gengliang and all!
>
> Dongjoon.
>
> On Tue, Oct 19, 2021 at 8:48 AM Xiao Li  wrote:
>
>> Thank you, Gengliang!
>>
>> Congrats to our community and all the contributors!
>>
>> Xiao
>>
>> Henrik Peng  于2021年10月19日周二 上午8:26写道:
>>
>>> Congrats and thanks!
>>>
>>>
>>> Gengliang Wang 于2021年10月19日 周二下午10:16写道:
>>>
 Hi all,

 Apache Spark 3.2.0 is the third release of the 3.x line. With
 tremendous contribution from the open-source community, this release
 managed to resolve in excess of 1,700 Jira tickets.

 We'd like to thank our contributors and users for their contributions
 and early feedback to this release. This release would not have been
 possible without you.

 To download Spark 3.2.0, head over to the download page:
 https://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-3-2-0.html

>>>


Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-06-24 Thread Jungtaek Lim
Meta question: this doesn't target Spark 3.2, right? Many folks have been
working on branch cut for Spark 3.2, so might be less active to jump in new
feature proposals right now.

On Fri, Jun 25, 2021 at 9:00 AM Holden Karau  wrote:

> I took an initial look at the PRs this morning and I’ll go through the
> design doc in more detail but I think these features look great. It’s
> especially important with the CA regulation changes to make this easier for
> folks to implement.
>
> On Thu, Jun 24, 2021 at 4:54 PM Anton Okolnychyi 
> wrote:
>
>> Hey everyone,
>>
>> I'd like to start a discussion on adding support for executing row-level
>> operations such as DELETE, UPDATE, MERGE for v2 tables (SPARK-35801). The
>> execution should be the same across data sources and the best way to do
>> that is to implement it in Spark.
>>
>> Right now, Spark can only parse and to some extent analyze DELETE,
>> UPDATE, MERGE commands. Data sources that support row-level changes have to
>> build custom Spark extensions to execute such statements. The goal of this
>> effort is to come up with a flexible and easy-to-use API that will work
>> across data sources.
>>
>> Design doc:
>>
>> https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60/
>>
>> PR for handling DELETE statements:
>> https://github.com/apache/spark/pull/33008
>>
>> Any feedback is more than welcome.
>>
>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>
>> - Anton
>>
>>
>>
>>
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-21 Thread Jungtaek Lim
+1 (non-binding) Thanks for your efforts!

On Mon, Jun 21, 2021 at 2:40 PM Kent Yao  wrote:

> +1 (non-binding)
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark .*
> *spark-authorizer A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *itatchi A** library t**hat
> brings useful functions from various modern database management systems to 
> **Apache
> Spark .*
>
>
>
> On 06/21/2021 13:40,Hyukjin Kwon
>  wrote:
>
> +1
>
> 2021년 6월 21일 (월) 오후 2:19, Dongjoon Hyun 님이 작성:
>
>> +1
>>
>> Thank you, Yi.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Sat, Jun 19, 2021 at 6:57 PM Yuming Wang  wrote:
>>
>>> +1
>>>
>>> Tested a batch of production query with Thrift Server.
>>>
>>> On Sat, Jun 19, 2021 at 3:04 PM Mridul Muralidharan 
>>> wrote:
>>>

 +1

 Signatures, digests, etc check out fine.
 Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Pmesos
 -Pkubernetes

 Regards,
 Mridul

 PS: Might be related to some quirk of my local env - the first test run
 (after clean + package) usually fails for me (typically for hive tests) -
 with a second run succeeding : this is not specific to this RC though.

 On Fri, Jun 18, 2021 at 6:14 PM Liang-Chi Hsieh 
 wrote:

> +1. Docs looks good. Binary looks good.
>
> Ran simple test and some tpcds queries.
>
> Thanks for working on this!
>
>
> wuyi wrote
> > Please vote on releasing the following candidate as Apache Spark
> version
> > 3.0.3.
> >
> > The vote is open until Jun 21th 3AM (PST) and passes if a majority
> +1 PMC
> > votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.0.3
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> https://spark.apache.org/
> >
> > The tag to be voted on is v3.0.3-rc1 (commit
> > 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8):
> > https://github.com/apache/spark/tree/v3.0.3-rc1
> >
> > The release files, including signatures, digests, etc. can be found
> at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> >
> https://repository.apache.org/content/repositories/orgapachespark-1386/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-docs/
> >
> > The list of bug fixes going into 3.0.3 can be found at the following
> URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12349723
> >
> > This release is using the release script of the tag v3.0.3-rc1.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate,
> then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the
> Java/Scala
> > you can add the staging repository to your projects resolvers and
> test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.0.3?
> > ===
> >
> > The current list of open tickets targeted at 3.0.3 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> > Version/s" = 3.0.3
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==

Re: Apache Spark 3.0.3 Release?

2021-06-09 Thread Jungtaek Lim
Late +1 Thanks!

On Thu, Jun 10, 2021 at 12:06 PM Yi Wu  wrote:

> Thanks all, I'll start the RC soon.
>
> On Wed, Jun 9, 2021 at 7:07 PM Gengliang Wang  wrote:
>
>> +1, thanks Yi
>>
>> Gengliang Wang
>>
>>
>>
>>
>> On Jun 9, 2021, at 6:03 PM, 郑瑞峰  wrote:
>>
>> +1, thanks Yi
>>
>>
>>


Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-02 Thread Jungtaek Lim
Nice! Thanks Dongjoon for your amazing efforts!

On Wed, Jun 2, 2021 at 2:59 PM Liang-Chi Hsieh  wrote:

> Thank you, Dongjoon!
>
>
>
> Takeshi Yamamuro wrote
> > Thank you, Dongjoon!
> >
> > On Wed, Jun 2, 2021 at 2:29 PM Xiao Li 
>
> > lixiao@
>
> >  wrote:
> >
> >> Thank you!
> >>
> >> Xiao
> >>
> >> On Tue, Jun 1, 2021 at 9:29 PM Hyukjin Kwon 
>
> > gurwls223@
>
> >  wrote:
> >>
> >>> awesome!
> >>>
> >>> 2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 
>
> > dongjoon.hyun@
>
> > 님이 작성:
> >>>
>  We are happy to announce the availability of Spark 3.1.2!
> 
>  Spark 3.1.2 is a maintenance release containing stability fixes. This
>  release is based on the branch-3.1 maintenance branch of Spark. We
>  strongly
>  recommend all 3.1 users to upgrade to this stable release.
> 
>  To download Spark 3.1.2, head over to the download page:
>  https://spark.apache.org/downloads.html
> 
>  To view the release notes:
>  https://spark.apache.org/releases/spark-release-3-1-2.html
> 
>  We would like to acknowledge all community members for contributing to
>  this
>  release. This release would not have been possible without you.
> 
>  Dongjoon Hyun
> 
> >>>
> >>
> >> --
> >>
> >>
> >
> > --
> > ---
> > Takeshi Yamamuro
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Apache Spark 3.1.2 Release?

2021-05-18 Thread Jungtaek Lim
Late +1 here as well, thanks for volunteering!

2021년 5월 19일 (수) 오전 11:24, 郑瑞峰 님이 작성:

> late +1. thanks Dongjoon!
>
>
> -- 原始邮件 --
> *发件人:* "Dongjoon Hyun" ;
> *发送时间:* 2021年5月19日(星期三) 凌晨1:29
> *收件人:* "Wenchen Fan";
> *抄送:* "Xiao Li";"Kent Yao";"John
> Zhuge";"Hyukjin Kwon";"Holden
> Karau";"Takeshi Yamamuro" >;"dev";"Yuming Wang";
> *主题:* Re: Apache Spark 3.1.2 Release?
>
> Thank you all! I'll start to prepare.
>
> Bests,
> Dongjoon.
>
> On Tue, May 18, 2021 at 12:53 AM Wenchen Fan  wrote:
>
>> +1, thanks!
>>
>> On Tue, May 18, 2021 at 1:37 PM Xiao Li  wrote:
>>
>>> +1 Thanks, Dongjoon!
>>>
>>> Xiao
>>>
>>>
>>>
>>> On Mon, May 17, 2021 at 8:45 PM Kent Yao  wrote:
>>>
 +1. thanks Dongjoon

 *Kent Yao *
 @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
 *a spark enthusiast*
 *kyuubi is a
 unified multi-tenant JDBC interface for large-scale data processing and
 analytics, built on top of Apache Spark .*
 *spark-authorizer A Spark
 SQL extension which provides SQL Standard Authorization for **Apache
 Spark .*
 *spark-postgres  A library
 for reading data from and transferring data to Postgres / Greenplum with
 Spark SQL and DataFrames, 10~100x faster.*
 *itatchi A** library t**hat
 brings useful functions from various modern database management systems to 
 **Apache
 Spark .*



 On 05/18/2021 10:57,John Zhuge 
 wrote:

 +1, thanks Dongjoon!

 On Mon, May 17, 2021 at 7:50 PM Yuming Wang  wrote:

> +1.
>
> On Tue, May 18, 2021 at 9:06 AM Hyukjin Kwon 
> wrote:
>
>> +1 thanks for driving me
>>
>> On Tue, 18 May 2021, 09:33 Holden Karau, 
>> wrote:
>>
>>> +1 and thanks for volunteering to be the RM :)
>>>
>>> On Mon, May 17, 2021 at 4:09 PM Takeshi Yamamuro <
>>> linguin@gmail.com> wrote:
>>>
 Thank you, Dongjoon~ sgtm, too.

 On Tue, May 18, 2021 at 7:34 AM Cheng Su 
 wrote:

> +1 for a new release, thanks Dongjoon!
>
> Cheng Su
>
> On 5/17/21, 2:44 PM, "Liang-Chi Hsieh"  wrote:
>
> +1 sounds good. Thanks Dongjoon for volunteering on this!
>
>
> Liang-Chi
>
>
> Dongjoon Hyun-2 wrote
> > Hi, All.
> >
> > Since Apache Spark 3.1.1 tag creation (Feb 21),
> > new 172 patches including 9 correctness patches and 4 K8s
> patches arrived
> > at branch-3.1.
> >
> > Shall we make a new release, Apache Spark 3.1.2, as the
> second release at
> > 3.1 line?
> > I'd like to volunteer for the release manager for Apache
> Spark 3.1.2.
> > I'm thinking about starting the first RC next week.
> >
> > $ git log --oneline v3.1.1..HEAD | wc -l
> >  172
> >
> > # Known correctness issues
> > SPARK-34534 New protocol FetchShuffleBlocks in
> OneForOneBlockFetcher
> > lead to data loss or correctness
> > SPARK-34545 PySpark Python UDF return inconsistent
> results when
> > applying 2 UDFs with different return type to 2 columns
> together
> > SPARK-34681 Full outer shuffled hash join when building
> left side
> > produces wrong result
> > SPARK-34719 fail if the view query has duplicated column
> names
> > SPARK-34794 Nested higher-order functions broken in DSL
> > SPARK-34829 transform_values return identical values
> when it's used
> > with udf that returns reference type
> > SPARK-34833 Apply right-padding correctly for correlated
> subqueries
> > SPARK-35381 Fix lambda variable name issues in nested
> DataFrame
> > functions in R APIs
> > SPARK-35382 Fix lambda variable name issues in nested
> DataFrame
> > functions in Python APIs
> >
> > # Notable K8s patches since K8s GA
> > SPARK-34674Close SparkContext after the Main method has
> finished
> > SPARK-34948Add ownerReference to executor configmap to
> fix leakages
> > SPARK-34820add apt-update before gnupg install
> > SPARK-34361In case of downscaling avoid killing of
> executors already
> > known by the scheduler backend in the pod allocator
> 

Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-18 Thread Jungtaek Lim
Thanks for the huge efforts on driving the release!

On Tue, May 18, 2021 at 4:53 PM Wenchen Fan  wrote:

> Thank you, Liang-Chi!
>
> On Tue, May 18, 2021 at 1:32 PM Dongjoon Hyun 
> wrote:
>
>> Finally! Thank you, Liang-Chi.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Mon, May 17, 2021 at 10:14 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Thank you for the release work, Liang-Chi~
>>>
>>> On Tue, May 18, 2021 at 2:11 PM Hyukjin Kwon 
>>> wrote:
>>>
 Yay!

 2021년 5월 18일 (화) 오후 12:57, Liang-Chi Hsieh 님이 작성:

> We are happy to announce the availability of Spark 2.4.8!
>
> Spark 2.4.8 is a maintenance release containing stability,
> correctness, and
> security fixes.
> This release is based on the branch-2.4 maintenance branch of Spark. We
> strongly recommend all 2.4 users to upgrade to this stable release.
>
> To download Spark 2.4.8, head over to the download page:
> http://spark.apache.org/downloads.html
>
> Note that you might need to clear your browser cache or to use
> `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-4-8.html
>
> We would like to acknowledge all community members for contributing to
> this
> release. This release would not have been possible without you.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>


Re: [DISCUSS] Add RocksDB StateStore

2021-04-27 Thread Jungtaek Lim
I think adding RocksDB state store to sql/core directly would be
OK. Personally I also voted "either way is fine with me" against RocksDB
state store implementation in Spark ecosystem. The overall stance hasn't
changed, but I'd like to point out that the risk becomes quite lower than
before, given the fact we can leverage Databricks RocksDB state store
implementation.

I feel there were two major reasons to add RocksDB state store to external
module;

1. stability

Databricks RocksDB state store implementation has been supported for years,
it won't require more time to incubate. We may want to review thoughtfully
to ensure the open sourced proposal fits to the Apache Spark and still
retains stability, but this is quite better than the previous targets to
adopt which may not be tested in production for years.

That makes me think that we don't have to put it into external and consider
it as experimental.

2. dependency

>From Yuanjian's mail, JNI library is the only dependency, which seems fine
to add by default. We already have LevelDB as one of core dependencies and
don't concern too much about the JNI library dependency. Probably someone
might figure out that there are outstanding benefits on replacing LevelDB
with RocksDB and then RocksDB can even be the one of core dependencies.

On Tue, Apr 27, 2021 at 6:41 PM Yuanjian Li  wrote:

> Hi all,
>
> Following the latest comments in SPARK-34198
> , Databricks decided
> to donate the commercial implementation of the RocksDBStateStore. Compared
> with the original decision, there’s only one topic we want to raise again
> for discussion: can we directly add the RockDBStateStoreProvider in the
> sql/core module? This suggestion based on the following reasons:
>
>1.
>
>The RocksDBStateStore aims to solve the problem of the original
>HDFSBasedStateStore, which is built-in.
>2.
>
>End users can conveniently set the config to use the new
>implementation.
>3.
>
>We can set the RocksDB one as the default one in the future.
>
>
> For the consideration of the dependency, I also checked the rocksdbjni we
> might introduce. As a JNI package
> ,
> it should not have any dependency conflicts with Apache Spark.
>
> Any suggestions are welcome!
>
> Best,
>
> Yuanjian
>
> Reynold Xin  于2021年2月14日周日 上午6:54写道:
>
>> Late +1
>>
>>
>> On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh 
>> wrote:
>>
>>> Hi devs,
>>>
>>> Thanks for all the inputs. I think overall there are positive inputs in
>>> Spark community about having RocksDB state store as external module. Then
>>> let's go forward with this direction and to improve structured streaming. I
>>> will keep update to the JIRA SPARK-34198.
>>>
>>> Thanks all again for the inputs and discussion.
>>>
>>> Liang-Chi Hsieh
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> - To
>>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>


Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-13 Thread Jungtaek Lim
+1 (non-binding)

signature OK, extracting tgz files OK, build source without running tests
OK.

On Tue, Apr 13, 2021 at 5:02 PM Herman van Hovell 
wrote:

> +1
>
> On Tue, Apr 13, 2021 at 2:40 AM sarutak  wrote:
>
>> +1 (non-binding)
>>
>> > +1
>> >
>> > On Tue, 13 Apr 2021, 02:58 Sean Owen,  wrote:
>> >
>> >> +1 same result as last RC for me.
>> >>
>> >> On Mon, Apr 12, 2021, 12:53 AM Liang-Chi Hsieh 
>> >> wrote:
>> >>
>> >>> Please vote on releasing the following candidate as Apache Spark
>> >>> version
>> >>> 2.4.8.
>> >>>
>> >>> The vote is open until Apr 15th at 9AM PST and passes if a
>> >>> majority +1 PMC
>> >>> votes are cast, with a minimum of 3 +1 votes.
>> >>>
>> >>> [ ] +1 Release this package as Apache Spark 2.4.8
>> >>> [ ] -1 Do not release this package because ...
>> >>>
>> >>> To learn more about Apache Spark, please see
>> >>> http://spark.apache.org/
>> >>>
>> >>> There are currently no issues targeting 2.4.8 (try project = SPARK
>> >>> AND
>> >>> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In
>> >>> Progress"))
>> >>>
>> >>> The tag to be voted on is v2.4.8-rc2 (commit
>> >>> a0ab27ca6b46b8e5a7ae8bb91e30546082fc551c):
>> >>> https://github.com/apache/spark/tree/v2.4.8-rc2
>> >>>
>> >>> The release files, including signatures, digests, etc. can be
>> >>> found at:
>> >>> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-bin/
>> >>>
>> >>> Signatures used for Spark RCs can be found in this file:
>> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >>>
>> >>> The staging repository for this release can be found at:
>> >>>
>> >>
>> > https://repository.apache.org/content/repositories/orgapachespark-1373/
>> >>>
>> >>> The documentation corresponding to this release can be found at:
>> >>> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-docs/
>> >>>
>> >>> The list of bug fixes going into 2.4.8 can be found at the
>> >>> following URL:
>> >>> https://s.apache.org/spark-v2.4.8-rc2
>> >>>
>> >>> This release is using the release script of the tag v2.4.8-rc2.
>> >>>
>> >>> FAQ
>> >>>
>> >>> =
>> >>> How can I help test this release?
>> >>> =
>> >>>
>> >>> If you are a Spark user, you can help us test this release by
>> >>> taking
>> >>> an existing Spark workload and running on this release candidate,
>> >>> then
>> >>> reporting any regressions.
>> >>>
>> >>> If you're working in PySpark you can set up a virtual env and
>> >>> install
>> >>> the current RC and see if anything important breaks, in the
>> >>> Java/Scala
>> >>> you can add the staging repository to your projects resolvers and
>> >>> test
>> >>> with the RC (make sure to clean up the artifact cache before/after
>> >>> so
>> >>> you don't end up building with an out of date RC going forward).
>> >>>
>> >>> ===
>> >>> What should happen to JIRA tickets still targeting 2.4.8?
>> >>> ===
>> >>>
>> >>> The current list of open tickets targeted at 2.4.8 can be found
>> >>> at:
>> >>> https://issues.apache.org/jira/projects/SPARK and search for
>> >>> "Target
>> >>> Version/s" = 2.4.8
>> >>>
>> >>> Committers should look at those and triage. Extremely important
>> >>> bug
>> >>> fixes, documentation, and API tweaks that impact compatibility
>> >>> should
>> >>> be worked on immediately. Everything else please retarget to an
>> >>> appropriate release.
>> >>>
>> >>> ==
>> >>> But my bug isn't fixed?
>> >>> ==
>> >>>
>> >>> In order to make timely releases, we will typically not hold the
>> >>> release unless the bug in question is a regression from the
>> >>> previous
>> >>> release. That being said, if there is something which is a
>> >>> regression
>> >>> that has not been correctly targeted please ping me or a committer
>> >>> to
>> >>> help target the issue.
>> >>>
>> >>> --
>> >>> Sent from:
>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >>>
>> >>>
>> >>
>> > -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Jungtaek Lim
Congrats all!

2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성:

> Congrats! Welcome!
>
>
> Matei Zaharia wrote
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers. Please join
> me
> > in welcoming them to their new role! Our new committers are:
> >
> > - Maciej Szymkiewicz (contributor to PySpark)
> > - Max Gekk (contributor to Spark SQL)
> > - Kent Yao (contributor to Spark SQL)
> > - Attila Zsolt Piros (contributor to decommissioning and Spark on
> > Kubernetes)
> > - Yi Wu (contributor to Spark Core and SQL)
> > - Gabor Somogyi (contributor to Streaming and security)
> >
> > All six of them contributed to Spark 3.1 and we’re very excited to have
> > them join as committers.
> >
> > Matei and the Spark PMC
> > -
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Checkpointing in Spark Structured Streaming

2021-03-22 Thread Jungtaek Lim
One more thing I missed, commit metadata for the batch N must be written
"after" all other parts of the checkpoint are successfully written for the
batch N.

So you seem to find a way to do asynchronous commit on "custom state store
provider" - as I commented before, it's being tied to the task lifecycle
which says you're no longer be able to fail the task once you make the
state store commit be async. There's a rough idea to do it, but would
require Spark code change on checkpoint commit phase - driver needs to
check the status of commits on state store for all stateful partitions, and
commits only when all commits for state store are successful. In detail, it
could be somewhat complicated on implementation.

On Tue, Mar 23, 2021 at 5:56 AM Rohit Agrawal  wrote:

> Thank you for the reply. For our use case, it's okay to not have
> exactly-once semantics. Given this use case of not needing exactly-once
> a) Is there any negative implications if one were to use a custom state
> store provider which asynchronously committed under the hood
> b) Is there any other option to achieve this without using a custom state
> store provider ?
>
> Rohit
>
> On Mon, Mar 22, 2021 at 4:09 PM Jungtaek Lim 
> wrote:
>
>> I see some points making async checkpoint be tricky to add in
>> micro-batch; one example is "end to end exactly-once", as the commit phase
>> in sink for the batch N can be run "after" the batch N + 1 has been started
>> and write for batch N + 1 can happen before committing batch N. state store
>> checkpoint is tied to task lifecycle instead of checkpoint phase, which is
>> also tricky to make it be async.
>>
>> There may be still some spots to optimize on checkpointing though, one
>> example is SPARK-34383 [1]. I've figured out it helps to reduce latency on
>> checkpointing with object stores by 300+ ms per batch.
>>
>> Btw, even though S3 is now strongly consistent, it doesn't mean it's HDFS
>> compatible as default implementation of SS checkpoint requires. Atomic
>> rename is not supported, as well as rename isn't just a change on metadata
>> (read from S3 and write to S3 again). Performance would be sub-optimal, and
>> Spark no longer be able to prevent concurrent streaming queries trying to
>> update to the same checkpoint which might possibly mess up the checkpoint.
>> You need to make sure there's only one streaming query running against a
>> specific checkpoint.
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-34383
>>
>> On Tue, Mar 23, 2021 at 1:55 AM Rohit Agrawal  wrote:
>>
>>> Hi,
>>>
>>> I have been experimenting with the Continuous mode and the Micro batch
>>> mode in Spark Structured Streaming. When enabling checkpoint to S3 instead
>>> of the local File System we see that Continuous mode has no change in
>>> latency (expected due to async checkpointing) however the Micro-batch mode
>>> experiences degradation likely due to sync checkpointing.
>>>
>>> Is there any way to get async checkpointing in the micro-batching mode
>>> as well to improve latency. Could that be done with custom checkpointing
>>> logic ? Any pointers / experiences in that direction would be helpful.
>>>
>>


Re: Checkpointing in Spark Structured Streaming

2021-03-22 Thread Jungtaek Lim
I see some points making async checkpoint be tricky to add in micro-batch;
one example is "end to end exactly-once", as the commit phase in sink for
the batch N can be run "after" the batch N + 1 has been started and write
for batch N + 1 can happen before committing batch N. state store
checkpoint is tied to task lifecycle instead of checkpoint phase, which is
also tricky to make it be async.

There may be still some spots to optimize on checkpointing though, one
example is SPARK-34383 [1]. I've figured out it helps to reduce latency on
checkpointing with object stores by 300+ ms per batch.

Btw, even though S3 is now strongly consistent, it doesn't mean it's HDFS
compatible as default implementation of SS checkpoint requires. Atomic
rename is not supported, as well as rename isn't just a change on metadata
(read from S3 and write to S3 again). Performance would be sub-optimal, and
Spark no longer be able to prevent concurrent streaming queries trying to
update to the same checkpoint which might possibly mess up the checkpoint.
You need to make sure there's only one streaming query running against a
specific checkpoint.

1. https://issues.apache.org/jira/browse/SPARK-34383

On Tue, Mar 23, 2021 at 1:55 AM Rohit Agrawal  wrote:

> Hi,
>
> I have been experimenting with the Continuous mode and the Micro batch
> mode in Spark Structured Streaming. When enabling checkpoint to S3 instead
> of the local File System we see that Continuous mode has no change in
> latency (expected due to async checkpointing) however the Micro-batch mode
> experiences degradation likely due to sync checkpointing.
>
> Is there any way to get async checkpointing in the micro-batching mode as
> well to improve latency. Could that be done with custom checkpointing logic
> ? Any pointers / experiences in that direction would be helpful.
>


Re: Determine global watermark via StreamingQueryProgress eventTime watermark String

2021-03-16 Thread Jungtaek Lim
There was a similar question (but another approach) and I've explained the
current status a bit.

https://lists.apache.org/thread.html/r89a61a10df71ccac132ce5d50b8fe405635753db7fa2aeb79f82fb77%40%3Cuser.spark.apache.org%3E

I guess this would also answer your question as well. At least for now,
Spark doesn't expose the current watermark in specific micro-batch to the
user level. It's abstracted away. I'm not sure knowing the exact global
watermark "outside" of the query would be able to affect the running query.

If there's a strong demand, we could probably consider adding some function
which provides the current watermark. I guess producing dropped events via
side-output is something we are in favor of (if that is not quite hard to
do), more than exposing the current watermark and letting users do that
instead.


On Wed, Mar 17, 2021 at 1:20 AM dwichman 
wrote:

> Hi Spark Developers,
>
> Is it possible to reliably determine the current global watermark that is
> being used in a streaming query via StreamingQueryProgress.onQueryProgress
> eventTime watermark String?
>
>
> https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.html
>
> The intention would be to precede the query watermark function with
> something like a map function and compare event times with the assumed
> global watermark to determine if the event will be dropped (i.e. too late).
>
> If StreamingQueryProgress.onQueryProgress eventTime watermark does not
> accurately reflect the current global watermark, is there another way to
> reliably determine it?
>
> Thanks for your help.
>
> -Derek
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Observable Metrics on Spark Datasets

2021-03-16 Thread Jungtaek Lim
Please follow up the discussion in the origin PR.
https://github.com/apache/spark/pull/26127

Dataset.observe() relies on the query listener for the batch query which is
an "unstable" API - that's why we decided to not add an example for the
batch query. For streaming query, it relies on the streaming query listener
which is a stable API. That said, personally I'd consider the new API to be
fit to the streaming query first, and see whether it fits to the batch
query as well.

If we found Dataset.observe() to be useful enough on the batch query, we'd
probably be better to discuss how to provide these metrics against a stable
API (so that Scala users could leverage it), and look back later for
PySpark. That looks to be the first one to do if we have a consensus on the
usefulness of observable metrics on batch query.


On Tue, Mar 16, 2021 at 4:17 PM Enrico Minack 
wrote:

> I am focusing on batch mode, not streaming mode. I would argue that
> Dataset.observe() is equally useful for large batch processing. If you
> need some motivating use cases, please let me know.
>
> Anyhow, the documentation of observe states it works for both, batch and
> streaming. And in batch mode, the helper class Observation helps reducing
> code and avoiding repetition.
>
> The PySpark implementation of the Observation class can implement *all*
> methods by merely calling into their JVM counterpart, where the locking,
> listening, registration and un-registration happens. I think this qualifies
> as: "all the logic happens in the JVM". All that is transferred to Python
> is a row's data. No listeners needed.
>
> Enrico
>
>
>
> Am 16.03.21 um 00:13 schrieb Jungtaek Lim:
>
> If I remember correctly, the major audience of the "observe" API is
> Structured Streaming, micro-batch mode. From the example, the abstraction
> in 2 isn't something working with Structured Streaming. It could be still
> done with callback, but it remains the question how much complexity is
> hidden from abstraction.
>
> I see you're focusing on PySpark - I'm not sure whether there's intention
> on not exposing query listener / streaming query listener to PySpark, but
> if there's some valid reason to do so, I'm not sure we do like to expose
> them to PySpark in any way. 2 isn't making sense to me with PySpark - how
> do you ensure all the logic is happening in the JVM and you can leverage
> these values from PySpark?
> (I see there's support for listeners with DStream in PySpark, so there
> might be reasons not to add the same for SQL/SS. Probably a lesson learned?)
>
>
> On Mon, Mar 15, 2021 at 6:59 PM Enrico Minack 
> wrote:
>
>> Hi Spark-Devs,
>>
>> the observable metrics that have been added to the Dataset API in 3.0.0
>> are a great improvement over the Accumulator APIs that seem to have much
>> weaker guarantees. I have two questions regarding follow-up contributions:
>>
>> *1. Add observe to Python **DataFrame*
>>
>> As I can see from master branch, there is no equivalent in the Python
>> API. Is this something planned to happen, or is it missing because there
>> are not listeners in PySpark which renders this method useless in Python. I
>> would be happy to contribute here.
>>
>>
>> *2. Add Observation class to simplify result access *
>>
>> The Dataset.observe method requires users to register listeners
>> <https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#observe(name:String,expr:org.apache.spark.sql.Column,exprs:org.apache.spark.sql.Column*):org.apache.spark.sql.Dataset[T]>
>> with QueryExecutionListener or StreamingQUeryListener to obtain the
>> result. I think for simple setups, this could be hidden behind a common
>> helper class here called Observation, which reduces the usage of observe
>> to three lines of code:
>>
>> // our Dataset (this does not count as a line of code)val df = Seq((1, "a"), 
>> (2, "b"), (4, "c"), (8, "d")).toDF("id", "value")
>> // define the observation we want to makeval observation = 
>> Observation("stats", count($"id"), sum($"id"))
>> // add the observation to the Dataset and execute an action on itval cnt = 
>> df.observe(observation).count()
>> // retrieve the resultassert(observation.get === Row(4, 15))
>>
>> The Observation class can handle the registration and de-registration of
>> the listener, as well as properly accessing the result across thread
>> boundaries.
>>
>> With *2.*, the observe method can be added to PySpark without
>> introducing listeners there at all. All the logic is happening in the JVM.
>>
>> Thanks for your thoughts on this.
>>
>> Regards,
>> Enrico
>>
>


Re: Observable Metrics on Spark Datasets

2021-03-15 Thread Jungtaek Lim
If I remember correctly, the major audience of the "observe" API is
Structured Streaming, micro-batch mode. From the example, the abstraction
in 2 isn't something working with Structured Streaming. It could be still
done with callback, but it remains the question how much complexity is
hidden from abstraction.

I see you're focusing on PySpark - I'm not sure whether there's intention
on not exposing query listener / streaming query listener to PySpark, but
if there's some valid reason to do so, I'm not sure we do like to expose
them to PySpark in any way. 2 isn't making sense to me with PySpark - how
do you ensure all the logic is happening in the JVM and you can leverage
these values from PySpark?
(I see there's support for listeners with DStream in PySpark, so there
might be reasons not to add the same for SQL/SS. Probably a lesson learned?)


On Mon, Mar 15, 2021 at 6:59 PM Enrico Minack 
wrote:

> Hi Spark-Devs,
>
> the observable metrics that have been added to the Dataset API in 3.0.0
> are a great improvement over the Accumulator APIs that seem to have much
> weaker guarantees. I have two questions regarding follow-up contributions:
>
> *1. Add observe to Python **DataFrame*
>
> As I can see from master branch, there is no equivalent in the Python API.
> Is this something planned to happen, or is it missing because there are not
> listeners in PySpark which renders this method useless in Python. I would
> be happy to contribute here.
>
>
> *2. Add Observation class to simplify result access *
>
> The Dataset.observe method requires users to register listeners
> 
> with QueryExecutionListener or StreamingQUeryListener to obtain the
> result. I think for simple setups, this could be hidden behind a common
> helper class here called Observation, which reduces the usage of observe
> to three lines of code:
>
> // our Dataset (this does not count as a line of code)val df = Seq((1, "a"), 
> (2, "b"), (4, "c"), (8, "d")).toDF("id", "value")
> // define the observation we want to makeval observation = 
> Observation("stats", count($"id"), sum($"id"))
> // add the observation to the Dataset and execute an action on itval cnt = 
> df.observe(observation).count()
> // retrieve the resultassert(observation.get === Row(4, 15))
>
> The Observation class can handle the registration and de-registration of
> the listener, as well as properly accessing the result across thread
> boundaries.
>
> With *2.*, the observe method can be added to PySpark without introducing
> listeners there at all. All the logic is happening in the JVM.
>
> Thanks for your thoughts on this.
>
> Regards,
> Enrico
>


Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-11 Thread Jungtaek Lim
+1 (non-binding) Excellent description on SPIP doc! Thanks for the amazing
effort!

On Wed, Mar 10, 2021 at 3:19 AM Liang-Chi Hsieh  wrote:

>
> +1 (non-binding).
>
> Thanks for the work!
>
>
> Erik Krogen wrote
> > +1 from me (non-binding)
> >
> > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao 
>
> > huaxin.gao11@
>
> >  wrote:
> >
> >> +1 (non-binding)
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Property spark.sql.streaming.minBatchesToRetain

2021-03-09 Thread Jungtaek Lim
That property decides how many log files (log file is created per batch per
type - types are like offsets, commits, etc.) to retain on the checkpoint.

Unless you're struggling with a small files problem on checkpoint, you
wouldn't need to tune the value. I guess that's why the configuration is
marked as "internal" meaning just some admins need to know about such
configuration.

On Wed, Mar 10, 2021 at 3:58 AM German Schiavon 
wrote:

> Hey Maxim,
>
> ok! I didn't see them.
>
> Is this property documented somewhere?
>
> Thanks!
>
> On Tue, 9 Mar 2021 at 13:57, Maxim Gekk  wrote:
>
>> Hi German,
>>
>> It is used at least at:
>> 1.
>> https://github.com/apache/spark/blob/a093d6feefb0e086d19c86ae53bf92df12ccf2fa/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompactibleFileStreamLog.scala#L56
>> 2.
>> https://github.com/apache/spark/blob/e7e016192f882cfb430d706c2099e58e1bcc014c/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L84
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Tue, Mar 9, 2021 at 3:27 PM German Schiavon 
>> wrote:
>>
>>> Hello all,
>>>
>>> I wanted to ask if this property is still active? I can't find it in the
>>> doc https://spark.apache.org/docs/latest/configuration.html or anywhere
>>> in the code(only in Tests).
>>>
>>> If so, should we remove it?
>>>
>>> val MIN_BATCHES_TO_RETAIN = 
>>> buildConf("spark.sql.streaming.minBatchesToRetain")
>>>   .internal()
>>>   .doc("The minimum number of batches that must be retained and made 
>>> recoverable.")
>>>   .version("2.1.1")
>>>   .intConf
>>>   .createWithDefault(100)
>>>
>>>


Re: using accumulators in (MicroBatch) InputPartitionReader

2021-03-07 Thread Jungtaek Lim
I'm not sure about the accumulator approach; one possible approach which
might work (DISCLAIMER: a random thought) would be employing an RPC
endpoint on the driver side which receives such information from executors
and plays as a coordinator.

Beware that Spark's RPC implementation is package private, so you may need
to play with some hacks (package name) and deal with changes on version
changes as Spark won't guarantee backward compatibility. If you could
employ similar things with only some lightweight dependencies making no
conflict, then I guess it would work as well.

On Thu, Mar 4, 2021 at 11:41 PM kordex  wrote:

> I tried to create a data source, however our use case is bit hard as
> we do only know the available offsets within the tasks, not on the
> driver. I therefore planned to use accumulators in the
> InputPartitionReader but they seem not to work.
>
> Example accumulation is done here
>
> https://github.com/kortemik/spark-source/blob/master/src/main/java/com/teragrep/pth06/ArchiveMicroBatchInputPartitionReader.java#L118
>
> I get on the task logs that the System.out.println() are called, so it
> can not be that the flow itself is broken, but the accumulators seem
> to work only while on the driver as on the logs at the
> https://github.com/kortemik/spark-source/tree/master
>
> Is it intentional that the accumulators do not work within the data source?
>
> One might ask why all this so I give brief explanation. We use gzipped
> files as the storage blobs and it's unknown prior to execution how
> many records they contain. Of course this can be mitigated by
> decompressing the files on the driver and then sending the offsets
> through to executors but it's a double effort. The aim however was to
> decompress them only once by doing a forward-lookup into the data and
> use accumulator to inform the driver that there is stuff available for
> the next batch as well or that the file is done and driver needs to
> pull the next one to keep executors busy.
>
> Any advices are welcome.
>
> Kind regards,
> -Mikko Kortelainen
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Jungtaek Lim
Thanks Hyukjin for driving the huge release, and thanks everyone for
contributing the release!

On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:

> Great work, Hyukjin !
>
> Bests,
> Angers
>
> Wenchen Fan  于2021年3月3日周三 下午5:02写道:
>
>> Great work and congrats!
>>
>> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>>
>>> Congrats, all!
>>>
>>> Bests,
>>> *Kent Yao *
>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>>> *a spark enthusiast*
>>> *kyuubi is a
>>> unified multi-tenant JDBC interface for large-scale data processing and
>>> analytics, built on top of Apache Spark .*
>>> *spark-authorizer A Spark
>>> SQL extension which provides SQL Standard Authorization for **Apache
>>> Spark .*
>>> *spark-postgres  A library
>>> for reading data from and transferring data to Postgres / Greenplum with
>>> Spark SQL and DataFrames, 10~100x faster.*
>>> *spark-func-extras A
>>> library that brings excellent and useful functions from various modern
>>> database management systems to Apache Spark .*
>>>
>>>
>>>
>>> On 03/3/2021 15:11,Takeshi Yamamuro
>>>  wrote:
>>>
>>> Great work and Congrats, all!
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
>>> wrote:
>>>

 Thanks Hyukjin and congratulations everyone on the release !

 Regards,
 Mridul

 On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:

> Great work, Hyukjin!
>
> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
> wrote:
>
>> We are excited to announce Spark 3.1.1 today.
>>
>> Apache Spark 3.1.1 is the second release of the 3.x line. This
>> release adds
>> Python type annotations and Python dependency management support as
>> part of Project Zen.
>> Other major updates include improved ANSI SQL compliance support,
>> history server support
>> in structured streaming, the general availability (GA) of Kubernetes
>> and node decommissioning
>> in Kubernetes and Standalone. In addition, this release continues to
>> focus on usability, stability,
>> and polish while resolving around 1500 tickets.
>>
>> We'd like to thank our contributors and users for their contributions
>> and early feedback to
>> this release. This release would not have been possible without you.
>>
>> To download Spark 3.1.1, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-1-1.html
>>
>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>>


Re: Please take a look at the draft of the Spark 3.1.1 release notes

2021-02-27 Thread Jungtaek Lim
Thanks Hyukjin! I've only looked into the SS part, and added a comment.
Otherwise it looks great!

On Sat, Feb 27, 2021 at 7:12 PM Dongjoon Hyun 
wrote:

> Thank you for sharing, Hyukjin!
>
> Dongjoon.
>
> On Sat, Feb 27, 2021 at 12:36 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I am preparing to publish and announce Spark 3.1.1.
>> This is the draft of the release note, and I plan to edit a bit more and
>> use it as the final release note.
>> Please take a look and let me know if I missed any major changes or
>> something.
>>
>>
>> https://docs.google.com/document/d/1x6zzgRsZ4u1DgUh1XpGzX914CZbsHeRYpbqZ-PV6wdQ/edit?usp=sharing
>>
>> Thanks.
>>
>


Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-22 Thread Jungtaek Lim
+1 (non-binding)

Verified signatures. Only a few commits added after RC2 which don't seem to
change the SS behavior, so I'd carry over my +1 from RC2.

On Mon, Feb 22, 2021 at 3:57 PM Hyukjin Kwon  wrote:

> Starting with my +1 (binding).
>
> 2021년 2월 22일 (월) 오후 3:56, Hyukjin Kwon 님이 작성:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.1.
>>
>> The vote is open until February 24th 11PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.1.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.1.1-rc3 (commit
>> 1d550c4e90275ab418b9161925049239227f3dc9):
>> https://github.com/apache/spark/tree/v3.1.1-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> 
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1367
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>>
>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>> https://s.apache.org/41kf2
>>
>> This release is using the release script of the tag v3.1.1-rc3.
>>
>> FAQ
>>
>> ===
>> What happened to 3.1.0?
>> ===
>>
>> There was a technical issue during Apache Spark 3.1.0 preparation, and it
>> was discussed and decided to skip 3.1.0.
>> Please see
>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>> more details.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.1.1?
>> ===
>>
>> The current list of open tickets targeted at 3.1.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.1.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>


Re: Please use Jekyll via "bundle exec" from now on

2021-02-18 Thread Jungtaek Lim
Nice fix. Thanks!

On Thu, Feb 18, 2021 at 7:13 PM Hyukjin Kwon  wrote:

> Thanks Attlila for fixing and sharing this.
>
> 2021년 2월 18일 (목) 오후 6:17, Attila Zsolt Piros 님이
> 작성:
>
>> Hello everybody,
>>
>> To pin the exact same version of Jekyll across all the contributors, Ruby
>> Bundler is introduced.
>> This way the differences in the generated documentation, which were
>> caused by using different Jekyll versions, are avoided.
>>
>> Regarding its usage this simply means an extra prefix "*bundle exec*"
>> must be added to call Jekyll, so:
>> - instead of "jekyll build" please use "*bundle exec jekyll build*"
>> - instead of "jekyll serve --watch" please use "*bundle exec jekyll
>> serve --watch*"
>>
>> If you are using an earlier Ruby version please install Bundler by "*gem
>> install bundler*" (as of Ruby 2.6 Bundler is part of core Ruby).
>>
>> This applies to both "apache/spark" and "apache/spark-website"
>> repositories where all the documentations are updated accordingly.
>>
>> For details please check the PRs introducing this:
>> - https://github.com/apache/spark/pull/31559
>> - https://github.com/apache/spark-website/pull/303
>>
>> Sincerely,
>> Attila Piros
>>
>


Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-18 Thread Jungtaek Lim
(Actually the real world case was fixed somehow and I wouldn't like to
point out a fixed one. I just would like to make sure what I think is
correct and is considered as "consensus".)

Just consider the case as simple - someone files two different JIRA issues
for new features and assigns to him/herself altogether, without sharing
anything about the ongoing efforts someone has made. (So you have no idea
even someone just files two different JIRA issues without "any" progress
and has them in a backlog.) The new features are not new and are likely
something others could work in parallel.

That said, committers can explicitly represent "I'm working on this so
please refrain from making redundant efforts." via assigning the issue,
which is actually similar to the comment "I'm working on this".
Unfortunately, this works only when the feature is something one who filed
a JIRA issue works uniquely. Occasional opposite cases aren't always a
notion of ignoring the signal of "I'm working on this". There're also
coincidences two different individuals/teams are working on exactly the
same at the same time.

My concern is that "assignment" might be considered pretty much stronger
than just commenting "I'm working on this" - it's like "Regardless of your
current progress, I started working on this so don't consider your effort
to be proposable. You should have filed the JIRA issue before I file one."
Is it possible for contributors to do the same? I guess not.

The other problem is the multiple assignments in parallel. I wouldn't like
to guess someone over-uses the power of assignments, but technically it's
simply possible that someone can file JIRA issues on his/her backlog which
can be done in a couple of months or so with assigning to him/herself,
which effectively blocks others from working or proposing the same. I
consider this as preemptive which sounds bad and even unfair.

On Fri, Feb 19, 2021 at 12:14 AM Sean Owen  wrote:

> I think it's OK to raise particular instances. It's hard for me to
> evaluate further in the abstract.
>
> I don't think we use Assignee much at all, except to kinda give credit
> when something is done. No piece of code or work can be solely owned by one
> person; this is just ASF policy.
>
> I think we've seen the occasional opposite case too: someone starts
> working on an issue, and then someone else also starts working on it with a
> competing fix or change.
>
> These are ultimately issues of communication. If an issue is pretty
> stalled, and you have a proposal, nothing wrong with just going ahead with
> a proposal. There may be no disagreement. It might result in the
> other person joining your PR. As I say, not sure if there's a deeper issue
> than that if even this hasn't been tried?
>
> On Mon, Feb 15, 2021 at 8:35 PM Jungtaek Lim 
> wrote:
>
>> Thanks for the input, Hyukjin!
>>
>> I have been keeping my own policy among all discussions I have raised - I
>> would provide the hypothetical example closer to the actual one and avoid
>> pointing out directly. The main purpose of the discussion is to ensure our
>> policy / consensus makes sense, no more. I can provide a more detailed
>> explanation if someone feels the explanation wasn't sufficient to
>> understand.
>>
>> Probably this discussion could play as a "reminder" to every committers
>> if similar discussion was raised before and it succeeded to build
>> consensus. If there's some point we don't build consensus yet, it'd be a
>> good time to discuss further. I don't know what exactly was the discussion
>> and the result so what is new here, but I guess this might be a duplicated
>> one as you say similar issue.
>>
>>
>>
>> On Tue, Feb 16, 2021 at 11:09 AM Hyukjin Kwon 
>> wrote:
>>
>>> I remember I raised a similar issue a long time ago in the dev mailing
>>> list. I agree that setting no assignee makes sense in most of the cases,
>>> and also think we share similar thoughts about the assignee on
>>> umbrella JIRAs, followup tasks, the case when it's clear with a design doc,
>>> etc.
>>> It makes me think that the actual issue by setting an assignee happens
>>> rarely, and it is an issue to several specific cases that would need a look
>>> case-by-case.
>>> Were there specific cases that made you concerned?
>>>
>>>
>>> 2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이
>>> 작성:
>>>
>>>> Hi devs,
>>>>
>>>> I'd like to raise a discussion and hear voices on the "assignee"
>>>> practice on committers which may lead issues on preemption.
>>>>
>>>> I feel this is t

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-15 Thread Jungtaek Lim
Thanks for the input, Hyukjin!

I have been keeping my own policy among all discussions I have raised - I
would provide the hypothetical example closer to the actual one and avoid
pointing out directly. The main purpose of the discussion is to ensure our
policy / consensus makes sense, no more. I can provide a more detailed
explanation if someone feels the explanation wasn't sufficient to
understand.

Probably this discussion could play as a "reminder" to every committers if
similar discussion was raised before and it succeeded to build consensus.
If there's some point we don't build consensus yet, it'd be a good time to
discuss further. I don't know what exactly was the discussion and the
result so what is new here, but I guess this might be a duplicated one as
you say similar issue.



On Tue, Feb 16, 2021 at 11:09 AM Hyukjin Kwon  wrote:

> I remember I raised a similar issue a long time ago in the dev mailing
> list. I agree that setting no assignee makes sense in most of the cases,
> and also think we share similar thoughts about the assignee on
> umbrella JIRAs, followup tasks, the case when it's clear with a design doc,
> etc.
> It makes me think that the actual issue by setting an assignee happens
> rarely, and it is an issue to several specific cases that would need a look
> case-by-case.
> Were there specific cases that made you concerned?
>
>
> 2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이
> 작성:
>
>> Hi devs,
>>
>> I'd like to raise a discussion and hear voices on the "assignee" practice
>> on committers which may lead issues on preemption.
>>
>> I feel this is the one of major unfairnesses between contributors and
>> committers if used improperly, especially when someone assigns themselves
>> with multiple JIRA issues.
>>
>> Let's say there're features A and B, which may take a month for each (or
>> require design doc) - both are individual major features, not subtasks or
>> some sort of "follow-up".
>>
>> Technically, committers can file two JIRA issues and assign both of
>> issues, "without actually doing no progress", and implicitly ensure no one
>> works on these issues for a couple of months. Even just a plan on backlog
>> can prevent others from taking up.
>>
>> I don't think this is fair with contributors, because contributors don't
>> tend to file an JIRA issue unless they made a lot of progress. (I'd like to
>> remind you, competition from contributor's position is quite tense and
>> stressful.) Say they already spent a month working on it and testing it in
>> production. They feel ready and visit JIRA, and realize the JIRA issue was
>> made and assigned to someone, while there's no progress on the JIRA issue.
>> No idea how much progress "someone" makes. They "might" ask about the
>> progress, but nothing will change if "someone" simply says "I'm still
>> working on this" (with even 1% of progress). Isn't this actually against
>> the reason we don't allow setting assignee to contributor?
>>
>> For sure, assigning the issue would make sense if the issue is a subtask
>> or follow-up, or the issue made explicit progress like design doc is being
>> put. In other cases I don't see any reason assigning the issue explicitly.
>> Someone may say to contributors, just leave a comment "I'm working on it",
>> but isn't it also something committers can also do when they are "actually"
>> working?
>>
>> I think committers should have no advantage on the possible competition
>> on contribution, and setting assignee without explicit progress makes me
>> worried.
>> To make it fair, either we should allow contributors to assign them or
>> don't allow committers to assign them unless extreme cases - they can still
>> use the approach contributors do.
>> (Again I'd feel OK to assign if there's a design doc proving that they
>> really spent non-trivial effort already. My point is preempting JIRA issues
>> with only sketched ideas or even just rationalizations.)
>>
>> Would like to hear everyone's voices.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. better yet, probably it's better then to restrict something
>> explicitly if we sincerely respect the underlying culture on the statement
>> "In case several people contributed, prefer to assign to the more ‘junior’,
>> non-committer contributor".
>>
>>
>>


[DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-14 Thread Jungtaek Lim
Hi devs,

I'd like to raise a discussion and hear voices on the "assignee" practice
on committers which may lead issues on preemption.

I feel this is the one of major unfairnesses between contributors and
committers if used improperly, especially when someone assigns themselves
with multiple JIRA issues.

Let's say there're features A and B, which may take a month for each (or
require design doc) - both are individual major features, not subtasks or
some sort of "follow-up".

Technically, committers can file two JIRA issues and assign both of issues,
"without actually doing no progress", and implicitly ensure no one works on
these issues for a couple of months. Even just a plan on backlog can
prevent others from taking up.

I don't think this is fair with contributors, because contributors don't
tend to file an JIRA issue unless they made a lot of progress. (I'd like to
remind you, competition from contributor's position is quite tense and
stressful.) Say they already spent a month working on it and testing it in
production. They feel ready and visit JIRA, and realize the JIRA issue was
made and assigned to someone, while there's no progress on the JIRA issue.
No idea how much progress "someone" makes. They "might" ask about the
progress, but nothing will change if "someone" simply says "I'm still
working on this" (with even 1% of progress). Isn't this actually against
the reason we don't allow setting assignee to contributor?

For sure, assigning the issue would make sense if the issue is a subtask or
follow-up, or the issue made explicit progress like design doc is being
put. In other cases I don't see any reason assigning the issue explicitly.
Someone may say to contributors, just leave a comment "I'm working on it",
but isn't it also something committers can also do when they are "actually"
working?

I think committers should have no advantage on the possible competition on
contribution, and setting assignee without explicit progress makes me
worried.
To make it fair, either we should allow contributors to assign them or
don't allow committers to assign them unless extreme cases - they can still
use the approach contributors do.
(Again I'd feel OK to assign if there's a design doc proving that they
really spent non-trivial effort already. My point is preempting JIRA issues
with only sketched ideas or even just rationalizations.)

Would like to hear everyone's voices.

Thanks,
Jungtaek Lim (HeartSaVioR)

ps. better yet, probably it's better then to restrict something explicitly
if we sincerely respect the underlying culture on the statement "In case
several people contributed, prefer to assign to the more ‘junior’,
non-committer contributor".


Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-09 Thread Jungtaek Lim
+1 (non-binding)

* verified signatures
* built custom distribution with enabling kubernetes & hadoop-cloud profile
* built custom docker image from dist
* ran applications "rate to kafka" & "kafka to kafka" on k8s cluster (local
k3s)

Thanks for driving the release!

Jungtaek Lim (HeartSaVioR)

On Tue, Feb 9, 2021 at 4:41 PM Cheng Su  wrote:

> +1 for this release candidate.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
> *From: *郑瑞峰 
> *Date: *Monday, February 8, 2021 at 10:58 PM
> *To: *Gengliang Wang , Sean Owen 
> *Cc: *gurwls223 , Yuming Wang ,
> dev 
> *Subject: *回复: [VOTE] Release Spark 3.1.1 (RC2)
>
>
>
> +1 (non-binding)
>
>
>
> Thank you, Hyukjin
>
>
>
>
>
> -- 原始邮件 --
>
> *发件人**:* "Gengliang Wang" ;
>
> *发送时间**:* 2021年2月9日(星期二) 中午1:50
>
> *收件人**:* "Sean Owen";
>
> *抄送**:* "Hyukjin Kwon";"Yuming Wang" >;"dev";
>
> *主题**:* Re: [VOTE] Release Spark 3.1.1 (RC2)
>
>
>
> +1
>
>
>
> On Tue, Feb 9, 2021 at 1:39 PM Sean Owen  wrote:
>
> Same result as last time for me, +1. Tested with Java 11.
>
> I fixed the two issues without assignee; one was WontFix though.
>
>
>
> On Mon, Feb 8, 2021 at 7:43 PM Hyukjin Kwon  wrote:
>
> Let's set the assignees properly then. Shouldn't be a problem for the
> release.
>
>
>
> On Tue, 9 Feb 2021, 10:40 Yuming Wang,  wrote:
>
>
>
> Many tickets do not have correct assignee:
>
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20in%20(3.1.0%2C%203.1.1)%20AND%20(assignee%20is%20EMPTY%20or%20assignee%20%3D%20apachespark)
> <https://issues.apache.org/jira/issues/?jql=project%20=%20SPARK%20AND%20status%20in%20(Resolved,%20Closed)%20AND%20fixVersion%20in%20(3.1.0,%203.1.1)%20AND%20(assignee%20is%20EMPTY%20or%20assignee%20=%20apachespark)>
>
>
>
>
>
> On Tue, Feb 9, 2021 at 9:05 AM Hyukjin Kwon  wrote:
>
> +1 (binding) from myself too.
>
>
>
> 2021년 2월 9일 (화) 오전 9:28, Kent Yao 님이 작성:
>
>
>
> +1
>
>
>
> *Kent Yao *
>
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>
> *a spark enthusiast*
>
> *kyuubi <https://github.com/yaooqinn/kyuubi>**is a unified
> multi-tenant JDBC interface for large-scale data processing and analytics,
> built on top of **Apache Spark <http://spark.apache.org/>**.*
> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>**A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark <http://spark.apache.org/>**.*
> *spark-postgres <https://github.com/yaooqinn/spark-postgres> **A library
> for reading data from and transferring data to Postgres / Greenplum with
> Spark SQL and DataFrames, 10~100x faster.*
> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>**A
> library that brings excellent and useful functions from various modern
> database management systems to **Apache Spark <http://spark.apache.org/>*
> *.*
>
>
>
>
>
> On 02/9/2021 08:24,Hyukjin Kwon 
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.1.1.
>
>
>
> The vote is open until February 15th 5PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> Note that it is 7 days this time because it is a holiday season in several
> countries including South Korea (where I live), China etc., and I would
> like to make sure people do not miss it because it is a holiday season.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.1.1
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
>
> The tag to be voted on is v3.1.1-rc2 (commit
> cf0115ac2d60070399af481b14566f33d22ec45e):
>
> https://github.com/apache/spark/tree/v3.1.1-rc2
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1365
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-docs/
>
>
>
> The list of bug fixes going into 3.1.1 can be found at the fo

Re: [DISCUSS] Add RocksDB StateStore

2021-02-08 Thread Jungtaek Lim
+1 to add, no matter to add under sql-core vs external module.

Rationalization for myself:

* The discussion thread and voices here show strong demand for adding
RocksDB state store out of the box.
* No workaround on huge state store problem out of the box. Direct
competitors on streaming frameworks provide it for years.
* Maintenance cost is the major concern when evaluating to add something,
but it can't be applied here, as contributors/committers from various
companies are willing to contribute.
* Apache Bahir project is no longer something being maintained actively -
the last release was in September 2019 based on Spark 2.4.0. We can no
longer easily say "let's add to Bahir instead".



On Tue, Feb 9, 2021 at 3:22 PM Gabor Somogyi 
wrote:

> +1 adding it any way.
>
> On Mon, 8 Feb 2021, 21:54 Holden Karau,  wrote:
>
>> +1 for an external module.
>>
>> On Mon, Feb 8, 2021 at 11:51 AM Cheng Su  wrote:
>>
>>> +1 for (2) adding to external module.
>>>
>>> I think this feature is useful and popular in practice, and option 2 is
>>> not conflict with previous concern for dependency.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Cheng Su
>>>
>>>
>>>
>>> *From: *Dongjoon Hyun 
>>> *Date: *Monday, February 8, 2021 at 10:39 AM
>>> *To: *Jacek Laskowski 
>>> *Cc: *Liang-Chi Hsieh , dev 
>>> *Subject: *Re: [DISCUSS] Add RocksDB StateStore
>>>
>>>
>>>
>>> Thank you, Liang-chi and all.
>>>
>>>
>>>
>>> +1 for (2) external module design because it can deliver the new feature
>>> in a safe way.
>>>
>>>
>>>
>>> Bests,
>>>
>>> Dongjoon
>>>
>>>
>>>
>>> On Mon, Feb 8, 2021 at 9:00 AM Jacek Laskowski  wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I'm "okay to add RocksDB StateStore as external module". See no reason
>>> not to.
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> 
>>>
>>> https://about.me/JacekLaskowski
>>>
>>> "The Internals Of" Online Books 
>>>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>>
>>> 
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 2, 2021 at 9:32 AM Liang-Chi Hsieh  wrote:
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, we need state store for state management
>>> for
>>> stateful operators such streaming aggregates, joins, etc. We have one and
>>> only one state store implementation now. It is in-memory hashmap which
>>> was
>>> backed up in HDFS complaint file system at the end of every micro-batch.
>>>
>>> As it basically uses in-memory map to store states, memory consumption
>>> is a
>>> serious issue and state store size is limited by the size of the executor
>>> memory. Moreover, state store using more memory means it may impact the
>>> performance of task execution that requires memory too.
>>>
>>> Internally we see more streaming applications that requires large state
>>> in
>>> stateful operations. For such requirements, we need a StateStore not
>>> rely on
>>> memory to store states.
>>>
>>> This seems to be also true externally as several other major streaming
>>> frameworks already use RocksDB for state management. RocksDB is an
>>> embedded
>>> DB and streaming engines can use it to store state instead of memory
>>> storage.
>>>
>>> So seems to me, it is proven to be good choice for large state usage. But
>>> Spark SS still lacks of a built-in state store for the requirement.
>>>
>>> Previously there was one attempt SPARK-28120 to add RocksDB StateStore
>>> into
>>> Spark SS. IIUC, it was pushed back due to two concerns: extra code
>>> maintenance cost and it introduces RocksDB dependency.
>>>
>>> For the first concern, as more users require to use the feature, it
>>> should
>>> be highly used code in SS and more developers will look at it. For second
>>> one, we propose (SPARK-34198) to add it as an external module to relieve
>>> the
>>> dependency concern.
>>>
>>> Because it was pushed back previously, I'm going to raise this
>>> discussion to
>>> know what people think about it now, in advance of submitting any code.
>>>
>>> I think there might be some possible opinions:
>>>
>>> 1. okay to add RocksDB StateStore into sql core module
>>> 2. not okay for 1, but okay to add RocksDB StateStore as external module
>>> 3. either 1 or 2 is okay
>>> 4. not okay to add RocksDB StateStore, no matter into sql core or as
>>> external module
>>>
>>> Please let us know if you have some thoughts.
>>>
>>> Thank you.
>>>
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: [VOTE] Release Spark 3.1.1 (RC1)

2021-01-18 Thread Jungtaek Lim
+1 (non-binding)

* verified signature and sha for all files (there's a glitch which I'll
describe in below)
* built source (DISCLAIMER: didn't run tests) and made custom distribution,
and built a docker image based on the distribution
  - used profiles: kubernetes, hadoop-3.2, hadoop-cloud
* ran some SS PySpark queries (Rate to Kafka, Kafka to Kafka) with Spark on
k8s (used MinIO - s3 compatible - as checkpoint location)
  - for Kafka reader, tested both approaches: newer (offset via admin
client) and older (offset via consumer)
* ran simple batch query with magic committer against MinIO storage &
dynamic volume provisioning (with NFS)
* verified DataStreamReader.table & DataStreamWriter.toTable works in
PySpark (which also verifies on Scala API as well)
* ran test stateful SS queries and checked the new additions of SS UI
(state store & watermark information)

A glitch from verifying sha; the file format of sha512 is different between
source targz and others. My tool succeeded with others and failed with
source targz, though I confirmed sha itself is the same. Not a blocker but
would be ideal if we can make it be consistent.

Thanks for driving the release process!

On Tue, Jan 19, 2021 at 2:25 PM Yuming Wang  wrote:

> +1.
>
> On Tue, Jan 19, 2021 at 7:54 AM Hyukjin Kwon  wrote:
>
>> I forgot to say :). I'll start with my +1.
>>
>> On Mon, 18 Jan 2021, 21:06 Hyukjin Kwon,  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.1.1.
>>>
>>> The vote is open until January 22nd 4PM PST and passes if a majority +1
>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.1.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.1.1-rc1 (commit
>>> 53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):
>>> https://github.com/apache/spark/tree/v3.1.1-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1364
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/
>>>
>>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>>> https://s.apache.org/41kf2
>>>
>>> This release is using the release script of the tag v3.1.1-rc1.
>>>
>>> FAQ
>>>
>>> ===
>>> What happened to 3.1.0?
>>> ===
>>>
>>> There was a technical issue during Apache Spark 3.1.0 preparation, and
>>> it was discussed and decided to skip 3.1.0.
>>> Please see
>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html
>>> for more details.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC via "pip install
>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz
>>> "
>>> and see if anything important breaks.
>>> In the Java/Scala, you can add the staging repository to your projects
>>> resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.1.1?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.1.1 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.1.1
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>


Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-06 Thread Jungtaek Lim
No worries about the accident. We're human beings, and everyone can make a
mistake. Let's wait and see the response of INFRA-21266.

Just a 2 cents, I'm actually leaning toward to skip 3.1.0 and start the
release process for 3.1.1, as anyone could be some sort of "rushing" on
verification on 3.1.0. As we're already biased by the fact the release is
already available, the RC might not be tested intensively and extensively.
I'm also OK to continue verification on RC1 and consider this as official
3.1.0 once vote passes if we are sure to do the normal release candidate QA
without bias. (I mean skipping some verifications they normally did, or
consider serious bugs to be "later" based on the expectation that 3.1.1
comes pretty soon.)

On Thu, Jan 7, 2021 at 6:56 AM Hyukjin Kwon  wrote:

> Thanks Dongjoon, Sean and Tom. I just thought that we could have some more
> bug fixes or some changes if RC1 passes as a regular release due to the
> relatively fewer RCs.
> I agree that if this RC passes, it's just that an RC passed normally per
> the regular process, and there's nothing wrong here. By right, there
> shouldn't be any special treatment or difference in 3.1.1.
> I more meant a practical point that we might happen to face some more bug
> fixes or breaking changes (of course as an exception) that happens
> sometimes.
>
>
> 2021년 1월 7일 (목) 오전 6:44, Tom Graves 님이 작성:
>
>> I think it makes sense to wait and see what they say on INFRA-21266
>> .
>>
>> In the mean time hopefully people can start testing it and if no other
>> problems found and vote passes can stay published.  It seems like the 2
>> issues above wouldn't be blockers in my opinion and could be handled in a
>> 3.1.1 but others can chime too.
>>
>> If we find other issues with it in testing and they can't revert in
>> INFRA-21266 - I assume we handle by putting some documentation out there
>> telling people not to use it and we go to 3.1.1.
>>
>> One thing I didn't follow was the comment: "release 3.1.1 fast that
>> exceptionally allows a bit of breaking changes" - what do you mean by that?
>>
>> if there is anything we can add to our release process documentation to
>> prevent in the future that would be great as well.
>>
>> Tom
>>
>> On Wednesday, January 6, 2021, 03:07:26 PM CST, Hyukjin Kwon <
>> gurwls...@gmail.com> wrote:
>>
>>
>> Yes, it was my mistake. I faced the same issue as INFRA-20651
>> , and it is worse in
>> my case because I misunderstood that RC and releases are separately
>> released out.
>> Right after this, I filed an INFRA JIRA to revert this at INFRA-21266
>> . We can wait and see
>> how it goes.
>>
>> Though, I know it’s impossible to remove by right. It is possible to
>> overwrite but it will affect people who already have it in their cache.
>> I am thinkthing two options:
>>
>>- Skip 3.1.0 and release 3.1.1 right away since the release isn’t
>>officially out to the main Apache repo/mirrors but only one of the
>>downstream channels. We can just say that there was something wrong during
>>the 3.1.0 release so it became 3.1.1 right away.
>>
>>
>>- Release 3.1.0 out, of course, based on the vote results here. We
>>could release 3.1.1 fast that exceptionally allows a bit of breaking
>>changes with properly documenting it in a release note and migration 
>> guide.
>>
>> I would appreciate it if I could hear other people' opinions.
>>
>> Thanks.
>>
>>
>>
>>
>>


Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-05 Thread Jungtaek Lim
There's an issue SPARK-33635 [1] reported due to performance regression on
Kafka read between Spark 2.4 vs 3.0, which sounds like a blocker. I'll mark
this as a blocker, unless anyone has different opinions.

1. https://issues.apache.org/jira/browse/SPARK-33635

On Wed, Jan 6, 2021 at 9:01 AM Hyukjin Kwon  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.1.0.
>
> The vote is open until January 8th 4PM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.1.0-rc1 (commit
> 97340c1e34cfd84de445b6b7545cfa466a1baaf6):
> https://github.com/apache/spark/tree/v3.1.0-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1363/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/
>
> The list of bug fixes going into 3.1.0 can be found at the following URL:
> https://s.apache.org/ldzzl
>
> This release is using the release script of the tag v3.1.0-rc1.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC via "pip install
> https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-bin/pyspark-3.1.0.tar.gz
> "
> and see if anything important breaks.
> In the Java/Scala, you can add the staging repository to your projects
> resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.0?
> ===
>
> The current list of open tickets targeted at 3.1.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-11-28 Thread Jungtaek Lim
To make clear, what Arun meant in old PR is, watermark and output mode are
not relevant. It's limited to the append mode in any way when we only deal
with watermark. So in this phase we don't (and shouldn't) bring output mode
in topic and make things complicated, unless we really have a solid plan to
introduce retraction.

On Fri, Nov 27, 2020 at 12:08 PM Yuanjian Li  wrote:

> Nice blog! Thanks for sharing, Etienne!
>
> Let's try to raise this discussion again after the 3.1 release. I do think
> more committers/contributors had realized the issue of global watermark per
> SPARK-24634 <https://issues.apache.org/jira/browse/SPARK-24634> and
> SPARK-33259 <https://issues.apache.org/jira/browse/SPARK-33259>.
>
> Leaving some thoughts on my end:
> 1. Compatibility: The per-operation watermark should be compatible with
> the original global one when there are no multi-aggregations.
> 2. Versioning: If we need to change checkpoints' format, versioning info
> should be added for the first time.
> 3. Fix more things together: We'd better fix more issues(e.g.
> per-operation output mode for multi-aggregations) together, which would
> require versioning changes in the same Spark version.
>
> Best,
> Yuanjian
>
>
> Etienne Chauchot  于2020年11月26日周四 下午5:29写道:
>
>> Hi,
>>
>> Regarding this subject I wrote a blog article that gives details about
>> the watermark architecture proposal that was discussed in the design doc
>> and in the PR:
>>
>>
>> https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html
>>
>> Best
>>
>> Etienne
>> On 29/09/2020 03:24, Yuanjian Li wrote:
>>
>> Thanks for the great discussion!
>>
>> Also interested in this feature and did some investigation before. As
>> Arun mentioned, similar to the "update" mode, the "complete" mode also
>> needs more design. We might need an operation level output mode for the
>> complete mode support. That is to say, if we use "complete" mode for every
>> aggregation operators, the wrong result will return.
>>
>> SPARK-26655 would be a good start, which only considers about "append"
>> mode. Maybe we need more discussion on the watermark interface. I will take
>> a close look at the doc and PR. Hope we will have the first version with
>> limitations and fix/remove them gradually.
>>
>> Best,
>> Yuanjian
>>
>> Jungtaek Lim  于2020年9月26日周六 上午10:31写道:
>>
>>> Thanks Etienne! Yeah I forgot to say nice talking with you again. And
>>> sorry I forgot to send the reply (was in draft).
>>>
>>> Regarding investment in SS, well, unfortunately I don't know - I'm just
>>> an individual. There might be various reasons to do so, most probably
>>> "priority" among the stuff. There's not much I could change.
>>>
>>> I agree the workaround is sub-optimal, but unless I see sufficient
>>> support in the community I probably couldn't make it go forward. I'll just
>>> say there's an elephant in the room - as the project goes forward for more
>>> than 10 years, backward compatibility is a top priority concern in the
>>> project, even across the major versions along the features/APIs. It is
>>> great for end users to migrate the version easily, but also blocks devs to
>>> fix the bad design once it ships. I'm the one complaining about these
>>> issues in the dev list, and I don't see willingness to correct them.
>>>
>>>
>>> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot 
>>> wrote:
>>>
>>>> Hi Jungtaek Lim,
>>>>
>>>> Nice to hear from you again since last time we talked :) and congrats
>>>> on becoming a Spark committer in the meantime ! (if I'm not mistaking you
>>>> were not at the time)
>>>>
>>>> I totally agree with what you're saying on merging structural parts of
>>>> Spark without having a broader consensus. What I don't understand is why
>>>> there is not more investment in SS. Especially because in another thread
>>>> the community is discussing about deprecating the regular DStream streaming
>>>> framework.
>>>>
>>>> Is the orientation of Spark now mostly batch ?
>>>>
>>>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview
>>>> 2 searching for this particular feature. And regarding the workaround, I'm
>>>> not sure it meets my needs as it will add delays and also may mess up with
>>>> watermarks.
>>>>
>>>> B

Re: Seeking committers' help to review on SS PR

2020-11-27 Thread Jungtaek Lim
Btw, there are two more PRs which got LGTM by a SS contributor but fail to
get attention from committers. They're 6+ months old. Could you help
reviewing this as well, or do you all think 6 months of time range + LGTM
from an SS contributor is enough to go ahead?

https://github.com/apache/spark/pull/27649
https://github.com/apache/spark/pull/28363

These are under 100 lines of changes per each, and not invasive.

On Sat, Nov 28, 2020 at 11:34 AM Jungtaek Lim 
wrote:

> Thanks for providing valuable feedback. Appreciate it. Sorry I haven't had
> time to reply to this in time (was OoO this week).
>
> I'm also in favor of "review then commit", I haven't been a "perfect" guy
> making no mistake (probably that justifies me as a human being), hence the
> review process is a critical one I really would like to go through in any
> way. The problem is, it's less likely I could get attention for my SS PRs
> from "committers". Hopefully there are a couple of contributors in the SS
> area, so getting reviewed by itself is a bit easier than before, but in any
> way I couldn't get a finalized review and go merging.
>
> This PR is actually an easy case, small enough and not invasive. I have
> planned major features most likely to bring design changes, which cannot go
> without attention from committers. I'd probably need committers for both
> design & code review - I'm a bit worried that such major change could be
> achieved with current situation.
> (This may not be resolved even with the time limit... If the community has
> a faith to me then I could bravely just go with time limit, but, is this
> something I can avoid ending up being blamed?)
>
> Probably the better resolution is filling enough SS committers. There
> should be some ways to do this, existing committers expert of SS area come
> back (which isn't something we can control), committers expert of other
> area jump in and help the area (which isn't also something we can control),
> or inviting new committer(s) for SS area. Anything would be fine for me.
>
>
> On Tue, Nov 24, 2020 at 2:46 AM Sean Owen  wrote:
>
>> Yes, agree, and that time limit is probably a lot shorter than 1.5 years.
>> I think these ultimately come down to judgment, and am affirming the
>> judgment that this amounts to 'reviewed'.
>>
>> On Mon, Nov 23, 2020 at 11:40 AM Ryan Blue  wrote:
>>
>>> I'll go take a look.
>>>
>>> While I would generally agree with Sean that it would be appropriate in
>>> this case to commit, I'm very hesitant to set that precedent. I'd prefer to
>>> stick with "review then commit" and, if needed, relax that constraint for
>>> parts of the project that can't get reviewers for a certain period of time.
>>> We did that in another community where there weren't many reviewers and we
>>> wanted to get more people involved, but we put a time limit on it and set
>>> expectations to prevent any perception of abuse. I would support doing that
>>> in SS.
>>>
>>> Thanks for being so patient on that PR. I'm sorry that you had to wait
>>> so long.
>>>
>>> On Mon, Nov 23, 2020 at 7:11 AM Sean Owen  wrote:
>>>
>>>> I don't see any objections on that thread. You're a committer and have
>>>> reviews from other knowledgeable people in this area. Do you have any
>>>> reason to believe it's controversial, like, changes semantics or APIs? Were
>>>> there related discussions elsewhere that expressed any concern?
>>>>
>>>> From a glance, OK it's introducing a new idea of state schema and
>>>> validation; would it conflict with any other possible approaches, have any
>>>> limits if this is enshrined as supported functionality? There's always some
>>>> cost to introducing yet more code to support, but, this doesn't look
>>>> intrusive or large.
>>>>
>>>> The "don't review your own PR" idea isn't hard-and-fast. I don't think
>>>> anyone needs to block for anything like this long if you have other capable
>>>> reviews and you are a committer, if you don't see that it impacts other
>>>> code meaningfully in a way that really demands review from others, and in
>>>> good faith judge that it is worthwhile. I think you are the one de facto
>>>> expert on that code and indeed you can't block yourself for 1.5 years or
>>>> else nothing substantial would happen.
>>>>
>>>>
>>>>
>>>> On Mon, Nov 23, 2020 at 1:18 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>

Re: Seeking committers' help to review on SS PR

2020-11-27 Thread Jungtaek Lim
Thanks for providing valuable feedback. Appreciate it. Sorry I haven't had
time to reply to this in time (was OoO this week).

I'm also in favor of "review then commit", I haven't been a "perfect" guy
making no mistake (probably that justifies me as a human being), hence the
review process is a critical one I really would like to go through in any
way. The problem is, it's less likely I could get attention for my SS PRs
from "committers". Hopefully there are a couple of contributors in the SS
area, so getting reviewed by itself is a bit easier than before, but in any
way I couldn't get a finalized review and go merging.

This PR is actually an easy case, small enough and not invasive. I have
planned major features most likely to bring design changes, which cannot go
without attention from committers. I'd probably need committers for both
design & code review - I'm a bit worried that such major change could be
achieved with current situation.
(This may not be resolved even with the time limit... If the community has
a faith to me then I could bravely just go with time limit, but, is this
something I can avoid ending up being blamed?)

Probably the better resolution is filling enough SS committers. There
should be some ways to do this, existing committers expert of SS area come
back (which isn't something we can control), committers expert of other
area jump in and help the area (which isn't also something we can control),
or inviting new committer(s) for SS area. Anything would be fine for me.


On Tue, Nov 24, 2020 at 2:46 AM Sean Owen  wrote:

> Yes, agree, and that time limit is probably a lot shorter than 1.5 years.
> I think these ultimately come down to judgment, and am affirming the
> judgment that this amounts to 'reviewed'.
>
> On Mon, Nov 23, 2020 at 11:40 AM Ryan Blue  wrote:
>
>> I'll go take a look.
>>
>> While I would generally agree with Sean that it would be appropriate in
>> this case to commit, I'm very hesitant to set that precedent. I'd prefer to
>> stick with "review then commit" and, if needed, relax that constraint for
>> parts of the project that can't get reviewers for a certain period of time.
>> We did that in another community where there weren't many reviewers and we
>> wanted to get more people involved, but we put a time limit on it and set
>> expectations to prevent any perception of abuse. I would support doing that
>> in SS.
>>
>> Thanks for being so patient on that PR. I'm sorry that you had to wait so
>> long.
>>
>> On Mon, Nov 23, 2020 at 7:11 AM Sean Owen  wrote:
>>
>>> I don't see any objections on that thread. You're a committer and have
>>> reviews from other knowledgeable people in this area. Do you have any
>>> reason to believe it's controversial, like, changes semantics or APIs? Were
>>> there related discussions elsewhere that expressed any concern?
>>>
>>> From a glance, OK it's introducing a new idea of state schema and
>>> validation; would it conflict with any other possible approaches, have any
>>> limits if this is enshrined as supported functionality? There's always some
>>> cost to introducing yet more code to support, but, this doesn't look
>>> intrusive or large.
>>>
>>> The "don't review your own PR" idea isn't hard-and-fast. I don't think
>>> anyone needs to block for anything like this long if you have other capable
>>> reviews and you are a committer, if you don't see that it impacts other
>>> code meaningfully in a way that really demands review from others, and in
>>> good faith judge that it is worthwhile. I think you are the one de facto
>>> expert on that code and indeed you can't block yourself for 1.5 years or
>>> else nothing substantial would happen.
>>>
>>>
>>>
>>> On Mon, Nov 23, 2020 at 1:18 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> I have been struggling to find reviewers who are committers, to get my
>>>> PR [1] for SPARK-27237 [2] reviewed. The PR was submitted on Mar. 2019 (1.5
>>>> years ago), and somehow it got two approvals from contributors working on
>>>> the SS area, but still doesn't get any committers' traction to review.
>>>> (I can review others' SS PRs and I'm trying to unblock other SS area
>>>> contributors, but I can't self review my SS PRs. Not sure it's technically
>>>> possible, but fully sure it's not encouraged.)
>>>>
>>>> Could I please ask help to unblock this before feature freeze for Spark
>>>> 3.1 is happening? Subm

Re: [SS] full outer stream-stream join

2020-11-22 Thread Jungtaek Lim
Adding rationalization here, my request for raising the thead to dev
mailing list is, to figure out possible reasons not having full outer join
at the moment when adding left/right outer join.

This is rather historical knowledge, so I have no idea about this. Most
likely a limited number of folks could answer and I hope we could get some
historical information.

Note that I don't object the change. Just wanted to make clear we don't
miss something.

On Sat, Nov 21, 2020 at 4:15 AM real-cheng-su 
wrote:

> Hi,
>
> Stream-stream join in spark structured streaming right now supports INNER,
> LEFT OUTER, RIGHT OUTER and LEFT SEMI join type. But it does not support
> FULL OUTER join and we are working on to add it in
> https://github.com/apache/spark/pull/30395 .
>
> Given LEFT OUTER and RIGHT OUTER stream-stream join is supported, the code
> needed for FULL OUTER join is actually quite straightforward:
>
> * For left side input row, check if there's a match on right side state
> store. if there's a match, output the joined row, o.w. output nothing. Put
> the row in left side state store.
> * For right side input row, check if there's a match on left side state
> store. if there's a match, output the joined row, o.w. output nothing. Put
> the row in right side state store.
> * State store eviction: evict rows from left/right side state store below
> watermark, and output rows never matched before (a combination of left
> outer
> and right outer join).
>
> Given FULL OUTER join consumes same amount of space in state store,
> compared
> with INNER/LEFT OUTER/RIGH OUTER join, and pretty easy to add. I don’t see
> any issues from system perspective that FULL OUTER join should not be
> added.
>
> I am wondering is there any major blocker to add FULL OUTER stream-stream
> join? Asking in dev mailing list in case we miss anything besides PR review
> participation, thanks.
>
> Cheng Su
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Seeking committers' help to review on SS PR

2020-11-22 Thread Jungtaek Lim
Hi devs,

I have been struggling to find reviewers who are committers, to get my PR
[1] for SPARK-27237 [2] reviewed. The PR was submitted on Mar. 2019 (1.5
years ago), and somehow it got two approvals from contributors working on
the SS area, but still doesn't get any committers' traction to review.
(I can review others' SS PRs and I'm trying to unblock other SS area
contributors, but I can't self review my SS PRs. Not sure it's technically
possible, but fully sure it's not encouraged.)

Could I please ask help to unblock this before feature freeze for Spark 3.1
is happening? Submitted 1.5 years ago and continues struggling for
including it in Spark 3.2 (another half of a year) doesn't make sense to me.

In addition, is there a way to unblock me to work for meaningful features
instead of being stuck with small improvements? I have something in my
backlog but I'd rather not want to continue struggling with new PRs.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://github.com/apache/spark/pull/24173
2. https://issues.apache.org/jira/browse/SPARK-27237


Re: [DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
I see some voices that it's not sufficient to understand the topic. Let me
elaborate this a bit more.

1. There're multiple reviewers reviewing the PR. (Say, A, B, C, D)
2. A and B leaves review comments on the PR, but no one makes the explicit
indication that these review comments are the final one.
3. The author of the PR addresses the review comments.
4. C checks that the review comments from A and B are addressed, and merges
the PR. In parallel (or a bit later), A is trying to check whether the
review comments are addressed (or even more, A could provide more review
comments afterwards), and realized the PR is already merged.

Saying again, there's "technically" no incorrect point. Let's give another
example of what I said "trade-off".

1. There're multiple reviewers reviewing the PR. (Say, A, B, C, D)
2. A and B leaves review comments on the PR, but no one makes the explicit
indication that these review comments are the final one.
3. The author of the PR addresses the review comments.
4. C checks that the review comments from A and B are addressed, and asks A
and B to confirm whether there's no further review comments, with the
condition that it will be merged in a few days later if there's no further
feedback.
5. If A and B confirms or A and B doesn't provide new feedback in the
period, C merges the PR. If A or B provides new feedback, go back to 3 with
resetting the days.

This is what we tend to comment as "@A @B I'll leave this a few days more
to see if anyone has further comments. Otherwise I'll merge this.".

I see both are used across various PRs, so it's not really something I want
to blame. Just want to make us think about what would be the ideal approach
we'd be better to prefer.


On Sat, Nov 14, 2020 at 3:46 PM Jungtaek Lim 
wrote:

> Oh sorry that was gone with flame (please just consider it as my fault)
> and I just removed all comments.
>
> Btw, when I always initiate discussions, I really do love to start
> discussion "without" specific instances which tend to go blaming each
> other. I understand it's not easy to discuss without taking examples, but
> I'll try to explain the situation on my best instead. Please let me know if
> there's some ambiguous or unclear thing to think about.
>
> On Sat, Nov 14, 2020 at 3:41 PM Sean Owen  wrote:
>
>> I am sure you are referring to some specific instances but I have not
>> followed enough to know what they are. Can you point them out? I think that
>> is most productive for everyone to understand.
>>
>> On Fri, Nov 13, 2020 at 10:16 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi devs,
>>>
>>> I know this is a super sensitive topic and at a risk of flame, but just
>>> like to try this. My apologies first.
>>> Assuming we all know about the ASF policy about code commit and I don't
>>> see Spark project has any explicit BYLAWS, it's technically possible to do
>>> anything for committers to do during merging.
>>>
>>> Sometimes this goes a bit depressing for reviewers, regardless of the
>>> intention, when merger makes a judgement by oneself to merge while the
>>> reviewers are still in the review phase. I observed the practice is used
>>> frequently, under the fact that we have post-review to address further
>>> comments later.
>>>
>>> I know about the concern that it's sometimes blocking unintentionally if
>>> we require merger to gather consensus about the merge from reviewers, but
>>> we also have some other practice holding on merging for a couple of days
>>> and noticing to reviewers whether they have further comments or not, which
>>> is I think a good trade-off.
>>>
>>> Exclude the cases where we're in release blocker mode, wouldn't we be
>>> hurt too much if we ask merger to respect the practice on noticing to
>>> reviewers that merging will be happen soon and waiting a day or so? I feel
>>> the post-review is opening the possibility for reviewers late on the party
>>> to review later, but it's over-used if it is leveraged as a judgement that
>>> merger can merge at any time and reviewers can still continue reviewing.
>>> Reviewers would feel broken flow - that is not the same experience with
>>> having more time to finalize reviewing before merging.
>>>
>>> Again I know it's super hard to reconsider the ongoing practice while
>>> the project has gone for the long way (10 years), but just wanted to hear
>>> the voices about this.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>


Re: [DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
Oh sorry that was gone with flame (please just consider it as my fault) and
I just removed all comments.

Btw, when I always initiate discussions, I really do love to start
discussion "without" specific instances which tend to go blaming each
other. I understand it's not easy to discuss without taking examples, but
I'll try to explain the situation on my best instead. Please let me know if
there's some ambiguous or unclear thing to think about.

On Sat, Nov 14, 2020 at 3:41 PM Sean Owen  wrote:

> I am sure you are referring to some specific instances but I have not
> followed enough to know what they are. Can you point them out? I think that
> is most productive for everyone to understand.
>
> On Fri, Nov 13, 2020 at 10:16 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Hi devs,
>>
>> I know this is a super sensitive topic and at a risk of flame, but just
>> like to try this. My apologies first.
>> Assuming we all know about the ASF policy about code commit and I don't
>> see Spark project has any explicit BYLAWS, it's technically possible to do
>> anything for committers to do during merging.
>>
>> Sometimes this goes a bit depressing for reviewers, regardless of the
>> intention, when merger makes a judgement by oneself to merge while the
>> reviewers are still in the review phase. I observed the practice is used
>> frequently, under the fact that we have post-review to address further
>> comments later.
>>
>> I know about the concern that it's sometimes blocking unintentionally if
>> we require merger to gather consensus about the merge from reviewers, but
>> we also have some other practice holding on merging for a couple of days
>> and noticing to reviewers whether they have further comments or not, which
>> is I think a good trade-off.
>>
>> Exclude the cases where we're in release blocker mode, wouldn't we be
>> hurt too much if we ask merger to respect the practice on noticing to
>> reviewers that merging will be happen soon and waiting a day or so? I feel
>> the post-review is opening the possibility for reviewers late on the party
>> to review later, but it's over-used if it is leveraged as a judgement that
>> merger can merge at any time and reviewers can still continue reviewing.
>> Reviewers would feel broken flow - that is not the same experience with
>> having more time to finalize reviewing before merging.
>>
>> Again I know it's super hard to reconsider the ongoing practice while the
>> project has gone for the long way (10 years), but just wanted to hear the
>> voices about this.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>


[DISCUSS] Review/merge phase, and post-review

2020-11-13 Thread Jungtaek Lim
Hi devs,

I know this is a super sensitive topic and at a risk of flame, but just
like to try this. My apologies first.
Assuming we all know about the ASF policy about code commit and I don't see
Spark project has any explicit BYLAWS, it's technically possible to do
anything for committers to do during merging.

Sometimes this goes a bit depressing for reviewers, regardless of the
intention, when merger makes a judgement by oneself to merge while the
reviewers are still in the review phase. I observed the practice is used
frequently, under the fact that we have post-review to address further
comments later.

I know about the concern that it's sometimes blocking unintentionally if we
require merger to gather consensus about the merge from reviewers, but we
also have some other practice holding on merging for a couple of days and
noticing to reviewers whether they have further comments or not, which is I
think a good trade-off.

Exclude the cases where we're in release blocker mode, wouldn't we be hurt
too much if we ask merger to respect the practice on noticing to reviewers
that merging will be happen soon and waiting a day or so? I feel the
post-review is opening the possibility for reviewers late on the party to
review later, but it's over-used if it is leveraged as a judgement that
merger can merge at any time and reviewers can still continue reviewing.
Reviewers would feel broken flow - that is not the same experience with
having more time to finalize reviewing before merging.

Again I know it's super hard to reconsider the ongoing practice while the
project has gone for the long way (10 years), but just wanted to hear the
voices about this.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-08 Thread Jungtaek Lim
After the check logic was introduced in Spark 3.0, there's another related
issue I addressed in Spark 3.1, SPARK-24634 [1].

Before SPARK-24634, there's no way to know how many rows are discarded due
to being late, even whether there's any late row or not. That said, the
issue has been the correctness issue "silently" impacting the
result. SPARK-24634 will provide the overall number of late rows in the
streaming listener, as well as the number of late rows "per operator" in
the SQL UI graph. So end users are no longer "blindly" impacted.

Even though, I'd agree that it's pretty hard to construct the query
which avoids correctness issues and still does chained stateful operations.
I see two separate JIRA issues on reporting the same correctness behavior,
meaning this is already impacting the end users' queries. (More number of
end users may not even notice the impact, as SPARK-24634 isn't released
yet.)

So overall I'm +1 to prevent the query in prior. This change would possibly
break some of user queries, but I'd suspect they might suffer from
correctness and they even didn't notice that.

For sure, a better approach would be dropping global watermark and
implementing operator-wise watermark properly. This is just a workaround,
but fixing watermark would require major effort.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24634


On Sat, Nov 7, 2020 at 3:59 PM Liang-Chi Hsieh  wrote:

> Hi devs,
>
> In Spark structured streaming, chained stateful operators possibly produces
> incorrect results under the global watermark. SPARK-33259
> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
> demostrating what the correctness issue could be.
>
> Currently we don't prevent users running such queries. Because the possible
> correctness in chained stateful operators in streaming query is not
> straightforward for users. From users perspective, it will possibly be
> considered as a Spark bug like SPARK-33259. It is also possible the worse
> case, users are not aware of the correctness issue and use wrong results.
>
> IMO, it is better to disable such queries and let users choose to run the
> query if they understand there is such risk, instead of implicitly running
> the query and let users to find out correctness issue by themselves.
>
> I would like to propose to disable the streaming query with possible
> correctness issue in chained stateful operators. The behavior can be
> controlled by a SQL config, so if users understand the risk and still want
> to run the query, they can disable the check.
>
> In the PR (https://github.com/apache/spark/pull/30210), the concern I got
> for now is, this changes current behavior and by default it will break some
> existing streaming queries. But I think it is pretty easy to disable the
> check with the new config. In the PR currently there is no objection but
> suggestion to hear more voices. Please let me know if you have some
> thoughts.
>
> Thanks.
> Liang-Chi Hsieh
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-25 Thread Jungtaek Lim
Yeah I'm in favor of fast-fail if things are not working out as end users
intended. Spark should only fail back when it doesn't make any difference
but only some sort of performance. (like whole stage codegen) This fail
back brings behavioral differences, which should be considered as a bug.

I'll file an issue and raise a PR sooner. Thanks for providing voices!

On Sat, Oct 24, 2020 at 2:03 AM Ryan Blue  wrote:

> I agree. If the user configures an invalid catalog, it should fail and
> propagate the exception. Running with a catalog other than the one the user
> requested is incorrect.
>
> On Fri, Oct 23, 2020 at 5:24 AM Russell Spitzer 
> wrote:
>
>> I was convinced that we should probably just fail, but if that is too
>> much of a change, then logging the exception is also acceptable.
>>
>> On Thu, Oct 22, 2020, 10:32 PM Jungtaek Lim 
>> wrote:
>>
>>> Hi devs,
>>>
>>> I got another report regarding configuring v2 session catalog, when
>>> Spark fails to instantiate the configured catalog. For now, it just simply
>>> logs error message without exception information, and silently uses the
>>> default session catalog.
>>>
>>>
>>> https://github.com/apache/spark/blob/3819d39607392aa968595e3d97b84fedf83d08d9/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala#L75-L95
>>>
>>> IMO, as the user intentionally provides the session catalog, it
>>> shouldn't fail back and just throw the exception. Otherwise (if we still
>>> want to do the failback), we need to add the exception information in the
>>> error log message at least.
>>>
>>> Would like to hear the voices.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[DISCUSS] preferred behavior when fails to instantiate configured v2 session catalog

2020-10-22 Thread Jungtaek Lim
Hi devs,

I got another report regarding configuring v2 session catalog, when Spark
fails to instantiate the configured catalog. For now, it just simply logs
error message without exception information, and silently uses the default
session catalog.

https://github.com/apache/spark/blob/3819d39607392aa968595e3d97b84fedf83d08d9/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogManager.scala#L75-L95

IMO, as the user intentionally provides the session catalog, it shouldn't
fail back and just throw the exception. Otherwise (if we still want to do
the failback), we need to add the exception information in the error log
message at least.

Would like to hear the voices.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
> If you just want to save typing the catalog name when writing table
names, you can set your custom catalog as the default catalog (See
SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is used
to extend the v1 session catalog, not replace it.

I'm sorry, but I don't get this.

The custom session catalog I use for V2_SESSION_CATALOG_IMPLEMENTATION is
intended to go through a specific provider first (v2) and go back to call
Spark's catalog if the table doesn't exist for the catalog. If this is not
a design intention of V2_SESSION_CATALOG_IMPLEMENTATION then OK (probably
should be documented somewhere), but the implementation doesn't receive any
call for methods so it's no-op even if it is just designed to extend V1
session catalog.

My understanding is, V1 commands leverage sparkSession.sessionState.catalog
which doesn't seem to know about extended session catalog. It just uses
ExternalCatalog which sticks to Spark built-in. That said, the
functionality is only partially working. Is this a thing we should fix for
Spark 3.0.2/3.1.0, or better to disable the feature until we ensure it
works for all commands?


On Thu, Oct 8, 2020 at 1:31 AM Ryan Blue  wrote:

> I disagree that this is “by design”. An operation like DROP TABLE should
> use a v2 drop plan if the table is v2.
>
> If a v2 table is loaded or created using a v2 catalog it should also be
> dropped that way. Otherwise, the v2 catalog is not notified when the table
> is dropped and can’t perform other necessary updates, like invalidating
> caches or dropping state outside of Hive. V2 tables should always use the
> v2 API, and I’m not aware of a design where that wasn’t the case.
>
> I’d also say that for DROP TABLE in particular, all calls could use the
> v2 catalog. We may not want to do this until we are confident as Wenchen
> said, but this would be the simpler solution. The v2 catalog can delegate
> to the old session catalog, after all.
>
> On Wed, Oct 7, 2020 at 3:48 AM Wenchen Fan  wrote:
>
>> If you just want to save typing the catalog name when writing table
>> names, you can set your custom catalog as the default catalog (See
>> SQLConf.DEFAULT_CATALOG). SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION is
>> used to extend the v1 session catalog, not replace it.
>>
>> On Wed, Oct 7, 2020 at 5:36 PM Jungtaek Lim 
>> wrote:
>>
>>> If it's by design and not prepared, then IMHO replacing the default
>>> session catalog is better to be restricted until things are sorted out, as
>>> it gives pretty much confusion and has known bugs. Actually there's another
>>> bug/limitation on default session catalog on the length of identifier,
>>> so things that work with custom catalog no longer work when it replaces
>>> default session catalog.
>>>
>>> On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan  wrote:
>>>
>>>> Ah, this is by design. V1 tables should still go through the v1 session
>>>> catalog. I think we can remove this restriction when we are confident about
>>>> the new v2 DDL commands that work with v2 catalog APIs.
>>>>
>>>> On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it
>>>>> simply works when I use custom catalog without replacing the default
>>>>> catalog).
>>>>>
>>>>> It just fails on v2 when the "default catalog" is replaced (say I
>>>>> replace 'spark_catalog'), because TempViewOrV1Table is providing value 
>>>>> even
>>>>> with v2 table, and then the catalyst goes with v1 exec. I guess all
>>>>> commands leveraging TempViewOrV1Table to determine whether the table is v1
>>>>> vs v2 would all suffer from this issue.
>>>>>
>>>>> On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan 
>>>>> wrote:
>>>>>
>>>>>> Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE
>>>>>> LIKE), so it's possible that some commands still go through the v1 
>>>>>> session
>>>>>> catalog although you configured a custom v2 session catalog.
>>>>>>
>>>>>> Can you create JIRA tickets if you hit any DDL commands that don't
>>>>>> support v2 catalog? We should fix them.
>>>>>>
>>>>>> On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> The logical plan for the parsed statement is gett

Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
If it's by design and not prepared, then IMHO replacing the default session
catalog is better to be restricted until things are sorted out, as it gives
pretty much confusion and has known bugs. Actually there's another
bug/limitation on default session catalog on the length of identifier,
so things that work with custom catalog no longer work when it replaces
default session catalog.

On Wed, Oct 7, 2020 at 6:05 PM Wenchen Fan  wrote:

> Ah, this is by design. V1 tables should still go through the v1 session
> catalog. I think we can remove this restriction when we are confident about
> the new v2 DDL commands that work with v2 catalog APIs.
>
> On Wed, Oct 7, 2020 at 5:00 PM Jungtaek Lim 
> wrote:
>
>> My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it
>> simply works when I use custom catalog without replacing the default
>> catalog).
>>
>> It just fails on v2 when the "default catalog" is replaced (say I replace
>> 'spark_catalog'), because TempViewOrV1Table is providing value even with v2
>> table, and then the catalyst goes with v1 exec. I guess all commands
>> leveraging TempViewOrV1Table to determine whether the table is v1 vs v2
>> would all suffer from this issue.
>>
>> On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan  wrote:
>>
>>> Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE
>>> LIKE), so it's possible that some commands still go through the v1 session
>>> catalog although you configured a custom v2 session catalog.
>>>
>>> Can you create JIRA tickets if you hit any DDL commands that don't
>>> support v2 catalog? We should fix them.
>>>
>>> On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> The logical plan for the parsed statement is getting converted either
>>>> for old one or v2, and for the former one it keeps using an external
>>>> catalog (Hive) - so replacing default session catalog with custom one and
>>>> trying to use it like it is in external catalog doesn't work, which
>>>> destroys the purpose of replacing the default session catalog.
>>>>
>>>> Btw I see one approach: in TempViewOrV1Table, if it matches
>>>> with SessionCatalogAndIdentifier where the catalog is TableCatalog, call
>>>> loadTable in catalog and see whether it's V1 table or not. Not sure it's a
>>>> viable approach though, as it requires loading a table during resolution of
>>>> the table identifier.
>>>>
>>>> On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue  wrote:
>>>>
>>>>> I've hit this with `DROP TABLE` commands that should be passed to a
>>>>> registered v2 session catalog, but are handled by v1. I think that's the
>>>>> only case we hit in our downstream test suites, but we haven't been
>>>>> exploring the use of a session catalog for fallback. We use v2 for
>>>>> everything now, which avoids the problem and comes with multi-catalog
>>>>> support.
>>>>>
>>>>> On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Hi devs,
>>>>>>
>>>>>> I'm not sure whether it's addressed in Spark 3.1, but at least from
>>>>>> Spark 3.0.1, many SQL DDL statements don't seem to go through the custom
>>>>>> catalog when I replace default catalog with custom catalog and only 
>>>>>> provide
>>>>>> 'dbName.tableName' as table identifier.
>>>>>>
>>>>>> I'm not an expert in this area, but after skimming the code I feel
>>>>>> TempViewOrV1Table looks to be broken for the case, as it can still be a 
>>>>>> V2
>>>>>> table. Classifying the table identifier to either V2 table or "temp view 
>>>>>> or
>>>>>> v1 table" looks to be mandatory, as former and latter have different code
>>>>>> paths and different catalog interfaces.
>>>>>>
>>>>>> That sounds to me as being stuck and the only "clear" approach seems
>>>>>> to disallow default catalog with custom one. Am I missing something?
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>


Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-07 Thread Jungtaek Lim
My case is DROP TABLE and DROP TABLE supports both v1 and v2 (as it simply
works when I use custom catalog without replacing the default catalog).

It just fails on v2 when the "default catalog" is replaced (say I replace
'spark_catalog'), because TempViewOrV1Table is providing value even with v2
table, and then the catalyst goes with v1 exec. I guess all commands
leveraging TempViewOrV1Table to determine whether the table is v1 vs v2
would all suffer from this issue.

On Wed, Oct 7, 2020 at 5:45 PM Wenchen Fan  wrote:

> Not all the DDL commands support v2 catalog APIs (e.g. CREATE TABLE LIKE),
> so it's possible that some commands still go through the v1 session catalog
> although you configured a custom v2 session catalog.
>
> Can you create JIRA tickets if you hit any DDL commands that don't support
> v2 catalog? We should fix them.
>
> On Wed, Oct 7, 2020 at 9:15 AM Jungtaek Lim 
> wrote:
>
>> The logical plan for the parsed statement is getting converted either for
>> old one or v2, and for the former one it keeps using an external catalog
>> (Hive) - so replacing default session catalog with custom one and trying to
>> use it like it is in external catalog doesn't work, which destroys the
>> purpose of replacing the default session catalog.
>>
>> Btw I see one approach: in TempViewOrV1Table, if it matches
>> with SessionCatalogAndIdentifier where the catalog is TableCatalog, call
>> loadTable in catalog and see whether it's V1 table or not. Not sure it's a
>> viable approach though, as it requires loading a table during resolution of
>> the table identifier.
>>
>> On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue  wrote:
>>
>>> I've hit this with `DROP TABLE` commands that should be passed to a
>>> registered v2 session catalog, but are handled by v1. I think that's the
>>> only case we hit in our downstream test suites, but we haven't been
>>> exploring the use of a session catalog for fallback. We use v2 for
>>> everything now, which avoids the problem and comes with multi-catalog
>>> support.
>>>
>>> On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> I'm not sure whether it's addressed in Spark 3.1, but at least from
>>>> Spark 3.0.1, many SQL DDL statements don't seem to go through the custom
>>>> catalog when I replace default catalog with custom catalog and only provide
>>>> 'dbName.tableName' as table identifier.
>>>>
>>>> I'm not an expert in this area, but after skimming the code I feel
>>>> TempViewOrV1Table looks to be broken for the case, as it can still be a V2
>>>> table. Classifying the table identifier to either V2 table or "temp view or
>>>> v1 table" looks to be mandatory, as former and latter have different code
>>>> paths and different catalog interfaces.
>>>>
>>>> That sounds to me as being stuck and the only "clear" approach seems to
>>>> disallow default catalog with custom one. Am I missing something?
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>


Re: SQL DDL statements with replacing default catalog with custom catalog

2020-10-06 Thread Jungtaek Lim
The logical plan for the parsed statement is getting converted either for
old one or v2, and for the former one it keeps using an external catalog
(Hive) - so replacing default session catalog with custom one and trying to
use it like it is in external catalog doesn't work, which destroys the
purpose of replacing the default session catalog.

Btw I see one approach: in TempViewOrV1Table, if it matches
with SessionCatalogAndIdentifier where the catalog is TableCatalog, call
loadTable in catalog and see whether it's V1 table or not. Not sure it's a
viable approach though, as it requires loading a table during resolution of
the table identifier.

On Wed, Oct 7, 2020 at 10:04 AM Ryan Blue  wrote:

> I've hit this with `DROP TABLE` commands that should be passed to a
> registered v2 session catalog, but are handled by v1. I think that's the
> only case we hit in our downstream test suites, but we haven't been
> exploring the use of a session catalog for fallback. We use v2 for
> everything now, which avoids the problem and comes with multi-catalog
> support.
>
> On Tue, Oct 6, 2020 at 5:55 PM Jungtaek Lim 
> wrote:
>
>> Hi devs,
>>
>> I'm not sure whether it's addressed in Spark 3.1, but at least from Spark
>> 3.0.1, many SQL DDL statements don't seem to go through the custom catalog
>> when I replace default catalog with custom catalog and only provide
>> 'dbName.tableName' as table identifier.
>>
>> I'm not an expert in this area, but after skimming the code I feel
>> TempViewOrV1Table looks to be broken for the case, as it can still be a V2
>> table. Classifying the table identifier to either V2 table or "temp view or
>> v1 table" looks to be mandatory, as former and latter have different code
>> paths and different catalog interfaces.
>>
>> That sounds to me as being stuck and the only "clear" approach seems to
>> disallow default catalog with custom one. Am I missing something?
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


SQL DDL statements with replacing default catalog with custom catalog

2020-10-06 Thread Jungtaek Lim
Hi devs,

I'm not sure whether it's addressed in Spark 3.1, but at least from Spark
3.0.1, many SQL DDL statements don't seem to go through the custom catalog
when I replace default catalog with custom catalog and only provide
'dbName.tableName' as table identifier.

I'm not an expert in this area, but after skimming the code I feel
TempViewOrV1Table looks to be broken for the case, as it can still be a V2
table. Classifying the table identifier to either V2 table or "temp view or
v1 table" looks to be mandatory, as former and latter have different code
paths and different catalog interfaces.

That sounds to me as being stuck and the only "clear" approach seems to
disallow default catalog with custom one. Am I missing something?

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this.

On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim 
wrote:

> Bump this again.
>
> On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Bump again.
>>
>> Unlike file stream sink which has lots of limitations and many of us have
>> been suggesting alternatives, file stream source is the only way if end
>> users want to read the data from files. No alternative unless they
>> introduce another ETL & storage (probably Kafka).
>>
>> On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi German,
>>>
>>> option 1 isn't about "deleting" the old files, as your input directory
>>> may be accessed by multiple queries. Kafka centralizes the maintenance of
>>> input data hence possible to apply retention without problem.
>>> option 1 is more about "hiding" the old files being read, so that end
>>> users "may" be able to delete the files once they ensure "all queries
>>> accessing the input directory" don't see the old files.
>>>
>>> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon <
>>> gschiavonsp...@gmail.com> wrote:
>>>
>>>> HI Jungtaek,
>>>>
>>>> I have a question, aren't both approaches compatible?
>>>>
>>>> How I see it, I think It would be interesting to have a retention
>>>> period to delete old files and/or the possibility of indicating an offset
>>>> (Timestamp). It would be very "similar" to how we do it with kafka.
>>>>
>>>> WDYT?
>>>>
>>>> On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> (I'd like to keep the discussion thread focusing on the specific topic
>>>>> - let's initiate another discussion threads on different topics.)
>>>>>
>>>>> Thanks for the input. I'd like to emphasize that the point in
>>>>> discussion is the "latestFirst" option - the rationalization starts from
>>>>> growing metadata log issues. I hope your input is picking option 2, but
>>>>> could you please make clear your input represents OK to "replace" the
>>>>> "latestFirst" option with "starting from timestamp"?
>>>>>
>>>>>
>>>>> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal <
>>>>> vikram.agra...@gmail.com> wrote:
>>>>>
>>>>>> If we compare file-stream source with other streaming sources such as
>>>>>> Kafka, the current behavior is indeed incomplete.  Starting the streaming
>>>>>> from a custom offset/particular point of time is something that is 
>>>>>> missing.
>>>>>> Typically filestream sources don't have auto-deletion of the older
>>>>>> data/files. In kafka we can define the retention period. So even if we 
>>>>>> use
>>>>>> "Earliest" we won't end up reading from the time when the Kafka topic was
>>>>>> created. On the other hand, streaming sources can hold very old files. 
>>>>>> It's
>>>>>> very valid use-cases to read the bulk of the old files using a batch job
>>>>>> until a particular timestamp. And then use streaming jobs for real-time
>>>>>> updates.
>>>>>>
>>>>>> So having support where we can specify a timestamp. and we would
>>>>>> consider files created post that timestamp can be useful.
>>>>>>
>>>>>> Another concern which we need to consider is the listing cost. is
>>>>>> there any way we can avoid listing the entire base directory and then
>>>>>> filtering out the new files. if the data is organized as partitions using
>>>>>> date, will it help to list only those partitions where new files were
>>>>>> added?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> bump, is there any interest on this topic?
>>>>>>>
>>>>>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>&g

Re: Output mode in Structured Streaming and DSv1 sink/DSv2 table

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this

On Sun, Sep 20, 2020 at 1:59 PM Jungtaek Lim 
wrote:

> Hi devs,
>
> We have a capability check in DSv2 defining which operations can be done
> against the data source both read and write. The concept was brought in
> DSv2, so it's not weird for DSv1 to don't have a concept.
>
> In SS the problem arises - if I understand correctly, we would like to
> couple the output mode in the query and the output table. That said,
> complete mode should enforce the output table to truncate the content.
> Update mode should enforce the output table to "upsert" or "delete and
> append" the content.
>
> Nothing has been done against the DSv1 sink - Spark doesn't enforce
> anything and works as append mode, though the query still respects the
> output mode on stateful operations.
>
> I understand we don't want to make end users surprised on broken
> compatibility, but shouldn't it be an "temporary" "exceptional" case
> and DSv2 never does it again? I'm seeing many built-in data sources being
> migrated to DSv2 with the exception of "do nothing for update/truncate",
> which simply destruct the rationalization on capability.
>
> In addition, they don't add TRUNCATE in capability but add
> SupportsTruncate in WriteBuilder, which is weird. It works as of now
> because SS misses checking capability on the writer side (I guess it only
> checks STREAMING_WRITE), but once we check capability in first place,
> things will break.
> (I'm looking into adding a writer plan in SS before analyzer, and check
> capability there.)
>
> What would be our best fix on this issue? Would we leave the
> responsibility of handling "truncate" on the data source (so do nothing is
> fine if it's intended), and just add TRUNCATE to the capability? (That
> should be documented in its data source description though.) Or drop the
> support on truncate if the data source is unable to truncate? (Foreach and
> Kafka output tables will be unable to apply complete mode afterwards.)
>
> Looking forward to hear everyone's thoughts.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-25 Thread Jungtaek Lim
Thanks Etienne! Yeah I forgot to say nice talking with you again. And sorry
I forgot to send the reply (was in draft).

Regarding investment in SS, well, unfortunately I don't know - I'm just an
individual. There might be various reasons to do so, most probably
"priority" among the stuff. There's not much I could change.

I agree the workaround is sub-optimal, but unless I see sufficient support
in the community I probably couldn't make it go forward. I'll just say
there's an elephant in the room - as the project goes forward for more than
10 years, backward compatibility is a top priority concern in the project,
even across the major versions along the features/APIs. It is great for end
users to migrate the version easily, but also blocks devs to fix the bad
design once it ships. I'm the one complaining about these issues in the dev
list, and I don't see willingness to correct them.


On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot 
wrote:

> Hi Jungtaek Lim,
>
> Nice to hear from you again since last time we talked :) and congrats on
> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
> not at the time)
>
> I totally agree with what you're saying on merging structural parts of
> Spark without having a broader consensus. What I don't understand is why
> there is not more investment in SS. Especially because in another thread
> the community is discussing about deprecating the regular DStream streaming
> framework.
>
> Is the orientation of Spark now mostly batch ?
>
> PS: yeah I saw your update on the doc when I took a look at 3.0 preview 2
> searching for this particular feature. And regarding the workaround, I'm
> not sure it meets my needs as it will add delays and also may mess up with
> watermarks.
>
> Best
>
> Etienne Chauchot
>
>
> On 04/09/2020 08:06, Jungtaek Lim wrote:
>
> Unfortunately I don't see enough active committers working on Structured
> Streaming; I don't expect major features/improvements can be brought in
> this situation.
>
> Technically I can review and merge the PR on major improvements in SS, but
> that depends on how huge the proposal is changing. If the proposal brings
> conceptual change, being reviewed by a committer wouldn't still be enough.
>
> So that's not due to the fact we think it's worthless. (That might be only
> me though.) I'd understand as there's not much investment on SS. There's
> also a known workaround for multiple aggregations (I've documented in the
> SS guide doc, in "Limitation of global watermark" section), though I
> totally agree the workaround is bad.
>
> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot 
> wrote:
>
>> Hi all,
>>
>> I'm also very interested in this feature but the PR is open since January
>> 2019 and was not updated. It raised a design discussion around watermarks
>> and a design doc was written (
>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>> We also commented this design but no matter what it seems that the subject
>> is still stale.
>>
>> Is there any interest in the community in delivering this feature or is
>> it considered worthless ? If the latter, can you explain why ?
>>
>> Best
>>
>> Etienne
>> On 22/05/2019 03:38, 张万新 wrote:
>>
>> Thanks, I'll check it out.
>>
>> Arun Mahadevan  于 2019年5月21日周二 01:31写道:
>>
>>> Heres the proposal for supporting it in "append" mode -
>>> https://github.com/apache/spark/pull/23576. You could see if it
>>> addresses your requirement and post your feedback in the PR.
>>> For "update" mode its going to be much harder to support this without
>>> first adding support for "retractions", otherwise we would end up with
>>> wrong results.
>>>
>>> - Arun
>>>
>>>
>>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi 
>>> wrote:
>>>
>>>> There is PR for this but not yet merged.
>>>>
>>>> On Mon, May 20, 2019 at 10:13 AM 张万新  wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I'd like to know what's the root reason why multiple aggregations on
>>>>> streaming dataframe is not allowed since it's a very useful feature, and
>>>>> flink has supported it for a long time.
>>>>>
>>>>> Thanks.
>>>>>
>>>>


Output mode in Structured Streaming and DSv1 sink/DSv2 table

2020-09-19 Thread Jungtaek Lim
Hi devs,

We have a capability check in DSv2 defining which operations can be done
against the data source both read and write. The concept was brought in
DSv2, so it's not weird for DSv1 to don't have a concept.

In SS the problem arises - if I understand correctly, we would like to
couple the output mode in the query and the output table. That said,
complete mode should enforce the output table to truncate the content.
Update mode should enforce the output table to "upsert" or "delete and
append" the content.

Nothing has been done against the DSv1 sink - Spark doesn't enforce
anything and works as append mode, though the query still respects the
output mode on stateful operations.

I understand we don't want to make end users surprised on broken
compatibility, but shouldn't it be an "temporary" "exceptional" case
and DSv2 never does it again? I'm seeing many built-in data sources being
migrated to DSv2 with the exception of "do nothing for update/truncate",
which simply destruct the rationalization on capability.

In addition, they don't add TRUNCATE in capability but add SupportsTruncate
in WriteBuilder, which is weird. It works as of now because SS misses
checking capability on the writer side (I guess it only checks
STREAMING_WRITE), but once we check capability in first place, things will
break.
(I'm looking into adding a writer plan in SS before analyzer, and check
capability there.)

What would be our best fix on this issue? Would we leave the responsibility
of handling "truncate" on the data source (so do nothing is fine if it's
intended), and just add TRUNCATE to the capability? (That should be
documented in its data source description though.) Or drop the support on
truncate if the data source is unable to truncate? (Foreach and Kafka
output tables will be unable to apply complete mode afterwards.)

Looking forward to hear everyone's thoughts.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread Jungtaek Lim
Yeah I realized there's a proposal for push-based shuffle, and I agree that
may unblock the architectural issue on true-streaming. (The root concern of
the continuous mode has been that it doesn't fit with the architecture of
Spark, and probably push-based shuffle could persuade me.)

I guess push-based shuffle is not the only blocker to make continuous mode
be stateful (all of the assumptions on microbatch are broken in the mode,
like global watermark, distributed checkpoint without stopping every tasks,
etc.), but even just repartitioning (probably easier to achieve) is still a
good improvement for the continuous mode. If someone is promising to look
into the improvement after the push-based shuffle, I agree that is a good
reason to keep continuous mode in place.

On Tue, Sep 15, 2020 at 11:02 PM Joseph Torres 
wrote:

> It's worth noting that the push-based shuffle SPIP currently in progress
> addresses a substantial blocker in the area. If you remember when we
> removed the half-finished stateful query support, the lack of that
> functionality and the challenge of implementing it is basically why it was
> half-finished. I can't make a hard commitment, but I do plan to take a look
> at how easy it would be to build continuous shuffle support on top of the
> SPIP once it's in, and continuous mode is gonna be a lot more useful if
> most (all?) queries can run using it.
>
> On Tue, Sep 15, 2020 at 6:37 AM Sean Owen  wrote:
>
>> I think we certainly can't remove it without deprecation and a few
>> releases. If there were big problems with it that weren't getting
>> fixed, sure maybe, but lack of interest in reviewing minor changes
>> isn't necessarily a bad sign. By the same logic you'd delete graphx
>> long ago.
>>
>> Anecdotally, yes there are people using it that I know of at least,
>> but I wouldn't know a lot of them.
>> I think the question is, is it causing a problem, like a lot of
>> maintenance? doesn't sound like it.
>>
>> On Tue, Sep 15, 2020 at 8:19 AM Jungtaek Lim
>>  wrote:
>> >
>> > Probably it would depend on the meaning of "experimental". My
>> understanding of "experimental" is more likely "incubation", which may be
>> graduated finally, or may be retired.
>> >
>> > To be clear, I'm evaluating the continuous mode as "candidate to
>> retire", unless there are actual use cases in production and at least a
>> couple of community members volunteer to maintain it. As far as I see the
>> activity in a year, there's no interest for the continuous mode in
>> community members. I can refer to at least three PRs which suffered to find
>> reviewers (around 1 year) and closed on inactivity. No improvements/bug
>> fixes except trivials. It doesn't seem to get some traction - few questions
>> in SO, a few posts in google search results which were all posted around
>> the date when continuous mode was introduced. Though I would be convinced
>> if someone could provide meaningful numbers of actual use cases.
>> >
>> > If the answer really has to be taken between un-experimental or not
>> (which says retirement is not an option), I'd rather vote to leave as
>> experimental, so I just keep forgetting about it. Actually it bothers
>> sometimes even if the change is done in micro-batch side (so that's not a
>> zero cost to maintain), but still better than officially supporting it.
>> >
>> >
>> > On Tue, Sep 15, 2020 at 9:08 PM Sean Owen  wrote:
>> >>
>> >> If you're suggesting making it un-Experimental, probably yes, as it is
>> >> de facto not going to change much I expect.
>> >> If you're saying remove it, probably not? I don't see that it's
>> >> anywhere near deprecated, and not sure it's unmaintained - obviously
>> >> tests etc still have to keep passing.
>> >>
>> >> On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim
>> >>  wrote:
>> >> >
>> >> > Hi devs,
>> >> >
>> >> > It was Spark 2.3 in Feb 2018 which introduced continuous mode in
>> Structured Streaming as "experimental".
>> >> >
>> >> > Now we are here at 2.5 years after its release - I feel it would be
>> a good time to evaluate the mode, whether the mode has been widely used or
>> not, and the mode has been making progress, as the mode is "experimental".
>> >> >
>> >> > At least from the surface I don't see any active effort for
>> continuous mode around the community - the last major effort was stateful
>> operation which was incomplete and I removed th

Re: [DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-15 Thread Jungtaek Lim
Probably it would depend on the meaning of "experimental". My understanding
of "experimental" is more likely "incubation", which may be graduated
finally, or may be retired.

To be clear, I'm evaluating the continuous mode as "candidate to retire",
unless there are actual use cases in production and at least a couple of
community members volunteer to maintain it. As far as I see the activity in
a year, there's no interest for the continuous mode in community members. I
can refer to at least three PRs which suffered to find reviewers (around 1
year) and closed on inactivity. No improvements/bug fixes except trivials.
It doesn't seem to get some traction - few questions in SO, a few posts in
google search results which were all posted around the date when continuous
mode was introduced. Though I would be convinced if someone could provide
meaningful numbers of actual use cases.

If the answer really has to be taken between un-experimental or not (which
says retirement is not an option), I'd rather vote to leave as
experimental, so I just keep forgetting about it. Actually it bothers
sometimes even if the change is done in micro-batch side (so that's not a
zero cost to maintain), but still better than officially supporting it.


On Tue, Sep 15, 2020 at 9:08 PM Sean Owen  wrote:

> If you're suggesting making it un-Experimental, probably yes, as it is
> de facto not going to change much I expect.
> If you're saying remove it, probably not? I don't see that it's
> anywhere near deprecated, and not sure it's unmaintained - obviously
> tests etc still have to keep passing.
>
> On Mon, Sep 14, 2020 at 11:34 PM Jungtaek Lim
>  wrote:
> >
> > Hi devs,
> >
> > It was Spark 2.3 in Feb 2018 which introduced continuous mode in
> Structured Streaming as "experimental".
> >
> > Now we are here at 2.5 years after its release - I feel it would be a
> good time to evaluate the mode, whether the mode has been widely used or
> not, and the mode has been making progress, as the mode is "experimental".
> >
> > At least from the surface I don't see any active effort for continuous
> mode around the community - the last major effort was stateful operation
> which was incomplete and I removed that. There were some couples of bug
> reports as well as fixes more than a year ago and almost nothing has been
> handled. (A trivial bugfix PR has been merged recently but that's all.) The
> new features introduced to the Structured Streaming (at least observable
> metrics, SS UI) don't apply to continuous mode, and no one made "support
> continuous mode" as a hard requirement on passing review in these PRs.
> >
> > I have no idea how many companies are using the mode in production
> (please add the voice if someone has statistics about this) but I don't see
> any bug reports recently, and see only a few questions in SO, which makes
> me think about cost on maintenance.
> >
> > I know there's a mood to avoid discontinue support as possible, but it
> sounds weird to keep something as "unmaintained", especially it's still
> "experimental" and main authors are no more active enough to promise
> maintenance/improvement on the module. Thoughts?
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
>


[DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-14 Thread Jungtaek Lim
Hi devs,

It was Spark 2.3 in Feb 2018 which introduced continuous mode in Structured
Streaming as "experimental".

Now we are here at 2.5 years after its release - I feel it would be a good
time to evaluate the mode, whether the mode has been widely used or not,
and the mode has been making progress, as the mode is "experimental".

At least from the surface I don't see any active effort for continuous mode
around the community - the last major effort was stateful operation which
was incomplete and I removed that. There were some couples of bug reports
as well as fixes more than a year ago and almost nothing has been handled.
(A trivial bugfix PR has been merged recently but that's all.) The new
features introduced to the Structured Streaming (at least observable
metrics, SS UI) don't apply to continuous mode, and no one made "support
continuous mode" as a hard requirement on passing review in these PRs.

I have no idea how many companies are using the mode in production (please
add the voice if someone has statistics about this) but I don't see any bug
reports recently, and see only a few questions in SO, which makes me think
about cost on maintenance.

I know there's a mood to avoid discontinue support as possible, but it
sounds weird to keep something as "unmaintained", especially it's still
"experimental" and main authors are no more active enough to promise
maintenance/improvement on the module. Thoughts?

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-04 Thread Jungtaek Lim
Unfortunately I don't see enough active committers working on Structured
Streaming; I don't expect major features/improvements can be brought in
this situation.

Technically I can review and merge the PR on major improvements in SS, but
that depends on how huge the proposal is changing. If the proposal brings
conceptual change, being reviewed by a committer wouldn't still be enough.

So that's not due to the fact we think it's worthless. (That might be only
me though.) I'd understand as there's not much investment on SS. There's
also a known workaround for multiple aggregations (I've documented in the
SS guide doc, in "Limitation of global watermark" section), though I
totally agree the workaround is bad.

On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot 
wrote:

> Hi all,
>
> I'm also very interested in this feature but the PR is open since January
> 2019 and was not updated. It raised a design discussion around watermarks
> and a design doc was written (
> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
> We also commented this design but no matter what it seems that the subject
> is still stale.
>
> Is there any interest in the community in delivering this feature or is it
> considered worthless ? If the latter, can you explain why ?
>
> Best
>
> Etienne
> On 22/05/2019 03:38, 张万新 wrote:
>
> Thanks, I'll check it out.
>
> Arun Mahadevan  于 2019年5月21日周二 01:31写道:
>
>> Heres the proposal for supporting it in "append" mode -
>> https://github.com/apache/spark/pull/23576. You could see if it
>> addresses your requirement and post your feedback in the PR.
>> For "update" mode its going to be much harder to support this without
>> first adding support for "retractions", otherwise we would end up with
>> wrong results.
>>
>> - Arun
>>
>>
>> On Mon, 20 May 2019 at 01:34, Gabor Somogyi 
>> wrote:
>>
>>> There is PR for this but not yet merged.
>>>
>>> On Mon, May 20, 2019 at 10:13 AM 张万新  wrote:
>>>
 Hi there,

 I'd like to know what's the root reason why multiple aggregations on
 streaming dataframe is not allowed since it's a very useful feature, and
 flink has supported it for a long time.

 Thanks.

>>>


Re: [VOTE] Release Spark 3.0.1 (RC3)

2020-08-29 Thread Jungtaek Lim
+1 (non-binding)

- Confirmed all combinations of Jenkins build passed for RC3.

spark-branch-3.0-test-maven-hadoop-2.7-hive-1.2 actually failed for the
commit, but the build passed again in the next commit which is only doc
change, so it was just due to a flaky test.

- Downloaded and verified all asc and sha512 files.

- Checked no blocker issues exist on 3.0.1.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Sat, Aug 29, 2020 at 11:28 AM Sean Owen  wrote:

> +1 from me. Same result as the last RC. I did see this test failure
> but I think it was transient; unless anyone else sees it.
>
> - SPARK-9757 Persist Parquet relation with decimal column *** FAILED ***
>   spark-submit returned with exit code 137.
>   Command line: './bin/spark-submit' '--class'
> 'org.apache.spark.sql.hive.SPARK_9757' '--name' 'SparkSQLConfTest'
> '--master' 'local-cluster[2,1,1024]' '--conf' 'spark.ui.enabled=false'
> '--conf' 'spark.master.rest.enabled=false' '--driver-java-options'
> '-Dderby.system.durability=test'
>
> 'file:/home/srowen/spark-3.0.1/sql/hive/target/tmp/spark-8ebab30d-7402-4ab4-9a1a-4f0a562c4e78/testJar-1598659287705.jar'
>
>
> On Fri, Aug 28, 2020 at 9:02 AM 郑瑞峰  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 3.0.1.
> >
> > The vote is open until Sep 2nd at 9AM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.0.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > There are currently no issues targeting 3.0.1 (try project = SPARK AND
> "Target Version/s" = "3.0.1" AND status in (Open, Reopened, "In Progress"))
> >
> > The tag to be voted on is v3.0.1-rc3 (commit
> dc04bf53fe821b7a07f817966c6c173f3b3788c6):
> > https://github.com/apache/spark/tree/v3.0.1-rc3
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.1-rc3-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1357/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.1-rc3-docs/
> >
> > The list of bug fixes going into 3.0.1 can be found at the following URL:
> > https://s.apache.org/q9g2d
> >
> > This release is using the release script of the tag v3.0.1-rc3.
> >
> > FAQ
> >
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.0.1?
> > ===
> >
> > The current list of open tickets targeted at 3.0.1 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.1
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-25 Thread Jungtaek Lim
Bump this again.

On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim 
wrote:

> Bump again.
>
> Unlike file stream sink which has lots of limitations and many of us have
> been suggesting alternatives, file stream source is the only way if end
> users want to read the data from files. No alternative unless they
> introduce another ETL & storage (probably Kafka).
>
> On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim 
> wrote:
>
>> Hi German,
>>
>> option 1 isn't about "deleting" the old files, as your input directory
>> may be accessed by multiple queries. Kafka centralizes the maintenance of
>> input data hence possible to apply retention without problem.
>> option 1 is more about "hiding" the old files being read, so that end
>> users "may" be able to delete the files once they ensure "all queries
>> accessing the input directory" don't see the old files.
>>
>> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon 
>> wrote:
>>
>>> HI Jungtaek,
>>>
>>> I have a question, aren't both approaches compatible?
>>>
>>> How I see it, I think It would be interesting to have a retention period
>>> to delete old files and/or the possibility of indicating an offset
>>> (Timestamp). It would be very "similar" to how we do it with kafka.
>>>
>>> WDYT?
>>>
>>> On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim 
>>> wrote:
>>>
>>>> (I'd like to keep the discussion thread focusing on the specific topic
>>>> - let's initiate another discussion threads on different topics.)
>>>>
>>>> Thanks for the input. I'd like to emphasize that the point in
>>>> discussion is the "latestFirst" option - the rationalization starts from
>>>> growing metadata log issues. I hope your input is picking option 2, but
>>>> could you please make clear your input represents OK to "replace" the
>>>> "latestFirst" option with "starting from timestamp"?
>>>>
>>>>
>>>> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal <
>>>> vikram.agra...@gmail.com> wrote:
>>>>
>>>>> If we compare file-stream source with other streaming sources such as
>>>>> Kafka, the current behavior is indeed incomplete.  Starting the streaming
>>>>> from a custom offset/particular point of time is something that is 
>>>>> missing.
>>>>> Typically filestream sources don't have auto-deletion of the older
>>>>> data/files. In kafka we can define the retention period. So even if we use
>>>>> "Earliest" we won't end up reading from the time when the Kafka topic was
>>>>> created. On the other hand, streaming sources can hold very old files. 
>>>>> It's
>>>>> very valid use-cases to read the bulk of the old files using a batch job
>>>>> until a particular timestamp. And then use streaming jobs for real-time
>>>>> updates.
>>>>>
>>>>> So having support where we can specify a timestamp. and we would
>>>>> consider files created post that timestamp can be useful.
>>>>>
>>>>> Another concern which we need to consider is the listing cost. is
>>>>> there any way we can avoid listing the entire base directory and then
>>>>> filtering out the new files. if the data is organized as partitions using
>>>>> date, will it help to list only those partitions where new files were
>>>>> added?
>>>>>
>>>>>
>>>>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> bump, is there any interest on this topic?
>>>>>>
>>>>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> (Just to add rationalization, you can refer the original mail thread
>>>>>>> on dev@ list to see efforts on addressing problems in file stream
>>>>>>> source / sink -
>>>>>>> https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
>>>>>>> )
>>>>>>>
>>>>>>> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
>>>>>>> kabhwan.opensou...@gmail.com> wrote

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-17 Thread Jungtaek Lim
Bump again.

Unlike file stream sink which has lots of limitations and many of us have
been suggesting alternatives, file stream source is the only way if end
users want to read the data from files. No alternative unless they
introduce another ETL & storage (probably Kafka).

On Fri, Jul 31, 2020 at 3:06 PM Jungtaek Lim 
wrote:

> Hi German,
>
> option 1 isn't about "deleting" the old files, as your input directory may
> be accessed by multiple queries. Kafka centralizes the maintenance of input
> data hence possible to apply retention without problem.
> option 1 is more about "hiding" the old files being read, so that end
> users "may" be able to delete the files once they ensure "all queries
> accessing the input directory" don't see the old files.
>
> On Fri, Jul 31, 2020 at 2:57 PM German Schiavon 
> wrote:
>
>> HI Jungtaek,
>>
>> I have a question, aren't both approaches compatible?
>>
>> How I see it, I think It would be interesting to have a retention period
>> to delete old files and/or the possibility of indicating an offset
>> (Timestamp). It would be very "similar" to how we do it with kafka.
>>
>> WDYT?
>>
>> On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim 
>> wrote:
>>
>>> (I'd like to keep the discussion thread focusing on the specific topic -
>>> let's initiate another discussion threads on different topics.)
>>>
>>> Thanks for the input. I'd like to emphasize that the point in discussion
>>> is the "latestFirst" option - the rationalization starts from
>>> growing metadata log issues. I hope your input is picking option 2, but
>>> could you please make clear your input represents OK to "replace" the
>>> "latestFirst" option with "starting from timestamp"?
>>>
>>>
>>> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal 
>>> wrote:
>>>
>>>> If we compare file-stream source with other streaming sources such as
>>>> Kafka, the current behavior is indeed incomplete.  Starting the streaming
>>>> from a custom offset/particular point of time is something that is missing.
>>>> Typically filestream sources don't have auto-deletion of the older
>>>> data/files. In kafka we can define the retention period. So even if we use
>>>> "Earliest" we won't end up reading from the time when the Kafka topic was
>>>> created. On the other hand, streaming sources can hold very old files. It's
>>>> very valid use-cases to read the bulk of the old files using a batch job
>>>> until a particular timestamp. And then use streaming jobs for real-time
>>>> updates.
>>>>
>>>> So having support where we can specify a timestamp. and we would
>>>> consider files created post that timestamp can be useful.
>>>>
>>>> Another concern which we need to consider is the listing cost. is there
>>>> any way we can avoid listing the entire base directory and then filtering
>>>> out the new files. if the data is organized as partitions using date, will
>>>> it help to list only those partitions where new files were added?
>>>>
>>>>
>>>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> bump, is there any interest on this topic?
>>>>>
>>>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> (Just to add rationalization, you can refer the original mail thread
>>>>>> on dev@ list to see efforts on addressing problems in file stream
>>>>>> source / sink -
>>>>>> https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
>>>>>> )
>>>>>>
>>>>>> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
>>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi devs,
>>>>>>>
>>>>>>> As I have been going through the various issues on metadata log
>>>>>>> growing, it's not only the issue of sink, but also the issue of source.
>>>>>>> Unlike sink metadata log which entries should be available to the
>>>>>>> readers, the source metadata log is only for the streaming query 
>>>>>>> starting
>>>

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-31 Thread Jungtaek Lim
Hi German,

option 1 isn't about "deleting" the old files, as your input directory may
be accessed by multiple queries. Kafka centralizes the maintenance of input
data hence possible to apply retention without problem.
option 1 is more about "hiding" the old files being read, so that end users
"may" be able to delete the files once they ensure "all queries accessing
the input directory" don't see the old files.

On Fri, Jul 31, 2020 at 2:57 PM German Schiavon 
wrote:

> HI Jungtaek,
>
> I have a question, aren't both approaches compatible?
>
> How I see it, I think It would be interesting to have a retention period
> to delete old files and/or the possibility of indicating an offset
> (Timestamp). It would be very "similar" to how we do it with kafka.
>
> WDYT?
>
> On Thu, 30 Jul 2020 at 23:51, Jungtaek Lim 
> wrote:
>
>> (I'd like to keep the discussion thread focusing on the specific topic -
>> let's initiate another discussion threads on different topics.)
>>
>> Thanks for the input. I'd like to emphasize that the point in discussion
>> is the "latestFirst" option - the rationalization starts from
>> growing metadata log issues. I hope your input is picking option 2, but
>> could you please make clear your input represents OK to "replace" the
>> "latestFirst" option with "starting from timestamp"?
>>
>>
>> On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal 
>> wrote:
>>
>>> If we compare file-stream source with other streaming sources such as
>>> Kafka, the current behavior is indeed incomplete.  Starting the streaming
>>> from a custom offset/particular point of time is something that is missing.
>>> Typically filestream sources don't have auto-deletion of the older
>>> data/files. In kafka we can define the retention period. So even if we use
>>> "Earliest" we won't end up reading from the time when the Kafka topic was
>>> created. On the other hand, streaming sources can hold very old files. It's
>>> very valid use-cases to read the bulk of the old files using a batch job
>>> until a particular timestamp. And then use streaming jobs for real-time
>>> updates.
>>>
>>> So having support where we can specify a timestamp. and we would
>>> consider files created post that timestamp can be useful.
>>>
>>> Another concern which we need to consider is the listing cost. is there
>>> any way we can avoid listing the entire base directory and then filtering
>>> out the new files. if the data is organized as partitions using date, will
>>> it help to list only those partitions where new files were added?
>>>
>>>
>>> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> bump, is there any interest on this topic?
>>>>
>>>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> (Just to add rationalization, you can refer the original mail thread
>>>>> on dev@ list to see efforts on addressing problems in file stream
>>>>> source / sink -
>>>>> https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
>>>>> )
>>>>>
>>>>> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> Hi devs,
>>>>>>
>>>>>> As I have been going through the various issues on metadata log
>>>>>> growing, it's not only the issue of sink, but also the issue of source.
>>>>>> Unlike sink metadata log which entries should be available to the
>>>>>> readers, the source metadata log is only for the streaming query starting
>>>>>> from the checkpoint, hence in theory it should only memorize about
>>>>>> minimal entries which prevent processing multiple times on the same file.
>>>>>>
>>>>>> This is not applied to the file stream source, and I think it's
>>>>>> because of the existence of the "latestFirst" option which I haven't seen
>>>>>> from any sources. The option works as reading files in "backward" order,
>>>>>> which means Spark can read the oldest file and latest file together in a
>>>>>> micro-batch, which ends up having to memorize all files previo

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Jungtaek Lim
+1 (non-binding, I guess)

Thanks for raising the issue and sorting it out!

On Fri, Jul 31, 2020 at 6:47 AM Holden Karau  wrote:

> Hi Spark Developers,
>
> After the discussion of the proposal to amend Spark committer guidelines,
> it appears folks are generally in agreement on policy clarifications. (See
> https://lists.apache.org/thread.html/r6706e977fda2c474a7f24775c933c2f46ea19afbfafb03c90f6972ba%40%3Cdev.spark.apache.org%3E,
> as well as some on the private@ list for PMC.) Therefore, I am calling
> for a majority VOTE, which will last at least 72 hours. See the ASF voting
> rules for procedural changes at
> https://www.apache.org/foundation/voting.html.
>
> The proposal is to add a new section entitled “When to Commit” to the
> Spark committer guidelines, currently at
> https://spark.apache.org/committers.html.
>
> ** START OF CHANGE **
>
> PRs shall not be merged during active, on-topic discussion unless they
> address issues such as critical security fixes of a public vulnerability.
> Under extenuating circumstances, PRs may be merged during active, off-topic
> discussion and the discussion directed to a more appropriate venue. Time
> should be given prior to merging for those involved with the conversation
> to explain if they believe they are on-topic.
>
> Lazy consensus requires giving time for discussion to settle while
> understanding that people may not be working on Spark as their full-time
> job and may take holidays. It is believed that by doing this, we can limit
> how often people feel the need to exercise their veto.
>
> All -1s with justification merit discussion.  A -1 from a non-committer
> can be overridden only with input from multiple committers, and suitable
> time must be offered for any committer to raise concerns. A -1 from a
> committer who cannot be reached requires a consensus vote of the PMC under
> ASF voting rules to determine the next steps within the ASF guidelines for
> code vetoes ( https://www.apache.org/foundation/voting.html ).
>
> These policies serve to reiterate the core principle that code must not be
> merged with a pending veto or before a consensus has been reached (lazy or
> otherwise).
>
> It is the PMC’s hope that vetoes continue to be infrequent, and when they
> occur, that all parties will take the time to build consensus prior to
> additional feature work.
>
> Being a committer means exercising your judgement while working in a
> community of people with diverse views. There is nothing wrong in getting a
> second (or third or fourth) opinion when you are uncertain. Thank you for
> your dedication to the Spark project; it is appreciated by the developers
> and users of Spark.
>
> It is hoped that these guidelines do not slow down development; rather, by
> removing some of the uncertainty, the goal is to make it easier for us to
> reach consensus. If you have ideas on how to improve these guidelines or
> other Spark project operating procedures, you should reach out on the dev@
> list to start the discussion.
>
> ** END OF CHANGE TEXT **
>
> I want to thank everyone who has been involved with the discussion leading
> to this proposal and those of you who take the time to vote on this. I look
> forward to our continued collaboration in building Apache Spark.
>
> I believe we share the goal of creating a welcoming community around the
> project. On a personal note, it is my belief that consistently applying
> this policy around commits can help to make a more accessible and welcoming
> community.
>
> Kind Regards,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread Jungtaek Lim
(I'd like to keep the discussion thread focusing on the specific topic -
let's initiate another discussion threads on different topics.)

Thanks for the input. I'd like to emphasize that the point in discussion
is the "latestFirst" option - the rationalization starts from
growing metadata log issues. I hope your input is picking option 2, but
could you please make clear your input represents OK to "replace" the
"latestFirst" option with "starting from timestamp"?


On Thu, Jul 30, 2020 at 4:48 PM vikram agrawal 
wrote:

> If we compare file-stream source with other streaming sources such as
> Kafka, the current behavior is indeed incomplete.  Starting the streaming
> from a custom offset/particular point of time is something that is missing.
> Typically filestream sources don't have auto-deletion of the older
> data/files. In kafka we can define the retention period. So even if we use
> "Earliest" we won't end up reading from the time when the Kafka topic was
> created. On the other hand, streaming sources can hold very old files. It's
> very valid use-cases to read the bulk of the old files using a batch job
> until a particular timestamp. And then use streaming jobs for real-time
> updates.
>
> So having support where we can specify a timestamp. and we would consider
> files created post that timestamp can be useful.
>
> Another concern which we need to consider is the listing cost. is there
> any way we can avoid listing the entire base directory and then filtering
> out the new files. if the data is organized as partitions using date, will
> it help to list only those partitions where new files were added?
>
>
> On Thu, Jul 30, 2020 at 11:22 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> bump, is there any interest on this topic?
>>
>> On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> (Just to add rationalization, you can refer the original mail thread on
>>> dev@ list to see efforts on addressing problems in file stream source /
>>> sink -
>>> https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
>>> )
>>>
>>> On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> As I have been going through the various issues on metadata log
>>>> growing, it's not only the issue of sink, but also the issue of source.
>>>> Unlike sink metadata log which entries should be available to the
>>>> readers, the source metadata log is only for the streaming query starting
>>>> from the checkpoint, hence in theory it should only memorize about
>>>> minimal entries which prevent processing multiple times on the same file.
>>>>
>>>> This is not applied to the file stream source, and I think it's because
>>>> of the existence of the "latestFirst" option which I haven't seen from any
>>>> sources. The option works as reading files in "backward" order, which means
>>>> Spark can read the oldest file and latest file together in a micro-batch,
>>>> which ends up having to memorize all files previously read. The option can
>>>> be changed during query restart, so even if the query is started with
>>>> "latestFirst" being false, it's not safe to apply the logic of minimizing
>>>> entries to memorize, as the option can be changed to true and then we'll
>>>> read files again.
>>>>
>>>> I'm seeing two approaches here:
>>>>
>>>> 1) apply "retention" - unlike "maxFileAge", the option would apply to
>>>> latestFirst as well. That said, if the retention is set to 7 days, the
>>>> files older than 7 days would never be read in any way. With this approach
>>>> we can at least get rid of entries which are older than retention. The
>>>> issue is how to play nicely with existing "maxFileAge", as it also plays
>>>> similar with the retention, though it's being ignored when latestFirst is
>>>> turned on. (Change the semantic of "maxFileAge" vs leave it to "soft
>>>> retention" and introduce another option.)
>>>>
>>>> (This approach is being proposed under SPARK-17604, and PR is available
>>>> - https://github.com/apache/spark/pull/28422)
>>>>
>>>> 2) replace "latestFirst" option with alternatives, which no longer read
>>>> in "backward" order - this doesn't say we have to read all files to move
>>>> forward. As we do with Kafka, start offset can be provided, ideally as a
>>>> timestamp, which Spark will read from such timestamp and forward order.
>>>> This doesn't cover all use cases of "latestFirst", but "latestFirst"
>>>> doesn't seem to be natural with the concept of SS (think about watermark),
>>>> I'd prefer to support alternatives instead of struggling with 
>>>> "latestFirst".
>>>>
>>>> Would like to hear your opinions.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>


Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-19 Thread Jungtaek Lim
(Just to add rationalization, you can refer the original mail thread on dev@
list to see efforts on addressing problems in file stream source / sink -
https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E
)

On Mon, Jul 20, 2020 at 6:18 AM Jungtaek Lim 
wrote:

> Hi devs,
>
> As I have been going through the various issues on metadata log growing,
> it's not only the issue of sink, but also the issue of source.
> Unlike sink metadata log which entries should be available to the readers,
> the source metadata log is only for the streaming query starting
> from the checkpoint, hence in theory it should only memorize about minimal
> entries which prevent processing multiple times on the same file.
>
> This is not applied to the file stream source, and I think it's because of
> the existence of the "latestFirst" option which I haven't seen from any
> sources. The option works as reading files in "backward" order, which means
> Spark can read the oldest file and latest file together in a micro-batch,
> which ends up having to memorize all files previously read. The option can
> be changed during query restart, so even if the query is started with
> "latestFirst" being false, it's not safe to apply the logic of minimizing
> entries to memorize, as the option can be changed to true and then we'll
> read files again.
>
> I'm seeing two approaches here:
>
> 1) apply "retention" - unlike "maxFileAge", the option would apply to
> latestFirst as well. That said, if the retention is set to 7 days, the
> files older than 7 days would never be read in any way. With this approach
> we can at least get rid of entries which are older than retention. The
> issue is how to play nicely with existing "maxFileAge", as it also plays
> similar with the retention, though it's being ignored when latestFirst is
> turned on. (Change the semantic of "maxFileAge" vs leave it to "soft
> retention" and introduce another option.)
>
> (This approach is being proposed under SPARK-17604, and PR is available -
> https://github.com/apache/spark/pull/28422)
>
> 2) replace "latestFirst" option with alternatives, which no longer read in
> "backward" order - this doesn't say we have to read all files to move
> forward. As we do with Kafka, start offset can be provided, ideally as a
> timestamp, which Spark will read from such timestamp and forward order.
> This doesn't cover all use cases of "latestFirst", but "latestFirst"
> doesn't seem to be natural with the concept of SS (think about watermark),
> I'd prefer to support alternatives instead of struggling with "latestFirst".
>
> Would like to hear your opinions.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>


[DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-19 Thread Jungtaek Lim
Hi devs,

As I have been going through the various issues on metadata log growing,
it's not only the issue of sink, but also the issue of source.
Unlike sink metadata log which entries should be available to the readers,
the source metadata log is only for the streaming query starting
from the checkpoint, hence in theory it should only memorize about minimal
entries which prevent processing multiple times on the same file.

This is not applied to the file stream source, and I think it's because of
the existence of the "latestFirst" option which I haven't seen from any
sources. The option works as reading files in "backward" order, which means
Spark can read the oldest file and latest file together in a micro-batch,
which ends up having to memorize all files previously read. The option can
be changed during query restart, so even if the query is started with
"latestFirst" being false, it's not safe to apply the logic of minimizing
entries to memorize, as the option can be changed to true and then we'll
read files again.

I'm seeing two approaches here:

1) apply "retention" - unlike "maxFileAge", the option would apply to
latestFirst as well. That said, if the retention is set to 7 days, the
files older than 7 days would never be read in any way. With this approach
we can at least get rid of entries which are older than retention. The
issue is how to play nicely with existing "maxFileAge", as it also plays
similar with the retention, though it's being ignored when latestFirst is
turned on. (Change the semantic of "maxFileAge" vs leave it to "soft
retention" and introduce another option.)

(This approach is being proposed under SPARK-17604, and PR is available -
https://github.com/apache/spark/pull/28422)

2) replace "latestFirst" option with alternatives, which no longer read in
"backward" order - this doesn't say we have to read all files to move
forward. As we do with Kafka, start offset can be provided, ideally as a
timestamp, which Spark will read from such timestamp and forward order.
This doesn't cover all use cases of "latestFirst", but "latestFirst"
doesn't seem to be natural with the concept of SS (think about watermark),
I'd prefer to support alternatives instead of struggling with "latestFirst".

Would like to hear your opinions.

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: Use /usr/bin/env python3 in scripts?

2020-07-17 Thread Jungtaek Lim
For me merge script worked for python 2.7, but I got some trouble with the
encoding issue (probably from contributor's name) so now I use the merge
script with virtualenv & python 3.7.7.

"python3" would be OK for me as well as it doesn't break virtualenv with
python 3.

On Sat, Jul 18, 2020 at 6:13 AM Driesprong, Fokko 
wrote:

> +1 I'm in favor of using python3
>
> Cheers, Fokko
>
> Op vr 17 jul. 2020 om 19:49 schreef Sean Owen :
>
>> Yeah I figured it's a best practice, so I'll raise a PR unless
>> somebody tells me not to. This is about build scripts, not Pyspark
>> itself, and half the scripts already specify python3.
>>
>> On Fri, Jul 17, 2020 at 12:36 PM Oli McCormack  wrote:
>> >
>> > [Warning: not spark+python specific information]
>> >
>> > It's recommended that you should explicitly call out python3 in a case
>> like this (see PEP-0394, and SO). Your environment is typical: python is
>> often a pointer to python2 for tooling compatibility reasons (other tools
>> or scripts that expect they're going to get python2 when they call python),
>> and you should use python3 to use the new version. What python points to
>> will change over time, so it's recommended to use python2 if explicitly
>> depending on that.
>> >
>> > More generally: It's common/recommended to use a virtual environment +
>> explicitly stated versions of Python and dependencies, rather than system
>> Python, so that python means exactly what you intend it to. I know very
>> little about the Spark python dev stack and how challenging it may be to do
>> this, so please take this with a dose of naiveté.
>> >
>> > - Oli
>> >
>> >
>> > On Fri, Jul 17, 2020 at 9:58 AM Sean Owen  wrote:
>> >>
>> >> So, we are on Python 3 entirely now right?
>> >> It might be just my local Mac env, but "/usr/bin/env python" uses
>> >> Python 2 on my mac.
>> >> Some scripts write "/usr/bin/env python3" now. Should that be the case
>> >> in all scripts?
>> >> Right now the merge script doesn't work for me b/c it was just updated
>> >> to be Python 3 only.
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] -1s and commits

2020-07-16 Thread Jungtaek Lim
On Fri, Jul 17, 2020 at 8:06 AM Holden Karau  wrote:

>
>
> On Thu, Jul 16, 2020 at 3:34 PM Jungtaek Lim 
> wrote:
>
>> I agree with Wenchen that there are different topics.
>>
> I agree. I mentioned it in my postscript because I wanted to provide the
> context around why I felt extra strongly about the importance of folks
> following the explicit rules. It makes it feel like there is a double
> standard, which is something I am not ok with. We've lost good community
> participants over the appearance of a double standard, and I don’t think it
> has to be intentional for us to loose people over this.
>
>>
>> The policy of veto is obvious, as ASF doc describes it with explicitly
>> saying non-overridable per project. In any way, the approach of resolving
>> the situation should lead to voters withdrawing their vetoes. There's
>> nothing to interpret differently, though I tend to agree with Sean's
>> perspective, the one who vetoed the PR should indicate themselves as a
>> blocker and follow up the PR actively on resolution of the veto (higher
>> priority). Absence on a couple of days shouldn't matter, but what if it's
>> going to be a couple of weeks, or even a couple of months, especially
>> without any prior notices?
>>
>> "The unanimous opinion from the different folks I talked with is that
>> any technical disagreement should be considered before moving on."
>>
>> This is not just for dealing with -1, but just for all of the reviews. +1
>> vote from any of committer without -1 vote from any qualified people (I
>> guess it tends to be also interpreted as committer as well) is just a
>> technical perspective. In practice, consensus should be made between
>> reviewers to produce final approval on the PR. If there're non-trivial
>> unresolved comments, merger has to stop and try to resolve these comments
>> either letting the author to deal with them, or confirming from reviewers
>> these comments are OK to defer/skip/take back. Obviously disagreements
>> between reviewers must be sorted out.
>>
> So I believe, for example, if there is a new community member who places a
> -1 they may not understand the significance of it.  In those cases I always
> seek to explain first, but I believe it's important to note that there is
> some bar for what qualified is in terms of veto.
>
> It seems like we can agree that committers in our project are well above
> the bar of a qualified voter for any code related discussion, and
> certainly, there could be other community members who are qualified in
> certain areas to veto as well.
>
>>
>> PRs with someone asking to wait for certain days - how long the PR has
>> been blocked by? That request may be the sign that they are voluntarily
>> finding their time to review in detail, so it's not a bad thing. I think
>> the real concern is that some days of silence in PR may lead to lose focus
>> and let the PR be stale; but just a couple of days before certain days you
>> can kindly remind, and after certain days you can go ahead, and they would
>> go with post-hoc reviews if they have something to bring up. I don't see
>> any issue here.
>>
> We're talking PRs open for > a month and PRs that have been open for
> almost a quarter. It's fine to ask time to review, but I believe there is a
> reasonable expectation that if someone cares about an area of review they
> should be more active than taking a look once a month.
>

I believe I am one of the top strugglers of the Spark community for getting
their PRs reviewed in recent 2 years. (Many of my PRs took more than 1 year
to get reviewed and merged, and I still have a bunch of PRs posted over an
year, with rebasing and triggering tests periodically.) I feel the
frustration by heart. One thing I want to make sure of is that I don't mean
"certain days" as months - I mean finding time in the next week or so.


> This reminds me of the old way Spark said no to PRs by just ignoring them
> which I think we agree is not the way we want to handle PRs anymore.
>

I agree showing disagreement might be better than ignoring, at least it
leads to move forward, but you know, it's not that comfortable to show
disagreement and open debate. It's not easy to force everyone to do so
instead of simply ignoring. Not an easy problem.


>
>> On Thu, Jul 16, 2020 at 4:29 PM Wenchen Fan  wrote:
>>
>>> It looks like there are two topics:
>>> 1. PRs with -1
>>> 2. PRs with someone asking to wait for certain days.
>>>
>>> Holden, it seems you are hitting 2? I think 2 can be problematic if
>>> there are people who keep asking to wait, and block the PR indefinitely.

Re: [DISCUSS] -1s and commits

2020-07-16 Thread Jungtaek Lim
I agree with Wenchen that there are different topics.

The policy of veto is obvious, as ASF doc describes it with explicitly
saying non-overridable per project. In any way, the approach of resolving
the situation should lead to voters withdrawing their vetoes. There's
nothing to interpret differently, though I tend to agree with Sean's
perspective, the one who vetoed the PR should indicate themselves as a
blocker and follow up the PR actively on resolution of the veto (higher
priority). Absence on a couple of days shouldn't matter, but what if it's
going to be a couple of weeks, or even a couple of months, especially
without any prior notices?

"The unanimous opinion from the different folks I talked with is that any
technical disagreement should be considered before moving on."

This is not just for dealing with -1, but just for all of the reviews. +1
vote from any of committer without -1 vote from any qualified people (I
guess it tends to be also interpreted as committer as well) is just a
technical perspective. In practice, consensus should be made between
reviewers to produce final approval on the PR. If there're non-trivial
unresolved comments, merger has to stop and try to resolve these comments
either letting the author to deal with them, or confirming from reviewers
these comments are OK to defer/skip/take back. Obviously disagreements
between reviewers must be sorted out.

PRs with someone asking to wait for certain days - how long the PR has been
blocked by? That request may be the sign that they are voluntarily finding
their time to review in detail, so it's not a bad thing. I think the real
concern is that some days of silence in PR may lead to lose focus and let
the PR be stale; but just a couple of days before certain days you can
kindly remind, and after certain days you can go ahead, and they would go
with post-hoc reviews if they have something to bring up. I don't see any
issue here.

On Thu, Jul 16, 2020 at 4:29 PM Wenchen Fan  wrote:

> It looks like there are two topics:
> 1. PRs with -1
> 2. PRs with someone asking to wait for certain days.
>
> Holden, it seems you are hitting 2? I think 2 can be problematic if there
> are people who keep asking to wait, and block the PR indefinitely. But if
> it's only asked once, this seems OK. BTW, since it's not a -1, so
> policy-wise we can still merge the PR if we think the PR already gets
> sufficient review.
>
> On Thu, Jul 16, 2020 at 4:33 AM Sean Owen  wrote:
>
>> I agree with all that, and would be surprised if anyone here objects
>> to any of that in principle. In practice, I'm sure it doesn't end up
>> that way sometimes, even in good faith. That is, I would not be
>> surprised if the parties involved don't even see the disconnect.
>>
>> What are the specific examples? for private@ if necessary, but, I
>> think dev@ could be fine because we're discussing general patterns
>> from specific examples, for everyone to learn from. This isn't
>> necessarily about individuals. (Heck, maybe I've gotten it wrong
>> myself)
>>
>> One general principle I'd add: we are probably getting more
>> conservative about big changes over time as Spark enters the long
>> plateau of maturity. See the discussion about breaking changes in 3.0,
>> or comments about waiting for review. That argues even more against
>> proceeding against raised issues.
>>
>> On the flip side, we have to be constructive. I like the idea of
>> proposing alternatives. Can you achieve this goal by doing Y instead
>> of X?  I also think there's a burden on the objector to provide a
>> rationale, certainly, but also drive a resolution. That could also
>> mean standing firm on the objection but calling in other reviewers and
>> being willing to accede to a majority. Put another way: someone who
>> objects and never really follows up with a path to consensus about
>> compromise or rejection isn't really objecting correctly. We can VOTE
>> if needed, but, if someone objected and didn't follow up and I
>> couldn't find anyone else backing it up and thought I'd addressed the
>> objection, I'd consider it resolved and proceed.
>>
>>
>>
>> On Wed, Jul 15, 2020 at 3:18 PM Holden Karau 
>> wrote:
>> >
>> > Hi Spark Development Community,
>> >
>> >
>> > Since Spark 3 has shipped I've been going through some of the PRs and
>> I've noticed some PRs have been merged with pending -1s with technical
>> reasons, including those from committers. I'm bringing this up because I
>> believe we, as PMC, committers and contributors, do not currently have a
>> consistent understanding, and I believe we should develop one. The
>> foundation level guidance is at
>> https://www.apache.org/foundation/voting.html#votes-on-code-modification.
>> >
>> >
>> >
>> > It is my belief that we should not be merging code with -1s from active
>> committers, -1 is a very strong signal and generally a sign that we should
>> step back and try and build consensus around the change. Looking at how the
>> httpd project (e.g. the 

Re: Welcoming some new Apache Spark committers

2020-07-15 Thread Jungtaek Lim
Thanks everyone for the warm welcome and all the kind words! Looking
forward to collaborate with you all.

On Wed, Jul 15, 2020 at 4:17 PM Herman van Hovell 
wrote:

> Congratulations!
>
> On Wed, Jul 15, 2020 at 9:00 AM angers.zhu  wrote:
>
>> Congratulations !
>>
>> angers.zhu
>> angers@gmail.com
>>
>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1=angers.zhu=angers.zhu%40gmail.com=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png=%5B%22angers.zhu%40gmail.com%22%5D>
>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
>>
>> On 07/15/2020 14:53,Wenchen Fan
>>  wrote:
>>
>> Congrats and welcome!
>>
>> On Wed, Jul 15, 2020 at 2:18 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Congratulations !
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Tue, Jul 14, 2020 at 12:37 PM Matei Zaharia 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> The Spark PMC recently voted to add several new committers. Please join
>>>> me in welcoming them to their new roles! The new committers are:
>>>>
>>>> - Huaxin Gao
>>>> - Jungtaek Lim
>>>> - Dilip Biswal
>>>>
>>>> All three of them contributed to Spark 3.0 and we’re excited to have
>>>> them join the project.
>>>>
>>>> Matei and the Spark PMC
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>


Re: [DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-07-12 Thread Jungtaek Lim
Just submitted the patch: https://github.com/apache/spark/pull/29077

On Tue, Jun 16, 2020 at 3:40 PM Jungtaek Lim 
wrote:

> Bump this again. I filed SPARK-31985 [1] and plan to submit a PR in a
> couple of days if there's no voice on the reason we should keep it.
>
> 1. https://issues.apache.org/jira/browse/SPARK-31985
>
> On Thu, May 21, 2020 at 8:54 AM Jungtaek Lim 
> wrote:
>
>> Let me share the effect on removing the incomplete and undocumented code
>> path. I manually tried out removing the code path and here's the change.
>>
>>
>> https://github.com/HeartSaVioR/spark/commit/aa53e9b1b33c0b8aec37704ad290b42ffb2962d8
>>
>> 1,120 lines deleted without hurting any existing streaming tests, except
>> a suite I removed as well since it tests such code path. No need to update
>> documentation as it was never publicized. This also removes some rules
>> which only apply for continuous mode, which gave exceptions on the fact
>> that "continuous mode is only available for map-like operations". Removing
>> them would be back to simplify the usage of continuous mode.
>>
>> Also worth noting that I had to manually remove the code path instead of
>> revert, because the code path has been changed to reflect DSv2 change. What
>> this means? We have to "update" the code path and concern
>> about compatibility, etc. while it never be used in production.
>>
>> On Tue, May 19, 2020 at 1:14 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> Hi devs,
>>>
>>> during experiment on complete mode I realized we left some incomplete
>>> code parts on supporting aggregation for continuous mode. (shuffle &
>>> coalesce)
>>>
>>> The work had been occurred around first half of 2018 and stopped, and no
>>> work has been done for around 2 years (so I don't expect anyone is working
>>> on this). The functionality is undocumented (as the work was only done
>>> partially) and continuous mode is experimental so I don't feel risks to get
>>> rid of the part.
>>>
>>> What do you think? If it makes sense then I'll raise a PR to get rid of
>>> the incomplete codes.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> ps. I'm kind of feeling the same with continuous mode itself - it has
>>> been still experimental over the 2 years, and while it lacks lots of
>>> essential things (backpressure, epoch synchronization, query progress,
>>> etc.) there's no additional major work over 1 year. We eventually have been
>>> excluded continuous mode for the new streaming feature like observable
>>> metric, streaming UI, etc. because it is on top of the feature
>>> which continuous mode lacks.
>>>
>>> Unlike incomplete code path, I'm not strongly against this, as it
>>> has been documented and someone might use the feature in production. I
>>> still think it's ideal to retire the feature smoothly.
>>> (Please chime in with the use case if someone has the production cases.)
>>>
>>> ps2. It feels like "feature branch" looks to be a thing to consider for
>>> efforts on one big feature.
>>>
>>


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-09 Thread Jungtaek Lim
As a side note, I've raised patches for addressing two frequent flaky
tests, CliSuite [1] and HiveSessionImplSuite [2]. Hope this helps to
mitigate the situation.

1. https://github.com/apache/spark/pull/29036
2. https://github.com/apache/spark/pull/29039

On Thu, Jul 9, 2020 at 11:51 AM Hyukjin Kwon  wrote:

> Thanks Shane!
>
> BTW, it's getting serious .. e.g)
> https://github.com/apache/spark/pull/28969.
> The tests could not pass in 7 days .. Hopefully restarting the machines
> will make the current situation better :-)
>
> Separately, I am working on a PR to run the Spark tests in Github Actions.
> We could hopefully use Github Actions and Jenkins together meanwhile.
>
>
> 2020년 7월 9일 (목) 오전 1:07, shane knapp ☠ 님이 작성:
>
>> this will be happening tomorrow...  today is Meeting Hell Day[tm].
>>
>> On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠  wrote:
>>
>>> i wasn't able to get to it today, so i'm hoping to squeeze in a quick
>>> trip to the colo tomorrow morning.  if not, then first thing thursday.
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: m2 cache issues in Jenkins?

2020-07-06 Thread Jungtaek Lim
Just encountered the same and it's worker-05 again. (You can find [error]
in the console to see what's the problem. I guess jetty artifacts in the
worker might be messed up.)

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125127/consoleFull


On Tue, Jul 7, 2020 at 5:35 AM Jungtaek Lim 
wrote:

> Could this be a flaky or persistent issue? It failed with Scala gendoc but
> it didn't fail with the part the PR modified. It ran from worker-05.
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125121/consoleFull
>
> On Tue, Jul 7, 2020 at 2:10 AM shane knapp ☠  wrote:
>
>> i killed and retriggered the PRB jobs on 04, and wiped that workers' m2
>> cache.
>>
>> On Mon, Jul 6, 2020 at 9:24 AM shane knapp ☠  wrote:
>>
>>> once the jobs running on that worker are finished, yes.
>>>
>>> On Sun, Jul 5, 2020 at 7:41 PM Hyukjin Kwon  wrote:
>>>
>>>> Shane, can we remove .m2 in worker machine 4?
>>>>
>>>> 2020년 7월 3일 (금) 오전 8:18, Jungtaek Lim 님이
>>>> 작성:
>>>>
>>>>> Looks like Jenkins service itself becomes unstable. It took
>>>>> considerable time to just open the test report for a specific build, and
>>>>> Jenkins doesn't pick the request on rebuild (retest this, please) in 
>>>>> Github
>>>>> comment.
>>>>>
>>>>> On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Ah, okay. Actually there already is -
>>>>>> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening.
>>>>>>
>>>>>> 2020년 7월 2일 (목) 오후 2:06, Holden Karau 님이 작성:
>>>>>>
>>>>>>> We don't I didn't file one originally, but Shane reminded me to in
>>>>>>> the future.
>>>>>>>
>>>>>>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Nope, do we have an existing ticket? I think we can reopen if there
>>>>>>>> is.
>>>>>>>>
>>>>>>>> 2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:
>>>>>>>>
>>>>>>>>> Huh interesting that it’s the same worker. Have you filed a ticket
>>>>>>>>> to Shane?
>>>>>>>>>
>>>>>>>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>>>>>>>>>>
>>>>>>>>>> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이
>>>>>>>>>> 작성:
>>>>>>>>>>
>>>>>>>>>>> done:
>>>>>>>>>>> -bash-4.1$ cd .m2
>>>>>>>>>>> -bash-4.1$ ls
>>>>>>>>>>> repository
>>>>>>>>>>> -bash-4.1$ time rm -rf *
>>>>>>>>>>>
>>>>>>>>>>> real17m4.607s
>>>>>>>>>>> user0m0.950s
>>>>>>>>>>> sys 0m18.816s
>>>>>>>>>>> -bash-4.1$
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ <
>>>>>>>>>>> skn...@berkeley.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> ok, i've taken that worker offline and once the job running on
>>>>>>>>>>>> it finishes, i'll wipe the cache.
>>>>>>>>>>>>
>>>>>>>>>>>> in the future, please file a JIRA and assign it to me so i
>>>>>>>>>>>> don't have to track my work through emails to the dev@ list.
>>>>>>>>>>>> ;)
>>>>>>>>>>>>
>>>>>>>>>>>> thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> shane
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau <
>>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>>>

Re: m2 cache issues in Jenkins?

2020-07-06 Thread Jungtaek Lim
Could this be a flaky or persistent issue? It failed with Scala gendoc but
it didn't fail with the part the PR modified. It ran from worker-05.

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125121/consoleFull

On Tue, Jul 7, 2020 at 2:10 AM shane knapp ☠  wrote:

> i killed and retriggered the PRB jobs on 04, and wiped that workers' m2
> cache.
>
> On Mon, Jul 6, 2020 at 9:24 AM shane knapp ☠  wrote:
>
>> once the jobs running on that worker are finished, yes.
>>
>> On Sun, Jul 5, 2020 at 7:41 PM Hyukjin Kwon  wrote:
>>
>>> Shane, can we remove .m2 in worker machine 4?
>>>
>>> 2020년 7월 3일 (금) 오전 8:18, Jungtaek Lim 님이
>>> 작성:
>>>
>>>> Looks like Jenkins service itself becomes unstable. It took
>>>> considerable time to just open the test report for a specific build, and
>>>> Jenkins doesn't pick the request on rebuild (retest this, please) in Github
>>>> comment.
>>>>
>>>> On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Ah, okay. Actually there already is -
>>>>> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening.
>>>>>
>>>>> 2020년 7월 2일 (목) 오후 2:06, Holden Karau 님이 작성:
>>>>>
>>>>>> We don't I didn't file one originally, but Shane reminded me to in
>>>>>> the future.
>>>>>>
>>>>>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Nope, do we have an existing ticket? I think we can reopen if there
>>>>>>> is.
>>>>>>>
>>>>>>> 2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:
>>>>>>>
>>>>>>>> Huh interesting that it’s the same worker. Have you filed a ticket
>>>>>>>> to Shane?
>>>>>>>>
>>>>>>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>>>>>>>>>
>>>>>>>>> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이
>>>>>>>>> 작성:
>>>>>>>>>
>>>>>>>>>> done:
>>>>>>>>>> -bash-4.1$ cd .m2
>>>>>>>>>> -bash-4.1$ ls
>>>>>>>>>> repository
>>>>>>>>>> -bash-4.1$ time rm -rf *
>>>>>>>>>>
>>>>>>>>>> real17m4.607s
>>>>>>>>>> user0m0.950s
>>>>>>>>>> sys 0m18.816s
>>>>>>>>>> -bash-4.1$
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ <
>>>>>>>>>> skn...@berkeley.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> ok, i've taken that worker offline and once the job running on
>>>>>>>>>>> it finishes, i'll wipe the cache.
>>>>>>>>>>>
>>>>>>>>>>> in the future, please file a JIRA and assign it to me so i don't
>>>>>>>>>>> have to track my work through emails to the dev@ list.  ;)
>>>>>>>>>>>
>>>>>>>>>>> thanks!
>>>>>>>>>>>
>>>>>>>>>>> shane
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau <
>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The most recent one I noticed was
>>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>>>>>>>>>>>  which
>>>>>>>>>>>> was run on  amp-jenkins-worker-04.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ <
>>>>>>>>>>>> skn...@berkeley.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> for those weird failures, it's super h

Re: m2 cache issues in Jenkins?

2020-07-02 Thread Jungtaek Lim
Looks like Jenkins service itself becomes unstable. It took considerable
time to just open the test report for a specific build, and Jenkins doesn't
pick the request on rebuild (retest this, please) in Github comment.

On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon  wrote:

> Ah, okay. Actually there already is -
> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening.
>
> 2020년 7월 2일 (목) 오후 2:06, Holden Karau 님이 작성:
>
>> We don't I didn't file one originally, but Shane reminded me to in the
>> future.
>>
>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon  wrote:
>>
>>> Nope, do we have an existing ticket? I think we can reopen if there is.
>>>
>>> 2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:
>>>
 Huh interesting that it’s the same worker. Have you filed a ticket to
 Shane?

 On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon 
 wrote:

> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>
> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이 작성:
>
>> done:
>> -bash-4.1$ cd .m2
>> -bash-4.1$ ls
>> repository
>> -bash-4.1$ time rm -rf *
>>
>> real17m4.607s
>> user0m0.950s
>> sys 0m18.816s
>> -bash-4.1$
>>
>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ 
>> wrote:
>>
>>> ok, i've taken that worker offline and once the job running on it
>>> finishes, i'll wipe the cache.
>>>
>>> in the future, please file a JIRA and assign it to me so i don't
>>> have to track my work through emails to the dev@ list.  ;)
>>>
>>> thanks!
>>>
>>> shane
>>>
>>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
>>> wrote:
>>>
 The most recent one I noticed was
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
  which
 was run on  amp-jenkins-worker-04.

 On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
 wrote:

> for those weird failures, it's super helpful to provide which
> workers are showing these issues.  :)
>
> i'd rather not wipe all of the m2 caches on all of the workers, as
> we'll then potentially get blacklisted again if we download too many
> packages from apache.org.
>
> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
> wrote:
>
>> Hi Folks,
>>
>> I've been see some weird failures on Jenkins and it looks like it
>> might be from the m2 cache. Would it be OK to clean it out? Or is it
>> important?
>>
>> Cheers,
>>
>> Holden
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-01 Thread Jungtaek Lim
https://issues.apache.org/jira/browse/SPARK-32148 was reported yesterday,
and if the report is valid it looks to be a blocker. I'll try to take a
look sooner.

On Thu, Jul 2, 2020 at 12:48 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Thanks Holden -- it would be great to also get 2.4.7 started
>
> Thanks
> Shivaram
>
> On Tue, Jun 30, 2020 at 10:31 PM Holden Karau 
> wrote:
> >
> > I can take care of 2.4.7 unless someone else wants to do it.
> >
> > On Tue, Jun 30, 2020 at 8:29 PM Jason Moore 
> wrote:
> >>
> >> Hi all,
> >>
> >>
> >>
> >> Could I get some input on the severity of this one that I found
> yesterday?  If that’s a correctness issue, should it block this patch?  Let
> me know under the ticket if there’s more info that I can provide to help.
> >>
> >>
> >>
> >> https://issues.apache.org/jira/browse/SPARK-32136
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Jason.
> >>
> >>
> >>
> >> From: Jungtaek Lim 
> >> Date: Wednesday, 1 July 2020 at 10:20 am
> >> To: Shivaram Venkataraman 
> >> Cc: Prashant Sharma , 郑瑞峰 ,
> Gengliang Wang , gurwls223 <
> gurwls...@gmail.com>, Dongjoon Hyun , Jules
> Damji , Holden Karau , Reynold
> Xin , Yuanjian Li , "
> dev@spark.apache.org" , Takeshi Yamamuro <
> linguin@gmail.com>
> >> Subject: Re: [DISCUSS] Apache Spark 3.0.1 Release
> >>
> >>
> >>
> >> SPARK-32130 [1] looks to be a performance regression introduced in
> Spark 3.0.0, which is ideal to look into before releasing another bugfix
> version.
> >>
> >>
> >>
> >> 1. https://issues.apache.org/jira/browse/SPARK-32130
> >>
> >>
> >>
> >> On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
> >>
> >> Hi all
> >>
> >>
> >>
> >> I just wanted to ping this thread to see if all the outstanding
> blockers for 3.0.1 have been fixed. If so, it would be great if we can get
> the release going. The CRAN team sent us a note that the version SparkR
> available on CRAN for the current R version (4.0.2) is broken and hence we
> need to update the package soon --  it will be great to do it with 3.0.1.
> >>
> >>
> >>
> >> Thanks
> >>
> >> Shivaram
> >>
> >>
> >>
> >> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma 
> wrote:
> >>
> >> +1 for 3.0.1 release.
> >>
> >> I too can help out as release manager.
> >>
> >>
> >>
> >> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
> >>
> >> I volunteer to be a release manager of 3.0.1, if nobody is working on
> this.
> >>
> >>
> >>
> >>
> >>
> >> -- 原始邮件 --
> >>
> >> 发件人: "Gengliang Wang";
> >>
> >> 发送时间: 2020年6月24日(星期三) 下午4:15
> >>
> >> 收件人: "Hyukjin Kwon";
> >>
> >> 抄送: "Dongjoon Hyun";"Jungtaek Lim"<
> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
> Karau";"Reynold Xin";"Shivaram
> Venkataraman";"Yuanjian Li"<
> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
> Yamamuro";
> >>
> >> 主题: Re: [DISCUSS] Apache Spark 3.0.1 Release
> >>
> >>
> >>
> >> +1, the issues mentioned are really serious.
> >>
> >>
> >>
> >> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon 
> wrote:
> >>
> >> +1.
> >>
> >> Just as a note,
> >> - SPARK-31918 is fixed now, and there's no blocker. - When we build
> SparkR, we should use the latest R version at least 4.0.0+.
> >>
> >>
> >>
> >> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이
> 작성:
> >>
> >> +1
> >>
> >>
> >>
> >> Bests,
> >>
> >> Dongjoon.
> >>
> >>
> >>
> >> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
> >>
> >> +1 on a 3.0.1 soon.
> >>
> >>
> >>
> >> Probably it would be nice if some Scala experts can take a look at
> https://issues.apache.org/jira/browse/SPARK-32051 and include the fix
> into 3.0.1 if possible.
> >>
> >

Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-30 Thread Jungtaek Lim
SPARK-32130 [1] looks to be a performance regression introduced in Spark
3.0.0, which is ideal to look into before releasing another bugfix version.

1. https://issues.apache.org/jira/browse/SPARK-32130

On Wed, Jul 1, 2020 at 7:05 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Hi all
>
> I just wanted to ping this thread to see if all the outstanding blockers
> for 3.0.1 have been fixed. If so, it would be great if we can get the
> release going. The CRAN team sent us a note that the version SparkR
> available on CRAN for the current R version (4.0.2) is broken and hence we
> need to update the package soon --  it will be great to do it with 3.0.1.
>
> Thanks
> Shivaram
>
> On Wed, Jun 24, 2020 at 8:31 PM Prashant Sharma 
> wrote:
>
>> +1 for 3.0.1 release.
>> I too can help out as release manager.
>>
>> On Thu, Jun 25, 2020 at 4:58 AM 郑瑞峰  wrote:
>>
>>> I volunteer to be a release manager of 3.0.1, if nobody is working on
>>> this.
>>>
>>>
>>> -- 原始邮件 --
>>> *发件人:* "Gengliang Wang";
>>> *发送时间:* 2020年6月24日(星期三) 下午4:15
>>> *收件人:* "Hyukjin Kwon";
>>> *抄送:* "Dongjoon Hyun";"Jungtaek Lim"<
>>> kabhwan.opensou...@gmail.com>;"Jules Damji";"Holden
>>> Karau";"Reynold Xin";"Shivaram
>>> Venkataraman";"Yuanjian Li"<
>>> xyliyuanj...@gmail.com>;"Spark dev list";"Takeshi
>>> Yamamuro";
>>> *主题:* Re: [DISCUSS] Apache Spark 3.0.1 Release
>>>
>>> +1, the issues mentioned are really serious.
>>>
>>> On Tue, Jun 23, 2020 at 7:56 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> +1.
>>>>
>>>> Just as a note,
>>>> - SPARK-31918 <https://issues.apache.org/jira/browse/SPARK-31918> is
>>>> fixed now, and there's no blocker. - When we build SparkR, we should use
>>>> the latest R version at least 4.0.0+.
>>>>
>>>> 2020년 6월 24일 (수) 오전 11:20, Dongjoon Hyun 님이
>>>> 작성:
>>>>
>>>>> +1
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Tue, Jun 23, 2020 at 1:19 PM Jungtaek Lim <
>>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>>
>>>>>> +1 on a 3.0.1 soon.
>>>>>>
>>>>>> Probably it would be nice if some Scala experts can take a look at
>>>>>> https://issues.apache.org/jira/browse/SPARK-32051 and include the
>>>>>> fix into 3.0.1 if possible.
>>>>>> Looks like APIs designed to work with Scala 2.11 & Java bring
>>>>>> ambiguity in Scala 2.12 & Java.
>>>>>>
>>>>>> On Wed, Jun 24, 2020 at 4:52 AM Jules Damji 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>> Pardon the dumb thumb typos :)
>>>>>>>
>>>>>>> On Jun 23, 2020, at 11:36 AM, Holden Karau 
>>>>>>> wrote:
>>>>>>>
>>>>>>> 
>>>>>>> +1 on a patch release soon
>>>>>>>
>>>>>>> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 on doing a new patch release soon. I saw some of these issues
>>>>>>>> when preparing the 3.0 release, and some of them are very serious.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>>>>>>>> shiva...@eecs.berkeley.edu> wrote:
>>>>>>>>
>>>>>>>>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1
>>>>>>>>> release soon.
>>>>>>>>>
>>>>>>>>> Shivaram
>>>>>>>>>
>>>>>>>>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro <
>>>>>>>>> linguin@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Thanks for the heads-up, Yuanjian!
>>>>>>>>>
>>>>>>>>> I also noticed branch-3.0 already has 39 commits after Spark
>>>>>>>

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Jungtaek Lim
Does this count only "new features" (probably major), or also count
"improvements"? I'm aware of a couple of improvements which should be
ideally included in the next release, but if this counts only major new
features then don't feel they should be listed.

On Tue, Jun 30, 2020 at 1:32 AM Holden Karau  wrote:

> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>> In my perspective, the last main missing piece was Dynamic
>>> allocation and
>>> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>> - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Inconsistent schema on Encoders.bean (reported issues from user@)

2020-06-26 Thread Jungtaek Lim
Somehow I just revisited the issue, and realized the issue is resolved in
Spark 3.0.0. ExpressionEncoder is refactored in Spark 3.0.0 and schema is
removed as a part of refactor, which seems to be a root cause as schema and
the data types of serializer don't match in such case. ExpressionEncoder in
Spark 3.0.0 pulls the schema from serializer, which removes the problem.

The remaining question is, would we like to fix it in 2.4.x?

On Tue, May 26, 2020 at 2:54 PM Jungtaek Lim 
wrote:

> I meant how to interpret Java Beans in Spark are not consistently defined.
>
> Unlike you've guessed, in most paths Spark uses "read-only" properties.
> (All the failed existing tests in my experiment have "read-only"
> properties.) The problematic case is when Java bean is used for read-write;
> one case is using Java bean as data type of "state" in structured
> streaming, where Spark will convert rows to Java beans and vice versa.
>
> On Sun, May 24, 2020 at 11:01 PM Sean Owen  wrote:
>
>> Java Beans are well-defined; it's valid to have a getter- or
>> setter-only property. That doesn't mean Spark can meaningfully use
>> such a property, as it typically has to both read and write them. I
>> guess it depends on context. For example, I don't see how you can have
>> a deserializer without setters, or a serializer without getters.
>>
>> case classes do have accessor (and if applicable mutator) methods
>> generated automatically but they do not follow bean conventions.
>> ("foo" gets a "foo" method, not "getFoo")
>>
>> I haven't read this in detail but it seems like most of the issue you
>> are seeing is that it's not checking the property names, just using
>> ordering, in your reproducer. That seems different?
>>
>> On Sun, May 24, 2020 at 3:00 AM Jungtaek Lim
>>  wrote:
>> >
>> > OK I just went through the change, and the change breaks bunch of
>> existing UTs.
>> >
>> > https://github.com/apache/spark/pull/28611
>> >
>> > Note that I modified all the cases where Spark extracts the columns for
>> "read method" only properties to both "read" & "write". It doesn't only
>> change the code path of Encoders.bean, but also change the code path of
>> createDataFrame from Java bean, including case class in Java language
>> (Scala-Java Interop). Case class doesn't have explicit setter & getter
>> methods.
>> >
>> > Personally I'm not in favor of the uncertainly of definition of Java
>> bean in Spark (explained nowhere), but also not sure we are OK with the
>> breaking changes. We might be able to reduce the breaking changes by
>> allowing the difference between createDataFrame (leave as it is) and
>> Encoders.bean (require read & write methods), but it is still a breaking
>> change and the difference would be confusing if we don't explain it enough.
>> >
>> > Any thoughts?
>> >
>> >
>> > On Mon, May 11, 2020 at 1:36 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>
>> >> First case is not tied to the batch / streaming as Encoders.bean
>> simply fails when inferring schema.
>> >>
>> >> Second case is tied to the streaming, and I've described the reason in
>> the last reply. I'm not sure we don't have similar case for batch though.
>> (If there're some operators only relying on the sequence of the columns
>> while matching row with schema, then it could be affected.)
>> >>
>> >> On Mon, May 11, 2020 at 1:24 PM Wenchen Fan 
>> wrote:
>> >>>
>> >>> is it a problem only for streaming or it affects batch queries as
>> well?
>> >>>
>> >>> On Fri, May 8, 2020 at 11:42 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>>
>> >>>> The first case of user report is obvious - according to the user
>> report, AVRO generated code contains getter which denotes to itself hence
>> Spark disallows (throws exception), but it doesn't have matching setter
>> method (if I understand correctly) so technically it shouldn't matter.
>> >>>>
>> >>>> For the second case of user report, I've reproduced with my own
>> code. Please refer the gist code:
>> https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca
>> >>>>
>> >>>> This code aggregates the max value of the values in key where the
>> key is in the range of (0 ~ 9).
>> >>>>
>> >>>> We're

Re: Handling user-facing metadata issues on file stream source & sink

2020-06-25 Thread Jungtaek Lim
Bump + adding one more issue I fixed (and by chance there's relevant report
in user mailing list recently)

* [SPARK-30462][SS] Streamline the logic on file stream source and sink to
avoid memory issue [1]

The patch stabilizes the driver's memory usage on utilizing a huge metadata
log, which was throwing OOME.

1. https://github.com/apache/spark/pull/28904

On Sun, Jun 14, 2020 at 4:14 PM Jungtaek Lim 
wrote:

> Bump again - hope to get some traction because these issues are either
> long-standing problems or noticeable improvements (each PR has numbers/UI
> graph to show the improvement).
>
> Fixed long-standing problems:
>
> * [SPARK-17604][SS] FileStreamSource: provide a new option to have
> retention on input files [1]
> * [SPARK-27188][SS] FileStreamSink: provide a new option to have retention
> on output files [2]
>
> There's no logic to control the size of metadata for file stream source &
> file stream sink, and it affects end users who run the streaming query with
> many input files / output files in the long run. Both are to resolve
> metadata growing incrementally over time. As the number of the issue
> represents for SPARK-17604 it's a fairly old problem. There're at least
> three relevant issues being reported on SPARK-27188.
>
> Improvements:
>
> * [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond
> maxFilesPerTrigger as unread files [3]
> * [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log
> twice if the query restarts from compact batch [4]
> * [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with
> LZ4 compression on FileStream(Source/Sink)Log [5]
>
> Above patches provide better performance on the condition described on
> each PR. Worth noting, SPARK-30946 provides pretty much better performance
> (~10x) on compaction per every compact batch, whereas it also reduces down
> the compact batch log file (~30% of current).
>
> 1. https://github.com/apache/spark/pull/28422
> 2. https://github.com/apache/spark/pull/28363
> 3. https://github.com/apache/spark/pull/27620
> 4. https://github.com/apache/spark/pull/27649
> 5. https://github.com/apache/spark/pull/27694
>
>
> On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Worth noting that I got similar question around local community as well.
>> These reporters didn't encounter the edge-case, they're encountered the
>> critical issue in the normal running of streaming query.
>>
>> On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim 
>> wrote:
>>
>>> (bump to expose the discussion to more readers)
>>>
>>> On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> I'm seeing more and more structured streaming end users encountered the
>>>> metadata issues on file stream source and sink. They have been known-issues
>>>> and there're even long-standing JIRA issues reported before, end users
>>>> report them again in user@ mailing list in April.
>>>>
>>>> * Spark Structure Streaming | FileStreamSourceLog not deleting list of
>>>> input files | Spark -2.4.0 [1]
>>>> * [Structured Streaming] Checkpoint file compact file grows big [2]
>>>>
>>>> I've proposed various improvements on the area (see my PRs [3]) but
>>>> suffered on lack of interests/reviews. I feel the issue is critical
>>>> (under-estimated) because...
>>>>
>>>> 1. It's one of "built-in" data sources which is being maintained by
>>>> Spark community. (End users may judge the state of project/area on the
>>>> quality on the built-in data source, because that's the thing they would
>>>> start with.)
>>>> 2. It's the only built-in data source which provides "end-to-end
>>>> exactly-once" in structured streaming.
>>>>
>>>> I'd hope to see us address such issues so that end users can live with
>>>> built-in data source. (It may not need to be perfect, but at least be
>>>> reasonable on the long-run streaming workloads.) I know there're couple of
>>>> alternatives, but I don't think starter would start from there. End users
>>>> may just try to find alternatives - not alternative of data source, but
>>>> alternative of streaming processing framework.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> 1.
>>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>>> 2.
>>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>>> 3. https://github.com/apache/spark/pulls/HeartSaVioR
>>>>
>>>


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Jungtaek Lim
+1 on a 3.0.1 soon.

Probably it would be nice if some Scala experts can take a look at
https://issues.apache.org/jira/browse/SPARK-32051 and include the fix into
3.0.1 if possible.
Looks like APIs designed to work with Scala 2.11 & Java bring ambiguity in
Scala 2.12 & Java.

On Wed, Jun 24, 2020 at 4:52 AM Jules Damji  wrote:

> +1 (non-binding)
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jun 23, 2020, at 11:36 AM, Holden Karau  wrote:
>
> 
> +1 on a patch release soon
>
> On Tue, Jun 23, 2020 at 10:47 AM Reynold Xin  wrote:
>
>> +1 on doing a new patch release soon. I saw some of these issues when
>> preparing the 3.0 release, and some of them are very serious.
>>
>>
>> On Tue, Jun 23, 2020 at 8:06 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>> +1 Thanks Yuanjian -- I think it'll be great to have a 3.0.1 release
>>> soon.
>>>
>>> Shivaram
>>>
>>> On Tue, Jun 23, 2020 at 3:43 AM Takeshi Yamamuro 
>>> wrote:
>>>
>>> Thanks for the heads-up, Yuanjian!
>>>
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
>>>
>>> wow, the updates are so quick. Anyway, +1 for the release.
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li 
>>> wrote:
>>>
>>> Hi dev-list,
>>>
>>> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
>>> since 4 blocker issues were found after Spark 3.0.0:
>>>
>>> [SPARK-31990] The state store compatibility broken will cause a
>>> correctness issue when Streaming query with `dropDuplicate` uses the
>>> checkpoint written by the old Spark version.
>>>
>>> [SPARK-32038] The regression bug in handling NaN values in
>>> COUNT(DISTINCT)
>>>
>>> [SPARK-31918][WIP] CRAN requires to make it working with the latest R
>>> 4.0. It makes the 3.0 release unavailable on CRAN, and only supports R
>>> [3.5, 4.0)
>>>
>>> [SPARK-31967] Downgrade vis.js to fix Jobs UI loading time regression
>>>
>>> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0. I
>>> think it would be great if we have Spark 3.0.1 to deliver the critical
>>> fixes.
>>>
>>> Any comments are appreciated.
>>>
>>> Best,
>>>
>>> Yuanjian
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>> - To
>>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>


Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Jungtaek Lim
Great, thanks all for your efforts on the huge step forward!

On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon  wrote:

> Yay!
>
> 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성:
>
>> Great job everyone ! Congratulations :-)
>>
>> Regards,
>> Mridul
>>
>> On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:
>>
>>> Hi all,
>>>
>>> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on
>>> many of the innovations from Spark 2.x, bringing new ideas as well as
>>> continuing long-term projects that have been in development. This release
>>> resolves more than 3400 tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to this release. This release would not have been
>>> possible without you.
>>>
>>> To download Spark 3.0.0, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-0-0.html
>>>
>>>
>>>
>>>


Re: [DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-06-16 Thread Jungtaek Lim
Bump this again. I filed SPARK-31985 [1] and plan to submit a PR in a
couple of days if there's no voice on the reason we should keep it.

1. https://issues.apache.org/jira/browse/SPARK-31985

On Thu, May 21, 2020 at 8:54 AM Jungtaek Lim 
wrote:

> Let me share the effect on removing the incomplete and undocumented code
> path. I manually tried out removing the code path and here's the change.
>
>
> https://github.com/HeartSaVioR/spark/commit/aa53e9b1b33c0b8aec37704ad290b42ffb2962d8
>
> 1,120 lines deleted without hurting any existing streaming tests, except a
> suite I removed as well since it tests such code path. No need to update
> documentation as it was never publicized. This also removes some rules
> which only apply for continuous mode, which gave exceptions on the fact
> that "continuous mode is only available for map-like operations". Removing
> them would be back to simplify the usage of continuous mode.
>
> Also worth noting that I had to manually remove the code path instead of
> revert, because the code path has been changed to reflect DSv2 change. What
> this means? We have to "update" the code path and concern
> about compatibility, etc. while it never be used in production.
>
> On Tue, May 19, 2020 at 1:14 PM Jungtaek Lim 
> wrote:
>
>> Hi devs,
>>
>> during experiment on complete mode I realized we left some incomplete
>> code parts on supporting aggregation for continuous mode. (shuffle &
>> coalesce)
>>
>> The work had been occurred around first half of 2018 and stopped, and no
>> work has been done for around 2 years (so I don't expect anyone is working
>> on this). The functionality is undocumented (as the work was only done
>> partially) and continuous mode is experimental so I don't feel risks to get
>> rid of the part.
>>
>> What do you think? If it makes sense then I'll raise a PR to get rid of
>> the incomplete codes.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. I'm kind of feeling the same with continuous mode itself - it has
>> been still experimental over the 2 years, and while it lacks lots of
>> essential things (backpressure, epoch synchronization, query progress,
>> etc.) there's no additional major work over 1 year. We eventually have been
>> excluded continuous mode for the new streaming feature like observable
>> metric, streaming UI, etc. because it is on top of the feature
>> which continuous mode lacks.
>>
>> Unlike incomplete code path, I'm not strongly against this, as it
>> has been documented and someone might use the feature in production. I
>> still think it's ideal to retire the feature smoothly.
>> (Please chime in with the use case if someone has the production cases.)
>>
>> ps2. It feels like "feature branch" looks to be a thing to consider for
>> efforts on one big feature.
>>
>


Re: Handling user-facing metadata issues on file stream source & sink

2020-06-14 Thread Jungtaek Lim
Bump again - hope to get some traction because these issues are either
long-standing problems or noticeable improvements (each PR has numbers/UI
graph to show the improvement).

Fixed long-standing problems:

* [SPARK-17604][SS] FileStreamSource: provide a new option to have
retention on input files [1]
* [SPARK-27188][SS] FileStreamSink: provide a new option to have retention
on output files [2]

There's no logic to control the size of metadata for file stream source &
file stream sink, and it affects end users who run the streaming query with
many input files / output files in the long run. Both are to resolve
metadata growing incrementally over time. As the number of the issue
represents for SPARK-17604 it's a fairly old problem. There're at least
three relevant issues being reported on SPARK-27188.

Improvements:

* [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond
maxFilesPerTrigger as unread files [3]
* [SPARK-30900][SS] FileStreamSource: Avoid reading compact metadata log
twice if the query restarts from compact batch [4]
* [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with
LZ4 compression on FileStream(Source/Sink)Log [5]

Above patches provide better performance on the condition described on each
PR. Worth noting, SPARK-30946 provides pretty much better performance
(~10x) on compaction per every compact batch, whereas it also reduces down
the compact batch log file (~30% of current).

1. https://github.com/apache/spark/pull/28422
2. https://github.com/apache/spark/pull/28363
3. https://github.com/apache/spark/pull/27620
4. https://github.com/apache/spark/pull/27649
5. https://github.com/apache/spark/pull/27694


On Fri, May 22, 2020 at 12:50 PM Jungtaek Lim 
wrote:

> Worth noting that I got similar question around local community as well.
> These reporters didn't encounter the edge-case, they're encountered the
> critical issue in the normal running of streaming query.
>
> On Fri, May 8, 2020 at 4:49 PM Jungtaek Lim 
> wrote:
>
>> (bump to expose the discussion to more readers)
>>
>> On Mon, May 4, 2020 at 5:45 PM Jungtaek Lim 
>> wrote:
>>
>>> Hi devs,
>>>
>>> I'm seeing more and more structured streaming end users encountered the
>>> metadata issues on file stream source and sink. They have been known-issues
>>> and there're even long-standing JIRA issues reported before, end users
>>> report them again in user@ mailing list in April.
>>>
>>> * Spark Structure Streaming | FileStreamSourceLog not deleting list of
>>> input files | Spark -2.4.0 [1]
>>> * [Structured Streaming] Checkpoint file compact file grows big [2]
>>>
>>> I've proposed various improvements on the area (see my PRs [3]) but
>>> suffered on lack of interests/reviews. I feel the issue is critical
>>> (under-estimated) because...
>>>
>>> 1. It's one of "built-in" data sources which is being maintained by
>>> Spark community. (End users may judge the state of project/area on the
>>> quality on the built-in data source, because that's the thing they would
>>> start with.)
>>> 2. It's the only built-in data source which provides "end-to-end
>>> exactly-once" in structured streaming.
>>>
>>> I'd hope to see us address such issues so that end users can live with
>>> built-in data source. (It may not need to be perfect, but at least be
>>> reasonable on the long-run streaming workloads.) I know there're couple of
>>> alternatives, but I don't think starter would start from there. End users
>>> may just try to find alternatives - not alternative of data source, but
>>> alternative of streaming processing framework.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1.
>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>> 2.
>>> https://lists.apache.org/thread.html/r0916e2fe8181a58c20ee8a76341aae243c76bbfd8758d8d94f79fe8e%40%3Cuser.spark.apache.org%3E
>>> 3. https://github.com/apache/spark/pulls/HeartSaVioR
>>>
>>


  1   2   3   >