PR builder broken

2023-05-10 Thread Xingbo Jiang
Hi dev,

I've seen multiple PR builder failures like below since this morning:
```
TypeError: Cannot read properties of undefined (reading 'head_sha')
at eval (eval at callAsyncFunction
(/home/runner/work/_actions/actions/github-script/v6/dist/index.js:15143:16),
:81:22)
Error: Unhandled error: TypeError: Cannot read properties of undefined
(reading 'head_sha')
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at async main
(/home/runner/work/_actions/actions/github-script/v6/dist/index.js:15236:20)
```
(Example links:
https://github.com/apache/spark/actions/runs/4940984520/jobs/8833154761?pr=40690,
https://github.com/apache/spark/actions/runs/4939269706/jobs/8829852985?pr=41123
)

It may be related to github, could someone help take a look?

Thanks,
Xingbo


Re: [VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Xingbo Jiang
+1

On Wed, Nov 30, 2022 at 5:59 PM Jungtaek Lim 
wrote:

> Starting with +1 from me.
>
> On Thu, Dec 1, 2022 at 10:54 AM Jungtaek Lim 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Asynchronous Offset Management in
>> Structured Streaming.
>>
>> The high level summary of the SPIP is that we propose a couple of
>> improvements on offset management in microbatch execution to lower down
>> processing latency, which would help for certain types of workloads.
>>
>> References:
>>
>>- JIRA ticket 
>>- SPIP doc
>>
>> 
>>- Discussion thread
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks!
>> Jungtaek Lim (HeartSaVioR)
>>
>


Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Xingbo Jiang
Congratulations!

Yuanjian Li 于2022年8月9日 周二20:31写道:

> Congratulations, Xinrong!
>
> XiDuo You 于2022年8月9日 周二19:18写道:
>
>> Congratulations!
>>
>> Haejoon Lee  于2022年8月10日周三 09:30写道:
>> >
>> > Congrats, Xinrong!!
>> >
>> > On Tue, Aug 9, 2022 at 5:12 PM Hyukjin Kwon 
>> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> The Spark PMC recently added Xinrong Meng as a committer on the
>> project. Xinrong is the major contributor of PySpark especially Pandas API
>> on Spark. She has guided a lot of new contributors enthusiastically. Please
>> join me in welcoming Xinrong!
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: SIGMOD System Award for Apache Spark

2022-05-13 Thread Xingbo Jiang
Congratulations!

On Fri, May 13, 2022 at 9:43 AM Xiao Li 
wrote:

> Congratulations to everyone!
>
> Xiao
>
> On Fri, May 13, 2022 at 9:34 AM Dongjoon Hyun 
> wrote:
>
>> Ya, it's really great!. Congratulations to the whole community!
>>
>> Dongjoon.
>>
>> On Fri, May 13, 2022 at 8:12 AM Chao Sun  wrote:
>>
>>> Huge congrats to the whole community!
>>>
>>> On Fri, May 13, 2022 at 1:56 AM Wenchen Fan  wrote:
>>>
 Great! Congratulations to everyone!

 On Fri, May 13, 2022 at 10:38 AM Gengliang Wang 
 wrote:

> Congratulations to the whole spark community!
>
> On Fri, May 13, 2022 at 10:14 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Congrats Spark community!
>>
>> On Fri, May 13, 2022 at 10:40 AM Qian Sun 
>> wrote:
>>
>>> Congratulations !!!
>>>
>>> 2022年5月13日 上午3:44,Matei Zaharia  写道:
>>>
>>> Hi all,
>>>
>>> We recently found out that Apache Spark received
>>>  the SIGMOD System Award
>>> this year, given by SIGMOD (the ACM’s data management research
>>> organization) to impactful real-world and research systems. This puts 
>>> Spark
>>> in good company with some very impressive previous recipients
>>> . This
>>> award is really an achievement by the whole community, so I wanted to 
>>> say
>>> congrats to everyone who contributes to Spark, whether through code, 
>>> issue
>>> reports, docs, or other means.
>>>
>>> Matei
>>>
>>>
>>>
>
> --
>
>


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Xingbo Jiang
+1 This is an exciting new feature!

On Sun, Sep 13, 2020 at 8:00 PM Mridul Muralidharan 
wrote:

> Hi,
>
> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
> shuffle to improve shuffle efficiency.
> Please take a look at:
>
>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>- SPIP doc:
>
> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>- POC against master and results summary :
>
> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>
> Active discussions on the jira and SPIP document have settled.
>
> I will leave the vote open until Friday (the 18th September 2020), 5pm
> CST.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks,
> Mridul
>


Re: [OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Xingbo Jiang
Hi Holden,

This is the digest for commits merged between *June 3 and June 16.* The
commits you mentioned would be included in the future digests.

Cheers,

Xingbo

On Tue, Jul 21, 2020 at 11:13 AM Holden Karau  wrote:

> I'd also add [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are
> being shutdown &
>
> [SPARK-21040][CORE] Speculate tasks which are running on decommission
> executors two of the PRs merged after the decommissioning SPIP.
>
> On Tue, Jul 21, 2020 at 10:53 AM Xingbo Jiang 
> wrote:
>
>> Hi all,
>>
>> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
>> For each API/configuration/behavior change, an *[API] *tag is added in
>> the title.
>>
>> CORE
>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31923core-ignore-internal-accumulators-that-use-unrecognized-types-rather-than-crashing-63--5>[3.0][SPARK-31923][CORE]
>> Ignore internal accumulators that use unrecognized types rather than
>> crashing (+63, -5)>
>> <https://github.com/apache/spark/commit/b333ed0c4a5733a9c36ad79de1d4c13c6cf3c5d4>
>>
>> A user may name his accumulators using the internal.metrics. prefix, so
>> that Spark treats them as internal accumulators and hides them from UI. We
>> should make JsonProtocol.accumValueToJson more robust and let it ignore
>> internal accumulators that use unrecognized types.
>>
>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api80spark-31486core-sparksubmitwaitappcompletion-flag-to-control-spark-submit-exit-in-standalone-cluster-mode-88--26>[API][3.1][SPARK-31486][CORE]
>> spark.submit.waitAppCompletion flag to control spark-submit exit in
>> Standalone Cluster Mode (+88, -26)>
>> <https://github.com/apache/spark/commit/6befb2d8bdc5743d0333f4839cf301af165582ce>
>>
>> This PR implements an application wait mechanism that allows spark-submit to
>> wait until the application finishes in Standalone mode. This will delay the
>> exit of spark-submit JVM until the job is completed. This implementation
>> will keep monitoring the application until it is either finished, failed,
>> or killed. This will be controlled via the following conf:
>>
>>-
>>
>>spark.standalone.submit.waitAppCompletion (Default: false)
>>
>>In standalone cluster mode, controls whether the client waits to exit
>>until the application completes. If set to true, the client process
>>will stay alive polling the driver's status. Otherwise, the client process
>>will exit after submission.
>>
>>
>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#sql>
>> SQL
>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#71spark-31220sql-repartition-obeys-initialpartitionnum-when-adaptiveexecutionenabled-27--12>[3.0][SPARK-31220][SQL]
>> repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
>> -12)>
>> <https://github.com/apache/spark/commit/1d1eacde9d1b6fb75a20e4b909d221e70ad737db>
>>
>> AQE and non-AQE use different configs to set the initial shuffle
>> partition number. This PR fixes repartition/DISTRIBUTE BY so that it
>> also uses the AQE config
>> spark.sql.adaptive.coalescePartitions.initialPartitionNum to set the
>> initial shuffle partition number if AQE is enabled.
>>
>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#70spark-31867sqlfollowup-check-result-differences-for-datetime-formatting-51--8>[3.0][SPARK-31867][SQL][FOLLOWUP]
>> Check result differences for datetime formatting (+51, -8)>
>> <https://github.com/apache/spark/commit/fc6af9d900ec6f6a1cbe8f987857a69e6ef600d1>
>>
>> Spark should throw SparkUpgradeException when getting DateTimeException for
>> datetime formatting in the EXCEPTION legacy Time Parser Policy.
>>
>> <https://github.com/databricks/runtime/wiki/OSS-Digest-June-3-~-June-9,-2020#api70spark-31879spark-31892sql-disable-week-based-pattern-letters-in-datetime-parsingformatting-1421--171-102--48>[API][3.0][SPARK-31879][SPARK-31892][SQL]
>> Disable week-based pattern letters in datetime parsing/formatting (+1421,
>> -171)>
>> <https://github.com/apache/spark/commit/9d5b5d0a5849ac329bbae26d9884d8843d8a8571>
>>  (+102,
>> -48)>
>> <https://github.com/apache/spark/commit/afe95bd9ad7a07c49deecf05f0a1000bb8f80caa>
>>
>> Week-based pattern letters have very weird behaviors during datetime
>> parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
>> with 

[OSS DIGEST] The major changes of Apache Spark from June 3 to June 16

2020-07-21 Thread Xingbo Jiang
Hi all,

This is the bi-weekly Apache Spark digest from the Databricks OSS team.
For each API/configuration/behavior change, an *[API] *tag is added in the
title.

CORE
[3.0][SPARK-31923][CORE]
Ignore internal accumulators that use unrecognized types rather than
crashing (+63, -5)>


A user may name his accumulators using the internal.metrics. prefix, so
that Spark treats them as internal accumulators and hides them from UI. We
should make JsonProtocol.accumValueToJson more robust and let it ignore
internal accumulators that use unrecognized types.
[API][3.1][SPARK-31486][CORE]
spark.submit.waitAppCompletion flag to control spark-submit exit in
Standalone Cluster Mode (+88, -26)>


This PR implements an application wait mechanism that allows spark-submit to
wait until the application finishes in Standalone mode. This will delay the
exit of spark-submit JVM until the job is completed. This implementation
will keep monitoring the application until it is either finished, failed,
or killed. This will be controlled via the following conf:

   -

   spark.standalone.submit.waitAppCompletion (Default: false)

   In standalone cluster mode, controls whether the client waits to exit
   until the application completes. If set to true, the client process will
   stay alive polling the driver's status. Otherwise, the client process will
   exit after submission.


SQL
[3.0][SPARK-31220][SQL]
repartition obeys initialPartitionNum when adaptiveExecutionEnabled (+27,
-12)>


AQE and non-AQE use different configs to set the initial shuffle partition
number. This PR fixes repartition/DISTRIBUTE BY so that it also uses the
AQE config spark.sql.adaptive.coalescePartitions.initialPartitionNum to set
the initial shuffle partition number if AQE is enabled.
[3.0][SPARK-31867][SQL][FOLLOWUP]
Check result differences for datetime formatting (+51, -8)>


Spark should throw SparkUpgradeException when getting DateTimeException for
datetime formatting in the EXCEPTION legacy Time Parser Policy.
[API][3.0][SPARK-31879][SPARK-31892][SQL]
Disable week-based pattern letters in datetime parsing/formatting (+1421,
-171)>

(+102,
-48)>


Week-based pattern letters have very weird behaviors during datetime
parsing in Spark 2.4, and it's very hard to simulate the legacy behaviors
with the new API. For formatting, the new API makes the start-of-week
localized, and it's not possible to keep the legacy behaviors. Since the
week-based fields are rarely used, we disable week-based pattern letters in
both parsing and formatting.
[3.0][SPARK-31896][SQL]
Handle am-pm timestamp parsing when hour is missing (+39, -3)>


This PR sets the hour field to 0 or 12 when the AMPM_OF_DAY field is AM or
PM during datetime parsing, to keep the behavior the same as Spark 2.4.
[API][3.1][SPARK-31830][SQL]
Consistent error handling for datetime formatting and parsing functions
(+126, -580)>


When parsing/formatting datetime values, it's better to fail fast if the
pattern string is invalid, instead of returning null for each input record.
The formatting 

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Xingbo Jiang
Welcome, Huaxin, Jungtaek, and Dilip!

Congratulations!

On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new roles! The new committers are:
>
> - Huaxin Gao
> - Jungtaek Lim
> - Dilip Biswal
>
> All three of them contributed to Spark 3.0 and we’re excited to have them
> join the project.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread Xingbo Jiang
+1(non-binding)

Jiaxin Shan 于2020年6月8日 周一下午9:50写道:

> +1
> I build binary using the following command, test spark workloads on
> Kubernetes (AWS EKS) and it's working well.
>
> ./dev/make-distribution.sh --name spark-v3.0.0-rc3-20200608 --tgz
> -Phadoop-3.2 -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud
> -Pscala-2.12
>
> On Mon, Jun 8, 2020 at 7:13 PM Bryan Cutler  wrote:
>
>> +1 (non-binding)
>>
>> On Mon, Jun 8, 2020, 1:49 PM Tom Graves 
>> wrote:
>>
>>> +1
>>>
>>> Tom
>>>
>>> On Saturday, June 6, 2020, 03:09:09 PM CDT, Reynold Xin <
>>> r...@databricks.com> wrote:
>>>
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.0.0.
>>>
>>> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes
>>> are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.0-rc3 (commit
>>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>>> https://github.com/apache/spark/tree/v3.0.0-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1350/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>>>
>>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>
>>> This release is using the release script of the tag v3.0.0-rc3.
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.0.0?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.0.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>>
>
> --
> Best Regards!
> Jiaxin Shan
> Tel:  412-230-7670
> Address: 470 2nd Ave S, Kirkland, WA
> 
>
>


Re: [OSS DIGEST] The major changes of Apache Spark from Mar 25 to Apr 7

2020-04-29 Thread Xingbo Jiang
Thank you so much for doing this, Xiao!

On Wed, Apr 29, 2020 at 11:09 AM Xiao Li  wrote:

> Hi all,
>
> This is the bi-weekly Apache Spark digest from the Databricks OSS team.
> For each API/configuration/behavior change, an *[API] *tag is added in
> the title.
>
> CORE
> [3.0][SPARK-30623][CORE]
> Spark external shuffle allow disable of separate event loop group (+66, -33)
> >
> 
>
> PR#22173  introduced a perf
> regression in shuffle, even if we disable the feature flag
> spark.shuffle.server.chunkFetchHandlerThreadsPercent. To fix the perf
> regression, this PR refactors the related code to completely disable this
> feature by default.
>
> [3.0][SPARK-31314][CORE]
> Revert SPARK-29285 to fix shuffle regression caused by creating temporary
> file eagerly (+10, -71)>
> 
>
> PR#25962  introduced a perf
> regression in shuffle, which may create empty files unnecessarily. This PR
> reverts it.
>
> [API][3.1][SPARK-29154][CORE]
> Update Spark scheduler for stage level scheduling (+704, -218)>
> 
>
> This PR updates the DAG scheduler to schedule tasks to match the resource
> profile. It's for the stage level scheduling.
> [API][3.1][SPARK-29153][CORE] Add ability to merge resource profiles
> within a stage with Stage Level Scheduling (+304, -15)>
> 
>
> Add the ability to optionally merged resource profiles if they are
> specified on multiple RDDs within a Stage. The feature is part of Stage
> Level Scheduling. There is a config
> spark.scheduler.resourceProfile.mergeConflicts to enable this feature,
> the config if off by default.
>
> spark.scheduler.resource.profileMergeConflicts (Default: false)
>
>- If set to true, Spark will merge ResourceProfiles when different
>profiles are specified in RDDs that get combined into a single stage. When
>they are merged, Spark chooses the maximum of each resource and creates a
>new ResourceProfile. The default of false results in Spark throwing an
>exception if multiple different ResourceProfiles are found in RDDs going
>into the same stage.
>
>
> [API][3.1][SPARK-31208][CORE]
> Add an experimental API: cleanShuffleDependencies (+158, -71)>
> 
>
> Add a new experimental developer API RDD.cleanShuffleDependencies(blocking:
> Boolean) to allow explicitly clean up shuffle files. This could help
> dynamic scaling of K8s backend since the backend only recycles executors
> without shuffle files.
>
>   /**   * :: Experimental ::   * Removes an RDD's shuffles and it's 
> non-persisted ancestors.   * When running without a shuffle service, cleaning 
> up shuffle files enables downscaling.   * If you use the RDD after this call, 
> you should checkpoint and materialize it first.   * If you are uncertain of 
> what you are doing, please do not use this feature.   * Additional techniques 
> for mitigating orphaned shuffle files:   *   * Tuning the driver GC to be 
> more aggressive, so the regular context cleaner is triggered   *   * Setting 
> an appropriate TTL for shuffle files to be auto cleaned   */
>   @Experimental
>   @DeveloperApi
>   @Since("3.1.0")
>   def cleanShuffleDependencies(blocking: Boolean = false): Unit
>
>
> [3.1][SPARK-31179]
> Fast fail the connection while last connection failed in fast fail time
> window (+68, -12)>
> 
>
> In TransportFactory, if a connection to the destination address fails, the
> new connection requests [that are created within a time window] fail fast
> for avoiding too many retries. This time window size is set to 95% of the
> IO retry wait time (spark.io.shuffle.retryWait whose default is 5 seconds).
>
> 

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-13 Thread Xingbo Jiang
Andrew, could you provide more context of your use case please? Is it like
you deploy homogeneous containers on hosts with available resources, and
each container launches one worker? Or you deploy workers directly on hosts
thus you could have multiple workers from the same application on the same
host?

Thanks,

Xingbo

On Fri, Mar 13, 2020 at 10:23 AM Sean Owen  wrote:

> You have multiple workers in one Spark (standalone) app? this wouldn't
> prevent N apps from each having a worker on a machine.
>
> On Fri, Mar 13, 2020 at 11:51 AM Andrew Melo 
> wrote:
> >
> > Hello,
> >
> > On Fri, Feb 28, 2020 at 13:21 Xingbo Jiang 
> wrote:
> >>
> >> Hi all,
> >>
> >> Based on my experience, there is no scenario that necessarily requires
> deploying multiple Workers on the same node with Standalone backend. A
> worker should book all the resources reserved to Spark on the host it is
> launched, then it can allocate those resources to one or more executors
> launched by this worker. Since each executor runs in a separated JVM, we
> can limit the memory of each executor to avoid long GC pause.
> >>
> >> The remaining concern is the local-cluster mode is implemented by
> launching multiple workers on the local host, we might need to re-implement
> LocalSparkCluster to launch only one Worker and multiple executors. It
> should be fine because local-cluster mode is only used in running Spark
> unit test cases, thus end users should not be affected by this change.
> >>
> >> Removing multiple workers on the same host support could simplify the
> deploy model of Standalone backend, and also reduce the burden to support
> legacy deploy pattern in the future feature developments. (There is an
> example in https://issues.apache.org/jira/browse/SPARK-27371 , where we
> designed a complex approach to coordinate resource requirements from
> different workers launched on the same host).
> >>
> >> The proposal is to update the document to deprecate the support of
> system environment `SPARK_WORKER_INSTANCES` in Spark 3.0, and remove the
> support in the next major version (Spark 3.1).
> >>
> >> Please kindly let me know if you have use cases relying on this feature.
> >
> >
> > When deploying spark on batch systems (by wrapping the standalone
> deployment in scripts that can be consumed by the batch scheduler), we
> typically end up with >1 worker per host. If I understand correctly, this
> proposal would make our use case unsupported.
> >
> > Thanks,
> > Andrew
> >
> >
> >
> >>
> >> Thanks!
> >>
> >> Xingbo
> >
> > --
> > It's dark in this basement.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-12 Thread Xingbo Jiang
Hi Prashant,

I guess you are referring to the local-cluster mode? AFAIK the
local-cluster mode has not been mentioned at all in the user guide, thus it
should only be used in Spark tests. Also, there are a few differences
between having multiple workers on the same node and having one worker on
each node, as I mentioned in
https://issues.apache.org/jira/browse/SPARK-27371 , a complex approach is
needed to resolve the resource requirement contentions between different
workers running on the same node.

Cheers,

Xingbo

On Thu, Mar 5, 2020 at 8:49 PM Prashant Sharma  wrote:

> It was by design, one could run multiple workers on his laptop for trying
> out or testing spark in distributed mode, one could launch multiple workers
> and see how resource offers and requirements work. Certainly, I have not
> commonly seen, starting multiple workers on the same node as a practice so
> far.
>
> Why do we consider it as a special case for scheduling, where two workers
> are on the same node than two different nodes? Possibly, optimize on
> network I/o and disk I/O?
>
> On Tue, Mar 3, 2020 at 12:45 AM Xingbo Jiang 
> wrote:
>
>> Thanks Sean for your input, I really think it could simplify Spark
>> Standalone backend a lot by only allowing a single worker on the same host,
>> also I can confirm this deploy model can satisfy all the workloads deployed
>> on Standalone backend AFAIK.
>>
>> Regarding the case multiple distinct Spark clusters running a worker on
>> one machine, I'm not sure whether that's something we have claimed to
>> support, could someone with more context on this scenario share their use
>> case?
>>
>> Cheers,
>>
>> Xingbo
>>
>> On Fri, Feb 28, 2020 at 11:29 AM Sean Owen  wrote:
>>
>>> I'll admit, I didn't know you could deploy multiple workers per
>>> machine. I agree, I don't see the use case for it? multiple executors,
>>> yes of course. And I guess you could imagine multiple distinct Spark
>>> clusters running a worker on one machine. I don't have an informed
>>> opinion therefore, but agree that it seems like a best practice enough
>>> to enforce 1 worker per machine, if it makes things simpler rather
>>> than harder.
>>>
>>> On Fri, Feb 28, 2020 at 1:21 PM Xingbo Jiang 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Based on my experience, there is no scenario that necessarily requires
>>> deploying multiple Workers on the same node with Standalone backend. A
>>> worker should book all the resources reserved to Spark on the host it is
>>> launched, then it can allocate those resources to one or more executors
>>> launched by this worker. Since each executor runs in a separated JVM, we
>>> can limit the memory of each executor to avoid long GC pause.
>>> >
>>> > The remaining concern is the local-cluster mode is implemented by
>>> launching multiple workers on the local host, we might need to re-implement
>>> LocalSparkCluster to launch only one Worker and multiple executors. It
>>> should be fine because local-cluster mode is only used in running Spark
>>> unit test cases, thus end users should not be affected by this change.
>>> >
>>> > Removing multiple workers on the same host support could simplify the
>>> deploy model of Standalone backend, and also reduce the burden to support
>>> legacy deploy pattern in the future feature developments. (There is an
>>> example in https://issues.apache.org/jira/browse/SPARK-27371 , where we
>>> designed a complex approach to coordinate resource requirements from
>>> different workers launched on the same host).
>>> >
>>> > The proposal is to update the document to deprecate the support of
>>> system environment `SPARK_WORKER_INSTANCES` in Spark 3.0, and remove the
>>> support in the next major version (Spark 3.1).
>>> >
>>> > Please kindly let me know if you have use cases relying on this
>>> feature.
>>> >
>>> > Thanks!
>>> >
>>> > Xingbo
>>>
>>


Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Xingbo Jiang
+1 (non-binding)

Cheers,

Xingbo

On Mon, Mar 9, 2020 at 9:35 AM Xiao Li  wrote:

> +1 (binding)
>
> Xiao
>
> On Mon, Mar 9, 2020 at 8:33 AM Denny Lee  wrote:
>
>> +1 (non-binding)
>>
>> On Mon, Mar 9, 2020 at 1:59 AM Hyukjin Kwon  wrote:
>>
>>> The proposal itself seems good as the factors to consider, Thanks
>>> Michael.
>>>
>>> Several concerns mentioned look good points, in particular:
>>>
>>> > ... assuming that this is for public stable APIs, not APIs that are
>>> marked as unstable, evolving, etc. ...
>>> I would like to confirm this. We already have API annotations such as
>>> Experimental, Unstable, etc. and the implication of each is still
>>> effective. If it's for stable APIs, it makes sense to me as well.
>>>
>>> > ... can we expand on 'when' an API change can occur ?  Since we are
>>> proposing to diverge from semver. ...
>>> I think this is a good point. If we're proposing to divert from semver,
>>> the delta compared to semver will have to be clarified to avoid different
>>> personal interpretations of the somewhat general principles.
>>>
>>> > ... can we narrow down on the migration from Apache Spark 2.4.5 to
>>> Apache Spark 3.0+? ...
>>>
>>> Assuming these concerns will be addressed, +1 (binding).
>>>
>>>
>>> 2020년 3월 9일 (월) 오후 4:53, Takeshi Yamamuro 님이 작성:
>>>
 +1 (non-binding)

 Bests,
 Takeshi

 On Mon, Mar 9, 2020 at 4:52 PM Gengliang Wang <
 gengliang.w...@databricks.com> wrote:

> +1 (non-binding)
>
> Gengliang
>
> On Mon, Mar 9, 2020 at 12:22 AM Matei Zaharia 
> wrote:
>
>> +1 as well.
>>
>> Matei
>>
>> On Mar 9, 2020, at 12:05 AM, Wenchen Fan  wrote:
>>
>> +1 (binding), assuming that this is for public stable APIs, not APIs
>> that are marked as unstable, evolving, etc.
>>
>> On Mon, Mar 9, 2020 at 1:10 AM Ismaël Mejía 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Michael's section on the trade-offs of maintaining / removing an API
>>> are one of
>>> the best reads I have seeing in this mailing list. Enthusiast +1
>>>
>>> On Sat, Mar 7, 2020 at 8:28 PM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>> >
>>> > This new policy has a good indention, but can we narrow down on
>>> the migration from Apache Spark 2.4.5 to Apache Spark 3.0+?
>>> >
>>> > I saw that there already exists a reverting PR to bring back Spark
>>> 1.4 and 1.5 APIs based on this AS-IS suggestion.
>>> >
>>> > The AS-IS policy is clearly mentioning that JVM/Scala-level
>>> difficulty, and it's nice.
>>> >
>>> > However, for the other cases, it sounds like `recommending older
>>> APIs as much as possible` due to the following.
>>> >
>>> >  > How long has the API been in Spark?
>>> >
>>> > We had better be more careful when we add a new policy and should
>>> aim not to mislead the users and 3rd party library developers to say 
>>> "older
>>> is better".
>>> >
>>> > Technically, I'm wondering who will use new APIs in their examples
>>> (of books and StackOverflow) if they need to write an additional warning
>>> like `this only works at 2.4.0+` always .
>>> >
>>> > Bests,
>>> > Dongjoon.
>>> >
>>> > On Fri, Mar 6, 2020 at 7:10 PM Mridul Muralidharan <
>>> mri...@gmail.com> wrote:
>>> >>
>>> >> I am in broad agreement with the prposal, as any developer, I
>>> prefer
>>> >> stable well designed API's :-)
>>> >>
>>> >> Can we tie the proposal to stability guarantees given by spark and
>>> >> reasonable expectation from users ?
>>> >> In my opinion, an unstable or evolving could change - while an
>>> >> experimental api which has been around for ages should be more
>>> >> conservatively handled.
>>> >> Which brings in question what are the stability guarantees as
>>> >> specified by annotations interacting with the proposal.
>>> >>
>>> >> Also, can we expand on 'when' an API change can occur ?  Since we
>>> are
>>> >> proposing to diverge from semver.
>>> >> Patch release ? Minor release ? Only major release ? Based on
>>> 'impact'
>>> >> of API ? Stability guarantees ?
>>> >>
>>> >> Regards,
>>> >> Mridul
>>> >>
>>> >>
>>> >>
>>> >> On Fri, Mar 6, 2020 at 7:01 PM Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>> >> >
>>> >> > I'll start off the vote with a strong +1 (binding).
>>> >> >
>>> >> > On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>> >> >>
>>> >> >> I propose to add the following text to Spark's Semantic
>>> Versioning policy and adopt it as the rubric that should be used when
>>> deciding to break APIs (even at major versions such as 3.0).
>>> >> >>
>>> >> >>
>>> >> >> I'll leave the vote open until Tuesday, March 10th at 2pm. As
>>> this is a 

Re: [DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-03-02 Thread Xingbo Jiang
Thanks Sean for your input, I really think it could simplify Spark
Standalone backend a lot by only allowing a single worker on the same host,
also I can confirm this deploy model can satisfy all the workloads deployed
on Standalone backend AFAIK.

Regarding the case multiple distinct Spark clusters running a worker on one
machine, I'm not sure whether that's something we have claimed to support,
could someone with more context on this scenario share their use case?

Cheers,

Xingbo

On Fri, Feb 28, 2020 at 11:29 AM Sean Owen  wrote:

> I'll admit, I didn't know you could deploy multiple workers per
> machine. I agree, I don't see the use case for it? multiple executors,
> yes of course. And I guess you could imagine multiple distinct Spark
> clusters running a worker on one machine. I don't have an informed
> opinion therefore, but agree that it seems like a best practice enough
> to enforce 1 worker per machine, if it makes things simpler rather
> than harder.
>
> On Fri, Feb 28, 2020 at 1:21 PM Xingbo Jiang 
> wrote:
> >
> > Hi all,
> >
> > Based on my experience, there is no scenario that necessarily requires
> deploying multiple Workers on the same node with Standalone backend. A
> worker should book all the resources reserved to Spark on the host it is
> launched, then it can allocate those resources to one or more executors
> launched by this worker. Since each executor runs in a separated JVM, we
> can limit the memory of each executor to avoid long GC pause.
> >
> > The remaining concern is the local-cluster mode is implemented by
> launching multiple workers on the local host, we might need to re-implement
> LocalSparkCluster to launch only one Worker and multiple executors. It
> should be fine because local-cluster mode is only used in running Spark
> unit test cases, thus end users should not be affected by this change.
> >
> > Removing multiple workers on the same host support could simplify the
> deploy model of Standalone backend, and also reduce the burden to support
> legacy deploy pattern in the future feature developments. (There is an
> example in https://issues.apache.org/jira/browse/SPARK-27371 , where we
> designed a complex approach to coordinate resource requirements from
> different workers launched on the same host).
> >
> > The proposal is to update the document to deprecate the support of
> system environment `SPARK_WORKER_INSTANCES` in Spark 3.0, and remove the
> support in the next major version (Spark 3.1).
> >
> > Please kindly let me know if you have use cases relying on this feature.
> >
> > Thanks!
> >
> > Xingbo
>


[DISCUSS] Remove multiple workers on the same host support from Standalone backend

2020-02-28 Thread Xingbo Jiang
Hi all,

Based on my experience, there is no scenario that necessarily requires
deploying multiple Workers on the same node with Standalone backend. A
worker should book all the resources reserved to Spark on the host it is
launched, then it can allocate those resources to one or more executors
launched by this worker. Since each executor runs in a separated JVM, we
can limit the memory of each executor to avoid long GC pause.

The remaining concern is the local-cluster mode is implemented by launching
multiple workers on the local host, we might need to re-implement
LocalSparkCluster to launch only one Worker and multiple executors. It
should be fine because local-cluster mode is only used in running Spark
unit test cases, thus end users should not be affected by this change.

Removing multiple workers on the same host support could simplify the
deploy model of Standalone backend, and also reduce the burden to support
legacy deploy pattern in the future feature developments. (There is an
example in https://issues.apache.org/jira/browse/SPARK-27371 , where we
designed a complex approach to coordinate resource requirements from
different workers launched on the same host).

The proposal is to update the document to deprecate the support of system
environment `SPARK_WORKER_INSTANCES` in Spark 3.0, and remove the support
in the next major version (Spark 3.1).

Please kindly let me know if you have use cases relying on this feature.

Thanks!

Xingbo


Re: spark-3.0.0-preview release notes link is broken

2019-11-28 Thread Xingbo Jiang
Hi Sandeep,

Thanks for reporting! spark-3.0.0-preview is not a stable release, so we
should not include this version in the `Release Notes for Stable Releases`
section. I've submitted a PR (
https://github.com/apache/spark-website/pull/235) to fix the issue.

Cheers,

Xingbo

On Thu, Nov 28, 2019 at 8:53 PM Sandeep Katta <
sandeep0102.opensou...@gmail.com> wrote:

> Hi,
>
> I see for preview release, release notes link is broken.
>
> https://spark.apache.org/releases/spark-release-3-0-0-preview.html
>


[ANNOUNCE] Announcing Apache Spark 3.0.0-preview

2019-11-07 Thread Xingbo Jiang
Hi all,

To enable wide-scale community testing of the upcoming Spark 3.0 release,
the Apache Spark community has posted a preview release of Spark 3.0. This
preview is *not a stable release in terms of either API or functionality*,
but it is meant to give the community early access to try the code that
will become Spark 3.0. If you would like to test the release, please
download it, and send feedback using either the mailing lists
 or JIRA

.

There are a lot of exciting new features added to Spark 3.0, including
Dynamic Partition Pruning, Adaptive Query Execution, Accelerator-aware
Scheduling, Data Source API with Catalog Supports, Vectorization in SparkR,
support of Hadoop 3/JDK 11/Scala 2.12, and many more. For a full list of
major features and changes in Spark 3.0.0-preview, please check the thread(
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-feature-list-and-major-changes-td28050.html
).

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 3.0.0-preview, head over to the download page:
https://archive.apache.org/dist/spark/spark-3.0.0-preview

Thanks,

Xingbo


Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-04 Thread Xingbo Jiang
This vote passes! I'll follow up with a formal release announcement soon.

+1:
Sean Owen (binding)
Wenchen Fan (binding)
Hyukjin Kwon (binding)
Dongjoon Hyun (binding)
Takeshi Yamamuro

+0: None

-1: None

Thanks, everyone!

Xingbo

On Mon, Nov 4, 2019 at 9:35 AM Dongjoon Hyun 
wrote:

> Hi, Xingbo.
>
> Could you sent a vote result email to finalize this vote, please?
>
> Bests,
> Dongjoon.
>
> On Fri, Nov 1, 2019 at 2:55 PM Takeshi Yamamuro 
> wrote:
>
>> +1, too.
>>
>> On Sat, Nov 2, 2019 at 3:36 AM Hyukjin Kwon  wrote:
>>
>>> +1
>>>
>>> On Fri, 1 Nov 2019, 15:36 Wenchen Fan,  wrote:
>>>
>>>> The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7
>>>> is more stable and we should make releases using 2.7 by default.
>>>>
>>>> +1
>>>>
>>>> On Fri, Nov 1, 2019 at 7:16 AM Xiao Li  wrote:
>>>>
>>>>> Spark 3.0 will still use the Hadoop 2.7 profile by default, I think.
>>>>> Hadoop 2.7 profile is much more stable than Hadoop 3.2 profile.
>>>>>
>>>>> On Thu, Oct 31, 2019 at 3:54 PM Sean Owen  wrote:
>>>>>
>>>>>> This isn't a big thing, but I see that the pyspark build includes
>>>>>> Hadoop 2.7 rather than 3.2. Maybe later we change the build to put in
>>>>>> 3.2 by default.
>>>>>>
>>>>>> Otherwise, the tests all seems to pass with JDK 8 / 11 with all
>>>>>> profiles enabled, so I'm +1 on it.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 31, 2019 at 1:00 AM Xingbo Jiang 
>>>>>> wrote:
>>>>>> >
>>>>>> > Please vote on releasing the following candidate as Apache Spark
>>>>>> version 3.0.0-preview.
>>>>>> >
>>>>>> > The vote is open until November 3 PST and passes if a majority +1
>>>>>> PMC votes are cast, with
>>>>>> > a minimum of 3 +1 votes.
>>>>>> >
>>>>>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
>>>>>> > [ ] -1 Do not release this package because ...
>>>>>> >
>>>>>> > To learn more about Apache Spark, please see
>>>>>> http://spark.apache.org/
>>>>>> >
>>>>>> > The tag to be voted on is v3.0.0-preview-rc2 (commit
>>>>>> 007c873ae34f58651481ccba30e8e2ba38a692c4):
>>>>>> > https://github.com/apache/spark/tree/v3.0.0-preview-rc2
>>>>>> >
>>>>>> > The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> >
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-bin/
>>>>>> >
>>>>>> > Signatures used for Spark RCs can be found in this file:
>>>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>> >
>>>>>> > The staging repository for this release can be found at:
>>>>>> >
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1336/
>>>>>> >
>>>>>> > The documentation corresponding to this release can be found at:
>>>>>> >
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-docs/
>>>>>> >
>>>>>> > The list of bug fixes going into 3.0.0 can be found at the
>>>>>> following URL:
>>>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>>>> >
>>>>>> > FAQ
>>>>>> >
>>>>>> > =
>>>>>> > How can I help test this release?
>>>>>> > =
>>>>>> >
>>>>>> > If you are a Spark user, you can help us test this release by taking
>>>>>> > an existing Spark workload and running on this release candidate,
>>>>>> then
>>>>>> > reporting any regressions.
>>>>>> >
>>>>>> > If you're working in PySpark you can set up a virtual env and
>>>>>> install
>>>>>> > the current RC and see if anything important breaks, in the
>>>>>> Java/Scala
>>>>>> > you can add the staging

[VOTE] SPARK 3.0.0-preview (RC2)

2019-10-31 Thread Xingbo Jiang
Please vote on releasing the following candidate as Apache Spark version
3.0.0-preview.

The vote is open until November 3 PST and passes if a majority +1 PMC votes
are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0-preview
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-preview-rc2 (commit
007c873ae34f58651481ccba30e8e2ba38a692c4):
https://github.com/apache/spark/tree/v3.0.0-preview-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1336/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12339177

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.0?
===

The current list of open tickets targeted at 3.0.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.0.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-30 Thread Xingbo Jiang
I was trying to avoid changing the version names and revert the changes on
master again. But you are right it might lead to confusions which release
script is used for RC2, I'll follow your advice and create a new RC2 tag.

Thanks!

Xingbo

On Wed, Oct 30, 2019 at 5:06 PM Dongjoon Hyun 
wrote:

> Hi, Xingbo.
>
> Currently, RC2 tag is pointing RC1 tag.
>
> https://github.com/apache/spark/tree/v3.0.0-preview-rc2
>
> Could you cut from the HEAD of master branch?
> Otherwise, nobody knows what release script you used for RC2.
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Oct 30, 2019 at 4:15 PM Xingbo Jiang 
> wrote:
>
>> Hi all,
>>
>> This RC fails because:
>> It fails to generate a PySpark release.
>>
>> I'll start RC2 soon.
>>
>> Thanks!
>>
>> Xingbo
>>
>>
>> On Wed, Oct 30, 2019 at 4:10 PM Xingbo Jiang 
>> wrote:
>>
>>> Thanks Sean, since we need to generate PySpark release with a different
>>> name, I would prefer fail RC1 and start another release candidate.
>>>
>>> Sean Owen  于2019年10月30日周三 下午4:00写道:
>>>
>>>> I agree that we need a Pyspark release for this preview release. If
>>>> it's a matter of producing it from the same tag, we can evaluate it
>>>> within this same release candidate. Otherwise, just roll another
>>>> release candidate.
>>>>
>>>> I was able to build it and pass all tests with JDK 8 and JDK 11
>>>> (hadoop-3.2 profile, note) on Ubuntu, so this is otherwise looking
>>>> good to me.
>>>>
>>>> On Tue, Oct 29, 2019 at 9:01 PM Xingbo Jiang 
>>>> wrote:
>>>> >
>>>> > Please vote on releasing the following candidate as Apache Spark
>>>> version 3.0.0-preview.
>>>> >
>>>> > The vote is open until November 2 PST and passes if a majority +1 PMC
>>>> votes are cast, with
>>>> > a minimum of 3 +1 votes.
>>>> >
>>>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
>>>> > [ ] -1 Do not release this package because ...
>>>> >
>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>> >
>>>> > The tag to be voted on is v3.0.0-preview-rc1 (commit
>>>> 5eddbb5f1d9789696927f435c55df887e50a1389):
>>>> > https://github.com/apache/spark/tree/v3.0.0-preview-rc1
>>>> >
>>>> > The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
>>>> >
>>>> > Signatures used for Spark RCs can be found in this file:
>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> >
>>>> > The staging repository for this release can be found at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1334/
>>>> >
>>>> > The documentation corresponding to this release can be found at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
>>>> >
>>>> > The list of bug fixes going into 3.0.0 can be found at the following
>>>> URL:
>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>> >
>>>> > FAQ
>>>> >
>>>> > =
>>>> > How can I help test this release?
>>>> > =
>>>> >
>>>> > If you are a Spark user, you can help us test this release by taking
>>>> > an existing Spark workload and running on this release candidate, then
>>>> > reporting any regressions.
>>>> >
>>>> > If you're working in PySpark you can set up a virtual env and install
>>>> > the current RC and see if anything important breaks, in the Java/Scala
>>>> > you can add the staging repository to your projects resolvers and test
>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>> > you don't end up building with a out of date RC going forward).
>>>> >
>>>> > ===
>>>> > What should happen to JIRA tickets still targeting 3.0.0?
>>>> > ===
>>>> >
>>>> > The current list of open tickets targeted at 3.0.0 can be found at:
>>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 3.0.0
>>>> >
>>>> > Committers should look at those and triage. Extremely important bug
>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>> > be worked on immediately. Everything else please retarget to an
>>>> > appropriate release.
>>>> >
>>>> > ==
>>>> > But my bug isn't fixed?
>>>> > ==
>>>> >
>>>> > In order to make timely releases, we will typically not hold the
>>>> > release unless the bug in question is a regression from the previous
>>>> > release. That being said, if there is something which is a regression
>>>> > that has not been correctly targeted please ping me or a committer to
>>>> > help target the issue.
>>>>
>>>


Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-30 Thread Xingbo Jiang
Hi all,

This RC fails because:
It fails to generate a PySpark release.

I'll start RC2 soon.

Thanks!

Xingbo


On Wed, Oct 30, 2019 at 4:10 PM Xingbo Jiang  wrote:

> Thanks Sean, since we need to generate PySpark release with a different
> name, I would prefer fail RC1 and start another release candidate.
>
> Sean Owen  于2019年10月30日周三 下午4:00写道:
>
>> I agree that we need a Pyspark release for this preview release. If
>> it's a matter of producing it from the same tag, we can evaluate it
>> within this same release candidate. Otherwise, just roll another
>> release candidate.
>>
>> I was able to build it and pass all tests with JDK 8 and JDK 11
>> (hadoop-3.2 profile, note) on Ubuntu, so this is otherwise looking
>> good to me.
>>
>> On Tue, Oct 29, 2019 at 9:01 PM Xingbo Jiang 
>> wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 3.0.0-preview.
>> >
>> > The vote is open until November 2 PST and passes if a majority +1 PMC
>> votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v3.0.0-preview-rc1 (commit
>> 5eddbb5f1d9789696927f435c55df887e50a1389):
>> > https://github.com/apache/spark/tree/v3.0.0-preview-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1334/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
>> >
>> > The list of bug fixes going into 3.0.0 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.0.0?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.0.0 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>


Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-30 Thread Xingbo Jiang
Thanks Sean, since we need to generate PySpark release with a different
name, I would prefer fail RC1 and start another release candidate.

Sean Owen  于2019年10月30日周三 下午4:00写道:

> I agree that we need a Pyspark release for this preview release. If
> it's a matter of producing it from the same tag, we can evaluate it
> within this same release candidate. Otherwise, just roll another
> release candidate.
>
> I was able to build it and pass all tests with JDK 8 and JDK 11
> (hadoop-3.2 profile, note) on Ubuntu, so this is otherwise looking
> good to me.
>
> On Tue, Oct 29, 2019 at 9:01 PM Xingbo Jiang 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 3.0.0-preview.
> >
> > The vote is open until November 2 PST and passes if a majority +1 PMC
> votes are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v3.0.0-preview-rc1 (commit
> 5eddbb5f1d9789696927f435c55df887e50a1389):
> > https://github.com/apache/spark/tree/v3.0.0-preview-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1334/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
> >
> > The list of bug fixes going into 3.0.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.0.0?
> > ===
> >
> > The current list of open tickets targeted at 3.0.0 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.0
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>


Re: Packages to release in 3.0.0-preview

2019-10-30 Thread Xingbo Jiang
scala 2.13 support is tracked by
https://issues.apache.org/jira/browse/SPARK-25075 , at the current time
there are still major issues remaining, thus we don't include scala 2.13
support in the 3.0.0-preview release.
If the task is finished before the code freeze of Spark 3.0.0, then it's
still possible to release Spark 3.0.0 with scala 2.13 packages.

Cheers,

Xingbo

antonkulaga  于2019年10月30日周三 下午3:36写道:

> Why not trying the current Scala (2.13)? Spark has always been one
> (sometimes
> - two) Scala versions away from the whole Scala ecosystem and it has always
> been a big pain point for everybody. I understand that in the past you
> could
> not switch because of compatibility issues, but 3.x is a major version
> update and you can break things, maybe you can finally consider to use the
> current Scala?
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-29 Thread Xingbo Jiang
Thanks Dongjoon for reporting this issue, I checked and it seems like the
release script would use the `SPARK_VERSION` instead of that specified in
`version.py`, thus it's not able to generate a pyspark release tarball. I
submitted a PR (https://github.com/apache/spark/pull/26306) to propose a
workaround of the issue. Please review and decide whether it is feasible.

Thanks again!

Xingbo

Dongjoon Hyun  于2019年10月29日周二 下午8:17写道:

> Hi, Xingbo.
>
> PySpark seems to fail to build. There is only `sha512`.
>
> SparkR_3.0.0-preview.tar.gz
> SparkR_3.0.0-preview.tar.gz.asc
> SparkR_3.0.0-preview.tar.gz.sha512
> *pyspark-3.0.0.preview.tar.gz.sha512*
> spark-3.0.0-preview-bin-hadoop2.7.tgz
> spark-3.0.0-preview-bin-hadoop2.7.tgz.asc
> spark-3.0.0-preview-bin-hadoop2.7.tgz.sha512
> spark-3.0.0-preview-bin-hadoop3.2.tgz
> spark-3.0.0-preview-bin-hadoop3.2.tgz.asc
> spark-3.0.0-preview-bin-hadoop3.2.tgz.sha512
> spark-3.0.0-preview-bin-without-hadoop.tgz
> spark-3.0.0-preview-bin-without-hadoop.tgz.asc
> spark-3.0.0-preview-bin-without-hadoop.tgz.sha512
> spark-3.0.0-preview.tgz
> spark-3.0.0-preview.tgz.asc
> spark-3.0.0-preview.tgz.sha512
>
>
> Bests,
> Dongjoon.
>
>
> On Tue, Oct 29, 2019 at 7:18 PM Xingbo Jiang 
> wrote:
>
>> Thanks for the correction, we shall remove the statement
>>>
>>> Everything else please retarget to an appropriate release.
>>>
>>
>> Reynold Xin  于2019年10月29日周二 下午7:09写道:
>>
>>> Does the description make sense? This is a preview release so there is
>>> no need to retarget versions.
>>>
>>> On Tue, Oct 29, 2019 at 7:01 PM Xingbo Jiang 
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 3.0.0-preview.
>>>>
>>>> The vote is open until November 2 PST and passes if a majority +1 PMC
>>>> votes are cast, with
>>>> a minimum of 3 +1 votes.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 3.0.0-preview
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v3.0.0-preview-rc1 (commit
>>>> 5eddbb5f1d9789696927f435c55df887e50a1389):
>>>> https://github.com/apache/spark/tree/v3.0.0-preview-rc1
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1334/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
>>>>
>>>> The list of bug fixes going into 3.0.0 can be found at the following
>>>> URL:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>>
>>>> FAQ
>>>>
>>>> =
>>>> How can I help test this release?
>>>> =
>>>>
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>> you can add the staging repository to your projects resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you don't end up building with a out of date RC going forward).
>>>>
>>>> ===
>>>> What should happen to JIRA tickets still targeting 3.0.0?
>>>> ===
>>>>
>>>> The current list of open tickets targeted at 3.0.0 can be found at:
>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 3.0.0
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>> be worked on immediately. Everything else please retarget to an
>>>> appropriate release.
>>>>
>>>> ==
>>>> But my bug isn't fixed?
>>>> ==
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from the previous
>>>> release. That being said, if there is something which is a regression
>>>> that has not been correctly targeted please ping me or a committer to
>>>> help target the issue.
>>>>
>>>


Re: [VOTE] SPARK 3.0.0-preview (RC1)

2019-10-29 Thread Xingbo Jiang
Thanks for the correction, we shall remove the statement
>
> Everything else please retarget to an appropriate release.
>

Reynold Xin  于2019年10月29日周二 下午7:09写道:

> Does the description make sense? This is a preview release so there is no
> need to retarget versions.
>
> On Tue, Oct 29, 2019 at 7:01 PM Xingbo Jiang 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.0.0-preview.
>>
>> The vote is open until November 2 PST and passes if a majority +1 PMC
>> votes are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.0.0-preview
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.0.0-preview-rc1 (commit
>> 5eddbb5f1d9789696927f435c55df887e50a1389):
>> https://github.com/apache/spark/tree/v3.0.0-preview-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1334/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/
>>
>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.0.0?
>> ===
>>
>> The current list of open tickets targeted at 3.0.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>


[VOTE] SPARK 3.0.0-preview (RC1)

2019-10-29 Thread Xingbo Jiang
Please vote on releasing the following candidate as Apache Spark version
3.0.0-preview.

The vote is open until November 2 PST and passes if a majority +1 PMC votes
are cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.0-preview
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.0.0-preview-rc1 (commit
5eddbb5f1d9789696927f435c55df887e50a1389):
https://github.com/apache/spark/tree/v3.0.0-preview-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1334/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc1-docs/

The list of bug fixes going into 3.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12339177

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.0?
===

The current list of open tickets targeted at 3.0.0 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.0.0

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Packages to release in 3.0.0-preview

2019-10-25 Thread Xingbo Jiang
Hi all,

I would like to bring out a discussion on how many packages shall be
released in 3.0.0-preview, the ones I can think of now:

* scala 2.12 + hadoop 2.7
* scala 2.12 + hadoop 3.2
* scala 2.12 + hadoop 3.2 + JDK 11

Do you have other combinations to add to the above list?

Cheers,

Xingbo


Unable to resolve dependency of sbt-mima-plugin since yesterday

2019-10-22 Thread Xingbo Jiang
Hi,

Do you have any idea why the `./dev/lint-scala` check are failure with the
following message since yesterday ?

WARNING: An illegal reflective access operation has occurred
> 9
> WARNING:
> Illegal reflective access by org.apache.ivy.util.url.IvyAuthenticator
> (file:/home/runner/work/spark/spark/build/sbt-launch-0.13.18.jar) to field
> java.net.Authenticator.theAuthenticator
> 10
> WARNING:
> Please consider reporting this to the maintainers of
> org.apache.ivy.util.url.IvyAuthenticator
> 11
> WARNING:
> Use --illegal-access=warn to enable warnings of further illegal reflective
> access operations
> 12
> WARNING:
> All illegal access operations will be denied in a future release
> 13
> Scalastyle
> checks failed at following occurrences:
> 14
> 
> [error] (*:update) sbt.ResolveException: unresolved dependency:
> com.typesafe#sbt-mima-plugin;0.3.0: not found
> 15
> ##[error]Process
> completed with exit code 1.
>

I'm not able to reproduce the failure on my local environment, but seems
all the open PRs are failing on this check.

Thanks,

Xingbo


Re: [PMCs] Any project news or announcements this week?

2019-10-22 Thread Xingbo Jiang
I'm working with Wenchen to make a release candidate, but haven't been
successful with our release script. I think we shall be able to make a
Spark 3.0 preview release this week, though.

cc @Sally

Cheers,

Xingbo

Sean Owen  于2019年10月20日周日 下午11:04写道:

> I wonder if we are likely to have a Spark 3.0 preview release this
> week? no rush, but if we do, let's CC Sally to maybe mention at
> ApacheCon.
>
> -- Forwarded message -
> From: Sally Khudairi 
> Date: Sun, Oct 20, 2019 at 4:00 PM
> Subject: [PMCs] Any project news or announcements this week?
> To: ASF Marketing & Publicity 
>
>
> Hello Apache PMCs!
>
> With ApacheCon taking place in Berlin this week, some attending
> journalists are interested in covering any project news.
>
> If your PMC has any newsworthy releases, milestones, etc., taking
> place over the next few days, do let me know. It would also be good to
> know if you'll be at the conference as well.
>
> Many thanks,
> Sally
>
> - - -
> Vice President Marketing & Publicity
> Vice President Sponsor Relations
> The Apache Software Foundation
>
> Tel +1 617 921 8656 | s...@apache.org
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-17 Thread Xingbo Jiang
I've deleted the branch-3.0-preview branch, and added `3.0.0-preview` tag
to master (https://github.com/apache/spark/releases/tag/3.0.0-preview).
I'll be working on make a RC now.

Cheers,

Xingbo

Sean Owen  于2019年10月17日周四 下午4:23写道:

> Sure, if that works, that's a simpler solution. The preview release is
> like an RC of the master branch itself.
> Are there any issues with that approach right now?
> Yes if it turns out that we can't get a reasonably stable release off
> master, then we can branch and cherry-pick. We'd have to retain the
> branch though.
>
> On Thu, Oct 17, 2019 at 12:28 AM Xingbo Jiang 
> wrote:
> >
> > How about add `3.0.0-preview` tag on master branch, and claim that for
> the preview release, we won't consider bugs introduced by new features
> merged into master after the first preview RC ? This could rule out the
> risk that we keep on import new commits and need to resolve more critical
> bugs thus the release would never converge.
> >
> > Cheers,
> >
> > Xingbo
> >
> > Sean Owen  于2019年10月16日周三 下午6:34写道:
> >>
> >> We do not have to do anything to branch-3.0-preview; it's just for the
> >> convenience of the RM. Just continue to merge to master for 3.0.
> >>
> >> If it happens that some state of the master branch works as a preview
> >> release, sure, just tag and release. We might get away with it. But if
> >> for example we have a small issue to fix with the preview and
> >> meanwhile something else has landed in the master branch that doesn't
> >> work, we'll struggle to get an RC out. I agree, that would be nice to
> >> not deal with this as a branch yet.
> >>
> >> But if we do: Yeah I figured the merge script would pick it up, which
> >> is a little annoying, but you can still just type branch-2.4.
> >> I think we have to retain the branch though if there are any
> >> cherry-picks, to record the state of the release.
> >>
> >> We don't want a "3.0-preview" version in JIRA. Let's fix the script if
> we must.
> >>
> >> So, I take it that the current preview RC didn't work. What if we
> >> delete that branch and try again from master? does that work?
> >>
> >> On Wed, Oct 16, 2019 at 11:19 AM Dongjoon Hyun 
> wrote:
> >> >
> >> > Technically, `branch-3.0-preview` has many issues.
> >> >
> >> > First of all, are we going to delete `branch-3.0-preview` after
> releasing `3.0-preview`?
> >> > I guess we didn't delete old branches (including feature branches
> like jdbc, yarn branches)
> >> >
> >> > Second, our merge script already starts to show `branch-3.0-preview`
> instead of `branch-2.4` already.
> >> > Currently, We need to merge to `master` -> `branch-3.0-preview` ->
> `branch-2.4`.
> >> > This already creates a burden to maintain our LTS branch `branch-2.4`.
> >> >
> >> > Third, during updating JIRA, our merge script starts to fail because
> it extracts the version number from `branch-3.0-preview` but Apache JIRA
> doesn't have a version `3.0-preview`. Are we going to add a release version
> at `Apache Spark JIRA`?
> >> > (I'm -1 for this. `Fixed Version: 3.0-preview` seems to be overkill).
> >> >
> >> > If we are reluctant to have `branch-3.0` because it has a meaning of
> `feature` and its merging cost, I'm +1 for tag on `master` (Reynold's
> suggestion)
> >> >
> >> > We can do vote and stabilize `3.0-alpha` in master branch.
> >> >
> >> > Bests,
> >> > Dongjoon.
> >> >
> >> >
> >> > On Wed, Oct 16, 2019 at 3:04 AM Sean Owen  wrote:
> >> >>
> >> >> I don't think we would want to cut 'branch-3.0' right now, which
> would
> >> >> imply that master is 3.1. We don't want to merge every new change
> into
> >> >> two branches.
> >> >> It may still be useful to have `branch-3.0-preview` as a short-lived
> >> >> branch just used to manage the preview release, as we will need to
> let
> >> >> development on 3.0 in master continue while stabilizing the preview
> >> >> release with a few selected cherry-picks, but that's only of concern
> >> >> to the release manager.
> >> >>
> >> >> On Wed, Oct 16, 2019 at 2:01 AM Xingbo Jiang 
> wrote:
> >> >> >
> >> >> > Hi Dongjoon,
> >> >> >
> >> >> > I'm not sure about the best practice of maintain

Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-17 Thread Xingbo Jiang
How about add `3.0.0-preview` tag on master branch, and claim that for the
preview release, we won't consider bugs introduced by new features merged
into master after the first preview RC ? This could rule out the risk that
we keep on import new commits and need to resolve more critical bugs thus
the release would never converge.

Cheers,

Xingbo

Sean Owen  于2019年10月16日周三 下午6:34写道:

> We do not have to do anything to branch-3.0-preview; it's just for the
> convenience of the RM. Just continue to merge to master for 3.0.
>
> If it happens that some state of the master branch works as a preview
> release, sure, just tag and release. We might get away with it. But if
> for example we have a small issue to fix with the preview and
> meanwhile something else has landed in the master branch that doesn't
> work, we'll struggle to get an RC out. I agree, that would be nice to
> not deal with this as a branch yet.
>
> But if we do: Yeah I figured the merge script would pick it up, which
> is a little annoying, but you can still just type branch-2.4.
> I think we have to retain the branch though if there are any
> cherry-picks, to record the state of the release.
>
> We don't want a "3.0-preview" version in JIRA. Let's fix the script if we
> must.
>
> So, I take it that the current preview RC didn't work. What if we
> delete that branch and try again from master? does that work?
>
> On Wed, Oct 16, 2019 at 11:19 AM Dongjoon Hyun 
> wrote:
> >
> > Technically, `branch-3.0-preview` has many issues.
> >
> > First of all, are we going to delete `branch-3.0-preview` after
> releasing `3.0-preview`?
> > I guess we didn't delete old branches (including feature branches like
> jdbc, yarn branches)
> >
> > Second, our merge script already starts to show `branch-3.0-preview`
> instead of `branch-2.4` already.
> > Currently, We need to merge to `master` -> `branch-3.0-preview` ->
> `branch-2.4`.
> > This already creates a burden to maintain our LTS branch `branch-2.4`.
> >
> > Third, during updating JIRA, our merge script starts to fail because it
> extracts the version number from `branch-3.0-preview` but Apache JIRA
> doesn't have a version `3.0-preview`. Are we going to add a release version
> at `Apache Spark JIRA`?
> > (I'm -1 for this. `Fixed Version: 3.0-preview` seems to be overkill).
> >
> > If we are reluctant to have `branch-3.0` because it has a meaning of
> `feature` and its merging cost, I'm +1 for tag on `master` (Reynold's
> suggestion)
> >
> > We can do vote and stabilize `3.0-alpha` in master branch.
> >
> > Bests,
> > Dongjoon.
> >
> >
> > On Wed, Oct 16, 2019 at 3:04 AM Sean Owen  wrote:
> >>
> >> I don't think we would want to cut 'branch-3.0' right now, which would
> >> imply that master is 3.1. We don't want to merge every new change into
> >> two branches.
> >> It may still be useful to have `branch-3.0-preview` as a short-lived
> >> branch just used to manage the preview release, as we will need to let
> >> development on 3.0 in master continue while stabilizing the preview
> >> release with a few selected cherry-picks, but that's only of concern
> >> to the release manager.
> >>
> >> On Wed, Oct 16, 2019 at 2:01 AM Xingbo Jiang 
> wrote:
> >> >
> >> > Hi Dongjoon,
> >> >
> >> > I'm not sure about the best practice of maintaining a preview release
> branch, since new features might still go into Spark 3.0 after preview
> release, I guess it might make more sense to have separated  branches for
> 3.0.0 and 3.0-preview.
> >> >
> >> > However, I'm open to both solutions, if we really want to reuse the
> branch to also release Spark 3.0.0, then I would be happy to create a new
> one.
> >> >
> >> > Thanks!
> >> >
> >> > Xingbo
> >> >
> >> > Dongjoon Hyun  于2019年10月16日周三 上午6:26写道:
> >> >>
> >> >> Hi,
> >> >>
> >> >> It seems that we have `branch-3.0-preview` branch.
> >> >>
> >> >> https://github.com/apache/spark/commits/branch-3.0-preview
> >> >>
> >> >> Can we have `branch-3.0` instead of `branch-3.0-preview`?
> >> >>
> >> >> We can tag `v3.0.0-preview` on `branch-3.0` and continue to use for
> `v3.0.0` later.
> >> >>
> >> >> Bests,
> >> >> Dongjoon.
>


Re: branch-3.0 vs branch-3.0-preview (?)

2019-10-16 Thread Xingbo Jiang
Hi Dongjoon,

I'm not sure about the best practice of maintaining a preview release
branch, since new features might still go into Spark 3.0 after preview
release, I guess it might make more sense to have separated  branches for
3.0.0 and 3.0-preview.

However, I'm open to both solutions, if we really want to reuse the branch
to also release Spark 3.0.0, then I would be happy to create a new one.

Thanks!

Xingbo

Dongjoon Hyun  于2019年10月16日周三 上午6:26写道:

> Hi,
>
> It seems that we have `branch-3.0-preview` branch.
>
> https://github.com/apache/spark/commits/branch-3.0-preview
>
> Can we have `branch-3.0` instead of `branch-3.0-preview`?
>
> We can tag `v3.0.0-preview` on `branch-3.0` and continue to use for
> `v3.0.0` later.
>
> Bests,
> Dongjoon.
>


Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread Xingbo Jiang
Hi all,

Here is the updated feature list:


SPARK-11215  Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150  Implement
Dynamic Partition Pruning

SPARK-13677  Support
Tree-Based Feature Transformation

SPARK-16692  Add
MultilabelClassificationEvaluator

SPARK-19591  Add sample
weights to decision trees

SPARK-19712  Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827  R API for
Power Iteration Clustering

SPARK-20286  Improve
logic for timing out executors in dynamic allocation

SPARK-20636  Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148  Acquire new
executors to avoid hang because of blacklisting

SPARK-22796  Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128  A new
approach to do adaptive execution in Spark SQL

SPARK-23155  Apply
custom log URL pattern for executor log URLs in SHS

SPARK-23539  Add support
for Kafka headers

SPARK-23674  Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710  Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333  Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417  Build and
Run Spark on JDK11

SPARK-24615 
Accelerator-aware task scheduling for Spark

SPARK-24920  Allow
sharing Netty's memory pool allocators

SPARK-25250  Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341  Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348  Data source
for binary files

SPARK-25390  data source
V2 API refactoring

SPARK-25501  Add Kafka
delegation token support

SPARK-25603  Generalize
Nested Column Pruning

SPARK-26132  Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215  define
reserved keywords after SQL standard

SPARK-26412  Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26651  Use
Proleptic Gregorian calendar

SPARK-26759  Arrow
optimization in SparkR's interoperability

SPARK-26848  Introduce
new option to Kafka source: offset by timestamp (starting/ending)

SPARK-27064  create
StreamingWrite at the beginning of streaming execution

SPARK-27119  Do not
infer schema when reading Hive serde table with native data source

SPARK-27225  Implement
join strategy hints

SPARK-27240  Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338  Fix
deadlock between TaskMemoryManager and
UnsafeExternalSorter$SpillableIterator

SPARK-27396  Public APIs
for extended Columnar Processing Support

SPARK-27463  Support
Dataframe Cogroup via Pandas UDFs

SPARK-27589 
Re-implement file sources with data source V2 API

SPARK-27677 
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699 

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Xingbo Jiang
>
>  What's the process to propose a feature to be included in the final Spark
> 3.0 release?
>

I don't know whether there exists any specific process here, normally you
just merge the feature into Spark master before release code freeze, and
then the feature would probably be included in the release. The code freeze
date for Spark 3.0 has not been decided yet, though.

Li Jin  于2019年10月8日周二 下午2:14写道:

> Thanks for summary!
>
> I have a question that is semi-related - What's the process to propose a
> feature to be included in the final Spark 3.0 release?
>
> In particular, I am interested in
> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the
> work so want to make sure I don't miss the "cut" date.
>
> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang  wrote:
>
>> Hi all,
>>
>> Thanks for all the feedbacks, here is the updated feature list:
>>
>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple
>> columns support added to various Transformers: StringIndexer
>>
>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>> Implement Dynamic Partition Pruning
>>
>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
>> Tree-Based Feature Transformation
>>
>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>> MultilabelClassificationEvaluator
>>
>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>> sample weights to decision trees
>>
>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>
>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API
>> for Power Iteration Clustering
>>
>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
>> logic for timing out executors in dynamic allocation
>>
>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>> Eliminate unnecessary shuffle with adjacent Window expressions
>>
>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire
>> new executors to avoid hang because of blacklisting
>>
>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple
>> columns support added to various Transformers: PySpark QuantileDiscretizer
>>
>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
>> approach to do adaptive execution in Spark SQL
>>
>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply
>> custom log URL pattern for executor log URLs in SHS
>>
>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>> support for Kafka headers
>>
>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>> Spark ML Listener for Tracking ML Pipeline Status
>>
>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade
>> the built-in Hive to 2.3.5 for hadoop-3.2
>>
>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
>> with validation set to Gradient Boosted Trees: Python API
>>
>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build
>> and Run Spark on JDK11
>>
>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>> Accelerator-aware task scheduling for Spark
>>
>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
>> sharing Netty's memory pool allocators
>>
>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race
>> condition with tasks running when new attempt for same stage is created
>> leads to other task in the next attempt running on the same partition id
>> retry multiple times
>>
>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
>> rolling back a shuffle map stage and re-generate the shuffle files
>>
>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>> source for binary files
>>
>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>> kafka delegation token support
>>
>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>> Generalize Nested Column Pruning
>>
>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
>> support for Scala 2.11 in Spark 3.0.0
>>
>> SPARK-26215 <https://issues.apache.org/jira/browse/SP

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Xingbo Jiang
andas UDFs

SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
Re-implement file sources with data source V2 API

SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially
push down disjunctive predicated in Parquet/ORC

SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test
cases from PostgreSQL to Spark SQL

SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate
Python 2 support

SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
applicable *.sql tests into UDF integrated test base

SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
dynamic allocation without an external shuffle service

SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust post
shuffle partition number in adaptive execution

SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
Trigger implementations to Triggers.scala and avoid exposing these to the
end users

SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document
Spark WEB UI

SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
RobustScaler feature transformer

SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata
Handling in Thrift Server

SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a SQL
reference doc

SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
test coverage of ThriftServer

SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> Dynamically
reuse subqueries in AQE

SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
outdated Experimental, Evolving annotations
SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> SPARK-28980
<https://issues.apache.org/jira/browse/SPARK-28980> Remove deprecated items
since <= 2.2.0

Cheers,

Xingbo

Hyukjin Kwon  于2019年10月7日周一 下午9:29写道:

> Cogroup Pandas UDF missing:
>
> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
> Dataframe Cogroup via Pandas UDFs
> Vectorized R execution:
>
> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
> optimization in SparkR's interoperability
>
>
> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim 님이
> 작성:
>
>> Thanks for bringing the nice summary of Spark 3.0 improvements!
>>
>> I'd like to add some items from structured streaming side,
>>
>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>> Trigger implementations to Triggers.scala and avoid exposing these to the
>> end users (removal of deprecated)
>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>> support for Kafka headers in Structured Streaming
>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>> kafka delegation token support (there were follow-up issues to add
>> functionalities like support multi clusters, etc.)
>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>> Introduce new option to Kafka source: offset by timestamp (starting/ending)
>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log warn
>> message on possible correctness issue for multiple stateful operations in
>> single query
>>
>> and core side,
>>
>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New
>> feature: apply custom log URL pattern for executor log URLs in SHS
>> (follow-up issue expanded the functionality to Spark UI as well)
>>
>> FYI if we count on current work in progress, there's ongoing umbrella
>> issue regarding rolling event log & snapshot (SPARK-28594
>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we struggle
>> to get things done in Spark 3.0.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang 
>> wrote:
>>
>>> Hi all,
>>>
>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
>>> I'm listing all the notable features and major changes that are ready to
>>> test/deliver, please don't hesitate to add more to the list:
>>>
>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215>
>>> Multiple columns support added to various Transformers: StringIndexer
>>>
>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>>> Implement Dynamic Partition Pruning
>>>
>>> SPARK-1

Spark 3.0 preview release feature list and major changes

2019-10-07 Thread Xingbo Jiang
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm
listing all the notable features and major changes that are ready to
test/deliver, please don't hesitate to add more to the list:

SPARK-11215  Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150  Implement
Dynamic Partition Pruning

SPARK-13677  Support
Tree-Based Feature Transformation

SPARK-16692  Add
MultilabelClassificationEvaluator

SPARK-19591  Add sample
weights to decision trees

SPARK-19712  Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827  R API for
Power Iteration Clustering

SPARK-20286  Improve
logic for timing out executors in dynamic allocation

SPARK-20636  Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148  Acquire new
executors to avoid hang because of blacklisting

SPARK-22796  Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128  A new
approach to do adaptive execution in Spark SQL

SPARK-23674  Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710  Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333  Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417  Build and
Run Spark on JDK11

SPARK-24615 
Accelerator-aware task scheduling for Spark

SPARK-24920  Allow
sharing Netty's memory pool allocators

SPARK-25250  Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341  Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348  Data source
for binary files

SPARK-25603  Generalize
Nested Column Pruning

SPARK-26132  Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215  define
reserved keywords after SQL standard

SPARK-26412  Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785  data source
v2 API refactor: streaming write

SPARK-26956  remove
streaming output mode from data source v2 APIs

SPARK-27064  create
StreamingWrite at the beginning of streaming execution

SPARK-27119  Do not
infer schema when reading Hive serde table with native data source

SPARK-27225  Implement
join strategy hints

SPARK-27240  Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338  Fix
deadlock between TaskMemoryManager and
UnsafeExternalSorter$SpillableIterator

SPARK-27396  Public APIs
for extended Columnar Processing Support

SPARK-27589 
Re-implement file sources with data source V2 API

SPARK-27677 
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699  Partially
push down disjunctive predicated in Parquet/ORC

SPARK-27763  Port test
cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884  Deprecate
Python 2 support

SPARK-27921  Convert
applicable *.sql tests into UDF integrated test base

SPARK-27963 

Re: Spark 3.0 preview release on-going features discussion

2019-09-23 Thread Xingbo Jiang
Thanks everyone, let me first work on the feature list and major changes
that have already been finished in the master branch.

Cheers!

Xingbo

Ryan Blue  于2019年9月20日周五 上午10:56写道:

> I’m not sure that DSv2 list is accurate. We discussed this in the DSv2
> sync this week (just sent out the notes) and came up with these items:
>
>- Finish TableProvider update to avoid another API change: pass all
>table config from metastore
>- Catalog behavior fix:
>https://issues.apache.org/jira/browse/SPARK-29014
>- Stats push-down fix: move push-down to the optimizer
>- Make DataFrameWriter compatible when updating a source from v1 to
>v2, by adding extractCatalogName and extractIdentifier to TableProvider
>
> Some of the ideas that came up, like changing the pushdown API, were
> passed on because it is too close to the release to reasonably get the
> changes done without a serious delay (like the API changes just before the
> 2.4 release).
>
> On Fri, Sep 20, 2019 at 9:55 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the summarization, Xingbo.
>>
>> I also agree with Sean because I don't think those block 3.0.0 preview
>> release.
>> Especially, correctness issues should not be there.
>>
>> Instead, could you summarize what we have as of now for 3.0.0 preview?
>>
>> I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-23710) will be in the
>> what-we-have list for 3.0.0 preview.
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Sep 20, 2019 at 6:22 AM Sean Owen  wrote:
>>
>>> Is this a list of items that might be focused on for the final 3.0
>>> release? At least, Scala 2.13 support shouldn't be on that list. The
>>> others look plausible, or are already done, but there are probably
>>> more.
>>>
>>> As for the 3.0 preview, I wouldn't necessarily block on any particular
>>> feature, though, yes, the more work that can go into important items
>>> between now and then, the better.
>>> I wouldn't necessarily present any list of things that will or might
>>> be in 3.0 with that preview; just list the things that are done, like
>>> JDK 11 support.
>>>
>>> On Fri, Sep 20, 2019 at 2:46 AM Xingbo Jiang 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Let's start a new thread to discuss the on-going features for Spark
>>> 3.0 preview release.
>>> >
>>> > Below is the feature list for the Spark 3.0 preview release. The list
>>> is collected from the previous discussions in the dev list.
>>> >
>>> > Followup of the shuffle+repartition correctness issue: support roll
>>> back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
>>> > Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
>>> https://issues.apache.org/jira/browse/SPARK-23710)
>>> > JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
>>> > Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
>>> > DataSourceV2 features
>>> >
>>> > Enable file source v2 writers (
>>> https://issues.apache.org/jira/browse/SPARK-27589)
>>> > CREATE TABLE USING with DataSourceV2
>>> > New pushdown API for DataSourceV2
>>> > Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
>>> https://issues.apache.org/jira/browse/SPARK-28303)
>>> >
>>> > Correctness issue: Stream-stream joins - left outer join gives
>>> inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
>>> > Revisiting Python / pandas UDF (
>>> https://issues.apache.org/jira/browse/SPARK-28264)
>>> > Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)
>>> >
>>> > Features that are nice to have:
>>> >
>>> > Use remote storage for persisting shuffle data (
>>> https://issues.apache.org/jira/browse/SPARK-25299)
>>> > Spark + Hadoop + Parquet + Avro compatibility problems (
>>> https://issues.apache.org/jira/browse/SPARK-25588)
>>> > Introduce new option to Kafka source - specify timestamp to start and
>>> end offset (https://issues.apache.org/jira/browse/SPARK-26848)
>>> > Delete files after processing in structured streaming (
>>> https://issues.apache.org/jira/browse/SPARK-20568)
>>> >
>>> > Here, I am proposing to cut the branch on October 15th. If the
>>> features are targeting to 3.0 preview release, please prioritize the work
>>> and finish it before the date. Note, Oct. 15th is not the code freeze of
>>> Spark 3.0. That means, the community will still work on the features for
>>> the upcoming Spark 3.0 release, even if they are not included in the
>>> preview release. The goal of preview release is to collect more feedback
>>> from the community regarding the new 3.0 features/behavior changes.
>>> >
>>> > Thanks!
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Spark 3.0 preview release on-going features discussion

2019-09-20 Thread Xingbo Jiang
Hi all,

Let's start a new thread to discuss the on-going features for Spark 3.0
preview release.

Below is the feature list for the Spark 3.0 preview release. The list is
collected from the previous discussions in the dev list.

   - Followup of the shuffle+repartition correctness issue: support roll
   back shuffle stages (https://issues.apache.org/jira/browse/SPARK-25341)
   - Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 (
   https://issues.apache.org/jira/browse/SPARK-23710)
   - JDK 11 support (https://issues.apache.org/jira/browse/SPARK-28684)
   - Scala 2.13 support (https://issues.apache.org/jira/browse/SPARK-25075)
   - DataSourceV2 features
  - Enable file source v2 writers (
  https://issues.apache.org/jira/browse/SPARK-27589)
  - CREATE TABLE USING with DataSourceV2
  - New pushdown API for DataSourceV2
  - Support DELETE/UPDATE/MERGE Operations in DataSourceV2 (
  https://issues.apache.org/jira/browse/SPARK-28303)
   - Correctness issue: Stream-stream joins - left outer join gives
   inconsistent output (https://issues.apache.org/jira/browse/SPARK-26154)
   - Revisiting Python / pandas UDF (
   https://issues.apache.org/jira/browse/SPARK-28264)
   - Spark Graph (https://issues.apache.org/jira/browse/SPARK-25994)

Features that are nice to have:

   - Use remote storage for persisting shuffle data (
   https://issues.apache.org/jira/browse/SPARK-25299)
   - Spark + Hadoop + Parquet + Avro compatibility problems (
   https://issues.apache.org/jira/browse/SPARK-25588)
   - Introduce new option to Kafka source - specify timestamp to start and
   end offset (https://issues.apache.org/jira/browse/SPARK-26848)
   - Delete files after processing in structured streaming (
   https://issues.apache.org/jira/browse/SPARK-20568)

Here, I am proposing to cut the branch on October 15th. If the features are
targeting to 3.0 preview release, please prioritize the work and finish it
before the date. Note, Oct. 15th is not the code freeze of Spark 3.0. That
means, the community will still work on the features for the upcoming Spark
3.0 release, even if they are not included in the preview release. The goal
of preview release is to collect more feedback from the community regarding
the new 3.0 features/behavior changes.

Thanks!


Re: Thoughts on Spark 3 release, or a preview release

2019-09-13 Thread Xingbo Jiang
Hi all,

I would like to volunteer to be the release manager of Spark 3 preview,
thanks!

Sean Owen  于2019年9月13日周五 上午11:21写道:

> Well, great to hear the unanimous support for a Spark 3 preview
> release. Now, I don't know how to make releases myself :) I would
> first open it up to our revered release managers: would anyone be
> interested in trying to make one? sounds like it's not too soon to get
> what's in master out for evaluation, as there aren't any major
> deficiencies left, although a number of items to consider for the
> final release.
>
> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in
> order to make it possible to test with JDK 11. (We're only on Scala
> 2.12 at this point.)
>
> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin  wrote:
> >
> > +1! Long due for a preview release.
> >
> >
> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau 
> wrote:
> >>
> >> I like the idea from the PoV of giving folks something to start testing
> against and exploring so they can raise issues with us earlier in the
> process and we have more time to make calls around this.
> >>
> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge  wrote:
> >>>
> >>> +1  Like the idea as a user and a DSv2 contributor.
> >>>
> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim 
> wrote:
> 
>  +1 (as a contributor) from me to have preview release on Spark 3 as
> it would help to test the feature. When to cut preview release is
> questionable, as major works are ideally to be done before that - if we are
> intended to introduce new features before official release, that should
> work regardless of this, but if we are intended to have opportunity to test
> earlier, ideally it should.
> 
>  As a one of contributors in structured streaming area, I'd like to
> add some items for Spark 3.0, both "must be done" and "better to have". For
> "better to have", I pick some items for new features which committers
> reviewed couple of rounds and dropped off without soft-reject (No valid
> reason to stop). For Spark 2.4 users, only added feature for structured
> streaming is Kafka delegation token. (given we assume revising Kafka
> consumer pool as improvement) I hope we provide some gifts for structured
> streaming users in Spark 3.0 envelope.
> 
>  > must be done
>  * SPARK-26154 Stream-stream joins - left outer join gives
> inconsistent output
>  It's a correctness issue with multiple users reported, being reported
> at Nov. 2018. There's a way to reproduce it consistently, and we have a
> patch submitted at Jan. 2019 to fix it.
> 
>  > better to have
>  * SPARK-23539 Add support for Kafka headers in Structured Streaming
>  * SPARK-26848 Introduce new option to Kafka source - specify
> timestamp to start and end offset
>  * SPARK-20568 Delete files after processing in structured streaming
> 
>  There're some more new features/improvements items in SS, but given
> we're talking about ramping-down, above list might be realistic one.
> 
> 
> 
>  On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin 
> wrote:
> >
> > As a user/non committer, +1
> >
> > I love the idea of an early 3.0.0 so we can test current dev against
> it, I know the final 3.x will probably need another round of testing when
> it gets out, but less for sure... I know I could checkout and compile, but
> having a “packaged” preversion is great if it does not take too much time
> to the team...
> >
> > jg
> >
> >
> > On Sep 11, 2019, at 20:40, Hyukjin Kwon  wrote:
> >
> > +1 from me too but I would like to know what other people think too.
> >
> > 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun 님이
> 작성:
> >>
> >> Thank you, Sean.
> >>
> >> I'm also +1 for the following three.
> >>
> >> 1. Start to ramp down (by the official branch-3.0 cut)
> >> 2. Apache Spark 3.0.0-preview in 2019
> >> 3. Apache Spark 3.0.0 in early 2020
> >>
> >> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview`
> helps it a lot.
> >>
> >> After this discussion, can we have some timeline for `Spark 3.0
> Release Window` in our versioning-policy page?
> >>
> >> - https://spark.apache.org/versioning-policy.html
> >>
> >> Bests,
> >> Dongjoon.
> >>
> >>
> >> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer 
> wrote:
> >>>
> >>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
> problems resolved, e.g.
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-25588
> >>> https://issues.apache.org/jira/browse/SPARK-27781
> >>>
> >>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.
> As far as I know, Parquet has not cut a release based on this new version.
> >>>
> >>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
> >>>
> >>> https://github.com/apache/spark/pull/24851
> >>> 

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Xingbo Jiang
Congratulations!

Wenchen Fan 于2019年9月9日 周一下午7:49写道:

> Congratulations!
>
> On Tue, Sep 10, 2019 at 10:19 AM Yuanjian Li 
> wrote:
>
>> Congratulations!
>>
>> sujith chacko  于2019年9月10日周二 上午10:15写道:
>>
>>> Congratulations all.
>>>
>>> On Tue, 10 Sep 2019 at 7:27 AM, Haibo  wrote:
>>>
 congratulations~



 在2019年09月10日 09:30,Joseph Torres
  写道:

 congratulations!

 On Mon, Sep 9, 2019 at 6:27 PM 王 斐  wrote:

> congratulations!
>
> 获取 Outlook for iOS 
>
> --
> *发件人:* Ye Xianjin 
> *发送时间:* 星期二, 九月 10, 2019 09:26
> *收件人:* Jeff Zhang
> *抄送:* Saisai Shao; dev
> *主题:* Re: Welcoming some new committers and PMC members
>
> Congratulations!
>
> Sent from my iPhone
>
> On Sep 10, 2019, at 9:19 AM, Jeff Zhang  wrote:
>
> Congratulations!
>
> Saisai Shao  于2019年9月10日周二 上午9:16写道:
>
>> Congratulations!
>>
>> Jungtaek Lim  于2019年9月9日周一 下午6:11写道:
>>
>>> Congratulations! Well deserved!
>>>
>>> On Tue, Sep 10, 2019 at 9:51 AM John Zhuge 
>>> wrote:
>>>
 Congratulations!

 On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp 
 wrote:

> congrats everyone!  :)
>
> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
> >
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers and
> one PMC member. Join me in welcoming them to their new roles!
> >
> > New PMC member: Dongjoon Hyun
> >
> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang,
> Yuming Wang, Weichen Xu, Ruifeng Zheng
> >
> > The new committers cover lots of important areas including ML,
> SQL, and data sources, so it’s great to have them here. All the best,
> >
> > Matei and the Spark PMC
> >
> >
> >
> -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

 --
 John Zhuge

>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>
> --
> Best Regards
>
> Jeff Zhang
>
>


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xingbo Jiang
+1 on the updated SPIP

Xingbo Jiang  于2019年3月26日周二 下午1:32写道:

> Hi all,
>
> Now we have had a few discussions over the updated SPIP, we also updated
> the SPIP addressing new feedbacks from some committers. IMO the SPIP is
> ready for another round of vote now.
> On the updated SPIP, we currently have two +1s (from Tom and Xiangrui),
> everyone else please vote again.
>
> The vote will be up for the next 72 hours.
>
> Thanks!
>
> Xingbo
>
> Xiangrui Meng  于2019年3月26日周二 上午11:32写道:
>
>>
>>
>> On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra 
>> wrote:
>>
>>> Maybe.
>>>
>>> And I expect that we will end up doing something based on
>>> spark.task.cpus in the short term. I'd just rather that this SPIP not make
>>> it look like this is the way things should ideally be done. I'd prefer that
>>> we be quite explicit in recognizing that this approach is a significant
>>> compromise, and I'd like to see at least some references to the beginning
>>> of serious longer-term efforts to do something better in a deeper re-design
>>> of resource scheduling.
>>>
>>
>> It is also a feature I desire as a user. How about suggesting it as a
>> future work in the SPIP? It certainly requires someone who fully
>> understands Spark scheduler to drive. Shall we start with a Spark JIRA? I
>> don't know much about scheduler like you do, but I can speak for DL use
>> cases. Maybe we just view it from different angles. To you
>> application-level request is a significant compromise. To me it provides a
>> major milestone that brings GPU to Spark workload. I know many users who
>> tried to do DL on Spark ended up doing hacks here and there, huge pain. The
>> scope covered by the current SPIP makes those users much happier. Tom and
>> Andy from NVIDIA are certainly more calibrated on the usefulness of the
>> current proposal.
>>
>>
>>>
>>> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng 
>>> wrote:
>>>
>>>> There are certainly use cases where different stages require different
>>>> number of CPUs or GPUs under an optimal setting. I don't think anyone
>>>> disagrees that ideally users should be able to do it. We are just dealing
>>>> with typical engineering trade-offs and see how we break it down into
>>>> smaller ones. I think it is fair to treat the task-level resource request
>>>> as a separate feature here because it also applies to CPUs alone without
>>>> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
>>>> years Spark is still able to cover many many use cases. Otherwise we
>>>> shouldn't see many Spark users around now. Here we just apply similar
>>>> arguments to GPUs.
>>>>
>>>> Initially, I was the person who really wanted task-level requests
>>>> because it is ideal. In an offline discussion, Andy Feng pointed out an
>>>> application-level setting should fit common deep learning training and
>>>> inference cases and it greatly simplifies necessary changes required to
>>>> Spark job scheduler. With Imran's feedback to the initial design sketch,
>>>> the application-level approach became my first choice because it is still
>>>> very valuable but much less risky. If a feature brings great value to
>>>> users, we should add it even it is not ideal.
>>>>
>>>> Back to the default value discussion, let's forget GPUs and only
>>>> consider CPUs. Would an application-level default number of CPU cores
>>>> disappear if we added task-level requests? If yes, does it mean that users
>>>> have to explicitly state the resource requirements for every single stage?
>>>> It is tedious to do and who do not fully understand the impact would
>>>> probably do it wrong and waste even more resources. Then how many cores
>>>> each task should use if user didn't specify it? I do see "spark.task.cpus"
>>>> is the answer here. The point I want to make is that "spark.task.cpus",
>>>> though less ideal, is still needed when we have task-level requests for
>>>> CPUs.
>>>>
>>>> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
>>>> wrote:
>>>>
>>>>> I remain unconvinced that a default configuration at the application
>>>>> level makes sense even in that case. There may be some applications where
>>>>> you know a priori that almost all the tasks for all the stages for all the
>>&g

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xingbo Jiang
t; work than just trying to extend the existing resource allocation 
>>>>>>> mechanisms
>>>>>>> to handle domain-specific resources, but it does feel to me like we 
>>>>>>> should
>>>>>>> at least be considering doing that deeper redesign.
>>>>>>>
>>>>>>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves
>>>>>>>  wrote:
>>>>>>>
>>>>>>> Tthe proposal here is that all your resources are static and the gpu
>>>>>>> per task config is global per application, meaning you ask for a certain
>>>>>>> amount memory, cpu, GPUs for every executor up front just like you do 
>>>>>>> today
>>>>>>> and every executor you get is that size.  This means that both static or
>>>>>>> dynamic allocation still work without explicitly adding more logic at 
>>>>>>> this
>>>>>>> point. Since the config for gpu per task is global it means every task 
>>>>>>> you
>>>>>>> want will need a certain ratio of cpu to gpu.  Since that is a global 
>>>>>>> you
>>>>>>> can't really have the scenario you mentioned, all tasks are assuming to
>>>>>>> need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 gpu per task 
>>>>>>> for
>>>>>>> each executor.  That means that I could only run 2 tasks and 3 cores 
>>>>>>> would
>>>>>>> be wasted.  The stage/task level configuration of resources was removed 
>>>>>>> and
>>>>>>> is something we can do in a separate SPIP.
>>>>>>> We thought erroring would make it more obvious to the user.  We
>>>>>>> could change this to a warning if everyone thinks that is better but I
>>>>>>> personally like the error until we can implement the per lower level per
>>>>>>> stage configuration.
>>>>>>>
>>>>>>> Tom
>>>>>>>
>>>>>>> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido <
>>>>>>> marcogaid...@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Thanks for this SPIP.
>>>>>>> I cannot comment on the docs, but just wanted to highlight one
>>>>>>> thing. In page 5 of the SPIP, when we talk about DRA, I see:
>>>>>>>
>>>>>>> "For instance, if each executor consists 4 CPUs and 2 GPUs, and
>>>>>>> each task requires 1 CPU and 1GPU, then we shall throw an error on
>>>>>>> application start because we shall always have at least 2 idle CPUs per
>>>>>>> executor"
>>>>>>>
>>>>>>> I am not sure this is a correct behavior. We might have tasks
>>>>>>> requiring only CPU running in parallel as well, hence that may make 
>>>>>>> sense.
>>>>>>> I'd rather emit a WARN or something similar. Anyway we just said we will
>>>>>>> keep GPU scheduling on task level out of scope for the moment, right?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Marco
>>>>>>>
>>>>>>> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <
>>>>>>> m...@databricks.com> ha scritto:
>>>>>>>
>>>>>>> Steve, the initial work would focus on GPUs, but we will keep the
>>>>>>> interfaces general to support other accelerators in the future. This was
>>>>>>> mentioned in the SPIP and draft design.
>>>>>>>
>>>>>>> Imran, you should have comment permission now. Thanks for making a
>>>>>>> pass! I don't think the proposed 3.0 features should block Spark 3.0
>>>>>>> release either. It is just an estimate of what we could deliver. I will
>>>>>>> update the doc to make it clear.
>>>>>>>
>>>>>>> Felix, it would be great if you can review the updated docs and let
>>>>>>> us know your feedback.
>>>>>>>
>>>>>>> ** How about setting a tentative vote closing time to next Tue (Mar
>>>>>>> 26)?
>>>>>>>
>>>>>>> On Wed, Mar 20, 2019 at 1

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-18 Thread Xingbo Jiang
Hi all,

I updated the SPIP doc

and stories
,
I hope it now contains clear scope of the changes and enough details for
SPIP vote.
Please review the updated docs, thanks!

Xiangrui Meng  于2019年3月6日周三 上午8:35写道:

> How about letting Xingbo make a major revision to the SPIP doc to make it
> clear what proposed are? I like Felix's suggestion to switch to the new
> Heilmeier template, which helps clarify what are proposed and what are not.
> Then let's review the new SPIP and resume the vote.
>
> On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid  wrote:
>
>> OK, I suppose then we are getting bogged down into what a vote on an SPIP
>> means then anyway, which I guess we can set aside for now.  With the level
>> of detail in this proposal, I feel like there is a reasonable chance I'd
>> still -1 the design or implementation.
>>
>> And the other thing you're implicitly asking the community for is to
>> prioritize this feature for continued review and maintenance.  There is
>> already work to be done in things like making barrier mode support dynamic
>> allocation (SPARK-24942), bugs in failure handling (eg. SPARK-25250), and
>> general efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm
>> very concerned about getting spread too thin.
>>
>
>> But if this is really just a vote on (1) is better gpu support important
>> for spark, in some form, in some release? and (2) is it *possible* to do
>> this in a safe way?  then I will vote +0.
>>
>> On Tue, Mar 5, 2019 at 8:25 AM Tom Graves  wrote:
>>
>>> So to me most of the questions here are implementation/design questions,
>>> I've had this issue in the past with SPIP's where I expected to have more
>>> high level design details but was basically told that belongs in the design
>>> jira follow on. This makes me think we need to revisit what a SPIP really
>>> need to contain, which should be done in a separate thread.  Note
>>> personally I would be for having more high level details in it.
>>> But the way I read our documentation on a SPIP right now that detail is
>>> all optional, now maybe we could argue its based on what reviewers request,
>>> but really perhaps we should make the wording of that more required.
>>>  thoughts?  We should probably separate that discussion if people want to
>>> talk about that.
>>>
>>> For this SPIP in particular the reason I +1 it is because it came down
>>> to 2 questions:
>>>
>>> 1) do I think spark should support this -> my answer is yes, I think
>>> this would improve spark, users have been requesting both better GPUs
>>> support and support for controlling container requests at a finer
>>> granularity for a while.  If spark doesn't support this then users may go
>>> to something else, so I think it we should support it
>>>
>>> 2) do I think its possible to design and implement it without causing
>>> large instabilities?   My opinion here again is yes. I agree with Imran and
>>> others that the scheduler piece needs to be looked at very closely as we
>>> have had a lot of issues there and that is why I was asking for more
>>> details in the design jira:
>>> https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe
>>> its possible to do.
>>>
>>> If others have reservations on similar questions then I think we should
>>> resolve here or take the discussion of what a SPIP is to a different thread
>>> and then come back to this, thoughts?
>>>
>>> Note there is a high level design for at least the core piece, which is
>>> what people seem concerned with, already so including it in the SPIP should
>>> be straight forward.
>>>
>>> Tom
>>>
>>> On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid <
>>> im...@therashids.com> wrote:
>>>
>>>
>>> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:
>>>
>>> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung 
>>> wrote:
>>>
>>> IMO upfront allocation is less useful. Specifically too expensive for
>>> large jobs.
>>>
>>>
>>> This is also an API/design discussion.
>>>
>>>
>>> I agree with Felix -- this is more than just an API question.  It has a
>>> huge impact on the complexity of what you're proposing.  You might be
>>> proposing big changes to a core and brittle part of spark, which is already
>>> short of experts.
>>>
>>> I don't see any value in having a vote on "does feature X sound cool?"
>>> We have to evaluate the potential benefit against the risks the feature
>>> brings and the continued maintenance cost.  We don't need super low-level
>>> details, but we have to a sketch of the design to be able to make that
>>> tradeoff.
>>>
>>


Re: SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Xingbo Jiang
Hi Sean,

To support GPU scheduling with YARN cluster, we have to update the hadoop
version to 3.1.2+. However, if we decide to not upgrade hadoop to beyond
that version for Spark 3.0, then we just have to disable/fallback the GPU
scheduling with YARN, users shall still be able to have that feature with
Standalone or Kubernetes cluster.

We didn't include the Mesos support in current SPIP because we didn't
receive use cases that require GPU scheduling on Mesos cluster, however, we
can still add Mesos support in the future if we observe valid use cases.

Thanks!

Xingbo

Sean Owen  于2019年3月1日周五 下午10:39写道:

> Two late breaking questions:
>
> This basically requires Hadoop 3.1 for YARN support?
> Mesos support is listed as a non goal but it already has support for
> requesting GPUs in Spark. That would be 'harmonized' with this
> implementation even if it's not extended?
>
> On Fri, Mar 1, 2019, 7:48 AM Xingbo Jiang  wrote:
>
>> I think we are aligned on the commitment, I'll start a vote thread for
>> this shortly.
>>
>> Xiangrui Meng  于2019年2月27日周三 上午6:47写道:
>>
>>> In case there are issues visiting Google doc, I attached PDF files to
>>> the JIRA.
>>>
>>> On Tue, Feb 26, 2019 at 7:41 AM Xingbo Jiang 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I want send a revised SPIP on implementing Accelerator(GPU)-aware
>>>> Scheduling. It improves Spark by making it aware of GPUs exposed by cluster
>>>> managers, and hence Spark can match GPU resources with user task requests
>>>> properly. If you have scenarios that need to run workloads(DL/ML/Signal
>>>> Processing etc.) on Spark cluster with GPU nodes, please help review and
>>>> check how it fits into your use cases. Your feedback would be greatly
>>>> appreciated!
>>>>
>>>> # Links to SPIP and Product doc:
>>>>
>>>> * Jira issue for the SPIP:
>>>> https://issues.apache.org/jira/browse/SPARK-24615
>>>> * Google Doc:
>>>> https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit?usp=sharing
>>>> * Product Doc:
>>>> https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit?usp=sharing
>>>>
>>>> Thank you!
>>>>
>>>> Xingbo
>>>>
>>>


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Xingbo Jiang
Start with +1 from myself.

Xingbo Jiang  于2019年3月1日周五 下午10:14写道:

> Hi all,
>
> I want to call for a vote of SPARK-24615
> <https://issues.apache.org/jira/browse/SPARK-24615>. It improves Spark by
> making it aware of GPUs exposed by cluster managers, and hence Spark can
> match GPU resources with user task requests properly. The proposal
> <https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit?usp=sharing>
>  and production doc
> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit?usp=sharing>
>  was
> made available on dev@ to collect input. Your can also find a design
> sketch at SPARK-27005 <https://issues.apache.org/jira/browse/SPARK-27005>.
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thank you!
>
> Xingbo
>


[VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Xingbo Jiang
Hi all,

I want to call for a vote of SPARK-24615
. It improves Spark by
making it aware of GPUs exposed by cluster managers, and hence Spark can
match GPU resources with user task requests properly. The proposal

 and production doc

was
made available on dev@ to collect input. Your can also find a design sketch
at SPARK-27005 .

The vote will be up for the next 72 hours. Please reply with your vote:

+1: Yeah, let's go forward and implement the SPIP.
+0: Don't really care.
-1: I don't think this is a good idea because of the following technical
reasons.

Thank you!

Xingbo


Re: SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Xingbo Jiang
I think we are aligned on the commitment, I'll start a vote thread for this
shortly.

Xiangrui Meng  于2019年2月27日周三 上午6:47写道:

> In case there are issues visiting Google doc, I attached PDF files to the
> JIRA.
>
> On Tue, Feb 26, 2019 at 7:41 AM Xingbo Jiang 
> wrote:
>
>> Hi all,
>>
>> I want send a revised SPIP on implementing Accelerator(GPU)-aware
>> Scheduling. It improves Spark by making it aware of GPUs exposed by cluster
>> managers, and hence Spark can match GPU resources with user task requests
>> properly. If you have scenarios that need to run workloads(DL/ML/Signal
>> Processing etc.) on Spark cluster with GPU nodes, please help review and
>> check how it fits into your use cases. Your feedback would be greatly
>> appreciated!
>>
>> # Links to SPIP and Product doc:
>>
>> * Jira issue for the SPIP:
>> https://issues.apache.org/jira/browse/SPARK-24615
>> * Google Doc:
>> https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit?usp=sharing
>> * Product Doc:
>> https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit?usp=sharing
>>
>> Thank you!
>>
>> Xingbo
>>
>


SPIP: Accelerator-aware Scheduling

2019-02-26 Thread Xingbo Jiang
Hi all,

I want send a revised SPIP on implementing Accelerator(GPU)-aware
Scheduling. It improves Spark by making it aware of GPUs exposed by cluster
managers, and hence Spark can match GPU resources with user task requests
properly. If you have scenarios that need to run workloads(DL/ML/Signal
Processing etc.) on Spark cluster with GPU nodes, please help review and
check how it fits into your use cases. Your feedback would be greatly
appreciated!

# Links to SPIP and Product doc:

* Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-24615
* Google Doc:
https://docs.google.com/document/d/1C4J_BPOcSCJc58HL7JfHtIzHrjU0rLRdQM3y7ejil64/edit?usp=sharing
* Product Doc:
https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit?usp=sharing

Thank you!

Xingbo


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-13 Thread Xingbo Jiang
I'm working on the fix of SPARK-23243
 and should be able push
another commit in 1~2 days. More detailed discussions can go to the PR.
Thanks for pushing this issue forward! I really appreciate efforts by
submit PRs or involve in the discussions actively!

2018-08-13 22:50 GMT+08:00 Tom Graves :

> I agree with Imran, we need to fix SPARK-23243
>  and any correctness
> issues for that matter.
>
> Tom
>
> On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid
>  wrote:
>
>
> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>
> SPARK-23243 : 
> Shuffle+Repartition
> on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about
> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
> long-standing issue, not a regression.
>
>
> This is a really serious data loss bug.  Yes its very complex, but we
> absolutely have to fix this, I really think it should be in 2.4.
> Has worked on it stopped?
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xingbo Jiang
Speaking of the code from hydrogen PRs, actually we didn't remove any of
the existing logic, and I tried my best to hide almost all of the newly
added logic behind a `isBarrier` tag (or something similar). I have to add
some new variables and new methods to the core code paths, but I think they
shall not be hit if you are not running barrier workloads.

The only significant change I can think of is I swapped the sequence of
failure handling in DAGScheduler, moving the `case FetchFailed` block to
before the `case Resubmitted` block, but again I don't think this shall
affect a regular workload because anyway you can only have one failure type.

Actually I also reviewed the previous PRs adding Spark on K8s support, and
I feel it's a good example of how to add new features to a project without
breaking existing workloads, I'm trying to follow that way in adding
barrier execution mode support.

I really appreciate any notice on hydrogen PRs and welcome comments to help
improve the feature, thanks!

2018-08-01 4:19 GMT+08:00 Reynold Xin :

> I actually totally agree that we should make sure it should have no impact
> on existing code if the feature is not used.
>
>
> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
> wrote:
>
>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>> however I've perused them, and they make substantial modifications to
>> Spark's core DAG scheduler code.
>>
>> What I'm wondering is: how high is the confidence level that the
>> "traditional" code paths are still stable. Put another way, is it even
>> possible to "turn off" or "opt out" of this experimental feature? This
>> analogy isn't perfect, but for example the k8s back-end is a major body of
>> code, but it has a very small impact on any *core* code paths, and so if
>> you opt out of it, it is well understood that you aren't running any
>> experimental code.
>>
>> Looking at the project hydrogen code, I'm less sure the same is true.
>> However, maybe there is a clear way to show how it is true.
>>
>>
>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
>> wrote:
>>
>>> No reasonable amount of time is likely going to be sufficient to fully
>>> vet the code as a PR. I'm not entirely happy with the design and code as
>>> they currently are (and I'm still trying to find the time to more publicly
>>> express my thoughts and concerns), but I'm fine with them going into 2.4
>>> much as they are as long as they go in with proper stability annotations
>>> and are understood not to be cast-in-stone final implementations, but
>>> rather as a way to get people using them and generating the feedback that
>>> is necessary to get us to something more like a final design and
>>> implementation.
>>>
>>> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
>>> wrote:
>>>

 Barrier mode seems like a high impact feature on Spark's core code: is
 one additional week enough time to properly vet this feature?

 On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
 joseph.tor...@databricks.com> wrote:

> Full continuous processing aggregation support ran into unanticipated
> scalability and scheduling problems. We’re planning to overcome those by
> using some of the barrier execution machinery, but since barrier execution
> itself is still in progress the full support isn’t going to make it into
> 2.4.
>
> Jose
>
> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
> tomasz.gaw...@outlook.com> wrote:
>
>> Hi,
>>
>> what is the status of Continuous Processing + Aggregations? As far as
>> I
>> remember, Jose Torres said it should  be easy to perform aggregations
>> if
>> coalesce(1) work. IIRC it's already merged to master.
>>
>> Is this work in progress? If yes, it would be great to have full
>> aggregation/join support in Spark 2.4 in CP.
>>
>> Pozdrawiam / Best regards,
>>
>> Tomek
>>
>>
>> On 2018-07-31 10:43, Petar Zečević wrote:
>> > This one is important to us: https://issues.apache.org/
>> jira/browse/SPARK-24020 (Sort-merge join inner range optimization)
>> but I think it could be useful to others too.
>> >
>> > It is finished and is ready to be merged (was ready a month ago at
>> least).
>> >
>> > Do you think you could consider including it in 2.4?
>> >
>> > Petar
>> >
>> >
>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>> >
>> >> I went through the open JIRA tickets and here is a list that we
>> should consider for Spark 2.4:
>> >>
>> >> High Priority:
>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> >> This one is critical to the Spark ecosystem for deep learning. It
>> only has a few remaining works and I think we should have it in Spark 
>> 2.4.
>> >>
>> >> Middle Priority:
>> >> SPARK-23899: Built-in SQL Function Improvement
>> >> We've already added a lot of built-in functions 

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-25 Thread Xingbo Jiang
Xiangrui and I are leading an effort to implement a highly desirable
feature, Barrier Execution Mode.
https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new
scheduling model to Apache Spark so users can properly embed distributed DL
training as a Spark stage to simplify the distributed training workflow.
The prototype has been demoed in the Spark Summit keynote. This new feature
got a very positive feedback from the whole community. The design doc and
pull requests got more comments than we initially anticipated. We want to
finish this feature in the upcoming release, Spark 2.4. Would it be
possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin :

> FYI 6 mo is coming up soon since the last release. We will cut the branch
> and code freeze on Aug 1st in order to get 2.4 out on time.
>
>


[SPARK-24581] Design: BarrierTaskContext.barrier()

2018-07-24 Thread Xingbo Jiang
Hi All,

This is a follow up work of [SPARK-24374
] SPIP: Support Barrier
Execution Mode in Apache Spark.
https://docs.google.com/document/d/1r07-vU5JTH6s1jJ6azkmK0K5it6jwpfO6b_K3mJmxR4/edit?usp=sharing

We need to provide a communication barrier function to help coordinate
tasks within a barrier stage, which is frequently required by ML/DL
workloads. Similar to MPI_Barrier function in MPI, the barrier() function
call blocks until all tasks in the same stage have reached this routine.
The design doc proposes to implement the barrier() function based on the
netty-based RPC framework in Spark, it introduces new driver side
BarrierCoordinator and new BarrierCoordinatorMessage, as well as new config
to handle timeout issue.

Please feel free to review and discuss on the design proposal.

Thanks,
Xingbo


[DESIGN] Barrier Execution Mode

2018-07-08 Thread Xingbo Jiang
Hi All,

I would like to invite you to review the design document for Barrier
Execution Mode:
https://docs.google.com/document/d/1GvcYR6ZFto3dOnjfLjZMtTezX0W5VYN9w1l4-tQXaZk/edit#

TL;DR: We announced the project Hydrogen on recent Spark+AI Summit, a major
part of the project involves significant changes to execution mode of
Spark. This design doc proposes new APIs as well as new execution mode
(known as barrier execution mode) to provide high-performance support for
DL workloads.

Major changes include:

   - Add RDDBarrier to support gang scheduling.
   - Add BarrierTaskContext to support global sync of all tasks in a stage;
   - Better fault tolerance approach for barrier stage, that in case some
   tasks fail in the middle, retry all tasks in the same stage.
   - Integrate barrier execution mode with Standalone cluster manager.

Please feel free to review and discuss on the design proposal.

Thanks,
Xingbo


Re: Time for 2.3.2?

2018-06-28 Thread Xingbo Jiang
+1

Wenchen Fan 于2018年6月28日 周四下午2:06写道:

> Hi Saisai, that's great! please go ahead!
>
> On Thu, Jun 28, 2018 at 12:56 PM Saisai Shao 
> wrote:
>
>> +1, like mentioned by Marcelo, these issues seems quite severe.
>>
>> I can work on the release if short of hands :).
>>
>> Thanks
>> Jerry
>>
>>
>> Marcelo Vanzin  于2018年6月28日周四 上午11:40写道:
>>
>>> +1. SPARK-24589 / SPARK-24552 are kinda nasty and we should get fixes
>>> for those out.
>>>
>>> (Those are what delayed 2.2.2 and 2.1.3 for those watching...)
>>>
>>> On Wed, Jun 27, 2018 at 7:59 PM, Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > Spark 2.3.1 was released just a while ago, but unfortunately we
>>> discovered
>>> > and fixed some critical issues afterward.
>>> >
>>> > SPARK-24495: SortMergeJoin may produce wrong result.
>>> > This is a serious correctness bug, and is easy to hit: have duplicated
>>> join
>>> > key from the left table, e.g. `WHERE t1.a = t2.b AND t1.a = t2.c`, and
>>> the
>>> > join is a sort merge join. This bug is only present in Spark 2.3.
>>> >
>>> > SPARK-24588: stream-stream join may produce wrong result
>>> > This is a correctness bug in a new feature of Spark 2.3: the
>>> stream-stream
>>> > join. Users can hit this bug if one of the join side is partitioned by
>>> a
>>> > subset of the join keys.
>>> >
>>> > SPARK-24552: Task attempt numbers are reused when stages are retried
>>> > This is a long-standing bug in the output committer that may introduce
>>> data
>>> > corruption.
>>> >
>>> > SPARK-24542: UDFXPath allow users to pass carefully crafted XML to
>>> > access arbitrary files
>>> > This is a potential security issue if users build access control
>>> module upon
>>> > Spark.
>>> >
>>> > I think we need a Spark 2.3.2 to address these issues(especially the
>>> > correctness bugs) ASAP. Any thoughts?
>>> >
>>> > Thanks,
>>> > Wenchen
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-01 Thread Xingbo Jiang
+1

2018-06-01 9:21 GMT-07:00 Xiangrui Meng :

> Hi all,
>
> I want to call for a vote of SPARK-24374
> . It introduces a new
> execution mode to Spark, which would help both integration with external
> DL/AI frameworks and MLlib algorithm performance. This is one of the
> follow-ups from a previous discussion on dev@
> 
> .
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Best,
> Xiangrui
> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com] 
>


Re: Clarify window behavior in Spark SQL

2018-04-03 Thread Xingbo Jiang
This is actually by design, without a `ORDER BY` clause, all rows are
considered as the peer row of the current row, which means that the frame
is effectively the entire partition. This behavior follows the window
syntax of PGSQL.
You can refer to the comment by yhuai:
https://github.com/apache/spark/pull/5604#discussion_r157931911
:)

2018-04-04 6:27 GMT+08:00 Reynold Xin :

> Do other (non-Hive) SQL systems do the same thing?
>
> On Tue, Apr 3, 2018 at 3:16 PM, Herman van Hövell tot Westerflier <
> her...@databricks.com> wrote:
>
>> This is something we inherited from Hive: https://cwiki.apache.org
>> /confluence/display/Hive/LanguageManual+WindowingAndAnalytics
>>
>> When ORDER BY is specified with missing WINDOW clause, the WINDOW
>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT
>>> ROW.
>>
>> When both ORDER BY and WINDOW clauses are missing, the WINDOW
>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
>>> FOLLOWING.
>>
>>
>> It sort of makes sense if you think about it. If there is no ordering
>> there is no way to have a bound frame. If there is ordering we default to
>> the most commonly used deterministic frame.
>>
>>
>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin  wrote:
>>
>>> Seems like a bug.
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin  wrote:
>>>
 Hi Devs,

 I am seeing some behavior with window functions that is a bit
 unintuitive and would like to get some clarification.

 When using aggregation function with window, the frame boundary seems
 to change depending on the order of the window.

 Example:
 (1)

 df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

 w1 = Window.partitionBy('id')

 df.withColumn('v2', mean(df.v).over(w1)).show()

 +---+---+---+

 | id|  v| v2|

 +---+---+---+

 |  0|  1|2.0|

 |  0|  2|2.0|

 |  0|  3|2.0|

 +---+---+---+

 (2)
 df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')

 w2 = Window.partitionBy('id').orderBy('v')

 df.withColumn('v2', mean(df.v).over(w2)).show()

 +---+---+---+

 | id|  v| v2|

 +---+---+---+

 |  0|  1|1.0|

 |  0|  2|1.5|

 |  0|  3|2.0|

 +---+---+---+

 Seems like orderBy('v') in the example (2) also changes the frame
 boundaries from (

 unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
 currentRow).


 I found this behavior a bit unintuitive. I wonder if this behavior is
 by design and if so, what's the specific rule that orderBy() interacts with
 frame boundaries?


 Thanks,

 Li


>>>
>>
>


Re: Welcome Zhenhua Wang as a Spark committer

2018-04-01 Thread Xingbo Jiang
congs & welcome!

2018-04-02 13:28 GMT+08:00 Wenchen Fan :

> Hi all,
>
> The Spark PMC recently added Zhenhua Wang as a committer on the project.
> Zhenhua is the major contributor of the CBO project, and has been
> contributing across several areas of Spark for a while, focusing especially
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>
> Wenchen
>


Re: Welcoming some new committers

2018-03-02 Thread Xingbo Jiang
Congratulations to everyone!

2018-03-03 8:51 GMT+08:00 Ilan Filonenko :

> Congrats to everyone! :)
>
> On Fri, Mar 2, 2018 at 7:34 PM Felix Cheung 
> wrote:
>
>> Congrats and welcome!
>>
>> --
>> *From:* Dongjoon Hyun 
>> *Sent:* Friday, March 2, 2018 4:27:10 PM
>> *To:* Spark dev list
>> *Subject:* Re: Welcoming some new committers
>>
>> Congrats to all!
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Mar 2, 2018 at 4:13 PM, Wenchen Fan  wrote:
>>
>>> Congratulations to everyone and welcome!
>>>
>>> On Sat, Mar 3, 2018 at 7:26 AM, Cody Koeninger 
>>> wrote:
>>>
 Congrats to the new committers, and I appreciate the vote of confidence.

 On Fri, Mar 2, 2018 at 4:41 PM, Matei Zaharia 
 wrote:
 > Hi everyone,
 >
 > The Spark PMC has recently voted to add several new committers to the
 project, based on their contributions to Spark 2.3 and other past work:
 >
 > - Anirudh Ramanathan (contributor to Kubernetes support)
 > - Bryan Cutler (contributor to PySpark and Arrow support)
 > - Cody Koeninger (contributor to streaming and Kafka support)
 > - Erik Erlandson (contributor to Kubernetes support)
 > - Matt Cheah (contributor to Kubernetes support and other parts of
 Spark)
 > - Seth Hendrickson (contributor to MLlib and PySpark)
 >
 > Please join me in welcoming Anirudh, Bryan, Cody, Erik, Matt and Seth
 as committers!
 >
 > Matei
 > -
 > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-22 Thread Xingbo Jiang
+1

2018-02-23 11:26 GMT+08:00 Takuya UESHIN :

> +1
>
> On Fri, Feb 23, 2018 at 12:24 PM, Wenchen Fan  wrote:
>
>> +1
>>
>> On Fri, Feb 23, 2018 at 6:23 AM, Sameer Agarwal 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.3.0. The vote is open until Tuesday February 27, 2018 at 8:00:00 am UTC
>>> and passes if a majority of at least 3 PMC +1 votes are cast.
>>>
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.0-rc5: https://github.com/apache/spar
>>> k/tree/v2.3.0-rc5 (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>>>
>>> List of JIRA tickets resolved in this release can be found here:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1266/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
>>> /_site/index.html
>>>
>>>
>>> FAQ
>>>
>>> ===
>>> What are the unresolved issues targeted for 2.3.0?
>>> ===
>>>
>>> Please see https://s.apache.org/oXKi. At the time of writing, there are
>>> currently no known release blockers.
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala you
>>> can add the staging repository to your projects resolvers and test with the
>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>> up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.0?
>>> ===
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
>>> appropriate.
>>>
>>> ===
>>> Why is my bug not fixed?
>>> ===
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.2.0. That being said, if
>>> there is something which is a regression from 2.2.0 and has not been
>>> correctly targeted please ping me or a committer to help target the issue
>>> (you can see the open issues listed as impacting Spark 2.3.0 at
>>> https://s.apache.org/WmoI).
>>>
>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-19 Thread Xingbo Jiang
+1


Wenchen Fan 于2018年2月20日 周二下午1:09写道:

> +1
>
> On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin  wrote:
>
>> +1
>>
>> On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal ,
>> wrote:
>>
>> this file shouldn't be included?
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>>
>>
>> I've now deleted this file
>>
>> *From:* Sameer Agarwal 
>>> *Sent:* Saturday, February 17, 2018 1:43:39 PM
>>> *To:* Sameer Agarwal
>>> *Cc:* dev
>>> *Subject:* Re: [VOTE] Spark 2.3.0 (RC4)
>>>
>>> I'll start with a +1 once again.
>>>
>>> All blockers reported against RC3 have been resolved and the builds are
>>> healthy.
>>>
>>> On 17 February 2018 at 13:41, Sameer Agarwal 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.3.0. The vote is open until Thursday February 22, 2018 at 8:00:00
 am UTC and passes if a majority of at least 3 PMC +1 votes are cast.


 [ ] +1 Release this package as Apache Spark 2.3.0

 [ ] -1 Do not release this package because ...


 To learn more about Apache Spark, please see https://spark.apache.org/

 The tag to be voted on is v2.3.0-rc4:
 https://github.com/apache/spark/tree/v2.3.0-rc4
 (44095cb65500739695b0324c177c19dfa1471472)

 List of JIRA tickets resolved in this release can be found here:
 https://issues.apache.org/jira/projects/SPARK/versions/12339551

 The release files, including signatures, digests, etc. can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1265/

 The documentation corresponding to this release can be found at:

 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs/_site/index.html


 FAQ

 ===
 What are the unresolved issues targeted for 2.3.0?
 ===

 Please see https://s.apache.org/oXKi. At the time of writing, there
 are currently no known release blockers.

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and install
 the current RC and see if anything important breaks, in the Java/Scala you
 can add the staging repository to your projects resolvers and test with the
 RC (make sure to clean up the artifact cache before/after so you don't end
 up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.3.0?
 ===

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should be
 worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 as
 appropriate.

 ===
 Why is my bug not fixed?
 ===

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from 2.2.0. That being
 said, if there is something which is a regression from 2.2.0 and has not
 been correctly targeted please ping me or a committer to help target the
 issue (you can see the open issues listed as impacting Spark 2.3.0 at
 https://s.apache.org/WmoI).

>>>
>>>
>>>
>>> --
>>> Sameer Agarwal
>>> Computer Science | UC Berkeley
>>> http://cs.berkeley.edu/~sameerag
>>>
>>
>>
>>
>> --
>> Sameer Agarwal
>> Computer Science | UC Berkeley
>> http://cs.berkeley.edu/~sameerag
>>
>>
>


Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-04 Thread Xingbo Jiang
I filed another NPE problem in WebUI, I believe this is regression in 2.3:
https://issues.apache.org/jira/browse/SPARK-23330

2018-02-01 10:38 GMT-08:00 Tom Graves :

> I filed a jira [SPARK-23304] Spark SQL coalesce() against hive not
> working - ASF JIRA  for
> the coalesce issue.
>
> [SPARK-23304] Spark SQL coalesce() against hive not working - ASF JIRA
>
> 
>
>
> Tom
>
> On Thursday, February 1, 2018, 12:36:02 PM CST, Sameer Agarwal <
> samee...@apache.org> wrote:
>
>
> [+ Xiao]
>
> SPARK-23290  does sound like a blocker. On the SQL side, I can confirm
> that there were non-trivial changes around repartitioning/coalesce and
> cache performance in 2.3 --  we're currently investigating these.
>
> On 1 February 2018 at 10:02, Andrew Ash  wrote:
>
> I'd like to nominate SPARK-23290
>  as a potential
> blocker for the 2.3.0 release.  It's a regression from 2.2.0 in that user
> pyspark code that works in 2.2.0 now fails in the 2.3.0 RCs: the type
> return type of date columns changed from object to datetime64[ns].  My
> understanding of the Spark Versioning Policy
>  is that user code should
> continue to run in future versions of Spark with the same major version
> number.
>
> Thanks!
>
> On Thu, Feb 1, 2018 at 9:50 AM, Tom Graves 
> wrote:
>
>
> Testing with spark 2.3 and I see a difference in the sql coalesce talking
> to hive vs spark 2.2. It seems spark 2.3 ignores the coalesce.
>
> Query:
> spark.sql("SELECT COUNT(DISTINCT(something)) FROM sometable WHERE dt >=
> '20170301' AND dt <= '20170331' AND something IS NOT
> NULL").coalesce(16).show()
>
> in spark 2.2 the coalesce works here, but in spark 2.3, it doesn't.
>  Anyone know about this issue or are there some weird config changes,
> otherwise I'll file a jira?
>
> Note I also see a performance difference when reading cached data. Spark
> 2.3. Small query on 19GB cached data, spark 2.3 is 30% worse.  This is only
> 13 seconds on spark 2.2 vs 17 seconds on spark 2.3.  Straight up reading
> from hive (orc) seems better though.
>
> Tom
>
>
>
> On Thursday, February 1, 2018, 11:23:45 AM CST, Michael Heuer <
> heue...@gmail.com> wrote:
>
>
> We found two classes new to Spark 2.3.0 that must be registered in Kryo
> for our tests to pass on RC2
>
> org.apache.spark.sql.execution .datasources.BasicWriteTaskSta ts
> org.apache.spark.sql.execution .datasources.ExecutedWriteSumm ary
>
> https://github.com/bigdatageno mics/adam/pull/1897
> 
>
> Perhaps a mention in release notes?
>
>michael
>
>
> On Thu, Feb 1, 2018 at 3:29 AM, Nick Pentreath 
> wrote:
>
> All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side
> that should be everything outstanding.
>
>
> On Thu, 1 Feb 2018 at 06:21 Yin Huai  wrote:
>
> seems we are not running tests related to pandas in pyspark tests (see my
> email "python tests related to pandas are skipped in jenkins"). I think we
> should fix this test issue and make sure all tests are good before cutting
> RC3.
>
> On Wed, Jan 31, 2018 at 10:12 AM, Sameer Agarwal 
> wrote:
>
> Just a quick status update on RC3 -- SPARK-23274
>  was resolved
> yesterday and tests have been quite healthy throughout this week and the
> last. I'll cut the new RC as soon as the remaining blocker (SPARK-23202
> ) is resolved.
>
>
> On 30 January 2018 at 10:12, Andrew Ash  wrote:
>
> I'd like to nominate SPARK-23274
>  as a potential
> blocker for the 2.3.0 release as well, due to being a regression from
> 2.2.0.  The ticket has a simple repro included, showing a query that works
> in prior releases but now fails with an exception in the catalyst optimizer.
>
> On Fri, Jan 26, 2018 at 10:41 AM, Sameer Agarwal 
> wrote:
>
> This vote has failed due to a number of aforementioned blockers. I'll
> follow up with RC3 as soon as the 2 remaining (non-QA) blockers are
> resolved: https://s.apache. org/oXKi 
>
>
> On 25 January 2018 at 12:59, Sameer Agarwal  wrote:
>
>
> Most tests pass on RC2, except I'm still seeing the timeout caused by 
> https://issues.apache.org/
> jira/browse/SPARK-23055
>  ; the tests never
> finish. I followed the thread a bit further and wasn't clear whether it was
> subsequently re-fixed for 2.3.0 or not. It says it's resolved along with 
> https://issues.apache.
> org/jira/browse/SPARK-22908
>