Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xingbo Jiang
+1 on the updated SPIP

Xingbo Jiang  于2019年3月26日周二 下午1:32写道:

> Hi all,
>
> Now we have had a few discussions over the updated SPIP, we also updated
> the SPIP addressing new feedbacks from some committers. IMO the SPIP is
> ready for another round of vote now.
> On the updated SPIP, we currently have two +1s (from Tom and Xiangrui),
> everyone else please vote again.
>
> The vote will be up for the next 72 hours.
>
> Thanks!
>
> Xingbo
>
> Xiangrui Meng  于2019年3月26日周二 上午11:32写道:
>
>>
>>
>> On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra 
>> wrote:
>>
>>> Maybe.
>>>
>>> And I expect that we will end up doing something based on
>>> spark.task.cpus in the short term. I'd just rather that this SPIP not make
>>> it look like this is the way things should ideally be done. I'd prefer that
>>> we be quite explicit in recognizing that this approach is a significant
>>> compromise, and I'd like to see at least some references to the beginning
>>> of serious longer-term efforts to do something better in a deeper re-design
>>> of resource scheduling.
>>>
>>
>> It is also a feature I desire as a user. How about suggesting it as a
>> future work in the SPIP? It certainly requires someone who fully
>> understands Spark scheduler to drive. Shall we start with a Spark JIRA? I
>> don't know much about scheduler like you do, but I can speak for DL use
>> cases. Maybe we just view it from different angles. To you
>> application-level request is a significant compromise. To me it provides a
>> major milestone that brings GPU to Spark workload. I know many users who
>> tried to do DL on Spark ended up doing hacks here and there, huge pain. The
>> scope covered by the current SPIP makes those users much happier. Tom and
>> Andy from NVIDIA are certainly more calibrated on the usefulness of the
>> current proposal.
>>
>>
>>>
>>> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng 
>>> wrote:
>>>
 There are certainly use cases where different stages require different
 number of CPUs or GPUs under an optimal setting. I don't think anyone
 disagrees that ideally users should be able to do it. We are just dealing
 with typical engineering trade-offs and see how we break it down into
 smaller ones. I think it is fair to treat the task-level resource request
 as a separate feature here because it also applies to CPUs alone without
 GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
 years Spark is still able to cover many many use cases. Otherwise we
 shouldn't see many Spark users around now. Here we just apply similar
 arguments to GPUs.

 Initially, I was the person who really wanted task-level requests
 because it is ideal. In an offline discussion, Andy Feng pointed out an
 application-level setting should fit common deep learning training and
 inference cases and it greatly simplifies necessary changes required to
 Spark job scheduler. With Imran's feedback to the initial design sketch,
 the application-level approach became my first choice because it is still
 very valuable but much less risky. If a feature brings great value to
 users, we should add it even it is not ideal.

 Back to the default value discussion, let's forget GPUs and only
 consider CPUs. Would an application-level default number of CPU cores
 disappear if we added task-level requests? If yes, does it mean that users
 have to explicitly state the resource requirements for every single stage?
 It is tedious to do and who do not fully understand the impact would
 probably do it wrong and waste even more resources. Then how many cores
 each task should use if user didn't specify it? I do see "spark.task.cpus"
 is the answer here. The point I want to make is that "spark.task.cpus",
 though less ideal, is still needed when we have task-level requests for
 CPUs.

 On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
 wrote:

> I remain unconvinced that a default configuration at the application
> level makes sense even in that case. There may be some applications where
> you know a priori that almost all the tasks for all the stages for all the
> jobs will need some fixed number of gpus; but I think the more common 
> cases
> will be dynamic configuration at the job or stage level. Stage level could
> have a lot of overlap with barrier mode scheduling -- barrier mode stages
> having a need for an inter-task channel resource, gpu-ified stages needing
> gpu resources, etc. Have I mentioned that I'm not a fan of the current
> barrier mode API, Xiangrui? :) Yes, I know: "Show me something better."
>
> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng 
> wrote:
>
>> Say if we support per-task resource requests in the future, it would
>> be still inconvenient for users to declare the resource requirements for
>> every single task/stage. So there must be some default 

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xingbo Jiang
Hi all,

Now we have had a few discussions over the updated SPIP, we also updated
the SPIP addressing new feedbacks from some committers. IMO the SPIP is
ready for another round of vote now.
On the updated SPIP, we currently have two +1s (from Tom and Xiangrui),
everyone else please vote again.

The vote will be up for the next 72 hours.

Thanks!

Xingbo

Xiangrui Meng  于2019年3月26日周二 上午11:32写道:

>
>
> On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra 
> wrote:
>
>> Maybe.
>>
>> And I expect that we will end up doing something based on spark.task.cpus
>> in the short term. I'd just rather that this SPIP not make it look like
>> this is the way things should ideally be done. I'd prefer that we be quite
>> explicit in recognizing that this approach is a significant compromise, and
>> I'd like to see at least some references to the beginning of serious
>> longer-term efforts to do something better in a deeper re-design of
>> resource scheduling.
>>
>
> It is also a feature I desire as a user. How about suggesting it as a
> future work in the SPIP? It certainly requires someone who fully
> understands Spark scheduler to drive. Shall we start with a Spark JIRA? I
> don't know much about scheduler like you do, but I can speak for DL use
> cases. Maybe we just view it from different angles. To you
> application-level request is a significant compromise. To me it provides a
> major milestone that brings GPU to Spark workload. I know many users who
> tried to do DL on Spark ended up doing hacks here and there, huge pain. The
> scope covered by the current SPIP makes those users much happier. Tom and
> Andy from NVIDIA are certainly more calibrated on the usefulness of the
> current proposal.
>
>
>>
>> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng 
>> wrote:
>>
>>> There are certainly use cases where different stages require different
>>> number of CPUs or GPUs under an optimal setting. I don't think anyone
>>> disagrees that ideally users should be able to do it. We are just dealing
>>> with typical engineering trade-offs and see how we break it down into
>>> smaller ones. I think it is fair to treat the task-level resource request
>>> as a separate feature here because it also applies to CPUs alone without
>>> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
>>> years Spark is still able to cover many many use cases. Otherwise we
>>> shouldn't see many Spark users around now. Here we just apply similar
>>> arguments to GPUs.
>>>
>>> Initially, I was the person who really wanted task-level requests
>>> because it is ideal. In an offline discussion, Andy Feng pointed out an
>>> application-level setting should fit common deep learning training and
>>> inference cases and it greatly simplifies necessary changes required to
>>> Spark job scheduler. With Imran's feedback to the initial design sketch,
>>> the application-level approach became my first choice because it is still
>>> very valuable but much less risky. If a feature brings great value to
>>> users, we should add it even it is not ideal.
>>>
>>> Back to the default value discussion, let's forget GPUs and only
>>> consider CPUs. Would an application-level default number of CPU cores
>>> disappear if we added task-level requests? If yes, does it mean that users
>>> have to explicitly state the resource requirements for every single stage?
>>> It is tedious to do and who do not fully understand the impact would
>>> probably do it wrong and waste even more resources. Then how many cores
>>> each task should use if user didn't specify it? I do see "spark.task.cpus"
>>> is the answer here. The point I want to make is that "spark.task.cpus",
>>> though less ideal, is still needed when we have task-level requests for
>>> CPUs.
>>>
>>> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
>>> wrote:
>>>
 I remain unconvinced that a default configuration at the application
 level makes sense even in that case. There may be some applications where
 you know a priori that almost all the tasks for all the stages for all the
 jobs will need some fixed number of gpus; but I think the more common cases
 will be dynamic configuration at the job or stage level. Stage level could
 have a lot of overlap with barrier mode scheduling -- barrier mode stages
 having a need for an inter-task channel resource, gpu-ified stages needing
 gpu resources, etc. Have I mentioned that I'm not a fan of the current
 barrier mode API, Xiangrui? :) Yes, I know: "Show me something better."

 On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:

> Say if we support per-task resource requests in the future, it would
> be still inconvenient for users to declare the resource requirements for
> every single task/stage. So there must be some default values defined
> somewhere for task resource requirements. "spark.task.cpus" and
> "spark.task.accelerator.gpu.count" could serve for this purpose without
> introducing 

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Xiao Li
Thanks, DB!

The Hive UDAF fix
https://github.com/apache/spark/commit/0cfefa7e864f443cfd76cff8c50617a8afd080fb
was
merged this weekend.

Xiao

DB Tsai  于2019年3月25日周一 下午9:46写道:

> RC9 was just cut. Will send out another thread once the build is finished.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Mon, Mar 25, 2019 at 5:10 PM Sean Owen  wrote:
> >
> > That's all merged now. I think you're clear to start an RC.
> >
> > On Mon, Mar 25, 2019 at 4:06 PM DB Tsai 
> wrote:
> > >
> > > I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961
> > > https://github.com/apache/spark/pull/24126 , anything critical that we
> > > have to wait for 2.4.1 release? Thanks!
> > >
> > > Sincerely,
> > >
> > > DB Tsai
> > > --
> > > Web: https://www.dbtsai.com
> > > PGP Key ID: 42E5B25A8F7A82C1
> > >
> > > On Sun, Mar 24, 2019 at 8:19 PM Sean Owen  wrote:
> > > >
> > > > Still waiting on a successful test - hope this one works.
> > > >
> > > > On Sun, Mar 24, 2019, 10:13 PM DB Tsai  wrote:
> > > >>
> > > >> Hello Sean,
> > > >>
> > > >> By looking at SPARK-26961 PR, seems it's ready to go. Do you think
> we
> > > >> can merge it into 2.4 branch soon?
> > > >>
> > > >> Sincerely,
> > > >>
> > > >> DB Tsai
> > > >> --
> > > >> Web: https://www.dbtsai.com
> > > >> PGP Key ID: 42E5B25A8F7A82C1
> > > >>
> > > >> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen 
> wrote:
> > > >> >
> > > >> > I think we can/should get in SPARK-26961 too; it's all but ready
> to commit.
> > > >> >
> > > >> > On Sat, Mar 23, 2019 at 2:02 PM DB Tsai 
> wrote:
> > > >> > >
> > > >> > > -1
> > > >> > >
> > > >> > > I will fail RC8, and cut another RC9 on Monday to include
> SPARK-27160,
> > > >> > > SPARK-27178, SPARK-27112. Please let me know if there is any
> critical
> > > >> > > PR that has to be back-ported into branch-2.4.
> > > >> > >
> > > >> > > Thanks.
> > > >> > >
> > > >> > > Sincerely,
> > > >> > >
> > > >> > > DB Tsai
> > > >> > > --
> > > >> > > Web: https://www.dbtsai.com
> > > >> > > PGP Key ID: 42E5B25A8F7A82C1
> > > >> > >
> > > >> > > On Fri, Mar 22, 2019 at 12:28 AM DB Tsai 
> wrote:
> > > >> > > >
> > > >> > > > Since we have couple concerns and hesitations to release rc8,
> how
> > > >> > > > about we give it couple days, and have another vote on March
> 25,
> > > >> > > > Monday? In this case, I will cut another rc9 in the Monday
> morning.
> > > >> > > >
> > > >> > > > Darcy, as Dongjoon mentioned,
> > > >> > > > https://github.com/apache/spark/pull/24092 is conflict
> against
> > > >> > > > branch-2.4, can you make anther PR against branch-2.4 so we
> can
> > > >> > > > include the ORC fix in 2.4.1?
> > > >> > > >
> > > >> > > > Thanks.
> > > >> > > >
> > > >> > > > Sincerely,
> > > >> > > >
> > > >> > > > DB Tsai
> > > >> > > > --
> > > >> > > > Web: https://www.dbtsai.com
> > > >> > > > PGP Key ID: 42E5B25A8F7A82C1
> > > >> > > >
> > > >> > > > On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung <
> felixcheun...@hotmail.com> wrote:
> > > >> > > > >
> > > >> > > > > Reposting for shane here
> > > >> > > > >
> > > >> > > > > [SPARK-27178]
> > > >> > > > >
> https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> > > >> > > > >
> > > >> > > > > Is problematic too and it’s not in the rc8 cut
> > > >> > > > >
> > > >> > > > > https://github.com/apache/spark/commits/branch-2.4
> > > >> > > > >
> > > >> > > > > (Personally I don’t want to delay 2.4.1 either..)
> > > >> > > > >
> > > >> > > > > 
> > > >> > > > > From: Sean Owen 
> > > >> > > > > Sent: Wednesday, March 20, 2019 11:18 AM
> > > >> > > > > To: DB Tsai
> > > >> > > > > Cc: dev
> > > >> > > > > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
> > > >> > > > >
> > > >> > > > > +1 for this RC. The tag is correct, licenses and sigs check
> out, tests
> > > >> > > > > of the source with most profiles enabled works for me.
> > > >> > > > >
> > > >> > > > > On Tue, Mar 19, 2019 at 5:28 PM DB Tsai
>  wrote:
> > > >> > > > > >
> > > >> > > > > > Please vote on releasing the following candidate as
> Apache Spark version 2.4.1.
> > > >> > > > > >
> > > >> > > > > > The vote is open until March 23 PST and passes if a
> majority +1 PMC votes are cast, with
> > > >> > > > > > a minimum of 3 +1 votes.
> > > >> > > > > >
> > > >> > > > > > [ ] +1 Release this package as Apache Spark 2.4.1
> > > >> > > > > > [ ] -1 Do not release this package because ...
> > > >> > > > > >
> > > >> > > > > > To learn more about Apache Spark, please see
> http://spark.apache.org/
> > > >> > > > > >
> > > >> > > > > > The tag to be voted on is v2.4.1-rc8 (commit
> 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> > > >> > > > > 

Re: [DISCUSS] Spark Columnar Processing

2019-03-25 Thread Wenchen Fan
Do you have some initial perf numbers? It seems fine to me to remain
row-based inside Spark with whole-stage-codegen, and convert rows to
columnar batches when communicating with external systems.

On Mon, Mar 25, 2019 at 1:05 PM Bobby Evans  wrote:

> This thread is to discuss adding in support for data frame processing
> using an in-memory columnar format compatible with Apache Arrow.  My main
> goal in this is to lay the groundwork so we can add in support for GPU
> accelerated processing of data frames, but this feature has a number of
> other benefits.  Spark currently supports Apache Arrow formatted data as an
> option to exchange data with python for pandas UDF processing. There has
> also been discussion around extending this to allow for exchanging data
> with other tools like pytorch, tensorflow, xgboost,... If Spark supports
> processing on Arrow compatible data it could eliminate the
> serialization/deserialization overhead when going between these systems.
> It also would allow for doing optimizations on a CPU with SIMD instructions
> similar to what Hive currently supports. Accelerated processing using a GPU
> is something that we will start a separate discussion thread on, but I
> wanted to set the context a bit.
>
> Jason Lowe, Tom Graves, and I created a prototype over the past few months
> to try and understand how to make this work.  What we are proposing is
> based off of lessons learned when building this prototype, but we really
> wanted to get feedback early on from the community. We will file a SPIP
> once we can get agreement that this is a good direction to go in.
>
> The current support for columnar processing lets a Parquet or Orc file
> format return a ColumnarBatch inside an RDD[InternalRow] using Scala’s type
> erasure. The code generation is aware that the RDD actually holds
> ColumnarBatchs and generates code to loop through the data in each batch as
> InternalRows.
>
> Instead, we propose a new set of APIs to work on an
> RDD[InternalColumnarBatch] instead of abusing type erasure. With this we
> propose adding in a Rule similar to how WholeStageCodeGen currently works.
> Each part of the physical SparkPlan would expose columnar support through a
> combination of traits and method calls. The rule would then decide when
> columnar processing would start and when it would end. Switching between
> columnar and row based processing is not free, so the rule would make a
> decision based off of an estimate of the cost to do the transformation and
> the estimated speedup in processing time.
>
> This should allow us to disable columnar support by simply disabling the
> rule that modifies the physical SparkPlan.  It should be minimal risk to
> the existing row-based code path, as that code should not be touched, and
> in many cases could be reused to implement the columnar version.  This also
> allows for small easily manageable patches. No huge patches that no one
> wants to review.
>
> As far as the memory layout is concerned OnHeapColumnVector and
> OffHeapColumnVector are already really close to being Apache Arrow
> compatible so shifting them over would be a relatively simple change.
> Alternatively we could add in a new implementation that is Arrow compatible
> if there are reasons to keep the old ones.
>
> Again this is just to get the discussion started, any feedback is welcome,
> and we will file a SPIP on it once we feel like the major changes we are
> proposing are acceptable.
>
> Thanks,
>
> Bobby Evans
>


Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Reynold Xin
At some point we should celebrate having the larger RC number ever in Spark ...

On Mon, Mar 25, 2019 at 9:44 PM, DB Tsai < dbt...@dbtsai.com.invalid > wrote:

> 
> 
> 
> RC9 was just cut. Will send out another thread once the build is finished.
> 
> 
> 
> 
> Sincerely,
> 
> 
> 
> DB Tsai
> -- Web: https:/ / www.
> dbtsai. com ( https://www.dbtsai.com/ )
> PGP Key ID: 42E5B25A8F7A82C1
> 
> 
> 
> On Mon, Mar 25, 2019 at 5:10 PM Sean Owen < srowen@ apache. org (
> sro...@apache.org ) > wrote:
> 
> 
>> 
>> 
>> That's all merged now. I think you're clear to start an RC.
>> 
>> 
>> 
>> On Mon, Mar 25, 2019 at 4:06 PM DB Tsai < dbtsai@ dbtsai. com. invalid (
>> dbt...@dbtsai.com.invalid ) > wrote:
>> 
>> 
>>> 
>>> 
>>> I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961 https:/ / 
>>> github.
>>> com/ apache/ spark/ pull/ 24126 (
>>> https://github.com/apache/spark/pull/24126 ) , anything critical that we
>>> have to wait for 2.4.1 release? Thanks!
>>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> 
>>> 
>>> DB Tsai
>>> -- Web: https:/ / 
>>> www.
>>> dbtsai. com ( https://www.dbtsai.com/ )
>>> PGP Key ID: 42E5B25A8F7A82C1
>>> 
>>> 
>>> 
>>> On Sun, Mar 24, 2019 at 8:19 PM Sean Owen < srowen@ apache. org (
>>> sro...@apache.org ) > wrote:
>>> 
>>> 
 
 
 Still waiting on a successful test - hope this one works.
 
 
 
 On Sun, Mar 24, 2019, 10:13 PM DB Tsai < dbtsai@ dbtsai. com (
 dbt...@dbtsai.com ) > wrote:
 
 
> 
> 
> Hello Sean,
> 
> 
> 
> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we can
> merge it into 2.4 branch soon?
> 
> 
> 
> Sincerely,
> 
> 
> 
> DB Tsai
> -- Web: https:/ / 
> www.
> dbtsai. com ( https://www.dbtsai.com/ )
> PGP Key ID: 42E5B25A8F7A82C1
> 
> 
> 
> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen < srowen@ apache. org (
> sro...@apache.org ) > wrote:
> 
> 
>> 
>> 
>> I think we can/should get in SPARK-26961 too; it's all but ready to
>> commit.
>> 
>> 
>> 
>> On Sat, Mar 23, 2019 at 2:02 PM DB Tsai < dbtsai@ dbtsai. com (
>> dbt...@dbtsai.com ) > wrote:
>> 
>> 
>>> 
>>> 
>>> -1
>>> 
>>> 
>>> 
>>> I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
>>> SPARK-27178, SPARK-27112. Please let me know if there is any critical PR
>>> that has to be back-ported into branch-2.4.
>>> 
>>> 
>>> 
>>> Thanks.
>>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> 
>>> 
>>> DB Tsai
>>> -- Web: https:/ 
>>> / www.
>>> dbtsai. com ( https://www.dbtsai.com/ )
>>> PGP Key ID: 42E5B25A8F7A82C1
>>> 
>>> 
>>> 
>>> On Fri, Mar 22, 2019 at 12:28 AM DB Tsai < dbtsai@ dbtsai. com (
>>> dbt...@dbtsai.com ) > wrote:
>>> 
>>> 
 
 
 Since we have couple concerns and hesitations to release rc8, how 
 about we
 give it couple days, and have another vote on March 25, Monday? In this
 case, I will cut another rc9 in the Monday morning.
 
 
 
 Darcy, as Dongjoon mentioned,
 https:/ / github. com/ apache/ spark/ pull/ 24092 (
 https://github.com/apache/spark/pull/24092 ) is conflict against
 branch-2.4, can you make anther PR against branch-2.4 so we can include
 the ORC fix in 2.4.1?
 
 
 
 Thanks.
 
 
 
 Sincerely,
 
 
 
 DB Tsai
 -- Web: 
 https:/ / www.
 dbtsai. com ( https://www.dbtsai.com/ )
 PGP Key ID: 42E5B25A8F7A82C1
 
 
 
 On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung < felixcheung_m@ hotmail. 
 com
 ( felixcheun...@hotmail.com ) > wrote:
 
 
> 
> 
> Reposting for shane here
> 
> 
> 
> [SPARK-27178]
> https:/ / github. com/ apache/ spark/ commit/ 
> 342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> (
> https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> )
> 
> 
> 
> Is problematic too and it’s not in the rc8 cut
> 
> 
> 
> https:/ / github. com/ apache/ spark/ commits/ branch-2. 4 (
> https://github.com/apache/spark/commits/branch-2.4 )
> 
> 
> 
> (Personally I don’t want to delay 2.4.1 either..)
> 
> 
> 
> 
> 

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread DB Tsai
RC9 was just cut. Will send out another thread once the build is finished.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Mon, Mar 25, 2019 at 5:10 PM Sean Owen  wrote:
>
> That's all merged now. I think you're clear to start an RC.
>
> On Mon, Mar 25, 2019 at 4:06 PM DB Tsai  wrote:
> >
> > I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961
> > https://github.com/apache/spark/pull/24126 , anything critical that we
> > have to wait for 2.4.1 release? Thanks!
> >
> > Sincerely,
> >
> > DB Tsai
> > --
> > Web: https://www.dbtsai.com
> > PGP Key ID: 42E5B25A8F7A82C1
> >
> > On Sun, Mar 24, 2019 at 8:19 PM Sean Owen  wrote:
> > >
> > > Still waiting on a successful test - hope this one works.
> > >
> > > On Sun, Mar 24, 2019, 10:13 PM DB Tsai  wrote:
> > >>
> > >> Hello Sean,
> > >>
> > >> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we
> > >> can merge it into 2.4 branch soon?
> > >>
> > >> Sincerely,
> > >>
> > >> DB Tsai
> > >> --
> > >> Web: https://www.dbtsai.com
> > >> PGP Key ID: 42E5B25A8F7A82C1
> > >>
> > >> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen  wrote:
> > >> >
> > >> > I think we can/should get in SPARK-26961 too; it's all but ready to 
> > >> > commit.
> > >> >
> > >> > On Sat, Mar 23, 2019 at 2:02 PM DB Tsai  wrote:
> > >> > >
> > >> > > -1
> > >> > >
> > >> > > I will fail RC8, and cut another RC9 on Monday to include 
> > >> > > SPARK-27160,
> > >> > > SPARK-27178, SPARK-27112. Please let me know if there is any critical
> > >> > > PR that has to be back-ported into branch-2.4.
> > >> > >
> > >> > > Thanks.
> > >> > >
> > >> > > Sincerely,
> > >> > >
> > >> > > DB Tsai
> > >> > > --
> > >> > > Web: https://www.dbtsai.com
> > >> > > PGP Key ID: 42E5B25A8F7A82C1
> > >> > >
> > >> > > On Fri, Mar 22, 2019 at 12:28 AM DB Tsai  wrote:
> > >> > > >
> > >> > > > Since we have couple concerns and hesitations to release rc8, how
> > >> > > > about we give it couple days, and have another vote on March 25,
> > >> > > > Monday? In this case, I will cut another rc9 in the Monday morning.
> > >> > > >
> > >> > > > Darcy, as Dongjoon mentioned,
> > >> > > > https://github.com/apache/spark/pull/24092 is conflict against
> > >> > > > branch-2.4, can you make anther PR against branch-2.4 so we can
> > >> > > > include the ORC fix in 2.4.1?
> > >> > > >
> > >> > > > Thanks.
> > >> > > >
> > >> > > > Sincerely,
> > >> > > >
> > >> > > > DB Tsai
> > >> > > > --
> > >> > > > Web: https://www.dbtsai.com
> > >> > > > PGP Key ID: 42E5B25A8F7A82C1
> > >> > > >
> > >> > > > On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung 
> > >> > > >  wrote:
> > >> > > > >
> > >> > > > > Reposting for shane here
> > >> > > > >
> > >> > > > > [SPARK-27178]
> > >> > > > > https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> > >> > > > >
> > >> > > > > Is problematic too and it’s not in the rc8 cut
> > >> > > > >
> > >> > > > > https://github.com/apache/spark/commits/branch-2.4
> > >> > > > >
> > >> > > > > (Personally I don’t want to delay 2.4.1 either..)
> > >> > > > >
> > >> > > > > 
> > >> > > > > From: Sean Owen 
> > >> > > > > Sent: Wednesday, March 20, 2019 11:18 AM
> > >> > > > > To: DB Tsai
> > >> > > > > Cc: dev
> > >> > > > > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
> > >> > > > >
> > >> > > > > +1 for this RC. The tag is correct, licenses and sigs check out, 
> > >> > > > > tests
> > >> > > > > of the source with most profiles enabled works for me.
> > >> > > > >
> > >> > > > > On Tue, Mar 19, 2019 at 5:28 PM DB Tsai 
> > >> > > > >  wrote:
> > >> > > > > >
> > >> > > > > > Please vote on releasing the following candidate as Apache 
> > >> > > > > > Spark version 2.4.1.
> > >> > > > > >
> > >> > > > > > The vote is open until March 23 PST and passes if a majority 
> > >> > > > > > +1 PMC votes are cast, with
> > >> > > > > > a minimum of 3 +1 votes.
> > >> > > > > >
> > >> > > > > > [ ] +1 Release this package as Apache Spark 2.4.1
> > >> > > > > > [ ] -1 Do not release this package because ...
> > >> > > > > >
> > >> > > > > > To learn more about Apache Spark, please see 
> > >> > > > > > http://spark.apache.org/
> > >> > > > > >
> > >> > > > > > The tag to be voted on is v2.4.1-rc8 (commit 
> > >> > > > > > 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> > >> > > > > > https://github.com/apache/spark/tree/v2.4.1-rc8
> > >> > > > > >
> > >> > > > > > The release files, including signatures, digests, etc. can be 
> > >> > > > > > found at:
> > >> > > > > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
> > >> > > > > >
> > >> > > > > > Signatures used for Spark RCs can be found in this file:
> > 

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Reynold Xin
+1 on doing this in 3.0.

On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung < felixcheun...@hotmail.com > 
wrote:

> 
> I’m +1 if 3.0
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( sro...@gmail.com ) >
> *Sent:* Monday, March 25, 2019 6:48 PM
> *To:* Hyukjin Kwon
> *Cc:* dev; Bryan Cutler; Takuya UESHIN; shane knapp
> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]
>  
> I don't know a lot about Arrow here, but seems reasonable. Is this for
> Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
> seems right.
> 
> On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon < gurwls223@ gmail. com (
> gurwls...@gmail.com ) > wrote:
> >
> > Hi all,
> >
> > We really need to upgrade the minimal version soon. It's actually
> slowing down the PySpark dev, for instance, by the overhead that sometimes
> we need currently to test all multiple matrix of Arrow and Pandas. Also,
> it currently requires to add some weird hacks or ugly codes. Some bugs
> exist in lower versions, and some features are not supported in low
> PyArrow, for instance.
> >
> > Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and
> my opinion as well, we should better increase the minimal version to
> 0.12.x. (Also, note that Pandas <> Arrow is an experimental feature).
> >
> > So, I and Bryan will proceed this roughly in few days if there isn't
> objections assuming we're fine with increasing it to 0.12.x. Please let me
> know if there are some concerns.
> >
> > For clarification, this requires some jobs in Jenkins to upgrade the
> minimal version of PyArrow (I cc'ed Shane as well).
> >
> > PS: I roughly heard that Shane's busy for some work stuff .. but it's
> kind of important in my perspective.
> >
> 
> -
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscr...@spark.apache.org )
>

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Felix Cheung
I’m +1 if 3.0



From: Sean Owen 
Sent: Monday, March 25, 2019 6:48 PM
To: Hyukjin Kwon
Cc: dev; Bryan Cutler; Takuya UESHIN; shane knapp
Subject: Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

I don't know a lot about Arrow here, but seems reasonable. Is this for
Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
seems right.

On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon  wrote:
>
> Hi all,
>
> We really need to upgrade the minimal version soon. It's actually slowing 
> down the PySpark dev, for instance, by the overhead that sometimes we need 
> currently to test all multiple matrix of Arrow and Pandas. Also, it currently 
> requires to add some weird hacks or ugly codes. Some bugs exist in lower 
> versions, and some features are not supported in low PyArrow, for instance.
>
> Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and my 
> opinion as well, we should better increase the minimal version to 0.12.x. 
> (Also, note that Pandas <> Arrow is an experimental feature).
>
> So, I and Bryan will proceed this roughly in few days if there isn't 
> objections assuming we're fine with increasing it to 0.12.x. Please let me 
> know if there are some concerns.
>
> For clarification, this requires some jobs in Jenkins to upgrade the minimal 
> version of PyArrow (I cc'ed Shane as well).
>
> PS: I roughly heard that Shane's busy for some work stuff .. but it's kind of 
> important in my perspective.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra 
wrote:

> Maybe.
>
> And I expect that we will end up doing something based on spark.task.cpus
> in the short term. I'd just rather that this SPIP not make it look like
> this is the way things should ideally be done. I'd prefer that we be quite
> explicit in recognizing that this approach is a significant compromise, and
> I'd like to see at least some references to the beginning of serious
> longer-term efforts to do something better in a deeper re-design of
> resource scheduling.
>

It is also a feature I desire as a user. How about suggesting it as a
future work in the SPIP? It certainly requires someone who fully
understands Spark scheduler to drive. Shall we start with a Spark JIRA? I
don't know much about scheduler like you do, but I can speak for DL use
cases. Maybe we just view it from different angles. To you
application-level request is a significant compromise. To me it provides a
major milestone that brings GPU to Spark workload. I know many users who
tried to do DL on Spark ended up doing hacks here and there, huge pain. The
scope covered by the current SPIP makes those users much happier. Tom and
Andy from NVIDIA are certainly more calibrated on the usefulness of the
current proposal.


>
> On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng  wrote:
>
>> There are certainly use cases where different stages require different
>> number of CPUs or GPUs under an optimal setting. I don't think anyone
>> disagrees that ideally users should be able to do it. We are just dealing
>> with typical engineering trade-offs and see how we break it down into
>> smaller ones. I think it is fair to treat the task-level resource request
>> as a separate feature here because it also applies to CPUs alone without
>> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
>> years Spark is still able to cover many many use cases. Otherwise we
>> shouldn't see many Spark users around now. Here we just apply similar
>> arguments to GPUs.
>>
>> Initially, I was the person who really wanted task-level requests because
>> it is ideal. In an offline discussion, Andy Feng pointed out an
>> application-level setting should fit common deep learning training and
>> inference cases and it greatly simplifies necessary changes required to
>> Spark job scheduler. With Imran's feedback to the initial design sketch,
>> the application-level approach became my first choice because it is still
>> very valuable but much less risky. If a feature brings great value to
>> users, we should add it even it is not ideal.
>>
>> Back to the default value discussion, let's forget GPUs and only consider
>> CPUs. Would an application-level default number of CPU cores disappear if
>> we added task-level requests? If yes, does it mean that users have to
>> explicitly state the resource requirements for every single stage? It is
>> tedious to do and who do not fully understand the impact would probably do
>> it wrong and waste even more resources. Then how many cores each task
>> should use if user didn't specify it? I do see "spark.task.cpus" is the
>> answer here. The point I want to make is that "spark.task.cpus", though
>> less ideal, is still needed when we have task-level requests for CPUs.
>>
>> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
>> wrote:
>>
>>> I remain unconvinced that a default configuration at the application
>>> level makes sense even in that case. There may be some applications where
>>> you know a priori that almost all the tasks for all the stages for all the
>>> jobs will need some fixed number of gpus; but I think the more common cases
>>> will be dynamic configuration at the job or stage level. Stage level could
>>> have a lot of overlap with barrier mode scheduling -- barrier mode stages
>>> having a need for an inter-task channel resource, gpu-ified stages needing
>>> gpu resources, etc. Have I mentioned that I'm not a fan of the current
>>> barrier mode API, Xiangrui? :) Yes, I know: "Show me something better."
>>>
>>> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:
>>>
 Say if we support per-task resource requests in the future, it would be
 still inconvenient for users to declare the resource requirements for every
 single task/stage. So there must be some default values defined somewhere
 for task resource requirements. "spark.task.cpus" and
 "spark.task.accelerator.gpu.count" could serve for this purpose without
 introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
 separated necessary GPU support from risky scheduler changes.

 On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra 
 wrote:

> Of course there is an issue of the perfect becoming the enemy of the
> good, so I can understand the impulse to get something done. I am left
> wanting, however, at least something more of a roadmap to a task-level
> future than just a vague "we may choose to do something more in the
> 

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
Maybe.

And I expect that we will end up doing something based on spark.task.cpus
in the short term. I'd just rather that this SPIP not make it look like
this is the way things should ideally be done. I'd prefer that we be quite
explicit in recognizing that this approach is a significant compromise, and
I'd like to see at least some references to the beginning of serious
longer-term efforts to do something better in a deeper re-design of
resource scheduling.

On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng  wrote:

> There are certainly use cases where different stages require different
> number of CPUs or GPUs under an optimal setting. I don't think anyone
> disagrees that ideally users should be able to do it. We are just dealing
> with typical engineering trade-offs and see how we break it down into
> smaller ones. I think it is fair to treat the task-level resource request
> as a separate feature here because it also applies to CPUs alone without
> GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
> years Spark is still able to cover many many use cases. Otherwise we
> shouldn't see many Spark users around now. Here we just apply similar
> arguments to GPUs.
>
> Initially, I was the person who really wanted task-level requests because
> it is ideal. In an offline discussion, Andy Feng pointed out an
> application-level setting should fit common deep learning training and
> inference cases and it greatly simplifies necessary changes required to
> Spark job scheduler. With Imran's feedback to the initial design sketch,
> the application-level approach became my first choice because it is still
> very valuable but much less risky. If a feature brings great value to
> users, we should add it even it is not ideal.
>
> Back to the default value discussion, let's forget GPUs and only consider
> CPUs. Would an application-level default number of CPU cores disappear if
> we added task-level requests? If yes, does it mean that users have to
> explicitly state the resource requirements for every single stage? It is
> tedious to do and who do not fully understand the impact would probably do
> it wrong and waste even more resources. Then how many cores each task
> should use if user didn't specify it? I do see "spark.task.cpus" is the
> answer here. The point I want to make is that "spark.task.cpus", though
> less ideal, is still needed when we have task-level requests for CPUs.
>
> On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
> wrote:
>
>> I remain unconvinced that a default configuration at the application
>> level makes sense even in that case. There may be some applications where
>> you know a priori that almost all the tasks for all the stages for all the
>> jobs will need some fixed number of gpus; but I think the more common cases
>> will be dynamic configuration at the job or stage level. Stage level could
>> have a lot of overlap with barrier mode scheduling -- barrier mode stages
>> having a need for an inter-task channel resource, gpu-ified stages needing
>> gpu resources, etc. Have I mentioned that I'm not a fan of the current
>> barrier mode API, Xiangrui? :) Yes, I know: "Show me something better."
>>
>> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:
>>
>>> Say if we support per-task resource requests in the future, it would be
>>> still inconvenient for users to declare the resource requirements for every
>>> single task/stage. So there must be some default values defined somewhere
>>> for task resource requirements. "spark.task.cpus" and
>>> "spark.task.accelerator.gpu.count" could serve for this purpose without
>>> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
>>> separated necessary GPU support from risky scheduler changes.
>>>
>>> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra 
>>> wrote:
>>>
 Of course there is an issue of the perfect becoming the enemy of the
 good, so I can understand the impulse to get something done. I am left
 wanting, however, at least something more of a roadmap to a task-level
 future than just a vague "we may choose to do something more in the
 future." At the risk of repeating myself, I don't think the
 existing spark.task.cpus is very good, and I think that building more on
 that weak foundation without a more clear path or stated intention to move
 to something better runs the risk of leaving Spark stuck in a bad
 neighborhood.

 On Thu, Mar 21, 2019 at 10:10 AM Tom Graves 
 wrote:

> While I agree with you that it would be ideal to have the task level
> resources and do a deeper redesign for the scheduler, I think that can be 
> a
> separate enhancement like was discussed earlier in the thread. That 
> feature
> is useful without GPU's.  I do realize that they overlap some but I think
> the changes for this will be minimal to the scheduler, follow existing
> conventions, and it is an improvement over what we have now. I know many

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
There are certainly use cases where different stages require different
number of CPUs or GPUs under an optimal setting. I don't think anyone
disagrees that ideally users should be able to do it. We are just dealing
with typical engineering trade-offs and see how we break it down into
smaller ones. I think it is fair to treat the task-level resource request
as a separate feature here because it also applies to CPUs alone without
GPUs, as Tom mentioned above. But having "spark.task.cpus" only for many
years Spark is still able to cover many many use cases. Otherwise we
shouldn't see many Spark users around now. Here we just apply similar
arguments to GPUs.

Initially, I was the person who really wanted task-level requests because
it is ideal. In an offline discussion, Andy Feng pointed out an
application-level setting should fit common deep learning training and
inference cases and it greatly simplifies necessary changes required to
Spark job scheduler. With Imran's feedback to the initial design sketch,
the application-level approach became my first choice because it is still
very valuable but much less risky. If a feature brings great value to
users, we should add it even it is not ideal.

Back to the default value discussion, let's forget GPUs and only consider
CPUs. Would an application-level default number of CPU cores disappear if
we added task-level requests? If yes, does it mean that users have to
explicitly state the resource requirements for every single stage? It is
tedious to do and who do not fully understand the impact would probably do
it wrong and waste even more resources. Then how many cores each task
should use if user didn't specify it? I do see "spark.task.cpus" is the
answer here. The point I want to make is that "spark.task.cpus", though
less ideal, is still needed when we have task-level requests for CPUs.

On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra 
wrote:

> I remain unconvinced that a default configuration at the application level
> makes sense even in that case. There may be some applications where you
> know a priori that almost all the tasks for all the stages for all the jobs
> will need some fixed number of gpus; but I think the more common cases will
> be dynamic configuration at the job or stage level. Stage level could have
> a lot of overlap with barrier mode scheduling -- barrier mode stages having
> a need for an inter-task channel resource, gpu-ified stages needing gpu
> resources, etc. Have I mentioned that I'm not a fan of the current barrier
> mode API, Xiangrui? :) Yes, I know: "Show me something better."
>
> On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:
>
>> Say if we support per-task resource requests in the future, it would be
>> still inconvenient for users to declare the resource requirements for every
>> single task/stage. So there must be some default values defined somewhere
>> for task resource requirements. "spark.task.cpus" and
>> "spark.task.accelerator.gpu.count" could serve for this purpose without
>> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
>> separated necessary GPU support from risky scheduler changes.
>>
>> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra 
>> wrote:
>>
>>> Of course there is an issue of the perfect becoming the enemy of the
>>> good, so I can understand the impulse to get something done. I am left
>>> wanting, however, at least something more of a roadmap to a task-level
>>> future than just a vague "we may choose to do something more in the
>>> future." At the risk of repeating myself, I don't think the
>>> existing spark.task.cpus is very good, and I think that building more on
>>> that weak foundation without a more clear path or stated intention to move
>>> to something better runs the risk of leaving Spark stuck in a bad
>>> neighborhood.
>>>
>>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves 
>>> wrote:
>>>
 While I agree with you that it would be ideal to have the task level
 resources and do a deeper redesign for the scheduler, I think that can be a
 separate enhancement like was discussed earlier in the thread. That feature
 is useful without GPU's.  I do realize that they overlap some but I think
 the changes for this will be minimal to the scheduler, follow existing
 conventions, and it is an improvement over what we have now. I know many
 users will be happy to have this even without the task level scheduling as
 many of the conventions used now to scheduler gpus can easily be broken by
 one bad user. I think from the user point of view this gives many users
 an improvement and we can extend it later to cover more use cases.

 Tom
 On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
 m...@clearstorydata.com> wrote:


 I understand the application-level, static, global nature
 of spark.task.accelerator.gpu.count and its similarity to the
 existing spark.task.cpus, but to me this feels like extending a 

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Sean Owen
I don't know a lot about Arrow here, but seems reasonable. Is this for
Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
seems right.

On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon  wrote:
>
> Hi all,
>
> We really need to upgrade the minimal version soon. It's actually slowing 
> down the PySpark dev, for instance, by the overhead that sometimes we need 
> currently to test all multiple matrix of Arrow and Pandas. Also, it currently 
> requires to add some weird hacks or ugly codes. Some bugs exist in lower 
> versions, and some features are not supported in low PyArrow, for instance.
>
> Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and my 
> opinion as well, we should better increase the minimal version to 0.12.x. 
> (Also, note that Pandas <> Arrow is an experimental feature).
>
> So, I and Bryan will proceed this roughly in few days if there isn't 
> objections assuming we're fine with increasing it to 0.12.x. Please let me 
> know if there are some concerns.
>
> For clarification, this requires some jobs in Jenkins to upgrade the minimal 
> version of PyArrow (I cc'ed Shane as well).
>
> PS: I roughly heard that Shane's busy for some work stuff .. but it's kind of 
> important in my perspective.
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread shane knapp
thanks for the heads up...  i'll test deploy this tomorrow and see what
gotchas turn up.  we may need to upgrade from python 3.4 to 3.5 IIRC.

On Mon, Mar 25, 2019 at 6:16 PM Hyukjin Kwon  wrote:

> Hi all,
>
> We really need to upgrade the minimal version soon. It's actually slowing
> down the PySpark dev, for instance, by the overhead that sometimes we need
> currently to test all multiple matrix of Arrow and Pandas. Also, it
> currently requires to add some weird hacks or ugly codes. Some bugs exist
> in lower versions, and some features are not supported in low PyArrow, for
> instance.
>
> Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and my
> opinion as well, we should better increase the minimal version to 0.12.x.
> (Also, note that Pandas <> Arrow is an experimental feature).
>
> So, I and Bryan will proceed this roughly in few days if there isn't
> objections assuming we're fine with increasing it to 0.12.x. Please let me
> know if there are some concerns.
>
> For clarification, this requires some jobs in Jenkins to upgrade the
> minimal version of PyArrow (I cc'ed Shane as well).
>
> PS: I roughly heard that Shane's busy for some work stuff .. but it's kind
> of important in my perspective.
>
>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
I remain unconvinced that a default configuration at the application level
makes sense even in that case. There may be some applications where you
know a priori that almost all the tasks for all the stages for all the jobs
will need some fixed number of gpus; but I think the more common cases will
be dynamic configuration at the job or stage level. Stage level could have
a lot of overlap with barrier mode scheduling -- barrier mode stages having
a need for an inter-task channel resource, gpu-ified stages needing gpu
resources, etc. Have I mentioned that I'm not a fan of the current barrier
mode API, Xiangrui? :) Yes, I know: "Show me something better."

On Mon, Mar 25, 2019 at 3:55 PM Xiangrui Meng  wrote:

> Say if we support per-task resource requests in the future, it would be
> still inconvenient for users to declare the resource requirements for every
> single task/stage. So there must be some default values defined somewhere
> for task resource requirements. "spark.task.cpus" and
> "spark.task.accelerator.gpu.count" could serve for this purpose without
> introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
> separated necessary GPU support from risky scheduler changes.
>
> On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra 
> wrote:
>
>> Of course there is an issue of the perfect becoming the enemy of the
>> good, so I can understand the impulse to get something done. I am left
>> wanting, however, at least something more of a roadmap to a task-level
>> future than just a vague "we may choose to do something more in the
>> future." At the risk of repeating myself, I don't think the
>> existing spark.task.cpus is very good, and I think that building more on
>> that weak foundation without a more clear path or stated intention to move
>> to something better runs the risk of leaving Spark stuck in a bad
>> neighborhood.
>>
>> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves  wrote:
>>
>>> While I agree with you that it would be ideal to have the task level
>>> resources and do a deeper redesign for the scheduler, I think that can be a
>>> separate enhancement like was discussed earlier in the thread. That feature
>>> is useful without GPU's.  I do realize that they overlap some but I think
>>> the changes for this will be minimal to the scheduler, follow existing
>>> conventions, and it is an improvement over what we have now. I know many
>>> users will be happy to have this even without the task level scheduling as
>>> many of the conventions used now to scheduler gpus can easily be broken by
>>> one bad user. I think from the user point of view this gives many users
>>> an improvement and we can extend it later to cover more use cases.
>>>
>>> Tom
>>> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
>>> m...@clearstorydata.com> wrote:
>>>
>>>
>>> I understand the application-level, static, global nature
>>> of spark.task.accelerator.gpu.count and its similarity to the
>>> existing spark.task.cpus, but to me this feels like extending a weakness of
>>> Spark's scheduler, not building on its strengths. That is because I
>>> consider binding the number of cores for each task to an application
>>> configuration to be far from optimal. This is already far from the desired
>>> behavior when an application is running a wide range of jobs (as in a
>>> generic job-runner style of Spark application), some of which require or
>>> can benefit from multi-core tasks, others of which will just waste the
>>> extra cores allocated to their tasks. Ideally, the number of cores
>>> allocated to tasks would get pushed to an even finer granularity that jobs,
>>> and instead being a per-stage property.
>>>
>>> Now, of course, making allocation of general-purpose cores and
>>> domain-specific resources work in this finer-grained fashion is a lot more
>>> work than just trying to extend the existing resource allocation mechanisms
>>> to handle domain-specific resources, but it does feel to me like we should
>>> at least be considering doing that deeper redesign.
>>>
>>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves 
>>> wrote:
>>>
>>> Tthe proposal here is that all your resources are static and the gpu per
>>> task config is global per application, meaning you ask for a certain amount
>>> memory, cpu, GPUs for every executor up front just like you do today and
>>> every executor you get is that size.  This means that both static or
>>> dynamic allocation still work without explicitly adding more logic at this
>>> point. Since the config for gpu per task is global it means every task you
>>> want will need a certain ratio of cpu to gpu.  Since that is a global you
>>> can't really have the scenario you mentioned, all tasks are assuming to
>>> need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 gpu per task for
>>> each executor.  That means that I could only run 2 tasks and 3 cores would
>>> be wasted.  The stage/task level configuration of resources was removed and
>>> is something we can do in a 

Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-25 Thread Hyukjin Kwon
Hi all,

We really need to upgrade the minimal version soon. It's actually slowing
down the PySpark dev, for instance, by the overhead that sometimes we need
currently to test all multiple matrix of Arrow and Pandas. Also, it
currently requires to add some weird hacks or ugly codes. Some bugs exist
in lower versions, and some features are not supported in low PyArrow, for
instance.

Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and my
opinion as well, we should better increase the minimal version to 0.12.x.
(Also, note that Pandas <> Arrow is an experimental feature).

So, I and Bryan will proceed this roughly in few days if there isn't
objections assuming we're fine with increasing it to 0.12.x. Please let me
know if there are some concerns.

For clarification, this requires some jobs in Jenkins to upgrade the
minimal version of PyArrow (I cc'ed Shane as well).

PS: I roughly heard that Shane's busy for some work stuff .. but it's kind
of important in my perspective.


Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Sean Owen
That's all merged now. I think you're clear to start an RC.

On Mon, Mar 25, 2019 at 4:06 PM DB Tsai  wrote:
>
> I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961
> https://github.com/apache/spark/pull/24126 , anything critical that we
> have to wait for 2.4.1 release? Thanks!
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Sun, Mar 24, 2019 at 8:19 PM Sean Owen  wrote:
> >
> > Still waiting on a successful test - hope this one works.
> >
> > On Sun, Mar 24, 2019, 10:13 PM DB Tsai  wrote:
> >>
> >> Hello Sean,
> >>
> >> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we
> >> can merge it into 2.4 branch soon?
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> --
> >> Web: https://www.dbtsai.com
> >> PGP Key ID: 42E5B25A8F7A82C1
> >>
> >> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen  wrote:
> >> >
> >> > I think we can/should get in SPARK-26961 too; it's all but ready to 
> >> > commit.
> >> >
> >> > On Sat, Mar 23, 2019 at 2:02 PM DB Tsai  wrote:
> >> > >
> >> > > -1
> >> > >
> >> > > I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
> >> > > SPARK-27178, SPARK-27112. Please let me know if there is any critical
> >> > > PR that has to be back-ported into branch-2.4.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > Sincerely,
> >> > >
> >> > > DB Tsai
> >> > > --
> >> > > Web: https://www.dbtsai.com
> >> > > PGP Key ID: 42E5B25A8F7A82C1
> >> > >
> >> > > On Fri, Mar 22, 2019 at 12:28 AM DB Tsai  wrote:
> >> > > >
> >> > > > Since we have couple concerns and hesitations to release rc8, how
> >> > > > about we give it couple days, and have another vote on March 25,
> >> > > > Monday? In this case, I will cut another rc9 in the Monday morning.
> >> > > >
> >> > > > Darcy, as Dongjoon mentioned,
> >> > > > https://github.com/apache/spark/pull/24092 is conflict against
> >> > > > branch-2.4, can you make anther PR against branch-2.4 so we can
> >> > > > include the ORC fix in 2.4.1?
> >> > > >
> >> > > > Thanks.
> >> > > >
> >> > > > Sincerely,
> >> > > >
> >> > > > DB Tsai
> >> > > > --
> >> > > > Web: https://www.dbtsai.com
> >> > > > PGP Key ID: 42E5B25A8F7A82C1
> >> > > >
> >> > > > On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung 
> >> > > >  wrote:
> >> > > > >
> >> > > > > Reposting for shane here
> >> > > > >
> >> > > > > [SPARK-27178]
> >> > > > > https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> >> > > > >
> >> > > > > Is problematic too and it’s not in the rc8 cut
> >> > > > >
> >> > > > > https://github.com/apache/spark/commits/branch-2.4
> >> > > > >
> >> > > > > (Personally I don’t want to delay 2.4.1 either..)
> >> > > > >
> >> > > > > 
> >> > > > > From: Sean Owen 
> >> > > > > Sent: Wednesday, March 20, 2019 11:18 AM
> >> > > > > To: DB Tsai
> >> > > > > Cc: dev
> >> > > > > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
> >> > > > >
> >> > > > > +1 for this RC. The tag is correct, licenses and sigs check out, 
> >> > > > > tests
> >> > > > > of the source with most profiles enabled works for me.
> >> > > > >
> >> > > > > On Tue, Mar 19, 2019 at 5:28 PM DB Tsai  
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > Please vote on releasing the following candidate as Apache Spark 
> >> > > > > > version 2.4.1.
> >> > > > > >
> >> > > > > > The vote is open until March 23 PST and passes if a majority +1 
> >> > > > > > PMC votes are cast, with
> >> > > > > > a minimum of 3 +1 votes.
> >> > > > > >
> >> > > > > > [ ] +1 Release this package as Apache Spark 2.4.1
> >> > > > > > [ ] -1 Do not release this package because ...
> >> > > > > >
> >> > > > > > To learn more about Apache Spark, please see 
> >> > > > > > http://spark.apache.org/
> >> > > > > >
> >> > > > > > The tag to be voted on is v2.4.1-rc8 (commit 
> >> > > > > > 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> >> > > > > > https://github.com/apache/spark/tree/v2.4.1-rc8
> >> > > > > >
> >> > > > > > The release files, including signatures, digests, etc. can be 
> >> > > > > > found at:
> >> > > > > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
> >> > > > > >
> >> > > > > > Signatures used for Spark RCs can be found in this file:
> >> > > > > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> > > > > >
> >> > > > > > The staging repository for this release can be found at:
> >> > > > > > https://repository.apache.org/content/repositories/orgapachespark-1318/
> >> > > > > >
> >> > > > > > The documentation corresponding to this release can be found at:
> >> > > > > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-docs/
> >> > > > > >
> >> > > > > > The list of bug fixes going into 2.4.1 can be found at the 
> >> > > > > > following URL:

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Sean Owen
This last test failed again, but, I claim we've actually seen it pass:
https://github.com/apache/spark/pull/24126#issuecomment-476410462
Would anybody else endorse merging it into 2.4 to proceed? I'll kick
of one more test for good measure.

On Mon, Mar 25, 2019 at 4:33 PM Sean Owen  wrote:
>
> Don't wait on this, but, I was going to slip in a message in the 2.4.1
> docs saying that Scala 2.11 support is deprecated, as it will be gone
> in Spark 3. I'll bang that out right now.
> Still waiting on a clean test build for that last JIRA, but maybe
> about to happen.
>
> On Mon, Mar 25, 2019 at 4:06 PM DB Tsai  wrote:
> >
> > I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961
> > https://github.com/apache/spark/pull/24126 , anything critical that we
> > have to wait for 2.4.1 release? Thanks!
> >
> > Sincerely,
> >
> > DB Tsai

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
Say if we support per-task resource requests in the future, it would be
still inconvenient for users to declare the resource requirements for every
single task/stage. So there must be some default values defined somewhere
for task resource requirements. "spark.task.cpus" and
"spark.task.accelerator.gpu.count" could serve for this purpose without
introducing breaking changes. So I'm +1 on the updated SPIP. It fairly
separated necessary GPU support from risky scheduler changes.

On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra 
wrote:

> Of course there is an issue of the perfect becoming the enemy of the good,
> so I can understand the impulse to get something done. I am left wanting,
> however, at least something more of a roadmap to a task-level future than
> just a vague "we may choose to do something more in the future." At the
> risk of repeating myself, I don't think the existing spark.task.cpus is
> very good, and I think that building more on that weak foundation without a
> more clear path or stated intention to move to something better runs the
> risk of leaving Spark stuck in a bad neighborhood.
>
> On Thu, Mar 21, 2019 at 10:10 AM Tom Graves  wrote:
>
>> While I agree with you that it would be ideal to have the task level
>> resources and do a deeper redesign for the scheduler, I think that can be a
>> separate enhancement like was discussed earlier in the thread. That feature
>> is useful without GPU's.  I do realize that they overlap some but I think
>> the changes for this will be minimal to the scheduler, follow existing
>> conventions, and it is an improvement over what we have now. I know many
>> users will be happy to have this even without the task level scheduling as
>> many of the conventions used now to scheduler gpus can easily be broken by
>> one bad user. I think from the user point of view this gives many users
>> an improvement and we can extend it later to cover more use cases.
>>
>> Tom
>> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>>
>>
>> I understand the application-level, static, global nature
>> of spark.task.accelerator.gpu.count and its similarity to the
>> existing spark.task.cpus, but to me this feels like extending a weakness of
>> Spark's scheduler, not building on its strengths. That is because I
>> consider binding the number of cores for each task to an application
>> configuration to be far from optimal. This is already far from the desired
>> behavior when an application is running a wide range of jobs (as in a
>> generic job-runner style of Spark application), some of which require or
>> can benefit from multi-core tasks, others of which will just waste the
>> extra cores allocated to their tasks. Ideally, the number of cores
>> allocated to tasks would get pushed to an even finer granularity that jobs,
>> and instead being a per-stage property.
>>
>> Now, of course, making allocation of general-purpose cores and
>> domain-specific resources work in this finer-grained fashion is a lot more
>> work than just trying to extend the existing resource allocation mechanisms
>> to handle domain-specific resources, but it does feel to me like we should
>> at least be considering doing that deeper redesign.
>>
>> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves 
>> wrote:
>>
>> Tthe proposal here is that all your resources are static and the gpu per
>> task config is global per application, meaning you ask for a certain amount
>> memory, cpu, GPUs for every executor up front just like you do today and
>> every executor you get is that size.  This means that both static or
>> dynamic allocation still work without explicitly adding more logic at this
>> point. Since the config for gpu per task is global it means every task you
>> want will need a certain ratio of cpu to gpu.  Since that is a global you
>> can't really have the scenario you mentioned, all tasks are assuming to
>> need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 gpu per task for
>> each executor.  That means that I could only run 2 tasks and 3 cores would
>> be wasted.  The stage/task level configuration of resources was removed and
>> is something we can do in a separate SPIP.
>> We thought erroring would make it more obvious to the user.  We could
>> change this to a warning if everyone thinks that is better but I personally
>> like the error until we can implement the per lower level per stage
>> configuration.
>>
>> Tom
>>
>> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido <
>> marcogaid...@gmail.com> wrote:
>>
>>
>> Thanks for this SPIP.
>> I cannot comment on the docs, but just wanted to highlight one thing. In
>> page 5 of the SPIP, when we talk about DRA, I see:
>>
>> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each
>> task requires 1 CPU and 1GPU, then we shall throw an error on application
>> start because we shall always have at least 2 idle CPUs per executor"
>>
>> I am not sure this is a correct behavior. We 

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread Sean Owen
Don't wait on this, but, I was going to slip in a message in the 2.4.1
docs saying that Scala 2.11 support is deprecated, as it will be gone
in Spark 3. I'll bang that out right now.
Still waiting on a clean test build for that last JIRA, but maybe
about to happen.

On Mon, Mar 25, 2019 at 4:06 PM DB Tsai  wrote:
>
> I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961
> https://github.com/apache/spark/pull/24126 , anything critical that we
> have to wait for 2.4.1 release? Thanks!
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>
> On Sun, Mar 24, 2019 at 8:19 PM Sean Owen  wrote:
> >
> > Still waiting on a successful test - hope this one works.
> >
> > On Sun, Mar 24, 2019, 10:13 PM DB Tsai  wrote:
> >>
> >> Hello Sean,
> >>
> >> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we
> >> can merge it into 2.4 branch soon?
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> --
> >> Web: https://www.dbtsai.com
> >> PGP Key ID: 42E5B25A8F7A82C1
> >>
> >> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen  wrote:
> >> >
> >> > I think we can/should get in SPARK-26961 too; it's all but ready to 
> >> > commit.
> >> >
> >> > On Sat, Mar 23, 2019 at 2:02 PM DB Tsai  wrote:
> >> > >
> >> > > -1
> >> > >
> >> > > I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
> >> > > SPARK-27178, SPARK-27112. Please let me know if there is any critical
> >> > > PR that has to be back-ported into branch-2.4.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > Sincerely,
> >> > >
> >> > > DB Tsai
> >> > > --
> >> > > Web: https://www.dbtsai.com
> >> > > PGP Key ID: 42E5B25A8F7A82C1
> >> > >
> >> > > On Fri, Mar 22, 2019 at 12:28 AM DB Tsai  wrote:
> >> > > >
> >> > > > Since we have couple concerns and hesitations to release rc8, how
> >> > > > about we give it couple days, and have another vote on March 25,
> >> > > > Monday? In this case, I will cut another rc9 in the Monday morning.
> >> > > >
> >> > > > Darcy, as Dongjoon mentioned,
> >> > > > https://github.com/apache/spark/pull/24092 is conflict against
> >> > > > branch-2.4, can you make anther PR against branch-2.4 so we can
> >> > > > include the ORC fix in 2.4.1?
> >> > > >
> >> > > > Thanks.
> >> > > >
> >> > > > Sincerely,
> >> > > >
> >> > > > DB Tsai
> >> > > > --
> >> > > > Web: https://www.dbtsai.com
> >> > > > PGP Key ID: 42E5B25A8F7A82C1
> >> > > >
> >> > > > On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung 
> >> > > >  wrote:
> >> > > > >
> >> > > > > Reposting for shane here
> >> > > > >
> >> > > > > [SPARK-27178]
> >> > > > > https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
> >> > > > >
> >> > > > > Is problematic too and it’s not in the rc8 cut
> >> > > > >
> >> > > > > https://github.com/apache/spark/commits/branch-2.4
> >> > > > >
> >> > > > > (Personally I don’t want to delay 2.4.1 either..)
> >> > > > >
> >> > > > > 
> >> > > > > From: Sean Owen 
> >> > > > > Sent: Wednesday, March 20, 2019 11:18 AM
> >> > > > > To: DB Tsai
> >> > > > > Cc: dev
> >> > > > > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
> >> > > > >
> >> > > > > +1 for this RC. The tag is correct, licenses and sigs check out, 
> >> > > > > tests
> >> > > > > of the source with most profiles enabled works for me.
> >> > > > >
> >> > > > > On Tue, Mar 19, 2019 at 5:28 PM DB Tsai  
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > Please vote on releasing the following candidate as Apache Spark 
> >> > > > > > version 2.4.1.
> >> > > > > >
> >> > > > > > The vote is open until March 23 PST and passes if a majority +1 
> >> > > > > > PMC votes are cast, with
> >> > > > > > a minimum of 3 +1 votes.
> >> > > > > >
> >> > > > > > [ ] +1 Release this package as Apache Spark 2.4.1
> >> > > > > > [ ] -1 Do not release this package because ...
> >> > > > > >
> >> > > > > > To learn more about Apache Spark, please see 
> >> > > > > > http://spark.apache.org/
> >> > > > > >
> >> > > > > > The tag to be voted on is v2.4.1-rc8 (commit 
> >> > > > > > 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
> >> > > > > > https://github.com/apache/spark/tree/v2.4.1-rc8
> >> > > > > >
> >> > > > > > The release files, including signatures, digests, etc. can be 
> >> > > > > > found at:
> >> > > > > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
> >> > > > > >
> >> > > > > > Signatures used for Spark RCs can be found in this file:
> >> > > > > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> > > > > >
> >> > > > > > The staging repository for this release can be found at:
> >> > > > > > https://repository.apache.org/content/repositories/orgapachespark-1318/
> >> > > > > >
> >> > > > > > The documentation corresponding to this release can be 

Re: [VOTE] Release Apache Spark 2.4.1 (RC8)

2019-03-25 Thread DB Tsai
I am going to cut a 2.4.1 rc9 soon tonight. Besides SPARK-26961
https://github.com/apache/spark/pull/24126 , anything critical that we
have to wait for 2.4.1 release? Thanks!

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Sun, Mar 24, 2019 at 8:19 PM Sean Owen  wrote:
>
> Still waiting on a successful test - hope this one works.
>
> On Sun, Mar 24, 2019, 10:13 PM DB Tsai  wrote:
>>
>> Hello Sean,
>>
>> By looking at SPARK-26961 PR, seems it's ready to go. Do you think we
>> can merge it into 2.4 branch soon?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>> On Sat, Mar 23, 2019 at 12:04 PM Sean Owen  wrote:
>> >
>> > I think we can/should get in SPARK-26961 too; it's all but ready to commit.
>> >
>> > On Sat, Mar 23, 2019 at 2:02 PM DB Tsai  wrote:
>> > >
>> > > -1
>> > >
>> > > I will fail RC8, and cut another RC9 on Monday to include SPARK-27160,
>> > > SPARK-27178, SPARK-27112. Please let me know if there is any critical
>> > > PR that has to be back-ported into branch-2.4.
>> > >
>> > > Thanks.
>> > >
>> > > Sincerely,
>> > >
>> > > DB Tsai
>> > > --
>> > > Web: https://www.dbtsai.com
>> > > PGP Key ID: 42E5B25A8F7A82C1
>> > >
>> > > On Fri, Mar 22, 2019 at 12:28 AM DB Tsai  wrote:
>> > > >
>> > > > Since we have couple concerns and hesitations to release rc8, how
>> > > > about we give it couple days, and have another vote on March 25,
>> > > > Monday? In this case, I will cut another rc9 in the Monday morning.
>> > > >
>> > > > Darcy, as Dongjoon mentioned,
>> > > > https://github.com/apache/spark/pull/24092 is conflict against
>> > > > branch-2.4, can you make anther PR against branch-2.4 so we can
>> > > > include the ORC fix in 2.4.1?
>> > > >
>> > > > Thanks.
>> > > >
>> > > > Sincerely,
>> > > >
>> > > > DB Tsai
>> > > > --
>> > > > Web: https://www.dbtsai.com
>> > > > PGP Key ID: 42E5B25A8F7A82C1
>> > > >
>> > > > On Wed, Mar 20, 2019 at 9:11 PM Felix Cheung 
>> > > >  wrote:
>> > > > >
>> > > > > Reposting for shane here
>> > > > >
>> > > > > [SPARK-27178]
>> > > > > https://github.com/apache/spark/commit/342e91fdfa4e6ce5cc3a0da085d1fe723184021b
>> > > > >
>> > > > > Is problematic too and it’s not in the rc8 cut
>> > > > >
>> > > > > https://github.com/apache/spark/commits/branch-2.4
>> > > > >
>> > > > > (Personally I don’t want to delay 2.4.1 either..)
>> > > > >
>> > > > > 
>> > > > > From: Sean Owen 
>> > > > > Sent: Wednesday, March 20, 2019 11:18 AM
>> > > > > To: DB Tsai
>> > > > > Cc: dev
>> > > > > Subject: Re: [VOTE] Release Apache Spark 2.4.1 (RC8)
>> > > > >
>> > > > > +1 for this RC. The tag is correct, licenses and sigs check out, 
>> > > > > tests
>> > > > > of the source with most profiles enabled works for me.
>> > > > >
>> > > > > On Tue, Mar 19, 2019 at 5:28 PM DB Tsai  
>> > > > > wrote:
>> > > > > >
>> > > > > > Please vote on releasing the following candidate as Apache Spark 
>> > > > > > version 2.4.1.
>> > > > > >
>> > > > > > The vote is open until March 23 PST and passes if a majority +1 
>> > > > > > PMC votes are cast, with
>> > > > > > a minimum of 3 +1 votes.
>> > > > > >
>> > > > > > [ ] +1 Release this package as Apache Spark 2.4.1
>> > > > > > [ ] -1 Do not release this package because ...
>> > > > > >
>> > > > > > To learn more about Apache Spark, please see 
>> > > > > > http://spark.apache.org/
>> > > > > >
>> > > > > > The tag to be voted on is v2.4.1-rc8 (commit 
>> > > > > > 746b3ddee6f7ad3464e326228ea226f5b1f39a41):
>> > > > > > https://github.com/apache/spark/tree/v2.4.1-rc8
>> > > > > >
>> > > > > > The release files, including signatures, digests, etc. can be 
>> > > > > > found at:
>> > > > > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-bin/
>> > > > > >
>> > > > > > Signatures used for Spark RCs can be found in this file:
>> > > > > > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> > > > > >
>> > > > > > The staging repository for this release can be found at:
>> > > > > > https://repository.apache.org/content/repositories/orgapachespark-1318/
>> > > > > >
>> > > > > > The documentation corresponding to this release can be found at:
>> > > > > > https://dist.apache.org/repos/dist/dev/spark/v2.4.1-rc8-docs/
>> > > > > >
>> > > > > > The list of bug fixes going into 2.4.1 can be found at the 
>> > > > > > following URL:
>> > > > > > https://issues.apache.org/jira/projects/SPARK/versions/2.4.1
>> > > > > >
>> > > > > > FAQ
>> > > > > >
>> > > > > > =
>> > > > > > How can I help test this release?
>> > > > > > =
>> > > > > >
>> > > > > > If you are a Spark user, you can help us test this release by 
>> > > > > > taking
>> > > > > > an existing 

[DISCUSS] Spark Columnar Processing

2019-03-25 Thread Bobby Evans
This thread is to discuss adding in support for data frame processing using
an in-memory columnar format compatible with Apache Arrow.  My main goal in
this is to lay the groundwork so we can add in support for GPU accelerated
processing of data frames, but this feature has a number of other
benefits.  Spark currently supports Apache Arrow formatted data as an
option to exchange data with python for pandas UDF processing. There has
also been discussion around extending this to allow for exchanging data
with other tools like pytorch, tensorflow, xgboost,... If Spark supports
processing on Arrow compatible data it could eliminate the
serialization/deserialization overhead when going between these systems.
It also would allow for doing optimizations on a CPU with SIMD instructions
similar to what Hive currently supports. Accelerated processing using a GPU
is something that we will start a separate discussion thread on, but I
wanted to set the context a bit.

Jason Lowe, Tom Graves, and I created a prototype over the past few months
to try and understand how to make this work.  What we are proposing is
based off of lessons learned when building this prototype, but we really
wanted to get feedback early on from the community. We will file a SPIP
once we can get agreement that this is a good direction to go in.

The current support for columnar processing lets a Parquet or Orc file
format return a ColumnarBatch inside an RDD[InternalRow] using Scala’s type
erasure. The code generation is aware that the RDD actually holds
ColumnarBatchs and generates code to loop through the data in each batch as
InternalRows.

Instead, we propose a new set of APIs to work on an
RDD[InternalColumnarBatch] instead of abusing type erasure. With this we
propose adding in a Rule similar to how WholeStageCodeGen currently works.
Each part of the physical SparkPlan would expose columnar support through a
combination of traits and method calls. The rule would then decide when
columnar processing would start and when it would end. Switching between
columnar and row based processing is not free, so the rule would make a
decision based off of an estimate of the cost to do the transformation and
the estimated speedup in processing time.

This should allow us to disable columnar support by simply disabling the
rule that modifies the physical SparkPlan.  It should be minimal risk to
the existing row-based code path, as that code should not be touched, and
in many cases could be reused to implement the columnar version.  This also
allows for small easily manageable patches. No huge patches that no one
wants to review.

As far as the memory layout is concerned OnHeapColumnVector and
OffHeapColumnVector are already really close to being Apache Arrow
compatible so shifting them over would be a relatively simple change.
Alternatively we could add in a new implementation that is Arrow compatible
if there are reasons to keep the old ones.

Again this is just to get the discussion started, any feedback is welcome,
and we will file a SPIP on it once we feel like the major changes we are
proposing are acceptable.

Thanks,

Bobby Evans


Re: Scala 2.11 support removed for Spark 3.0.0

2019-03-25 Thread Darcy Shen
Cool, Scala 2.12 compiles faster than Scala 2.11 .


But it runs slower than Scala 2.11 by default. We may enable some compiler 
optimization options.



 On Mon, 25 Mar 2019 23:53:18 +0800 Sean Owen  
wrote 



I merged https://github.com/apache/spark/pull/23098 . "-Pscala-2.11" 

won't work anymore in master. I think this shouldn't be a surprise or 

disruptive as 2.12 is already the default. 

 

The change isn't big and I think pretty reliable, but keep an eye out 

for issues. 

 

Shane you are welcome to remove the Scala 2.11 test job. 

 

We could proceed to make some more enhancements that require 2.12, but 

I think we got most of them in this PR. 

 

- 

To unsubscribe e-mail: mailto:dev-unsubscr...@spark.apache.org

Scala 2.11 support removed for Spark 3.0.0

2019-03-25 Thread Sean Owen
I merged https://github.com/apache/spark/pull/23098 . "-Pscala-2.11"
won't work anymore in master. I think this shouldn't be a surprise or
disruptive as 2.12 is already the default.

The change isn't big and I think pretty reliable, but keep an eye out
for issues.

Shane you are welcome to remove the Scala 2.11 test job.

We could proceed to make some more enhancements that require 2.12, but
I think we got most of them in this PR.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
Of course there is an issue of the perfect becoming the enemy of the good,
so I can understand the impulse to get something done. I am left wanting,
however, at least something more of a roadmap to a task-level future than
just a vague "we may choose to do something more in the future." At the
risk of repeating myself, I don't think the existing spark.task.cpus is
very good, and I think that building more on that weak foundation without a
more clear path or stated intention to move to something better runs the
risk of leaving Spark stuck in a bad neighborhood.

On Thu, Mar 21, 2019 at 10:10 AM Tom Graves  wrote:

> While I agree with you that it would be ideal to have the task level
> resources and do a deeper redesign for the scheduler, I think that can be a
> separate enhancement like was discussed earlier in the thread. That feature
> is useful without GPU's.  I do realize that they overlap some but I think
> the changes for this will be minimal to the scheduler, follow existing
> conventions, and it is an improvement over what we have now. I know many
> users will be happy to have this even without the task level scheduling as
> many of the conventions used now to scheduler gpus can easily be broken by
> one bad user. I think from the user point of view this gives many users
> an improvement and we can extend it later to cover more use cases.
>
> Tom
> On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <
> m...@clearstorydata.com> wrote:
>
>
> I understand the application-level, static, global nature
> of spark.task.accelerator.gpu.count and its similarity to the
> existing spark.task.cpus, but to me this feels like extending a weakness of
> Spark's scheduler, not building on its strengths. That is because I
> consider binding the number of cores for each task to an application
> configuration to be far from optimal. This is already far from the desired
> behavior when an application is running a wide range of jobs (as in a
> generic job-runner style of Spark application), some of which require or
> can benefit from multi-core tasks, others of which will just waste the
> extra cores allocated to their tasks. Ideally, the number of cores
> allocated to tasks would get pushed to an even finer granularity that jobs,
> and instead being a per-stage property.
>
> Now, of course, making allocation of general-purpose cores and
> domain-specific resources work in this finer-grained fashion is a lot more
> work than just trying to extend the existing resource allocation mechanisms
> to handle domain-specific resources, but it does feel to me like we should
> at least be considering doing that deeper redesign.
>
> On Thu, Mar 21, 2019 at 7:33 AM Tom Graves 
> wrote:
>
> Tthe proposal here is that all your resources are static and the gpu per
> task config is global per application, meaning you ask for a certain amount
> memory, cpu, GPUs for every executor up front just like you do today and
> every executor you get is that size.  This means that both static or
> dynamic allocation still work without explicitly adding more logic at this
> point. Since the config for gpu per task is global it means every task you
> want will need a certain ratio of cpu to gpu.  Since that is a global you
> can't really have the scenario you mentioned, all tasks are assuming to
> need GPU.  For instance. I request 5 cores, 2 GPUs, set 1 gpu per task for
> each executor.  That means that I could only run 2 tasks and 3 cores would
> be wasted.  The stage/task level configuration of resources was removed and
> is something we can do in a separate SPIP.
> We thought erroring would make it more obvious to the user.  We could
> change this to a warning if everyone thinks that is better but I personally
> like the error until we can implement the per lower level per stage
> configuration.
>
> Tom
>
> On Thursday, March 21, 2019, 1:45:01 AM PDT, Marco Gaido <
> marcogaid...@gmail.com> wrote:
>
>
> Thanks for this SPIP.
> I cannot comment on the docs, but just wanted to highlight one thing. In
> page 5 of the SPIP, when we talk about DRA, I see:
>
> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each task
> requires 1 CPU and 1GPU, then we shall throw an error on application start
> because we shall always have at least 2 idle CPUs per executor"
>
> I am not sure this is a correct behavior. We might have tasks requiring
> only CPU running in parallel as well, hence that may make sense. I'd rather
> emit a WARN or something similar. Anyway we just said we will keep GPU
> scheduling on task level out of scope for the moment, right?
>
> Thanks,
> Marco
>
> Il giorno gio 21 mar 2019 alle ore 01:26 Xiangrui Meng <
> m...@databricks.com> ha scritto:
>
> Steve, the initial work would focus on GPUs, but we will keep the
> interfaces general to support other accelerators in the future. This was
> mentioned in the SPIP and draft design.
>
> Imran, you should have comment permission now. Thanks for making a pass! I
> 

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Tom Graves
 +1 on the updated SPIP.
Tom
On Monday, March 18, 2019, 12:56:22 PM CDT, Xingbo Jiang 
 wrote:  
 
 Hi all,
I updated the SPIP doc and stories, I hope it now contains clear scope of the 
changes and enough details for SPIP vote.Please review the updated docs, thanks!
Xiangrui Meng  于2019年3月6日周三 上午8:35写道:

How about letting Xingbo make a major revision to the SPIP doc to make it clear 
what proposed are? I like Felix's suggestion to switch to the new Heilmeier 
template, which helps clarify what are proposed and what are not. Then let's 
review the new SPIP and resume the vote.
On Tue, Mar 5, 2019 at 7:54 AM Imran Rashid  wrote:

OK, I suppose then we are getting bogged down into what a vote on an SPIP means 
then anyway, which I guess we can set aside for now.  With the level of detail 
in this proposal, I feel like there is a reasonable chance I'd still -1 the 
design or implementation.
And the other thing you're implicitly asking the community for is to prioritize 
this feature for continued review and maintenance.  There is already work to be 
done in things like making barrier mode support dynamic allocation 
(SPARK-24942), bugs in failure handling (eg. SPARK-25250), and general 
efficiency of failure handling (eg. SPARK-25341, SPARK-20178).  I'm very 
concerned about getting spread too thin.


But if this is really just a vote on (1) is better gpu support important for 
spark, in some form, in some release? and (2) is it *possible* to do this in a 
safe way?  then I will vote +0.
On Tue, Mar 5, 2019 at 8:25 AM Tom Graves  wrote:

 So to me most of the questions here are implementation/design questions, I've 
had this issue in the past with SPIP's where I expected to have more high level 
design details but was basically told that belongs in the design jira follow 
on. This makes me think we need to revisit what a SPIP really need to contain, 
which should be done in a separate thread.  Note personally I would be for 
having more high level details in it.But the way I read our documentation on a 
SPIP right now that detail is all optional, now maybe we could argue its based 
on what reviewers request, but really perhaps we should make the wording of 
that more required.  thoughts?  We should probably separate that discussion if 
people want to talk about that.
For this SPIP in particular the reason I +1 it is because it came down to 2 
questions:
1) do I think spark should support this -> my answer is yes, I think this would 
improve spark, users have been requesting both better GPUs support and support 
for controlling container requests at a finer granularity for a while.  If 
spark doesn't support this then users may go to something else, so I think it 
we should support it
2) do I think its possible to design and implement it without causing large 
instabilities?   My opinion here again is yes. I agree with Imran and others 
that the scheduler piece needs to be looked at very closely as we have had a 
lot of issues there and that is why I was asking for more details in the design 
jira:  https://issues.apache.org/jira/browse/SPARK-27005.  But I do believe its 
possible to do.
If others have reservations on similar questions then I think we should resolve 
here or take the discussion of what a SPIP is to a different thread and then 
come back to this, thoughts?    
Note there is a high level design for at least the core piece, which is what 
people seem concerned with, already so including it in the SPIP should be 
straight forward.
Tom
On Monday, March 4, 2019, 2:52:43 PM CST, Imran Rashid 
 wrote:  
 
 On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng  wrote:

On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung  wrote:
IMO upfront allocation is less useful. Specifically too expensive for large 
jobs.

This is also an API/design discussion.

I agree with Felix -- this is more than just an API question.  It has a huge 
impact on the complexity of what you're proposing.  You might be proposing big 
changes to a core and brittle part of spark, which is already short of experts.
I don't see any value in having a vote on "does feature X sound cool?"  We have 
to evaluate the potential benefit against the risks the feature brings and the 
continued maintenance cost.  We don't need super low-level details, but we have 
to a sketch of the design to be able to make that tradeoff.