Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-26 Thread Mark Hamstra
wandered out of the context of this SPIP, I know. I'll at least +0 this SPIP, but I also couldn't let my concerns go unvoiced. On Mon, Mar 25, 2019 at 8:32 PM Xiangrui Meng wrote: > > > On Mon, Mar 25, 2019 at 8:07 PM Mark Hamstra > wrote: > >> Maybe. >> >> And I

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
swer here. The point I want to make is that "spark.task.cpus", though > less ideal, is still needed when we have task-level requests for CPUs. > > On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra > wrote: > >> I remain unconvinced that a default configuration at the

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
rly > separated necessary GPU support from risky scheduler changes. > > On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra > wrote: > >> Of course there is an issue of the perfect becoming the enemy of the >> good, so I can understand the impulse to get something done. I am lef

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
conventions used now to scheduler gpus can easily be broken by > one bad user. I think from the user point of view this gives many users > an improvement and we can extend it later to cover more use cases. > > Tom > On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Hamstra <

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Mark Hamstra
I understand the application-level, static, global nature of spark.task.accelerator.gpu.count and its similarity to the existing spark.task.cpus, but to me this feels like extending a weakness of Spark's scheduler, not building on its strengths. That is because I consider binding the number of

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Mark Hamstra
> Spark uses, which triggers it > - it doesn't work in 2.4.0 > > It's not a regression from 2.4.0, which is the immediate question. > There isn't even a Parquet fix available. > But I'm not even seeing why this is excuse-making? > > On Sun, Mar 10, 2019 at 8:44 PM Mark Hams

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Mark Hamstra
Now wait... we created a regression in 2.4.0. Arguably, we should have blocked that release until we had a fix; but the issue came up late in the release process and it looks to me like there wasn't an adequate fix immediately available, so we did something bad and released 2.4.0 with a known

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
I'll try to find some time, but it's really at a premium right now. On Mon, Mar 4, 2019 at 3:17 PM Xiangrui Meng wrote: > > > On Mon, Mar 4, 2019 at 3:10 PM Mark Hamstra > wrote: > >> :) Sorry, that was ambiguous. I was seconding Imran's comment. >> > > Co

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
:) Sorry, that was ambiguous. I was seconding Imran's comment. On Mon, Mar 4, 2019 at 3:09 PM Xiangrui Meng wrote: > > > On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra > wrote: > >> +1 >> > > Mark, just to be clear, are you +1 on the SPIP or Imran's point? >

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
+1 On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid wrote: > On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng wrote: > >> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung >> wrote: >> >>> IMO upfront allocation is less useful. Specifically too expensive for >>> large jobs. >>> >> >> This is also an

Re: [RESULT] [VOTE] Functional DataSourceV2 in Spark 3.0

2019-03-03 Thread Mark Hamstra
ue wrote: > > This vote fails with the following counts: > > 3 +1 votes: > >- Matt Cheah >- Ryan Blue >- Sean Owen (binding) > > 1 -0 vote: > >- Jose Torres > > 2 -1 votes: > >- Mark Hamstra (binding) >- Midrul Mura

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
ion is to this commitment for 3.0, but remember > that 3.0 is the next release so that we can remove deprecated APIs. It does > not mean that we aren't adding new features in that release and aren't > considering other goals. > > On Thu, Feb 28, 2019 at 10:12 AM Mark Hamstra > wrote:

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
Then I'm -1. Setting new features as blockers of major releases is not proper project management, IMO. On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue wrote: > Mark, if this goal is adopted, "we" is the Apache Spark community. > > On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
Who is "we" in these statements, such as "we should consider a functional DSv2 implementation a blocker for Spark 3.0"? If it means those contributing to the DSv2 effort want to set their own goals, milestones, etc., then that is fine with me. If you mean that the Apache Spark project should

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Mark Hamstra
est a direction for the community to take, and I fully accept that the > decision is up to the community. I think it is reasonable to candidly state > how this matters; that context informs the discussion. > > On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra > wrote: > >> To your other

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-22 Thread Mark Hamstra
rom throwing out a date, I probably just restated what everyone > said. But I was 'summoned' :) > > On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra > wrote: > >> However, as other people mentioned, Spark 3.0 has many other major >>> features as well >>> >

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-08 Thread Mark Hamstra
There are 2. C'mon Marcelo, you can make it 3! On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin wrote: > Hi Takeshi, > > Since we only really have one +1 binding vote, do you want to extend > this vote a bit? > > I've been stuck on a few things but plan to test this (setting things > up now), but

Re: Trigger full GC during executor idle time?

2019-01-02 Thread Mark Hamstra
Without addressing whether the change is beneficial or not, I will note that the logic in the paper and the PR's description is incorrect: "During execution, some executor nodes finish the tasks assigned to them early and wait for the entire stage to complete before more tasks are assigned to

Re: A survey about IP clearance of Spark in UC Berkeley for donating to Apache

2018-11-28 Thread Mark Hamstra
Your history isn't really accurate. Years before Spark became an Apache project, the AMPlab and UC Berkeley placed the Spark code under a 3-clause BSD License and made the code publicly available. Later, a group of developers and Spark users from both inside and outside Berkeley brought Spark and

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Mark Hamstra
te 2.11 right now via announcement and/or Spark 2.4.1 soon. > Drop 2.11 support in Spark 3.0, and support only 2.12. > - (same as above, but add Spark 2.13 support if possible for Spark 3.0) > > > On Wed, Nov 7, 2018 at 12:32 PM Mark Hamstra > wrote: > > > > I'm not followin

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Mark Hamstra
I'm not following "exclude Scala 2.13". Is there something inherent in making 2.12 the default Scala version in Spark 3.0 that would prevent us from supporting the option of building with 2.13? On Tue, Nov 6, 2018 at 5:48 PM Sean Owen wrote: > That's possible here, sure. The issue is: would you

Re: What's a blocker?

2018-10-24 Thread Mark Hamstra
Yeah, I can pretty much agree with that. Before we get into release candidates, it's not as big a deal if something gets labeled as a blocker. Once we are into an RC, I'd like to see any discussions as to whether something is or isn't a blocker at least cross-referenced in the RC VOTE thread so

Re: About introduce function sum0 to Spark

2018-10-23 Thread Mark Hamstra
Fan wrote: > This is logically `sum( if(isnull(col), 0, col) )` right? > > On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛 wrote: > >> The name is from Apache Calcite, And it doesn’t matter, we can introduce >> our own. >> >> >> >> >> >> --- >

Re: About introduce function sum0 to Spark

2018-10-22 Thread Mark Hamstra
That's a horrible name. This is just a fold. On Mon, Oct 22, 2018 at 7:39 PM 陶 加涛 wrote: > Hi, in calcite, has the concept of sum0, here I quote the definition of > sum0: > > > > Sum0 is an aggregator which returns the sum of the values which > > go into it like Sum. It differs in that when no

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-27 Thread Mark Hamstra
obably have to do some forks (at least for the CliDriver), the > thriftserver has a bunch of code which doesn't run under "startWithContext" > so we may have an issue there as well. > > > On Wed, Sep 26, 2018, 6:21 PM Mark Hamstra > wrote: > >> You're talking about

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-26 Thread Mark Hamstra
You're talking about users starting Thriftserver or SqlShell from the command line, right? It's much easier if you are starting a Thriftserver programmatically so that you can register functions when initializing a SparkContext and then HiveThriftServer2.startWithContext using that context. On

Re: ***UNCHECKED*** Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Mark Hamstra
That's overstated. We will also block for a data correctness issue -- and that is, arguably, what this is. On Wed, Sep 19, 2018 at 12:21 AM Reynold Xin wrote: > We also only block if it is a new regression. > > On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao > wrote: > >> Hi Marco, >> >> From my

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
ully removing it: for > example, if Pandas and TensorFlow no longer support Python 2 past some > point, that might be a good point to remove it. > > Matei > > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra > wrote: > > > > If we're going to do that, then we need to d

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
spark versions supporting > Py2 past the point where Py2 is no longer receiving security patches > > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra > wrote: > >> We could also deprecate Py2 already in the 2.4.0 release. >> >> On Sat, Sep 15, 2018 at 11:46 A

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
>> it more obvious to Pandas users, that will help the most. The other issue >> though is that a bunch of Pandas functions are just missing in Spark — it >> would be awesome to set up an umbrella JIRA to just track those and let >> people fill them in. >> >> Matei >&

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
It's not splitting hairs, Erik. It's actually very close to something that I think deserves some discussion (perhaps on a separate thread.) What I've been thinking about also concerns API "friendliness" or style. The original RDD API was very intentionally modeled on the Scala parallel collections

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Mark Hamstra
We could also deprecate Py2 already in the 2.4.0 release. On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson wrote: > In case this didn't make it onto this thread: > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove > it entirely on a later 3.x release. > > On Sat, Sep

Re: time for Apache Spark 3.0?

2018-09-06 Thread Mark Hamstra
Yes, that is why we have these annotations in the code and the corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java As long as it is properly annotated, we can change or

Re: Naming policy for packages

2018-08-15 Thread Mark Hamstra
While it is permissible to have a maven identify like "spark-foo" from "org.bar", I'll agree with Sean that avoiding that kind of name is often wiser. It is just too easy to slip into prohibited usage if the most popular, de facto identification turns out to become "spark-foo" instead of something

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-08 Thread Mark Hamstra
I'm inclined to agree. Just saying that it is not a regression doesn't really cut it when it is a now known data correctness issue. We need something a lot more than nothing before releasing 2.4.0. At a barest minimum, that has to be much more complete and publicly highlighted documentation of the

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Mark Hamstra
No reasonable amount of time is likely going to be sufficient to fully vet the code as a PR. I'm not entirely happy with the design and code as they currently are (and I'm still trying to find the time to more publicly express my thoughts and concerns), but I'm fine with them going into 2.4 much

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Mark Hamstra
See some of the related discussion under https://github.com/apache/spark/pull/21589 If feels to me like we need some kind of user code mechanism to signal policy preferences to Spark. This could also include ways to signal scheduling policy, which could include things like scheduling pool and/or

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra
cked mirrors then we might have bigger problems, but > there the issue is verifying the download sigs in the first place. Those > would have to come from archive.apache.org. > > If you're up for it, yes that could be a fine security precaution. > > On Thu, Jul 19, 2018,

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra
Is there or should there be some checking of digests just to make sure that we are really testing against the same thing in /tmp/test-spark that we are distributing from the archive? On Thu, Jul 19, 2018 at 11:15 AM Sean Owen wrote: > Ideally, that list is updated with each release, yes.

Re: time for Apache Spark 3.0?

2018-06-15 Thread Mark Hamstra
Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs. I still remain unconvinced that the next version can't be 2.4.0. On Fri, Jun 15, 2018 at 1:34 AM Andy

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Mark Hamstra
+1 On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > Given that I expect at least a few people to be busy with Spark Summit next > week, I'm taking the liberty of setting an extended voting period. The

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Mark Hamstra
There is no hadoop-2.8 profile. Use hadoop-2.7, which is effectively hadoop-2.7+ On Fri, Jun 1, 2018 at 4:01 PM Nicholas Chammas wrote: > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 > using Flintrock . However, trying > to load

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Mark Hamstra
If I am understanding you correctly, you're just saying that the problem is that you know what you want to keep, not what you want to throw away, and that there is no unpersist DataFrames call based on that what-to-keep information. On Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra
ter locality of in-memory partitions. > > Regards, > Matthias > > On Sat, Apr 7, 2018 at 8:50 AM, Mark Hamstra <m...@clearstorydata.com> > wrote: > > Sorry, but I'm still not understanding this use case. Are you somehow > > creating additional scheduling poo

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra
Sorry, but I'm still not understanding this use case. Are you somehow creating additional scheduling pools dynamically as Jobs execute? If so, that is a very unusual thing to do. Scheduling pools are intended to be statically configured -- initialized, living and dying with the Application. On

Re: time for Apache Spark 3.0?

2018-04-05 Thread Mark Hamstra
As with Sean, I'm not sure that this will require a new major version, but we should also be looking at Java 9 & 10 support -- particularly with regard to their better functionality in a containerized environment (memory limits from cgroups, not sysconf; support for cpusets). In that regard, we

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra
ing to > deprecate the fork and migrate all the work done on the fork into the main > line. > > > > -Matt Cheah > > > > *From: *Mark Hamstra <m...@clearstorydata.com> > *Date: *Monday, February 5, 2018 at 1:44 PM > *To: *Matt Cheah <mch...@palantir.com

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra
That's good, but you should probably stop and consider whether the discussions that led up to this document's creation could have taken place on this dev list -- because if they could have, then they probably should have as part of the whole spark-on-k8s project becoming part of mainline spark

Re: Union in Spark context

2018-02-05 Thread Mark Hamstra
First, the public API cannot be changed except when there is a major version change, and there is no way that we are going to do Spark 3.0.0 just for this change. Second, the change would be a mistake since the two different union methods are quite different. The method in RDD only ever works on

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Mark Hamstra
https://github.com/apache/spark/pull/19717#discussion_r154502824 >> >> to the best of my understanding, neither of those poses a problem. If we >> based the image off of centos I'd also expect the licensing of any image >> deps to be compatible. >> >> On Thu, D

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-14 Thread Mark Hamstra
cases, as long as our images are bare bones, they can use the >> >> spark-driver/spark-executor images we publish as the base, and build >> their >> >> customization as a layer on top of it. >> >> >> >> I think the composability of docker images, makes

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-11-29 Thread Mark Hamstra
It's probably also worth considering whether there is only one, well-defined, correct way to create such an image or whether this is a reasonable avenue for customization. Part of why we don't do something like maintain and publish canonical Debian packages for Spark is because different

Re: Object in compiler mirror not found - maven build

2017-11-26 Thread Mark Hamstra
Or you just have zinc running but in a bad state. `zinc -shutdown` should kill it off and let you try again. On Sun, Nov 26, 2017 at 2:12 PM, Sean Owen wrote: > I'm not seeing that on OS X or Linux. It sounds a bit like you have an old > version of zinc or scala or something

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-14 Thread Mark Hamstra
e in >> question (e.g. Why are we downloading Spark in a test case ?). >> >> Thanks >> Shivaram >> >> On Wed, Sep 13, 2017 at 11:50 AM, Mark Hamstra <m...@clearstorydata.com> >> wrote: >> > Yeah, but that discussion and use case is a bit dif

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Mark Hamstra
Yeah, but that discussion and use case is a bit different -- providing a different route to download the final released and approved artifacts that were built using only acceptable artifacts and sources vs. building and checking prior to release using something that is not from an Apache mirror.

Re: Supporting Apache Aurora as a cluster manager

2017-09-11 Thread Mark Hamstra
While it may be worth creating the design doc and JIRA ticket so that we at least have a better idea and a record of what you are talking about, I kind of doubt that we are going to want to merge this into the Spark codebase. That's not because of anything specific to this Aurora effort, but

Re: SPIP: Spark on Kubernetes

2017-08-28 Thread Mark Hamstra
> > In my opinion, the fact that there are nearly no changes to spark-core, > and most of our changes are additive should go to prove that this adds > little complexity to the workflow of the committers. Actually (and somewhat perversely), the otherwise praiseworthy isolation of the Kubernetes

Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Mark Hamstra
This is another argument for getting the code to the point where this can default to "true": SQLConf.scala: val ADAPTIVE_EXECUTION_ENABLED = buildConf(" *spark.sql.adaptive.enabled*") On Tue, Aug 22, 2017 at 12:27 PM, Reynold Xin wrote: > +1 > > > On Tue, Aug 22, 2017 at

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Mark Hamstra
Points 2, 3 and 4 of the Project Plan in that document (i.e. "port existing data sources using internal APIs to use the proposed public Data Source V2 API") have my full support. Really, I'd like to see that dog-fooding effort completed and lesson learned from it fully digested before we remove

Re: a stage can belong to more than one job please?

2017-06-06 Thread Mark Hamstra
Yes, a Stage can be part of more than one Job. The jobIds field of Stage is used repeatedly in the DAGScheduler. On Tue, Jun 6, 2017 at 5:04 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > I read same code of spark about stage. > > The constructor of stage keep the first job ID the stage was

Re: Why did spark switch from AKKA to net / ...

2017-05-07 Thread Mark Hamstra
The point is that Spark's prior usage of Akka was limited enough that it could fairly easily be removed entirely instead of forcing particular architectural decisions on Spark's users. On Sun, May 7, 2017 at 1:14 PM, geoHeil wrote: > Thank you! > In the issue they

Re: Should we consider a Spark 2.1.1 release?

2017-03-19 Thread Mark Hamstra
That doesn't necessarily follow, Jacek. There is a point where too frequent releases decrease quality. That is because releases don't come for free -- each one demands a considerable amount of time from release managers, testers, etc. -- time that would otherwise typically be devoted to improving

Re: Spark Improvement Proposals

2017-03-09 Thread Mark Hamstra
-0 on voting on whether we need a vote. On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin wrote: > I'm fine without a vote. (are we voting on wether we need a vote?) > > > On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen wrote: > >> I think a VOTE is over-thinking

Re: Sharing data in columnar storage between two applications

2016-12-26 Thread Mark Hamstra
> On Dec 25, 2016, at 5:24 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > > NOt so much about between applications, rather multiple frameworks within > an application, but still related: https://cs.stanford. > edu/~matei/papers/2017/cidr_weld.pdf > > On Sun, Dec 25, 2

Re: Shuffle intermidiate results not being cached

2016-12-26 Thread Mark Hamstra
Shuffle results are only reused if you are reusing the exact same RDD. If you are working with Dataframes that you have not explicitly cached, then they are going to be producing new RDDs within their physical plan creation and evaluation, so you won't get implicit shuffle reuse. This is what

Re: Sharing data in columnar storage between two applications

2016-12-25 Thread Mark Hamstra
NOt so much about between applications, rather multiple frameworks within an application, but still related: https://cs.stanford.edu/~matei/papers/2017/cidr_weld.pdf On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki wrote: > Here is an interesting discussion to share data

Re: Can I add a new method to RDD class?

2016-12-07 Thread Mark Hamstra
The easiest way is probably with: mvn versions:set -DnewVersion=your_new_version On Wed, Dec 7, 2016 at 11:31 AM, Teng Long wrote: > Hi Holden, > > Can you please tell me how to edit version numbers efficiently? the > correct way? I'm really struggling with this and

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
You still have the problem that even within a single Job it is often the case that not every Exchange really wants to use the same number of shuffle partitions. On Tue, Nov 15, 2016 at 2:46 AM, Sean Owen wrote: > Once you get to needing this level of fine-grained control,

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
AFAIK, the adaptive shuffle partitioning still isn't completely ready to be made the default, and there are some corner issues that need to be addressed before this functionality is declared finished and ready. E.g., the current logic can make data skew problems worse by turning One Big Partition

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread Mark Hamstra
Take a look at spark.sql.adaptive.enabled and the ExchangeCoordinator. A single, fixed-sized sql.shuffle.partitions is not the only way to control the number of partitions in an Exchange -- if you are willing to deal with code that is still off by by default. On Mon, Nov 14, 2016 at 4:19 PM,

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Mark Hamstra
the 2.0 release notes which I linked to. Here > they are > <http://spark.apache.org/releases/spark-release-2-0-0.html#deprecations> > again. > ​ > > On Tue, Oct 25, 2016 at 3:19 PM Mark Hamstra <m...@clearstorydata.com> > wrote: > >> No, I think our intent is that usin

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Mark Hamstra
What's changed since the last time we discussed these issues, about 7 months ago? Or, another way to formulate the question: What are the threshold criteria that we should use to decide when to end Scala 2.10 and/or Java 7 support? On Tue, Oct 25, 2016 at 8:36 AM, Sean Owen

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-24 Thread Mark Hamstra
age suffices, especially given we > have people from lots of language backgrounds here. > > > On Mon, Oct 24, 2016 at 6:11 PM Mark Hamstra <m...@clearstorydata.com> > wrote: > >> Alright, that does it! Who is responsible for this "straw-man" >> abuse

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-24 Thread Mark Hamstra
Alright, that does it! Who is responsible for this "straw-man" abuse that is becoming too commonplace in the Spark community? "Straw-man" does not mean something like "trial balloon" or "run it up the flagpole and see if anyone salutes", and I would really appreciate it if Spark developers would

Re: DAGScheduler.handleJobCancellation uses jobIdToStageIds for verification while jobIdToActiveJob for lookup?

2016-10-13 Thread Mark Hamstra
, Imran Rashid <iras...@cloudera.com> > wrote: > > Hi Jacek, > > > > doesn't look like there is any good reason -- Mark Hamstra might know > this > > best. Feel free to open a jira & pr for it, you can ping Mark, Kay > > Ousterhout, and me (@squ

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
documents say lots of confusing stuff, including that commiters are > in practice given a vote. > > https://www.apache.org/foundation/voting.html > > I don't care either way, if someone wants me to sub commiter for PMC in > the voting section, fine, we just need a clear outcome. > &g

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
I'm not a fan of the SEP acronym. Besides it prior established meaning of "Somebody else's problem", the are other inappropriate or offensive connotations such as this Australian slang that often gets shortened to just "sep": http://www.urbandictionary.com/define.php?term=Seppo On Sun, Oct 9,

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-10-01 Thread Mark Hamstra
not be a blocker. > > > On Fri, Sep 30, 2016 at 8:34 AM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> 0 >> >> RC4 is causing a build regression for me on at least one of my machines. >> RC3 built and ran tests successfully, but the tests co

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-30 Thread Mark Hamstra
0 RC4 is causing a build regression for me on at least one of my machines. RC3 built and ran tests successfully, but the tests consistently fail with RC4 unless I revert 9e91a1009e6f916245b4d4018de1664ea3decfe7, "[SPARK-15703][SCHEDULER][CORE][WEBUI] Make ListenerBus event queue size configurable

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Mark Hamstra
d on this RC, we would need to cut > 2.0.2 immediately. > > > > > > On Tue, Sep 27, 2016 at 10:18 AM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> I've got a couple of build niggles that should really be investigated at >> some point (what look to be

Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Mark Hamstra
+1 And I'll dare say that for those with Spark in production, what is more important is that maintenance releases come out in a timely fashion than that new features are released one month sooner or later. On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin wrote: > We are 2

Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-25 Thread Mark Hamstra
Spark's branch-2.0 is a maintenance branch, effectively meaning that only bug-fixes will be added to it. There are other maintenance branches (such as branch-1.6) that are also receiving bug-fixes in theory, but not so much in fact as maintenance branches get older. The major and minor version

Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-23 Thread Mark Hamstra
Similar but not identical configuration (Java 8/macOs 10.12 with build/mvn -Phive -Phive-thriftserver -Phadoop-2.7 -Pyarn clean install); Similar but not identical failure: ... - line wrapper only initialized once when used as encoder outer scope Spark context available as 'sc' (master =

Re: spark roadmap

2016-08-29 Thread Mark Hamstra
At this point, there is no target date set for 2.1. That's something that we should do fairly soon, but right now there is at least a little room for discussion as to whether we want to continue with the same pace of releases that we targeted throughout the 1.x development cycles, or whether

Re: renaming "minor release" to "feature release"

2016-07-29 Thread Mark Hamstra
One issue worth at least considering is that our minor releases usually do not include only new features, but also many bug-fixes -- at least some of which often do not get backported into the next patch-level release. "Feature release" does not convey that information. On Thu, Jul 28, 2016 at

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2016-07-23 Thread Mark Hamstra
ill. On Sat, Jul 23, 2016 at 3:50 PM, Koert Kuipers <ko...@tresata.com> wrote: > i care about signalling it in advance mostly. and given the performance > differences we do have some interest in pushing towards java 8 > > On Jul 23, 2016 6:10 PM, "Mark Hamstra" &

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-15 Thread Mark Hamstra
Yes. https://github.com/apache/spark/pull/11796 On Fri, Jul 15, 2016 at 2:50 PM, Krishna Sankar wrote: > Can't find the "spark-assembly-2.0.0-hadoop2.7.0.jar" after compilation. > Usually it is in the assembly/target/scala-2.11 > Has the packaging changed for 2.0.0 ? >

Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Mark Hamstra
You've got to satisfy my curiosity, though. Why would you want to run such a badly out-of-date version in production? I mean, 2.0.0 is just about ready for release, and lagging three full releases behind, with one of them being a major version release, is a long way from where Spark is now. On

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra
e are people that develop for Spark on Windows. The > referenced issue is indeed Minor and has nothing to do with unit tests. > > > > *From:* Mark Hamstra [mailto:m...@clearstorydata.com] > *Sent:* Wednesday, June 22, 2016 4:09 PM > *To:* Marcelo Vanzin <van...@cloudera.com> &

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra
It's also marked as Minor, not Blocker. On Wed, Jun 22, 2016 at 4:07 PM, Marcelo Vanzin wrote: > On Wed, Jun 22, 2016 at 4:04 PM, Ulanov, Alexander > wrote: > > -1 > > > > Spark Unit tests fail on Windows. Still not resolved, though marked as > >

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra
SPARK-15893 is resolved as a duplicate of SPARK-15899. SPARK-15899 is Unresolved. On Wed, Jun 22, 2016 at 4:04 PM, Ulanov, Alexander wrote: > -1 > > Spark Unit tests fail on Windows. Still not resolved, though marked as > resolved. > >

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Mark Hamstra
make this consistent. > > But, I think the resolution is simple: it's not 'dangerous' to release > this and I don't think people who say they think this really do. So > just finish this release normally, and we're done. Even if you think > there's an argument against it, weig

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Mark Hamstra
This is not a Databricks vs. The World situation, and the fact that some persist in forcing every issue into that frame is getting annoying. There are good engineering and project-management reasons not to populate the long-term, canonical repository of Maven artifacts with what are known to be

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-03 Thread Mark Hamstra
It's not a question of whether the preview artifacts can be made available on Maven central, but rather whether they must be or should be. I've got no problems leaving these unstable, transitory artifacts out of the more permanent, canonical repository. On Fri, Jun 3, 2016 at 1:53 AM, Steve

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-20 Thread Mark Hamstra
This is isn't yet a release candidate since, as Reynold mention in his opening post, preview releases are "not meant to be functional, i.e. they can and highly likely will contain critical bugs or documentation errors." Once we're at the point where we expect there not to be such bugs and errors,

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
ber, which is the case > with dynamic allocation. HDFS nodes aren't decreasing in number though, > and we can still colocate on those nodes, as always. > > On Thu, Apr 28, 2016 at 11:19 AM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> So you are only considering

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
after a work-load burst your cluster dynamically changes from 1 > workers to 1000, will the typical HDFS replication factor be sufficient to > retain access to the shuffle files in HDFS > > HDFS isn't resizing. Spark is. HDFS files should be HA and durable. > > On Thu,

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
Yes, replicated and distributed shuffle materializations are key requirement to maintain performance in a fully elastic cluster where Executors aren't just reallocated across an essentially fixed number of Worker nodes, but rather the number of Workers itself is dynamic. Retaining the file

Re: Question about Scala style, explicit typing within transformation functions and anonymous val.

2016-04-17 Thread Mark Hamstra
dability. AFAIK, for this reason, that is > not (or rarely) used in Spark. > > 2016-04-17 15:54 GMT+09:00 Mark Hamstra <m...@clearstorydata.com>: > >> FWIW, 3 should work as just `.map(function)`. >> >> On Sat, Apr 16, 2016 at 11:48 PM, Reynold Xin <r...@datab

Re: Question about Scala style, explicit typing within transformation functions and anonymous val.

2016-04-17 Thread Mark Hamstra
FWIW, 3 should work as just `.map(function)`. On Sat, Apr 16, 2016 at 11:48 PM, Reynold Xin wrote: > Hi Hyukjin, > > Thanks for asking. > > For 1 the change is almost always better. > > For 2 it depends on the context. In general if the type is not obvious, it > helps

  1   2   >