Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-21 Thread Mark Hamstra
I understand the application-level, static, global nature of spark.task.accelerator.gpu.count and its similarity to the existing spark.task.cpus, but to me this feels like extending a weakness of Spark's scheduler, not building on its strengths. That is because I consider binding the number of core

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
any of the conventions used now to scheduler gpus can easily be broken by > one bad user. I think from the user point of view this gives many users > an improvement and we can extend it later to cover more use cases. > > Tom > On Thursday, March 21, 2019, 9:15:05 AM PDT, Mark Ham

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
IP. It fairly > separated necessary GPU support from risky scheduler changes. > > On Mon, Mar 25, 2019 at 8:39 AM Mark Hamstra > wrote: > >> Of course there is an issue of the perfect becoming the enemy of the >> good, so I can understand the impulse to get something done

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Mark Hamstra
uot;spark.task.cpus" is the > answer here. The point I want to make is that "spark.task.cpus", though > less ideal, is still needed when we have task-level requests for CPUs. > > On Mon, Mar 25, 2019 at 6:46 PM Mark Hamstra > wrote: > >> I remain unconvinced

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-26 Thread Mark Hamstra
fault of anyone currently contributing. I've wandered out of the context of this SPIP, I know. I'll at least +0 this SPIP, but I also couldn't let my concerns go unvoiced. On Mon, Mar 25, 2019 at 8:32 PM Xiangrui Meng wrote: > > > On Mon, Mar 25, 2019 at 8:07 PM Mark Hamst

Re: Object in compiler mirror not found - maven build

2017-11-26 Thread Mark Hamstra
Or you just have zinc running but in a bad state. `zinc -shutdown` should kill it off and let you try again. On Sun, Nov 26, 2017 at 2:12 PM, Sean Owen wrote: > I'm not seeing that on OS X or Linux. It sounds a bit like you have an old > version of zinc or scala or something installed. > > On Su

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-11-29 Thread Mark Hamstra
It's probably also worth considering whether there is only one, well-defined, correct way to create such an image or whether this is a reasonable avenue for customization. Part of why we don't do something like maintain and publish canonical Debian packages for Spark is because different organizati

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-14 Thread Mark Hamstra
to always be built that way. >> The >> >> driver and executor images, there may be cases where people want to >> >> customize it - (like putting all dependencies into it for example). >> >> In those cases, as long as our images are bare bones, t

Re: Publishing official docker images for KubernetesSchedulerBackend

2017-12-19 Thread Mark Hamstra
sion_r154502824 >> >> to the best of my understanding, neither of those poses a problem. If we >> based the image off of centos I'd also expect the licensing of any image >> deps to be compatible. >> >> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra >>

Re: Union in Spark context

2018-02-05 Thread Mark Hamstra
First, the public API cannot be changed except when there is a major version change, and there is no way that we are going to do Spark 3.0.0 just for this change. Second, the change would be a mistake since the two different union methods are quite different. The method in RDD only ever works on t

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra
That's good, but you should probably stop and consider whether the discussions that led up to this document's creation could have taken place on this dev list -- because if they could have, then they probably should have as part of the whole spark-on-k8s project becoming part of mainline spark deve

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra
issue that the work done on the fork was > isolated from the dev mailing list. Moving forward as we push our work into > mainline Spark, we aim to be transparent with the Spark community via the > Spark mailing list and Spark JIRA tickets. We’re specifically aiming to > deprecate the f

Re: time for Apache Spark 3.0?

2018-04-05 Thread Mark Hamstra
As with Sean, I'm not sure that this will require a new major version, but we should also be looking at Java 9 & 10 support -- particularly with regard to their better functionality in a containerized environment (memory limits from cgroups, not sysconf; support for cpusets). In that regard, we sho

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra
Sorry, but I'm still not understanding this use case. Are you somehow creating additional scheduling pools dynamically as Jobs execute? If so, that is a very unusual thing to do. Scheduling pools are intended to be statically configured -- initialized, living and dying with the Application. On Sat

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra
ode, which is equivalent to FIFO. Providing a way to set the mode of > the default scheduler would be awesome. > > Regarding why fair scheduling showed generally better performance for > out-of-core datasets, I don't have a good answer. My guess was > isolated job scheduling and b

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Mark Hamstra
If I am understanding you correctly, you're just saying that the problem is that you know what you want to keep, not what you want to throw away, and that there is no unpersist DataFrames call based on that what-to-keep information. On Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas wrote: > I cer

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-01 Thread Mark Hamstra
There is no hadoop-2.8 profile. Use hadoop-2.7, which is effectively hadoop-2.7+ On Fri, Jun 1, 2018 at 4:01 PM Nicholas Chammas wrote: > I was able to successfully launch a Spark cluster on EC2 at 2.3.1 RC4 > using Flintrock . However, trying > to load the

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Mark Hamstra
+1 On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > Given that I expect at least a few people to be busy with Spark Summit next > week, I'm taking the liberty of setting an extended voting period. The vot

Re: time for Apache Spark 3.0?

2018-06-15 Thread Mark Hamstra
Changing major version numbers is not about new features or a vague notion that it is time to do something that will be seen to be a significant release. It is about breaking stable public APIs. I still remain unconvinced that the next version can't be 2.4.0. On Fri, Jun 15, 2018 at 1:34 AM Andy

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra
Is there or should there be some checking of digests just to make sure that we are really testing against the same thing in /tmp/test-spark that we are distributing from the archive? On Thu, Jul 19, 2018 at 11:15 AM Sean Owen wrote: > Ideally, that list is updated with each release, yes. Non-cur

Re: Cleaning Spark releases from mirrors, and the flakiness of HiveExternalCatalogVersionsSuite

2018-07-19 Thread Mark Hamstra
cked mirrors then we might have bigger problems, but > there the issue is verifying the download sigs in the first place. Those > would have to come from archive.apache.org. > > If you're up for it, yes that could be a fine security precaution. > > On Thu, Jul 19, 2018, 2:11 PM

Re: [DISCUSS][SQL] Control the number of output files

2018-07-25 Thread Mark Hamstra
See some of the related discussion under https://github.com/apache/spark/pull/21589 If feels to me like we need some kind of user code mechanism to signal policy preferences to Spark. This could also include ways to signal scheduling policy, which could include things like scheduling pool and/or b

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Mark Hamstra
No reasonable amount of time is likely going to be sufficient to fully vet the code as a PR. I'm not entirely happy with the design and code as they currently are (and I'm still trying to find the time to more publicly express my thoughts and concerns), but I'm fine with them going into 2.4 much as

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-08 Thread Mark Hamstra
I'm inclined to agree. Just saying that it is not a regression doesn't really cut it when it is a now known data correctness issue. We need something a lot more than nothing before releasing 2.4.0. At a barest minimum, that has to be much more complete and publicly highlighted documentation of the

Re: Naming policy for packages

2018-08-15 Thread Mark Hamstra
While it is permissible to have a maven identify like "spark-foo" from "org.bar", I'll agree with Sean that avoiding that kind of name is often wiser. It is just too easy to slip into prohibited usage if the most popular, de facto identification turns out to become "spark-foo" instead of something

Re: time for Apache Spark 3.0?

2018-09-06 Thread Mark Hamstra
Yes, that is why we have these annotations in the code and the corresponding labels appearing in the API documentation: https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java As long as it is properly annotated, we can change or ev

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Mark Hamstra
We could also deprecate Py2 already in the 2.4.0 release. On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson wrote: > In case this didn't make it onto this thread: > > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove > it entirely on a later 3.x release. > > On Sat, Sep 15

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
It's not splitting hairs, Erik. It's actually very close to something that I think deserves some discussion (perhaps on a separate thread.) What I've been thinking about also concerns API "friendliness" or style. The original RDD API was very intentionally modeled on the Scala parallel collections

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
make >> it more obvious to Pandas users, that will help the most. The other issue >> though is that a bunch of Pandas functions are just missing in Spark — it >> would be awesome to set up an umbrella JIRA to just track those and let >> people fill them in. >> >> M

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
some spark versions supporting > Py2 past the point where Py2 is no longer receiving security patches > > > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra > wrote: > >> We could also deprecate Py2 already in the 2.4.0 release. >> >> On Sat, Sep 15, 2018 at 11:46 A

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Mark Hamstra
ng before fully removing it: for > example, if Pandas and TensorFlow no longer support Python 2 past some > point, that might be a good point to remove it. > > Matei > > > On Sep 17, 2018, at 11:01 AM, Mark Hamstra > wrote: > > > > If we're going to do tha

Re: ***UNCHECKED*** Re: Re: Re: [VOTE] SPARK 2.3.2 (RC6)

2018-09-19 Thread Mark Hamstra
That's overstated. We will also block for a data correctness issue -- and that is, arguably, what this is. On Wed, Sep 19, 2018 at 12:21 AM Reynold Xin wrote: > We also only block if it is a new regression. > > On Wed, Sep 19, 2018 at 12:18 AM Saisai Shao > wrote: > >> Hi Marco, >> >> From my u

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-26 Thread Mark Hamstra
You're talking about users starting Thriftserver or SqlShell from the command line, right? It's much easier if you are starting a Thriftserver programmatically so that you can register functions when initializing a SparkContext and then HiveThriftServer2.startWithContext using that context. On We

Re: Adding Extension to Load Custom functions into Thriftserver/SqlShell

2018-09-27 Thread Mark Hamstra
'll > probably have to do some forks (at least for the CliDriver), the > thriftserver has a bunch of code which doesn't run under "startWithContext" > so we may have an issue there as well. > > > On Wed, Sep 26, 2018, 6:21 PM Mark Hamstra > wrote: > >

Re: About introduce function sum0 to Spark

2018-10-22 Thread Mark Hamstra
That's a horrible name. This is just a fold. On Mon, Oct 22, 2018 at 7:39 PM 陶 加涛 wrote: > Hi, in calcite, has the concept of sum0, here I quote the definition of > sum0: > > > > Sum0 is an aggregator which returns the sum of the values which > > go into it like Sum. It differs in that when no n

Re: About introduce function sum0 to Spark

2018-10-23 Thread Mark Hamstra
2:23 AM Wenchen Fan wrote: > This is logically `sum( if(isnull(col), 0, col) )` right? > > On Tue, Oct 23, 2018 at 2:58 PM 陶 加涛 wrote: > >> The name is from Apache Calcite, And it doesn’t matter, we can introduce >> our own. >> >> >> >> >> &

Re: What's a blocker?

2018-10-24 Thread Mark Hamstra
Yeah, I can pretty much agree with that. Before we get into release candidates, it's not as big a deal if something gets labeled as a blocker. Once we are into an RC, I'd like to see any discussions as to whether something is or isn't a blocker at least cross-referenced in the RC VOTE thread so tha

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Mark Hamstra
I'm not following "exclude Scala 2.13". Is there something inherent in making 2.12 the default Scala version in Spark 3.0 that would prevent us from supporting the option of building with 2.13? On Tue, Nov 6, 2018 at 5:48 PM Sean Owen wrote: > That's possible here, sure. The issue is: would you

Re: Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-07 Thread Mark Hamstra
- Deprecate 2.11 right now via announcement and/or Spark 2.4.1 soon. > Drop 2.11 support in Spark 3.0, and support only 2.12. > - (same as above, but add Spark 2.13 support if possible for Spark 3.0) > > > On Wed, Nov 7, 2018 at 12:32 PM Mark Hamstra > wrote: > > > > I'

Re: A survey about IP clearance of Spark in UC Berkeley for donating to Apache

2018-11-28 Thread Mark Hamstra
Your history isn't really accurate. Years before Spark became an Apache project, the AMPlab and UC Berkeley placed the Spark code under a 3-clause BSD License and made the code publicly available. Later, a group of developers and Spark users from both inside and outside Berkeley brought Spark and t

Re: Trigger full GC during executor idle time?

2019-01-02 Thread Mark Hamstra
Without addressing whether the change is beneficial or not, I will note that the logic in the paper and the PR's description is incorrect: "During execution, some executor nodes finish the tasks assigned to them early and wait for the entire stage to complete before more tasks are assigned to them,

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-08 Thread Mark Hamstra
There are 2. C'mon Marcelo, you can make it 3! On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin wrote: > Hi Takeshi, > > Since we only really have one +1 binding vote, do you want to extend > this vote a bit? > > I've been stuck on a few things but plan to test this (setting things > up now), but i

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-22 Thread Mark Hamstra
g to get. The > perfect is the enemy of the good. > > Aside from throwing out a date, I probably just restated what everyone > said. But I was 'summoned' :) > > On Fri, Feb 22, 2019 at 12:40 PM Mark Hamstra > wrote: > >> However, as other people mentioned, Sp

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-24 Thread Mark Hamstra
munity, to > suggest a direction for the community to take, and I fully accept that the > decision is up to the community. I think it is reasonable to candidly state > how this matters; that context informs the discussion. > > On Fri, Feb 22, 2019 at 1:55 PM Mark Hamstra > wrote: >

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
Who is "we" in these statements, such as "we should consider a functional DSv2 implementation a blocker for Spark 3.0"? If it means those contributing to the DSv2 effort want to set their own goals, milestones, etc., then that is fine with me. If you mean that the Apache Spark project should offici

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
Then I'm -1. Setting new features as blockers of major releases is not proper project management, IMO. On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue wrote: > Mark, if this goal is adopted, "we" is the Apache Spark community. > > On Thu, Feb 28, 2019 at 9:52 AM Mark Hamst

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
ike your objection is to this commitment for 3.0, but remember > that 3.0 is the next release so that we can remove deprecated APIs. It does > not mean that we aren't adding new features in that release and aren't > considering other goals. > > On Thu, Feb 28, 2019 at 10:12 AM

Re: [RESULT] [VOTE] Functional DataSourceV2 in Spark 3.0

2019-03-03 Thread Mark Hamstra
an Blue wrote: > > This vote fails with the following counts: > > 3 +1 votes: > >- Matt Cheah >- Ryan Blue >- Sean Owen (binding) > > 1 -0 vote: > >- Jose Torres > > 2 -1 votes: > >- Mark Hamstra (binding) >- Midrul Muralidh

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
+1 On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid wrote: > On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng wrote: > >> On Sun, Mar 3, 2019 at 10:20 AM Felix Cheung >> wrote: >> >>> IMO upfront allocation is less useful. Specifically too expensive for >>> large jobs. >>> >> >> This is also an API/de

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
:) Sorry, that was ambiguous. I was seconding Imran's comment. On Mon, Mar 4, 2019 at 3:09 PM Xiangrui Meng wrote: > > > On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra > wrote: > >> +1 >> > > Mark, just to be clear, are you +1 on the SPIP or Imran's po

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Mark Hamstra
I'll try to find some time, but it's really at a premium right now. On Mon, Mar 4, 2019 at 3:17 PM Xiangrui Meng wrote: > > > On Mon, Mar 4, 2019 at 3:10 PM Mark Hamstra > wrote: > >> :) Sorry, that was ambiguous. I was seconding Imran's comment. >&g

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Mark Hamstra
Now wait... we created a regression in 2.4.0. Arguably, we should have blocked that release until we had a fix; but the issue came up late in the release process and it looks to me like there wasn't an adequate fix immediately available, so we did something bad and released 2.4.0 with a known regre

Re: [VOTE] Release Apache Spark 2.4.1 (RC6)

2019-03-10 Thread Mark Hamstra
Avro than > Spark uses, which triggers it > - it doesn't work in 2.4.0 > > It's not a regression from 2.4.0, which is the immediate question. > There isn't even a Parquet fix available. > But I'm not even seeing why this is excuse-making? > > On Sun, Mar

Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Mark Hamstra
You've got to satisfy my curiosity, though. Why would you want to run such a badly out-of-date version in production? I mean, 2.0.0 is just about ready for release, and lagging three full releases behind, with one of them being a major version release, is a long way from where Spark is now. On W

Re: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-15 Thread Mark Hamstra
Yes. https://github.com/apache/spark/pull/11796 On Fri, Jul 15, 2016 at 2:50 PM, Krishna Sankar wrote: > Can't find the "spark-assembly-2.0.0-hadoop2.7.0.jar" after compilation. > Usually it is in the assembly/target/scala-2.11 > Has the packaging changed for 2.0.0 ? > Cheers > > > On Thu, Jul

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2016-07-23 Thread Mark Hamstra
Why the push to remove Java 7 support as soon as possible (which is how I read your "cluster admins plan to migrate by date X, so Spark should end Java 7 support then, too")? First, I don't think we should be removing Java 7 support until some time after all or nearly all relevant clusters are act

Re: drop java 7 support for spark 2.1.x or spark 2.2.x

2016-07-23 Thread Mark Hamstra
le still. On Sat, Jul 23, 2016 at 3:50 PM, Koert Kuipers wrote: > i care about signalling it in advance mostly. and given the performance > differences we do have some interest in pushing towards java 8 > > On Jul 23, 2016 6:10 PM, "Mark Hamstra" wrote: > > Why

Re: renaming "minor release" to "feature release"

2016-07-29 Thread Mark Hamstra
One issue worth at least considering is that our minor releases usually do not include only new features, but also many bug-fixes -- at least some of which often do not get backported into the next patch-level release. "Feature release" does not convey that information. On Thu, Jul 28, 2016 at 8:

Re: spark roadmap

2016-08-29 Thread Mark Hamstra
At this point, there is no target date set for 2.1. That's something that we should do fairly soon, but right now there is at least a little room for discussion as to whether we want to continue with the same pace of releases that we targeted throughout the 1.x development cycles, or whether lengt

Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-23 Thread Mark Hamstra
Similar but not identical configuration (Java 8/macOs 10.12 with build/mvn -Phive -Phive-thriftserver -Phadoop-2.7 -Pyarn clean install); Similar but not identical failure: ... - line wrapper only initialized once when used as encoder outer scope Spark context available as 'sc' (master = local-c

Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-25 Thread Mark Hamstra
Spark's branch-2.0 is a maintenance branch, effectively meaning that only bug-fixes will be added to it. There are other maintenance branches (such as branch-1.6) that are also receiving bug-fixes in theory, but not so much in fact as maintenance branches get older. The major and minor version nu

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Mark Hamstra
I've got a couple of build niggles that should really be investigated at some point (what look to be OOM issues in spark-repl when building and testing with mvn in a single pass instead of in two passes with -DskipTests first; the killing of ssh sessions by YarnClusterSuite), but these aren't anyth

Re: [discuss] Spark 2.x release cadence

2016-09-27 Thread Mark Hamstra
+1 And I'll dare say that for those with Spark in production, what is more important is that maintenance releases come out in a timely fashion than that new features are released one month sooner or later. On Tue, Sep 27, 2016 at 12:06 PM, Reynold Xin wrote: > We are 2 months past releasing Spa

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-27 Thread Mark Hamstra
we would need to cut > 2.0.2 immediately. > > > > > > On Tue, Sep 27, 2016 at 10:18 AM, Mark Hamstra > wrote: > >> I've got a couple of build niggles that should really be investigated at >> some point (what look to be OOM issues in spark-repl when

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-30 Thread Mark Hamstra
0 RC4 is causing a build regression for me on at least one of my machines. RC3 built and ran tests successfully, but the tests consistently fail with RC4 unless I revert 9e91a1009e6f916245b4d4018de1664ea3decfe7, "[SPARK-15703][SCHEDULER][CORE][WEBUI] Make ListenerBus event queue size configurable

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-10-01 Thread Mark Hamstra
tch you mentioned increased the memory > usage of BlockManagerSuite and made the tests easy to OOM. It can be > fixed by mocking SparkContext (or may be not necessary since Jenkins's > maven and sbt builds are green now). > > However, since this is only a test issue, it should not

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
If I'm correctly understanding the kind of voting that you are talking about, then to be accurate, it is only the PMC members that have a vote, not all committers: https://www.apache.org/foundation/how-it-works.html#pmc-members On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger wrote: > I think th

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
I'm not a fan of the SEP acronym. Besides it prior established meaning of "Somebody else's problem", the are other inappropriate or offensive connotations such as this Australian slang that often gets shortened to just "sep": http://www.urbandictionary.com/define.php?term=Seppo On Sun, Oct 9, 20

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
fusing stuff, including that commiters are > in practice given a vote. > > https://www.apache.org/foundation/voting.html > > I don't care either way, if someone wants me to sub commiter for PMC in > the voting section, fine, we just need a clear outcome. > > On Oc

Re: DAGScheduler.handleJobCancellation uses jobIdToStageIds for verification while jobIdToActiveJob for lookup?

2016-10-13 Thread Mark Hamstra
an Rashid > wrote: > > Hi Jacek, > > > > doesn't look like there is any good reason -- Mark Hamstra might know > this > > best. Feel free to open a jira & pr for it, you can ping Mark, Kay > > Ousterhout, and me (@squito) for review. > > > >

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-24 Thread Mark Hamstra
Alright, that does it! Who is responsible for this "straw-man" abuse that is becoming too commonplace in the Spark community? "Straw-man" does not mean something like "trial balloon" or "run it up the flagpole and see if anyone salutes", and I would really appreciate it if Spark developers would

Re: Mini-Proposal: Make it easier to contribute to the contributing to Spark Guide

2016-10-24 Thread Mark Hamstra
guage suffices, especially given we > have people from lots of language backgrounds here. > > > On Mon, Oct 24, 2016 at 6:11 PM Mark Hamstra > wrote: > >> Alright, that does it! Who is responsible for this "straw-man" >> abuse that is becoming too commonplac

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Mark Hamstra
What's changed since the last time we discussed these issues, about 7 months ago? Or, another way to formulate the question: What are the threshold criteria that we should use to decide when to end Scala 2.10 and/or Java 7 support? On Tue, Oct 25, 2016 at 8:36 AM, Sean Owen wrote: > I'd like to

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Mark Hamstra
No, I think our intent is that using a deprecated language version can generate warnings, but that it should still work; whereas once we remove support for a language version, then it really is ok for Spark developers to do things not compatible with that version and for users attempting to use tha

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-25 Thread Mark Hamstra
per the 2.0 release notes which I linked to. Here > they are > <http://spark.apache.org/releases/spark-release-2-0-0.html#deprecations> > again. > ​ > > On Tue, Oct 25, 2016 at 3:19 PM Mark Hamstra > wrote: > >> No, I think our intent is that using a deprecated

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-14 Thread Mark Hamstra
Take a look at spark.sql.adaptive.enabled and the ExchangeCoordinator. A single, fixed-sized sql.shuffle.partitions is not the only way to control the number of partitions in an Exchange -- if you are willing to deal with code that is still off by by default. On Mon, Nov 14, 2016 at 4:19 PM, leo9

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
AFAIK, the adaptive shuffle partitioning still isn't completely ready to be made the default, and there are some corner issues that need to be addressed before this functionality is declared finished and ready. E.g., the current logic can make data skew problems worse by turning One Big Partition

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
You still have the problem that even within a single Job it is often the case that not every Exchange really wants to use the same number of shuffle partitions. On Tue, Nov 15, 2016 at 2:46 AM, Sean Owen wrote: > Once you get to needing this level of fine-grained control, should you not > consid

Re: Can I add a new method to RDD class?

2016-12-07 Thread Mark Hamstra
The easiest way is probably with: mvn versions:set -DnewVersion=your_new_version On Wed, Dec 7, 2016 at 11:31 AM, Teng Long wrote: > Hi Holden, > > Can you please tell me how to edit version numbers efficiently? the > correct way? I'm really struggling with this and don't know where to look. >

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-12 Thread Mark Hamstra
Yes, I see the same. On Mon, Dec 12, 2016 at 5:52 PM, Marcelo Vanzin wrote: > Another failing test is "ReplSuite:should clone and clean line object > in ClosureCleaner". It never passes for me, just keeps spinning until > the JVM eventually starts throwing OOM errors. Anyone seeing that? > > On

Re: Reg: Any Dev member in and around Chennai / Tamilnadu

2016-12-20 Thread Mark Hamstra
http://spark.apache.org/committers.html On Tue, Dec 20, 2016 at 4:48 AM, Sivanesan Govindaraj < nesan.commit...@gmail.com> wrote: > HI Dev, > >Sorry to bother with non-technical query. I wish to connect with any > active contributor / committer in and around Chennai / TamilNadu. I wish to > c

Re: Sharing data in columnar storage between two applications

2016-12-25 Thread Mark Hamstra
NOt so much about between applications, rather multiple frameworks within an application, but still related: https://cs.stanford.edu/~matei/papers/2017/cidr_weld.pdf On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki wrote: > Here is an interesting discussion to share data in columnar storage > b

Re: Shuffle intermidiate results not being cached

2016-12-26 Thread Mark Hamstra
Shuffle results are only reused if you are reusing the exact same RDD. If you are working with Dataframes that you have not explicitly cached, then they are going to be producing new RDDs within their physical plan creation and evaluation, so you won't get implicit shuffle reuse. This is what htt

Re: Sharing data in columnar storage between two applications

2016-12-26 Thread Mark Hamstra
16, at 5:24 PM, Mark Hamstra wrote: > > NOt so much about between applications, rather multiple frameworks within > an application, but still related: https://cs.stanford. > edu/~matei/papers/2017/cidr_weld.pdf > > On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki > wrote: > &

Re: Spark Improvement Proposals

2017-03-09 Thread Mark Hamstra
-0 on voting on whether we need a vote. On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin wrote: > I'm fine without a vote. (are we voting on wether we need a vote?) > > > On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen wrote: > >> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt. >>

Re: Should we consider a Spark 2.1.1 release?

2017-03-19 Thread Mark Hamstra
That doesn't necessarily follow, Jacek. There is a point where too frequent releases decrease quality. That is because releases don't come for free -- each one demands a considerable amount of time from release managers, testers, etc. -- time that would otherwise typically be devoted to improving (

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-01 Thread Mark Hamstra
LocalityPlacementStrategySuite hangs -- definitely been seeing that one for quite awhile, not just with 2.1.1-rc, also with Ubuntu 16.10, and not with macOS Sierra. On Sat, Apr 1, 2017 at 12:34 PM, Sean Owen wrote: > (Tiny nits: first line says '2.1.0', just a note for next copy/paste of > the e

Re: Why did spark switch from AKKA to net / ...

2017-05-07 Thread Mark Hamstra
The point is that Spark's prior usage of Akka was limited enough that it could fairly easily be removed entirely instead of forcing particular architectural decisions on Spark's users. On Sun, May 7, 2017 at 1:14 PM, geoHeil wrote: > Thank you! > In the issue they outline that hard wired depende

Re: a stage can belong to more than one job please?

2017-06-06 Thread Mark Hamstra
Yes, a Stage can be part of more than one Job. The jobIds field of Stage is used repeatedly in the DAGScheduler. On Tue, Jun 6, 2017 at 5:04 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > I read same code of spark about stage. > > The constructor of stage keep the first job ID the stage was

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Mark Hamstra
Points 2, 3 and 4 of the Project Plan in that document (i.e. "port existing data sources using internal APIs to use the proposed public Data Source V2 API") have my full support. Really, I'd like to see that dog-fooding effort completed and lesson learned from it fully digested before we remove any

Re: Increase Timeout or optimize Spark UT?

2017-08-22 Thread Mark Hamstra
This is another argument for getting the code to the point where this can default to "true": SQLConf.scala: val ADAPTIVE_EXECUTION_ENABLED = buildConf(" *spark.sql.adaptive.enabled*") On Tue, Aug 22, 2017 at 12:27 PM, Reynold Xin wrote: > +1 > > > On Tue, Aug 22, 2017 at 12:25 PM, Maciej Szymk

Re: SPIP: Spark on Kubernetes

2017-08-28 Thread Mark Hamstra
> > In my opinion, the fact that there are nearly no changes to spark-core, > and most of our changes are additive should go to prove that this adds > little complexity to the workflow of the committers. Actually (and somewhat perversely), the otherwise praiseworthy isolation of the Kubernetes co

Re: Supporting Apache Aurora as a cluster manager

2017-09-10 Thread Mark Hamstra
While it may be worth creating the design doc and JIRA ticket so that we at least have a better idea and a record of what you are talking about, I kind of doubt that we are going to want to merge this into the Spark codebase. That's not because of anything specific to this Aurora effort, but rather

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-13 Thread Mark Hamstra
Yeah, but that discussion and use case is a bit different -- providing a different route to download the final released and approved artifacts that were built using only acceptable artifacts and sources vs. building and checking prior to release using something that is not from an Apache mirror. Th

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-14 Thread Mark Hamstra
; question (e.g. Why are we downloading Spark in a test case ?). >> >> Thanks >> Shivaram >> >> On Wed, Sep 13, 2017 at 11:50 AM, Mark Hamstra >> wrote: >> > Yeah, but that discussion and use case is a bit different -- providing a >> > different

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Mark Hamstra
+1 (binding) On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas wrote: > +1 on this proposal. > > On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu wrote: > > > Will these maintainers have a cleanup for those pending PRs upon we start > > to apply this model? > > > I second Nan's question. I would like to

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
> > The console mode of sbt (just run > sbt/sbt and then a long running console session is started that will accept > further commands) is great for building individual subprojects or running > single test suites. In addition to being faster since its a long running > JVM, its got a lot of nice fe

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
Ok, strictly speaking, that's equivalent to your second class of examples, "development console", not the first "sbt console" On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra wrote: > The console mode of sbt (just run >> sbt/sbt and then a long running co

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
More or less correct, but I'd add that there are an awful lot of software systems out there that use Maven. Integrating with those systems is generally easier if you are also working with Spark in Maven. (And I wouldn't classify all of those Maven-built systems as "legacy", Michael :) What that

Re: Spurious test failures, testing best practices

2014-11-30 Thread Mark Hamstra
> > - Start the SBT interactive console with sbt/sbt > - Build your assembly by running the "assembly" target in the assembly > project: assembly/assembly > - Run all the tests in one module: core/test > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this > also supports tab

  1   2   3   >