Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Mark Hamstra
n't. The appearance alone is > enough to act to make this consistent. > > But, I think the resolution is simple: it's not 'dangerous' to release > this and I don't think people who say they think this really do. So > just finish this release normally, and we&#x

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Mark Hamstra
This is not a Databricks vs. The World situation, and the fact that some persist in forcing every issue into that frame is getting annoying. There are good engineering and project-management reasons not to populate the long-term, canonical repository of Maven artifacts with what are known to be se

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-03 Thread Mark Hamstra
It's not a question of whether the preview artifacts can be made available on Maven central, but rather whether they must be or should be. I've got no problems leaving these unstable, transitory artifacts out of the more permanent, canonical repository. On Fri, Jun 3, 2016 at 1:53 AM, Steve Lough

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-20 Thread Mark Hamstra
This is isn't yet a release candidate since, as Reynold mention in his opening post, preview releases are "not meant to be functional, i.e. they can and highly likely will contain critical bugs or documentation errors." Once we're at the point where we expect there not to be such bugs and errors,

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
case > with dynamic allocation. HDFS nodes aren't decreasing in number though, > and we can still colocate on those nodes, as always. > > On Thu, Apr 28, 2016 at 11:19 AM, Mark Hamstra > wrote: > >> So you are only considering the case where your set of HDFS nodes is &g

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
luster dynamically changes from 1 > workers to 1000, will the typical HDFS replication factor be sufficient to > retain access to the shuffle files in HDFS > > HDFS isn't resizing. Spark is. HDFS files should be HA and durable. > > On Thu, Apr 28, 2016 at 11:08 AM, Ma

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
Yes, replicated and distributed shuffle materializations are key requirement to maintain performance in a fully elastic cluster where Executors aren't just reallocated across an essentially fixed number of Worker nodes, but rather the number of Workers itself is dynamic. Retaining the file interfac

Re: Question about Scala style, explicit typing within transformation functions and anonymous val.

2016-04-17 Thread Mark Hamstra
s reason, that is > not (or rarely) used in Spark. > > 2016-04-17 15:54 GMT+09:00 Mark Hamstra : > >> FWIW, 3 should work as just `.map(function)`. >> >> On Sat, Apr 16, 2016 at 11:48 PM, Reynold Xin >> wrote: >> >>> Hi Hyukjin, >>&g

Re: Question about Scala style, explicit typing within transformation functions and anonymous val.

2016-04-16 Thread Mark Hamstra
FWIW, 3 should work as just `.map(function)`. On Sat, Apr 16, 2016 at 11:48 PM, Reynold Xin wrote: > Hi Hyukjin, > > Thanks for asking. > > For 1 the change is almost always better. > > For 2 it depends on the context. In general if the type is not obvious, it > helps readability to explicitly d

Re: Coding style question (about extra anonymous closure within functional transformations)

2016-04-14 Thread Mark Hamstra
I don't believe the Scala compiler understands the difference between your two examples the same way that you do. Looking at a few similar cases, I've only found the bytecode produced to be the same regardless of which style is used. On Wed, Apr 13, 2016 at 7:46 PM, Hyukjin Kwon wrote: > Hi all

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-11 Thread Mark Hamstra
d, Apr 6, 2016 at 2:57 PM, Mark Hamstra > wrote: > > ... My concern is that either of those options will take more resources >> than some Spark users will have available in the ~3 months remaining before >> Spark 2.0.0, which will cause fragmentation into Spark 1.

Re: Executor shutdown hooks?

2016-04-06 Thread Mark Hamstra
Why would the Executors shutdown when the Job is terminated? Executors are bound to Applications, not Jobs. Furthermore, unless spark.job.interruptOnCancel is set to true, canceling the Job at the Application and DAGScheduler level won't actually interrupt the Tasks running on the Executors. If

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-06 Thread Mark Hamstra
I agree with your general logic and understanding of semver. That is why if we are going to violate the strictures of semver, I'd only be happy doing so if support for Java 7 and/or Scala 2.10 were clearly understood to be deprecated already in the 2.0.0 release -- i.e. from the outset not to be u

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-06 Thread Mark Hamstra
;>> once we get there. >>> >>> On Fri, Apr 1, 2016 at 10:00 PM, Raymond Honderdors < >>> raymond.honderd...@sizmek.com> wrote: >>> >>>> What about a seperate branch for scala 2.10? >>>> >>>> >>>> >>>&g

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
ee how supporting scala 2.10 for >> spark 2.0 implies supporting it for all of spark 2.x >> >> Regarding Koert's comment on akka, I thought all akka dependencies >> have been removed from spark after SPARK-7997 and the recent removal >> of external/akka >> &

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
t; Regarding Koert's comment on akka, I thought all akka dependencies >> have been removed from spark after SPARK-7997 and the recent removal >> of external/akka >> >> On Wed, Mar 30, 2016 at 9:36 AM, Mark Hamstra >> wrote: >> > Dropping Scala 2.10 support ha

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
Dropping Scala 2.10 support has to happen at some point, so I'm not fundamentally opposed to the idea; but I've got questions about how we go about making the change and what degree of negative consequences we are willing to accept. Until now, we have been saying that 2.10 support will be continue

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Mark Hamstra
There aren't many such libraries, but there are a few. When faced with one of those dependencies that still doesn't go beyond 2.10, you essentially have the choice of taking on the maintenance burden to bring the library up to date, or you do what is potentially a fairly larger refactoring to use

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Mark Hamstra
It's a pain in the ass. Especially if some of your transitive dependencies never upgraded from 2.10 to 2.11. On Thu, Mar 24, 2016 at 4:50 PM, Reynold Xin wrote: > If you want to go down that route, you should also ask somebody who has > had experience managing a large organization's application

Re: Dynamic allocation availability on standalone mode. Misleading doc.

2016-03-07 Thread Mark Hamstra
Yes, it works in standalone mode. On Mon, Mar 7, 2016 at 4:25 PM, Eugene Morozov wrote: > Hi, the feature looks like the one I'd like to use, but there are two > different descriptions in the docs of whether it's available. > > I'm on a standalone deployment mode and here: > http://spark.apache.

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-02 Thread Mark Hamstra
+1 On Wed, Mar 2, 2016 at 2:45 PM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.1! > > The vote is open until Saturday, March 5, 2016 at 20:00 UTC and passes if > a majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this pa

Re: Dataframe Partitioning

2016-03-01 Thread Mark Hamstra
I don't entirely agree. You're best off picking the right size :). That's almost impossible, though, since at the input end of the query processing you often want a large number of partitions to get sufficient parallelism for both performance and to avoid spilling or OOM, while at the output end

Re: Spark 2.0.0 release plan

2016-01-29 Thread Mark Hamstra
https://github.com/apache/spark/pull/10608 On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote: > I'm not an authoritative source but I think it is indeed the plan to > move the default build to 2.11. > > See this discussion for more detail > > http://apache-spark-developers-list.1001551.n3.na

Re: Latency due to driver fetching sizes of output statuses

2016-01-23 Thread Mark Hamstra
Do all of those thousands of Stages end up being actual Stages that need to be computed, or are the vast majority of them eventually "skipped" Stages? If the latter, then there is the potential to modify the DAGScheduler to avoid much of this behavior: https://issues.apache.org/jira/browse/SPARK-10

Re: RDD[Vector] Immutability issue

2015-12-29 Thread Mark Hamstra
You can, but you shouldn't. Using backdoors to mutate the data in an RDD is a good way to produce confusing and inconsistent results when, e.g., an RDD's lineage needs to be recomputed or a Task is resubmitted on fetch failure. On Tue, Dec 29, 2015 at 11:24 AM, ai he wrote: > Same thing. > > Sa

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Mark Hamstra
+1 On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes > if a majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release th

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Mark Hamstra
+1 On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and > passes if a majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release t

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Mark Hamstra
I'm afraid you're correct, Krishna: core/src/main/scala/org/apache/spark/package.scala: val SPARK_VERSION = "1.6.0-SNAPSHOT" docs/_config.yml:SPARK_VERSION: 1.6.0-SNAPSHOT On Mon, Dec 14, 2015 at 6:51 PM, Krishna Sankar wrote: > Guys, >The sc.version gives 1.6.0-SNAPSHOT. Need to change to

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Mark Hamstra
+1 On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes > if a majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release thi

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-04 Thread Mark Hamstra
0 Currently figuring out who is responsible for the regression that I am seeing in some user code ScalaUDFs that make use of Timestamps and where NULL from a CSV file read in via a TestHive#registerTestTable is now producing 1969-12-31 23:59:59.99 instead of null. On Thu, Dec 3, 2015 at 1:57

Re: Quick question regarding Maven and Spark Assembly jar

2015-12-03 Thread Mark Hamstra
Try to read this before Marcelo gets to you. https://issues.apache.org/jira/browse/SPARK-11157 On Thu, Dec 3, 2015 at 5:27 PM, Matt Cheah wrote: > Hi everyone, > > A very brief question out of curiosity – is there any particular reason > why we don’t publish the Spark assembly jar on the Maven r

Re: A proposal for Spark 2.0

2015-12-03 Thread Mark Hamstra
;>> difficult for them to make this transition. > >>>> > >>>> Using the same set of APIs also means that it will be easier to > backport > >>>> critical fixes to the 1.x line. > >>>> > >>>> It's not clear to me t

Re: A proposal for Spark 2.0

2015-11-18 Thread Mark Hamstra
can't move to Spark 2.0 because of the backwards incompatible > changes, like removal of deprecated APIs, Scala 2.11 etc. > > Kostas > > > On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra > wrote: > >> Why does stabilization of those two features require a 1.7 release >

Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Mark Hamstra
FiloDB is also closely reated. https://github.com/tuplejump/FiloDB On Mon, Nov 16, 2015 at 12:24 AM, Nick Pentreath wrote: > Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop > input/output format support: > https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/ma

Re: A proposal for Spark 2.0

2015-11-13 Thread Mark Hamstra
gt;> I mean, we need to think about what kind of RDD APIs we have to provide >> to developer, maybe the fundamental API is enough, like, the ShuffledRDD >> etc.. But PairRDDFunctions probably not in this category, as we can do the >> same thing easily with DF/DS, even better perform

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra
now that the source relation > (/RDD) is already partitioned on the grouping expressions. AFAIK the spark > sql still does not allow that knowledge to be applied to the optimizer - so > a full shuffle will be performed. However in the native RDD we can use > preservesPartitioning=true.

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra
The place of the RDD API in 2.0 is also something I've been wondering about. I think it may be going too far to deprecate it, but changing emphasis is something that we might consider. The RDD API came well before DataFrames and DataSets, so programming guides, introductory how-to articles and th

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In that case, as a > workar

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
t. On Tue, Nov 10, 2015 at 7:04 PM, Mark Hamstra wrote: > Heh... ok, I was intentionally pushing those bullet points to be extreme > to find where people would start pushing back, and I'll agree that we do > probably want some new features in 2.0 -- but I think we've got goo

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
> > I think we are in agreement, although I wouldn't go to the extreme and say > "a release with no new features might even be best." > > Can you elaborate "anticipatory changes"? A concrete example or so would > be helpful. > > On Tue, Nov 10, 2015 a

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
I'm liking the way this is shaping up, and I'd summarize it this way (let me know if I'm misunderstanding or misrepresenting anything): - New features are not at all the focus of Spark 2.0 -- in fact, a release with no new features might even be best. - Remove deprecated API that we agree

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
Really, Sandy? "Extra consideration" even for already-deprecated API? If we're not going to remove these with a major version change, then just when will we remove them? On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza wrote: > Another +1 to Reynold's proposal. > > Maybe this is obvious, but I'd li

Re: Ready to talk about Spark 2.0?

2015-11-08 Thread Mark Hamstra
Yes, that's clearer -- at least to me. But before going any further, let me note that we are already sliding past Sean's opening question of "Should we start talking about Spark 2.0?" to actually start talking about Spark 2.0. I'll try to keep the rest of this post at a higher- or meta-level in o

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Mark Hamstra
+1 On Tue, Nov 3, 2015 at 3:22 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a > majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache

Re: State of the Build

2015-11-05 Thread Mark Hamstra
There was a lot of discussion that preceded our arriving at this statement in the Spark documentation: "Maven is the official build tool recommended for packaging Spark, and is the build of reference." https://spark.apache.org/docs/latest/building-spark.html#building-with-sbt I'm not aware of anyt

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-25 Thread Mark Hamstra
Should 1.5.2 wait for Josh's fix of SPARK-11293? On Sun, Oct 25, 2015 at 2:25 PM, Sean Owen wrote: > The signatures and licenses are fine. I continue to get failures in > these tests though, with "-Pyarn -Phadoop-2.6 -Phive > -Phive-thriftserver" on Ubuntu 15 / Java 7. > > - Unpersisting HttpBro

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-25 Thread Mark Hamstra
You're correct, Sean: That build change isn't in branch-1.5, so the two-phase build is still needed there. On Sun, Oct 25, 2015 at 9:30 AM, Sean Owen wrote: > I believe you still need to "clean package" and then "test" > separately. Or did the change to make that unnecessary go in to 1.5? > > FW

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Mark Hamstra
full git commit history of exactly what is in that build readily available, not just somewhat arbitrary JARs. On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang wrote: > But I cannot find 1.5.1-SNAPSHOT either at > https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.1

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Mark Hamstra
There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released. The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release candidates and then the 1.5.1 release. On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang wrote: > I'd like to use some important bug fixes in 1.5 branch and I

Re: New Spark json endpoints

2015-09-17 Thread Mark Hamstra
While we're at it, adding endpoints that get results by jobGroup (cf. SparkContext#setJobGroup) instead of just for a single Job would also be very useful to some of us. On Thu, Sep 17, 2015 at 7:30 AM, Imran Rashid wrote: > Hi Kevin, > > I think it would be great if you added this. It never go

Re: Foundation policy on releases and Spark nightly builds

2015-07-14 Thread Mark Hamstra
> > Please keep in mind that you are also "ASF people," as is the entire Spark > community (users and all)[4]. Phrasing things in terms of "us and them" by > drawing a distinction on "[they] get in a fight on our mailing list" is not > helpful. But they started it! A bit more seriously, my perspe

Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-09 Thread Mark Hamstra
+1 On Wed, Jul 8, 2015 at 10:55 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.4.1! > > This release fixes a handful of known issues in Spark 1.4.0, listed here: > http://s.apache.org/spark-1.4.1 > > The tag to be voted on is v1.4.1-rc4

Re: [VOTE] Release Apache Spark 1.4.1 (RC3)

2015-07-08 Thread Mark Hamstra
HiveSparkSubmitSuite is fine for me, but I do see the same issue with DataFrameStatSuite -- OSX 10.10.4, java 1.7.0_75, -Phive -Phive-thriftserver -Phadoop-2.4 -Pyarn On Wed, Jul 8, 2015 at 4:18 AM, Sean Owen wrote: > The POM issue is resolved and the build succeeds. The license and sigs > stil

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-06 Thread Mark Hamstra
+1 On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.4.0! > > The tag to be voted on is v1.4.0-rc3 (commit 22596c5): > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= > 22596c534a38cfdda91aef18aa9

Re: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Mark Hamstra
This discussion belongs on the dev list. Please post any replies there. On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park wrote: > Hi, > > I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to > confirm whether these are bugs or not before opening a jira. > > *1)* I can no longer

Re: Speeding up Spark build during development

2015-05-03 Thread Mark Hamstra
https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri wrote: > This is great. I didn't know about the mvn script in the build directory. > > Pramod > > On Fri, May 1, 2015 at 9:51 AM, York, Brennon > > wrote: > > > Follow

Re: Should we let everyone set Assignee?

2015-04-22 Thread Mark Hamstra
Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell wrote: > Hi Vinod, > > Thanks for you thoughts - However, I do not agree with your sentiment > and implications. Spark is broadly quit

Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-12 Thread Mark Hamstra
+1 On Fri, Apr 10, 2015 at 11:05 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.3.1! > > The tag to be voted on is v1.3.1-rc2 (commit 3e83913): > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043b

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Mark Hamstra
+1 On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.3.1! > > The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-06 Thread Mark Hamstra
+1 On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.2.2! > > The tag to be voted on is v1.2.2-rc1 (commit 7531b50): > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7531b50e406ee2e3301b009ceea7

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Mark Hamstra
Is that correct, or is the JIRA just out of sync, since TD's PR was merged? https://github.com/apache/spark/pull/5008 On Mon, Apr 6, 2015 at 11:10 AM, Hari Shreedharan wrote: > It does not look like https://issues.apache.org/jira/browse/SPARK-6222 > made it. It was targeted towards this release.

Re: What is the meaning to of 'STATE' in a worker/ an executor?

2015-03-29 Thread Mark Hamstra
A LOADING Executor is on the way to RUNNING, but hasn't yet been registered with the Master, so it isn't quite ready to do useful work. > On Mar 29, 2015, at 9:09 PM, Niranda Perera wrote: > > Hi, > > I have noticed in the Spark UI, workers and executors run on several states, > ALIVE, LOAD

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Mark Hamstra
oblem, but I haven't run it to ground yet.) On Mon, Feb 23, 2015 at 12:18 PM, Michael Armbrust wrote: > On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra > wrote: > >> So what are we expecting of Hive 0.12.0 builds with this RC? I know not >> every combination of Hadoop

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-22 Thread Mark Hamstra
So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the "Building Spark" page isn't looking too good to me. Working from f97b0d4, the example build command works: mvn -Pyarn -

Re: Keep or remove Debian packaging in Spark?

2015-02-10 Thread Mark Hamstra
here: > > > > https://issues.apache.org/jira/browse/BIGTOP-1480 > > > > > > > > -Original Message- > > From: Sean Owen [mailto:so...@cloudera.com] > > Sent: Monday, February 9, 2015 3:52 PM > > To: Nicholas Chammas > > Cc: Patric

Re: Keep or remove Debian packaging in Spark?

2015-02-09 Thread Mark Hamstra
> > it sounds like nobody intends these to be used to actually deploy Spark I wouldn't go quite that far. What we have now can serve as useful input to a deployment tool like Chef, but the user is then going to need to add some customization or configuration within the context of that tooling to

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Mark Hamstra
In master, Reynold has already taken care of moving Row into org.apache.spark.sql; so, even though the implementation of Row (and GenericRow et al.) is in Catalyst (which is more optimizer than parser), that needn't be of concern to users of the API in its most recent state. On Tue, Jan 27, 2015 a

Re: Job priority

2015-01-11 Thread Mark Hamstra
ght-1000 pool >>>> will always get to launch tasks first whenever it has jobs active." >>>> >>>> On Sat, Jan 10, 2015 at 11:57 PM, Alessandro Baretta < >>>> alexbare...@gmail.com> wrote: >>>> >>>>> Mark

Re: ANNOUNCE: New build script ./build/mvn

2014-12-27 Thread Mark Hamstra
> > Scala complication speed Heh. I like that. On Sat, Dec 27, 2014 at 1:51 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Linkies for the curious: > >- SPARK-4501 : Create >build/mvn to automatically download maven/zinc/sc

Re: What RDD transformations trigger computations?

2014-12-18 Thread Mark Hamstra
SPARK-2992 is a good start, but it's not exhaustive. For example, zipWithIndex is also an eager transformation, and we occasionally see PRs suggesting additional eager transformations. On Thu, Dec 18, 2014 at 12:14 PM, Reynold Xin wrote: > > Alessandro was probably referring to some transformati

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-12 Thread Mark Hamstra
+1 On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen wrote: > > +1. Tested using spark-perf and the Spark EC2 scripts. I didn’t notice > any performance regressions that could not be attributed to changes of > default configurations. To be more specific, when running Spark 1.2.0 with > the Spark 1.1

Re: Adding RDD function to segment an RDD (like substring)

2014-12-09 Thread Mark Hamstra
`zipWithIndex` is both compute intensive and breaks Spark's "transformations are lazy" model, so it is probably not appropriate to add this to the public RDD API. If `zipWithIndex` weren't already what I consider to be broken, I'd be much friendlier to building something more on top of it, but I r

Re: drop table if exists throws exception

2014-12-05 Thread Mark Hamstra
And that is no different from how Hive has worked for a long time. On Fri, Dec 5, 2014 at 11:42 AM, Michael Armbrust wrote: > The command run fine for me on master. Note that Hive does print an > exception in the logs, but that exception does not propogate to user code. > > On Thu, Dec 4, 2014

Re: Spurious test failures, testing best practices

2014-11-30 Thread Mark Hamstra
> > - Start the SBT interactive console with sbt/sbt > - Build your assembly by running the "assembly" target in the assembly > project: assembly/assembly > - Run all the tests in one module: core/test > - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this > also supports tab

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
More or less correct, but I'd add that there are an awful lot of software systems out there that use Maven. Integrating with those systems is generally easier if you are also working with Spark in Maven. (And I wouldn't classify all of those Maven-built systems as "legacy", Michael :) What that

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
Ok, strictly speaking, that's equivalent to your second class of examples, "development console", not the first "sbt console" On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra wrote: > The console mode of sbt (just run >> sbt/sbt and then a long running co

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
> > The console mode of sbt (just run > sbt/sbt and then a long running console session is started that will accept > further commands) is great for building individual subprojects or running > single test suites. In addition to being faster since its a long running > JVM, its got a lot of nice fe

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Mark Hamstra
+1 (binding) On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas wrote: > +1 on this proposal. > > On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu wrote: > > > Will these maintainers have a cleanup for those pending PRs upon we start > > to apply this model? > > > I second Nan's question. I would like to

Re: Moving PR Builder to mvn

2014-10-24 Thread Mark Hamstra
Your's are in the same ballpark with mine, where maven builds with zinc take about 1.4x the time to build with SBT. On Fri, Oct 24, 2014 at 4:24 PM, Sean Owen wrote: > Here's a crude benchmark on a Linux box (GCE n1-standard-4). zinc gets > the assembly build in range of SBT's time. > > mvn -Dsk

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Mark Hamstra
https://github.com/apache/spark/pull/2576 On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan wrote: > James, > > Michael at the meetup last night said there was some development > activity around ORCFiles. > > I'm curious though, what are the pros and cons of ORCFiles vs Parquet? > > On Wed, Oct 8, 20

Re: Workflow Scheduler for Spark

2014-09-17 Thread Mark Hamstra
See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, referenced in that JIRA: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov wrote: > I have problems using Oozie. For example it doesn't

JIRA content request

2014-07-29 Thread Mark Hamstra
Of late, I've been coming across quite a few pull requests and associated JIRA issues that contain nothing indicating their purpose beyond a pretty minimal description of what the pull request does. On the pull request itself, a reference to the corresponding JIRA in the title combined with a desc

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Mark Hamstra
; in which branch. > > - Patrick > > On Mon, Jul 28, 2014 at 10:02 AM, Mark Hamstra > wrote: > > Where and how is that fork being maintained? I'm not seeing an obviously > > correct branch or tag in the main asf hive repo & github mirror. > > > > >

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Mark Hamstra
Where and how is that fork being maintained? I'm not seeing an obviously correct branch or tag in the main asf hive repo & github mirror. On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell wrote: > It would be great if the hive team can fix that issue. If not, we'll > have to continue forking ou

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
You can find some of the prior, related discussion here: https://issues.apache.org/jira/browse/SPARK-1021 On Mon, Jul 21, 2014 at 1:25 PM, Erik Erlandson wrote: > > > - Original Message - > > Rather than embrace non-lazy transformations and add more of them, I'd > > rather we 1) try to

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate existing non-lazy transformation. T

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Sure, drop() would be useful, but breaking the "transformations are lazy; only actions launch jobs" model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but rather that each such exception to

Re: Master compilation with sbt

2014-07-19 Thread Mark Hamstra
> project mllib . . . > clean . . . > compile . . . >test ...all works fine for me @2a732110d46712c535b75dd4f5a73761b6463aa8 On Sat, Jul 19, 2014 at 11:10 AM, Debasish Das wrote: > I am at the reservoir sampling commit: > > commit 586e716e47305cd7c2c3ff35c0e828b63ef2f6a8 > Author: Reynold Xin

Re: ExecutorState.LOADING?

2014-07-09 Thread Mark Hamstra
ater (replaced by RUNNING) by the same > Mr. Zaharia: > > https://github.com/apache/spark/commit/bb1bce79240da22c2677d9f8159683cdf73158c2#diff-776a630ac2b2ec5fe85c07ca20a58fc0 > > So I'd say it's safe to delete it. > > > On Wed, Jul 9, 2014 at 2:36 PM, Mark Hamstra >

ExecutorState.LOADING?

2014-07-09 Thread Mark Hamstra
Doesn't look to me like this is used. Does anybody recall what it was intended for?

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-05 Thread Mark Hamstra
+1 On Fri, Jul 4, 2014 at 12:40 PM, Patrick Wendell wrote: > I'll start the voting with a +1 - ran tests on the release candidate > and ran some basic programs. RC1 passed our performance regression > suite, and there are no major changes from that RC. > > On Fri, Jul 4, 2014 at 12:39 PM, Patri

Re: Assorted project updates (tests, build, etc)

2014-06-22 Thread Mark Hamstra
Just a couple of FYI notes: With Zinc and the scala-maven-plugin, repl and incremental builds are also available to those doing day-to-day development using Maven. As long as you don't have to delve into the extra boilerplate and verbosity of Maven's POMs relative to an SBT build file, there is li

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-27 Thread Mark Hamstra
+1 On Tue, May 27, 2014 at 9:26 AM, Ankur Dave wrote: > 0 > > OK, I withdraw my downvote. > > Ankur >

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-21 Thread Mark Hamstra
+1 On Tue, May 20, 2014 at 11:09 PM, Henry Saputra wrote: > Signature and hash for source looks good > No external executable package with source - good > Compiled with git and maven - good > Ran examples and sample programs locally and standalone -good > > +1 > > - Henry > > > > On Tue, May 20,

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Mark Hamstra
That's all very old functionality in Spark terms, so it shouldn't have anything to do with your installation being out-of-date. There is also no need to cast as long as the relevant implicit conversions are in scope: import org.apache.spark.SparkContext._ On Tue, May 20, 2014 at 1:00 PM, GlennSt

Re: spark 1.0 standalone application

2014-05-19 Thread Mark Hamstra
That's the crude way to do it. If you run `sbt/sbt publishLocal`, then you can resolve the artifact from your local cache in the same way that you would resolve it if it were deployed to a remote cache. That's just the build step. Actually running the application will require the necessary jars

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
busy? I do think the rate of > > >>>> significant issues will slow down. > > >>>> > > >>>> Version ain't nothing but a number, but if it has any meaning it's > the > > >>>> semantic versioning meaning. 1.0 imposes e

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
table idea, then I'm listening. On Sat, May 17, 2014 at 11:59 AM, Mridul Muralidharan wrote: > On 17-May-2014 11:40 pm, "Mark Hamstra" wrote: > > > > That is a past issue that we don't need to be re-opening now. The > present > > Huh ? If we need to

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
o 1.0 - so I made my opinion known but left it to the wisdom of larger > group of committers to decide ... I did not think it was critical enough to > do a binding -1 on. > > Regards > Mridul > On 17-May-2014 9:43 pm, "Mark Hamstra" wrote: > > > Which of the u

Re: [VOTE] Release Apache Spark 1.0.0 (rc9)

2014-05-17 Thread Mark Hamstra
+1 On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell wrote: > I'll start the voting with a +1. > > On Sat, May 17, 2014 at 12:58 AM, Patrick Wendell > wrote: > > Please vote on releasing the following candidate as Apache Spark version > 1.0.0! > > This has one bug fix and one minor feature on t

<    1   2   3   >