Merging code into branch 1.3

2015-02-18 Thread Patrick Wendell
Hey Committers, Now that Spark 1.3 rc1 is cut, please restrict branch-1.3 merges to the following: 1. Fixes for issues blocking the 1.3 release (i.e. 1.2.X regressions) 2. Documentation and tests. 3. Fixes for non-blocker issues that are surgical, low-risk, and/or outside of the core. If there

[VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2 The release files, including signatures, digests, etc.

Re: Replacing Jetty with TomCat

2015-02-18 Thread Sean Owen
I do not think it makes sense to make the web server configurable. Mostly because there's no real problem in running an HTTP service internally based on Netty while you run your own HTTP service based on something else like Tomcat. What's the problem? On Wed, Feb 18, 2015 at 3:14 AM, Niranda

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Sean Owen
The serializer is created with val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue) Which is definitely not the closure serializer and so should respect what you are setting with spark.serializer. Maybe you can do a quick bit of debugging to see where that assumption

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Patrick Wendell
UISeleniumSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal ... This is a newer test suite. There is something flaky about it, we should definitely fix it, IMO it's not a blocker though. Patrick this link gives a 404:

Re: quick jenkins restart tomorrow morning, ~7am PST

2015-02-18 Thread shane knapp
i'm actually going to do this now -- it's really quiet today. there are two spark pull request builds running, which i will kill and retrigger once jenkins is back up: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27689/

quick jenkins restart tomorrow morning, ~7am PST

2015-02-18 Thread shane knapp
i'll be kicking jenkins to up the open file limits on the workers. it should be a very short downtime, and i'll post updates on my progress tomorrow. shane

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Imran Rashid
This would be pretty tricky to do -- the issue is that right now sparkContext.runJob has you pass in a function from a partition to *one* result object that gets serialized and sent back: Iterator[T] = U, and that idea is baked pretty deep into a lot of the internals, DAGScheduler, Task,

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Matt Cheah
But RDD.aggregate() has this code: // Clone the zero value since we will also be serializing it as part of tasks var jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance()) I do see the SparkEnv.get.serializer used in aggregateByKey however. Perhaps we just missed it

If job fails shuffle space is not cleaned

2015-02-18 Thread Debasish Das
Hi, Some of my jobs failed due to no space left on device and on those jobs I was monitoring the shuffle space...when the job failed shuffle space did not clean and I had to manually clean it... Is there a JIRA already tracking this issue ? If no one has been assigned to it, I can take a look.

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Sean Owen
On OS X and Ubuntu I see the following test failure in the source release for 1.3.0-RC1: UISeleniumSuite: *** RUN ABORTED *** java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal ... Patrick this link gives a 404: https://people.apache.org/keys/committer/pwendell.asc Finally, I

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Josh Rosen
It looks like this was fixed in https://issues.apache.org/jira/browse/SPARK-4743 / https://github.com/apache/spark/pull/3605. Can you see whether that patch fixes this issue for you? On Tue, Feb 17, 2015 at 8:31 PM, Matt Cheah mch...@palantir.com wrote: Hi everyone, I was using

Issue SPARK-5008 (persistent-hdfs broken)

2015-02-18 Thread Joe Wass
I've recently run into problems caused by ticket SPARK-5008 https://issues.apache.org/jira/browse/SPARK-5008 This seems like quite a serious regression in 1.2.0, meaning that it's not really possible to use persistent-hdfs. The config for the persistent-hdfs points to the wrong part of the

Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Andrew Ash
Hi Spark devs, I'm creating a streaming export functionality for RDDs and am having some trouble with large partitions. The RDD.toLocalIterator() call pulls over a partition at a time to the driver, and then streams the RDD out from that partition before pulling in the next one. When you have

Re: [ml] Lost persistence for fold in crossvalidation.

2015-02-18 Thread Joseph Bradley
Now in JIRA form: https://issues.apache.org/jira/browse/SPARK-5844 On Tue, Feb 17, 2015 at 3:12 PM, Xiangrui Meng men...@gmail.com wrote: There are three different regParams defined in the grid and there are tree folds. For simplicity, we didn't split the dataset into three and reuse them,

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Sean Owen
That looks, at the least, inconsistent. As far as I know this should be changed so that the zero value is always cloned via the non-closure serializer. Any objection to that? On Wed, Feb 18, 2015 at 10:28 PM, Matt Cheah mch...@palantir.com wrote: But RDD.aggregate() has this code: // Clone

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Reynold Xin
Yes, that's a bug and should be using the standard serializer. On Wed, Feb 18, 2015 at 2:58 PM, Sean Owen so...@cloudera.com wrote: That looks, at the least, inconsistent. As far as I know this should be changed so that the zero value is always cloned via the non-closure serializer. Any

Re: Batch prediciton for ALS

2015-02-18 Thread Xiangrui Meng
Please create a JIRA for it and we should discuss the APIs first before updating the code. -Xiangrui On Tue, Feb 17, 2015 at 4:10 PM, Debasish Das debasish.da...@gmail.com wrote: It will be really help us if we merge it but I guess it is already diverged from the new ALS...I will also take a

[Performance] Possible regression in rdd.take()?

2015-02-18 Thread Matt Cheah
Hi everyone, Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take() consistently has a slower execution time on the later release. I was wondering if anyone else has had similar observations. I have two setups where this reproduces. The first is a local test. I launched a spark

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Sean Owen
On Wed, Feb 18, 2015 at 6:13 PM, Patrick Wendell pwend...@gmail.com wrote: Patrick this link gives a 404: https://people.apache.org/keys/committer/pwendell.asc Works for me. Maybe it's some ephemeral issue? Yes works now; I swear it didn't before! that's all set now. The signing key is in

Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Patrick Wendell
I believe the heuristic governing the way that take() decides to fetch partitions changed between these versions. It could be that in certain cases the new heuristic is worse, but it might be good to just look at the source code and see, for your number of elements taken and number of partitions,

Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Matt Cheah
I actually tested Spark 1.2.0 with the code in the rdd.take() method swapped out for what was in Spark 1.0.2. The run time was still slower, which indicates to me something at work lower in the stack. -Matt Cheah On 2/18/15, 4:54 PM, Patrick Wendell pwend...@gmail.com wrote: I believe the

Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Matt Cheah
Ah okay, I turned on spark.localExecution.enabled and the performance returned to what Spark 1.0.2 had. However I can see how users can inadvertently incur memory and network strain in fetching the whole partition to the driver. I¹ll evaluate on my side if we want to turn this on or not. Thanks

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-18 Thread Krishna Sankar
+1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:50 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.1.x 1.2.x 2.1.

Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Aaron Davidson
You might be seeing the result of this patch: https://github.com/apache/spark/commit/d069c5d9d2f6ce06389ca2ddf0b3ae4db72c5797 which was introduced in 1.1.1. This patch disabled the ability for take() to run without launching a Spark job, which means that the latency is significantly increased

Re: Streaming partitions to driver for use in .toLocalIterator

2015-02-18 Thread Mingyu Kim
Another alternative would be to compress the partition in memory in a streaming fashion instead of calling .toArray on the iterator. Would it be an easier mitigation to the problem? Or, is it hard to compress the rows one by one without materializing the full partition in memory using the

Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-18 Thread Kannan Rajah
According to hive documentation, sort by is supposed to order the results for each reducer. So if we set a single reducer, then the results should be sorted, right? But this is not happening. Any idea why? Looks like the settings I am using to restrict the number of reducers is not having an