Hey Committers,
Now that Spark 1.3 rc1 is cut, please restrict branch-1.3 merges to
the following:
1. Fixes for issues blocking the 1.3 release (i.e. 1.2.X regressions)
2. Documentation and tests.
3. Fixes for non-blocker issues that are surgical, low-risk, and/or
outside of the core.
If there
Please vote on releasing the following candidate as Apache Spark version 1.3.0!
The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
The release files, including signatures, digests, etc.
I do not think it makes sense to make the web server configurable.
Mostly because there's no real problem in running an HTTP service
internally based on Netty while you run your own HTTP service based on
something else like Tomcat. What's the problem?
On Wed, Feb 18, 2015 at 3:14 AM, Niranda
The serializer is created with
val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
Which is definitely not the closure serializer and so should respect
what you are setting with spark.serializer.
Maybe you can do a quick bit of debugging to see where that assumption
UISeleniumSuite:
*** RUN ABORTED ***
java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
...
This is a newer test suite. There is something flaky about it, we
should definitely fix it, IMO it's not a blocker though.
Patrick this link gives a 404:
i'm actually going to do this now -- it's really quiet today.
there are two spark pull request builds running, which i will kill and
retrigger once jenkins is back up:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27689/
i'll be kicking jenkins to up the open file limits on the workers. it
should be a very short downtime, and i'll post updates on my progress
tomorrow.
shane
This would be pretty tricky to do -- the issue is that right now
sparkContext.runJob has you pass in a function from a partition to *one*
result object that gets serialized and sent back: Iterator[T] = U, and
that idea is baked pretty deep into a lot of the internals, DAGScheduler,
Task,
But RDD.aggregate() has this code:
// Clone the zero value since we will also be serializing it as part of
tasks
var jobResult = Utils.clone(zeroValue,
sc.env.closureSerializer.newInstance())
I do see the SparkEnv.get.serializer used in aggregateByKey however. Perhaps
we just missed it
Hi,
Some of my jobs failed due to no space left on device and on those jobs I
was monitoring the shuffle space...when the job failed shuffle space did
not clean and I had to manually clean it...
Is there a JIRA already tracking this issue ? If no one has been assigned
to it, I can take a look.
On OS X and Ubuntu I see the following test failure in the source
release for 1.3.0-RC1:
UISeleniumSuite:
*** RUN ABORTED ***
java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
...
Patrick this link gives a 404:
https://people.apache.org/keys/committer/pwendell.asc
Finally, I
It looks like this was fixed in
https://issues.apache.org/jira/browse/SPARK-4743 /
https://github.com/apache/spark/pull/3605. Can you see whether that patch
fixes this issue for you?
On Tue, Feb 17, 2015 at 8:31 PM, Matt Cheah mch...@palantir.com wrote:
Hi everyone,
I was using
I've recently run into problems caused by ticket SPARK-5008
https://issues.apache.org/jira/browse/SPARK-5008
This seems like quite a serious regression in 1.2.0, meaning that it's not
really possible to use persistent-hdfs. The config for the persistent-hdfs
points to the wrong part of the
Hi Spark devs,
I'm creating a streaming export functionality for RDDs and am having some
trouble with large partitions. The RDD.toLocalIterator() call pulls over a
partition at a time to the driver, and then streams the RDD out from that
partition before pulling in the next one. When you have
Now in JIRA form: https://issues.apache.org/jira/browse/SPARK-5844
On Tue, Feb 17, 2015 at 3:12 PM, Xiangrui Meng men...@gmail.com wrote:
There are three different regParams defined in the grid and there are
tree folds. For simplicity, we didn't split the dataset into three and
reuse them,
That looks, at the least, inconsistent. As far as I know this should
be changed so that the zero value is always cloned via the non-closure
serializer. Any objection to that?
On Wed, Feb 18, 2015 at 10:28 PM, Matt Cheah mch...@palantir.com wrote:
But RDD.aggregate() has this code:
// Clone
Yes, that's a bug and should be using the standard serializer.
On Wed, Feb 18, 2015 at 2:58 PM, Sean Owen so...@cloudera.com wrote:
That looks, at the least, inconsistent. As far as I know this should
be changed so that the zero value is always cloned via the non-closure
serializer. Any
Please create a JIRA for it and we should discuss the APIs first
before updating the code. -Xiangrui
On Tue, Feb 17, 2015 at 4:10 PM, Debasish Das debasish.da...@gmail.com wrote:
It will be really help us if we merge it but I guess it is already diverged
from the new ALS...I will also take a
Hi everyone,
Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take()
consistently has a slower execution time on the later release. I was
wondering if anyone else has had similar observations.
I have two setups where this reproduces. The first is a local test. I
launched a spark
On Wed, Feb 18, 2015 at 6:13 PM, Patrick Wendell pwend...@gmail.com wrote:
Patrick this link gives a 404:
https://people.apache.org/keys/committer/pwendell.asc
Works for me. Maybe it's some ephemeral issue?
Yes works now; I swear it didn't before! that's all set now. The
signing key is in
I believe the heuristic governing the way that take() decides to fetch
partitions changed between these versions. It could be that in certain
cases the new heuristic is worse, but it might be good to just look at
the source code and see, for your number of elements taken and number
of partitions,
I actually tested Spark 1.2.0 with the code in the rdd.take() method
swapped out for what was in Spark 1.0.2. The run time was still slower,
which indicates to me something at work lower in the stack.
-Matt Cheah
On 2/18/15, 4:54 PM, Patrick Wendell pwend...@gmail.com wrote:
I believe the
Ah okay, I turned on spark.localExecution.enabled and the performance
returned to what Spark 1.0.2 had. However I can see how users can
inadvertently incur memory and network strain in fetching the whole
partition to the driver.
I¹ll evaluate on my side if we want to turn this on or not. Thanks
+1 (non-binding, of course)
1. Compiled OSX 10.10 (Yosemite) OK Total time: 14:50 min
mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.1.x
1.2.x
2.1.
You might be seeing the result of this patch:
https://github.com/apache/spark/commit/d069c5d9d2f6ce06389ca2ddf0b3ae4db72c5797
which was introduced in 1.1.1. This patch disabled the ability for take()
to run without launching a Spark job, which means that the latency is
significantly increased
Another alternative would be to compress the partition in memory in a
streaming fashion instead of calling .toArray on the iterator. Would it be
an easier mitigation to the problem? Or, is it hard to compress the rows
one by one without materializing the full partition in memory using the
According to hive documentation, sort by is supposed to order the results
for each reducer. So if we set a single reducer, then the results should be
sorted, right? But this is not happening. Any idea why? Looks like the
settings I am using to restrict the number of reducers is not having an
27 matches
Mail list logo