better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Hi Spark devs, I was looking into the memory usage of shuffle and one annoying thing is the default compression codec (LZF) is that the implementation we use allocates buffers pretty generously. I did a simple experiment and found that creating 1000 LZFOutputStream allocated 198976424 bytes

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Patrick Wendell
1. The first error I met is the different SerializationVersionUID in ExecuterStatus I resolved by explicitly declare SerializationVersionUID in ExecuterStatus.scala and recompile branch-0.1-jdbc I don't think there is a class in Spark named ExecuterStatus (sic) ... or ExecutorStatus. Is

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu
Ah, sorry, sorry It's executorState under deploy package On Monday, July 14, 2014, Patrick Wendell pwend...@gmail.com wrote: 1. The first error I met is the different SerializationVersionUID in ExecuterStatus I resolved by explicitly declare SerializationVersionUID in

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Mridul Muralidharan
We tried with lower block size for lzf, but it barfed all over the place. Snappy was the way to go for our jobs. Regards, Mridul On Mon, Jul 14, 2014 at 12:31 PM, Reynold Xin r...@databricks.com wrote: Hi Spark devs, I was looking into the memory usage of shuffle and one annoying thing is

Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Hi all, I've been evaluating YourKit and would like to profile the heap and CPU usage of certain tests from the Spark test suite. In particular, I'm very interested in tracking heap usage by allocation site. Unfortunately, I get a lot of crashes running Spark tests with profiling (and thus

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Matei Zaharia
I haven't seen issues using the JVM's own tools (jstack, jmap, hprof and such), so maybe there's a problem in YourKit or in your release of the JVM. Otherwise I'd suggest increasing the heap size of the unit tests a bit (you can do this in the SBT build file). Maybe they are very close to full

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd

Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Cody Koeninger
Hi all, just wanted to give a heads up that we're seeing a reproducible deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2 If jira is a better place for this, apologies in advance - figured talking about it on the mailing list was friendlier than randomly (re)opening jira tickets. I know Gary had

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote: Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Cody, This Jstack seems truncated, would you mind giving the entire stack trace? For the second thread, for instance, we can't see where the lock is being acquired. - Patrick On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger cody.koenin...@mediacrossing.com wrote: Hi all, just wanted to give

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Thanks, Matei; I have also had some success with jmap and friends and will probably just stick with them! best, wb - Original Message - From: Matei Zaharia matei.zaha...@gmail.com To: dev@spark.apache.org Sent: Monday, July 14, 2014 1:02:04 PM Subject: Re: Profiling Spark tests

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The full jstack would still be useful, but our current working theory is that this is due to the fact that Configuration#loadDefaults goes through every Configuration object that was ever created (via Configuration.REGISTRY) and locks it, thus introducing a dependency from new Configuration to

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Nishkam Ravi
Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would help. I suspect we will start seeing the ConcurrentModificationException again. The right fix has gone into Hadoop through 10456. Unfortunately, I don't have any bright ideas on how to synchronize this at the Spark level

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Stephen Haberman
Just a comment from the peanut gallery, but these buffers are a real PITA for us as well. Probably 75% of our non-user-error job failures are related to them. Just naively, what about not doing compression on the fly? E.g. during the shuffle just write straight to disk, uncompressed? For us, we

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Sandy Ryza
Stephen, Often the shuffle is bound by writes to disk, so even if disks have enough space to store the uncompressed data, the shuffle can complete faster by writing less data. Reynold, This isn't a big help in the short term, but if we switch to a sort-based shuffle, we'll only need a single

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Matei Zaharia
You can actually turn off shuffle compression by setting spark.shuffle.compress to false. Try that out, there will still be some buffers for the various OutputStreams, but they should be smaller. Matei On Jul 14, 2014, at 3:30 PM, Stephen Haberman stephen.haber...@gmail.com wrote: Just a

Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Nishkam, Aaron's fix should prevent two concurrent accesses to getJobConf (and the Hadoop code therein). But if there is code elsewhere that tries to mutate the configuration, then I could see how we might still have the ConcurrentModificationException. I looked at your patch for

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We use the Hadoop configuration inside of our code executing on Spark as we need to list out files in the path. Maybe that is why it is exposed for us. On Mon, Jul 14, 2014 at 6:57 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Nishkam, Aaron's fix should prevent two concurrent accesses

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Copying Jon here since he worked on the lzf library at Ning. Jon - any comments on this topic? On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can actually turn off shuffle compression by setting spark.shuffle.compress to false. Try that out, there will

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We'll try to run a build tomorrow AM. On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com wrote: Andrew and Gary, Would you guys be able to test https://github.com/apache/spark/pull/1409/files and see if it solves your problem? - Patrick On Mon, Jul 14, 2014 at 4:18 PM,

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The patch won't solve the problem where two people try to add a configuration option at the same time, but I think there is currently an issue where two people can try to initialize the Configuration at the same time and still run into a ConcurrentModificationException. This at least reduces

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Davies Liu
Maybe we could try LZ4 [1], which has better performance and smaller footprint than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x higher than LZF[2], but memory used is 10x smaller than LZF (16k vs 190k). [1] https://github.com/jpountz/lz4-java [2]

SBT gen-idea doesn't work well after merging SPARK-1776

2014-07-14 Thread DB Tsai
I've a clean clone of spark master repository, and I generated the intellij project file by sbt gen-idea as usual. There are two issues we have after merging SPARK-1776 (read dependencies from Maven). 1) After SPARK-1776, sbt gen-idea will download the dependencies from internet even those jars

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Jon Hartlaub
Is the held memory due to just instantiating the LZFOutputStream? If so, I'm a surprised and I consider that a bug. I suspect the held memory may be due to a SoftReference - memory will be released with enough memory pressure. Finally, is it necessary to keep 1000 (or more) decoders active?

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Aaron Davidson
One of the core problems here is the number of open streams we have, which is (# cores * # reduce partitions), which can easily climb into the tens of thousands for large jobs. This is a more general problem that we are planning on fixing for our largest shuffles, as even moderate buffer sizes can

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
- Original Message - From: Aaron Davidson ilike...@gmail.com To: dev@spark.apache.org Sent: Monday, July 14, 2014 5:21:10 PM Subject: Re: Profiling Spark tests with YourKit (or something else) Out of curiosity, what problems are you seeing with Utils.getCallSite? Aaron, if I enable

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Aaron Davidson
Would you mind filing a JIRA for this? That does sound like something bogus happening on the JVM/YourKit level, but this sort of diagnosis is sufficiently important that we should be resilient against it. On Mon, Jul 14, 2014 at 6:01 PM, Will Benton wi...@redhat.com wrote: - Original

ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Just launched an EC2 cluster from git hash 9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD accessing data in S3 yields the following error output. I understand that NoClassDefFoundError errors may mean something in the deployment was messed up. Is that correct? When I launch a

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread scwf
hi,Cody i met this issue days before and i post a PR for this( https://github.com/apache/spark/pull/1385) it's very strange that if i synchronize conf it will deadlock but it is ok when synchronize initLocalJobConfFuncOpt Here's the entire jstack output. On Mon, Jul 14, 2014 at 4:44 PM,

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu
I resolved the issue by setting an internal maven repository to contain the Spark-1.0.1 jar compiled from branch-0.1-jdbc and replacing the dependency to the central repository with our own repository I believe there should be some more lightweight way Best, -- Nan Zhu On Monday, July

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Sure thing: https://issues.apache.org/jira/browse/SPARK-2486 https://github.com/apache/spark/pull/1413 best, wb - Original Message - From: Aaron Davidson ilike...@gmail.com To: dev@spark.apache.org Sent: Monday, July 14, 2014 8:38:16 PM Subject: Re: Profiling Spark tests with

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I'm not sure either of those PRs will fix the concurrent adds to Configuration issue I observed. I've got a stack trace and writeup I'll share in an hour or two (traveling today). On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote: hi,Cody i met this issue days before and i post a PR for

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Andrew is your issue also a regression from 1.0.0 to 1.0.1? The immediate priority is addressing regressions between these two releases. On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote: I'm not sure either of those PRs will fix the concurrent adds to Configuration issue I

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, I'd just add

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I don't believe mine is a regression. But it is related to thread safety on Hadoop Configuration objects. Should I start a new thread? On Jul 15, 2014 12:55 AM, Patrick Wendell pwend...@gmail.com wrote: Andrew is your issue also a regression from 1.0.0 to 1.0.1? The immediate priority is

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Aaron Davidson
This one is typically due to a mismatch between the Hadoop versions -- i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the classpath, or something like that. Not certain why you're seeing this with spark-ec2, but I'm assuming this is related to the issues you posted in a

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Shivaram Venkataraman
My guess is that this is related to https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets excluded from the SBT assembly jar. I am not sure if the assembly jar used in EC2 is generated using SBT though. Shivaram On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Patrick Wendell
Yeah - this is likely caused by SPARK-2471. On Mon, Jul 14, 2014 at 10:11 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: My guess is that this is related to https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets excluded from the SBT assembly jar. I am not sure

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Andrew, Yeah, that would be preferable. Definitely worth investigating both, but the regression is more pressing at the moment. - Patrick On Mon, Jul 14, 2014 at 10:02 PM, Andrew Ash and...@andrewash.com wrote: I don't believe mine is a regression. But it is related to thread safety on

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Okie doke--added myself as a watcher on that issue. On a related note, what are the thoughts on automatically spinning up/down EC2 clusters and running tests against them? It would probably be way too cumbersome to do that for every build, but perhaps on some schedule it could help validate that