When you say "Spark is one of the forerunners for our technology choice", what are the other options you are looking into ?
I start cross validation runs on a 40 core, 160 GB spark job using a script...I woke up in the morning, none of the jobs crashed ! and the project just came out of incubation I wish Spark keep evolving as a standalone Akka cluster (MPI cluster if you remember C++ mpiexec :-) where you can plug and play any distributed file system (HDFS,..,) or distributed caching systems (HBase, Cassandra,..) I am also confident that Spark as a standalone akka cluster can serve analytics driven scalable frontend apps....and by analytics I don't mean sql analytics...but predictive analytics... On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth < andras.nem...@lynxanalytics.com> wrote: > Hello Spark Users, > > With the recent graduation of Spark to a top level project (grats, btw!), > maybe a well timed question. :) > > We are at the very beginning of a large scale big data project and after > two months of exploration work we'd like to settle on the technologies to > use, roll up our sleeves and start to build the system. > > Spark is one of the forerunners for our technology choice. > > My question in essence is whether it's a good idea or is Spark too > 'experimental' just yet to bet our lives (well, the project's life) on it. > > The benefits of choosing Spark are numerous and I guess all too obvious > for this audience - e.g. we love its powerful abstraction, ease of > development and the potential for using a single system for serving and > manipulating huge amount of data. > > This email aims to ask about the risks. I enlist concrete issues we've > encountered below, but basically my concern boils down to two philosophical > points: > I. Is it too much magic? Lots of things "just work right" in Spark and > it's extremely convenient and efficient when it indeed works. But should we > be worried that customization is hard if the built in behavior is not quite > right for us? Are we to expect hard to track down issues originating from > the black box behind the magic? > II. Is it mature enough? E.g. we've created a pull > request<https://github.com/apache/spark/pull/181>which fixes a problem that > we were very surprised no one ever stumbled upon > before. So that's why I'm asking: is Spark being already used in > professional settings? Can one already trust it being reasonably bug free > and reliable? > > I know I'm asking a biased audience, but that's fine, as I want to be > convinced. :) > > So, to the concrete issues. Sorry for the long mail, and let me know if I > should break this out into more threads or if there is some other way to > have this discussion... > > 1. Memory management > The general direction of these questions is whether it's possible to take > RDD caching related memory management more into our own hands as LRU > eviction is nice most of the time but can be very suboptimal in some of our > use cases. > A. Somehow prioritize cached RDDs, E.g. mark some "essential" that one > really wants to keep. I'm fine with going down in flames if I mark too much > data essential. > B. Memory "reflection": can you pragmatically get the memory size of a > cached rdd and memory sizes available in total/per executor? If we could do > this we could indirectly avoid automatic evictions of things we might > really want to keep in memory. > C. Evictions caused by RDD partitions on the driver. I had a setup with > huge worker memory and smallish memory on the driver JVM. To my surprise, > the system started to cache RDD partitions on the driver as well. As the > driver ran out of memory I started to see evictions while there were still > plenty of space on workers. This resulted in lengthy recomputations. Can > this be avoided somehow? > D. Broadcasts. Is it possible to get rid of a broadcast manually, without > waiting for the LRU eviction taking care of it? Can you tell the size of a > broadcast programmatically? > > > 2. Akka lost connections > We have quite often experienced lost executors due to akka exceptions - > mostly connection lost or similar. It seems to happen when an executor gets > extremely busy with some CPU intensive work. Our hypothesis is that akka > network threads get starved and the executor fails to respond within > timeout limits. Is this plausible? If yes, what can we do with it? > > In general, these are scary errors in the sense that they come from the > very core of the framework and it's hard to link it to something we do in > our own code, and thus hard to find a fix. So a question more for the > community: how often do you end up scratching your head about cases where > spark magic doesn't work perfectly? > > > 3. Recalculation of cached rdds > I see the following scenario happening. I load two RDDs A,B from disk, > cache them and then do some jobs on them, at the very least a count on > each. After these jobs are done I see on the storage panel that 100% of > these RDDs are cached in memory. > > Then I create a third RDD C which is created by multiple joins and maps > from A and B, also cache it and start a job on C. When I do this I still > see A and B completely cached and also see C slowly getting more and more > cached. This is all fine and good, but in the meanwhile I see stages > running on the UI that point to code which is used to load A and B. How is > this possible? Am I misunderstanding how cached RDDs should behave? > > And again the general question - how can one debug such issues? > > 4. Shuffle on disk > Is it true - I couldn't find it in official docs, but did see this > mentioned in various threads - that shuffle _always_ hits disk? > (Disregarding OS caches.) Why is this the case? Are you planning to add a > function to do shuffle in memory or are there some intrinsic reasons for > this to be impossible? > > > Sorry again for the giant mail, and thanks for any insights! > > Andras > > >