Santosh, To add a bit more to what Nabeel said, Spark and Impala are very different tools. Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop ecosystem. Impala really shines when your entire dataset fits into memory and your processing can be expressed in terms of sql. Paired with the column oriented Parquet format, it can really scream with the right dataset.
Spark also has a SQL layer (formely shark, now more tightly integrated with Spark), but at least for our dataset, Impala was faster. However, Spark has a fantastic and far more flexible programming model. As has been mentioned a few times in this thread, it has a better batch processing model than map/reduce, it can do stream processing, and in the newest release, it looks like it can even mix and match sql queries. You do need to be more aware of memory issues than map/reduce, since using more memory is one of the primary sources of Sparks speed, but with that caveat, its a great technology. In our particular workflow, we're replacing map/reduce with spark for our batch layer and using Impala for our query layer. Daniel, For what it's worth, we've had a bunch of hanging issues because the garbage collector seems to get out of control. The most effective technique has been to dramatically increase the numPartition argument in our various groupBy and cogroup calls which reduces the per-task memory requirements. We also reduced the memory used by the shuffler ( spark.shuffle.memoryFraction) and turned off RDD memory (since we don't have any iterative algorithms). Finally, using kryo delivered a hug performance and memory boost (even without registering any custom serializers). Keith On Tue, Jul 8, 2014 at 2:58 PM, Robert James <srobertja...@gmail.com> wrote: > As a new user, I can definitely say that my experience with Spark has > been rather raw. The appeal of interactive, batch, and in between all > using more or less straight Scala is unarguable. But the experience > of deploying Spark has been quite painful, mainly about gaps between > compile time and run time to the JVM, due to dependency conflicts, > having to use uber jars, Spark's own uber jar which includes some very > old libs, etc. > > What's more, there's very little resources available to help. Some > times I've been able to get help via public sources, but, more often > than not, it's been trial and error. Enough that, despite Spark's > unmistakable appeal, we are seriously considering dropping it entirely > and just doing a classical Hadoop. > > On 7/8/14, Surendranauth Hiraman <suren.hira...@velos.io> wrote: > > Aaron, > > > > I don't think anyone was saying Spark can't handle this data size, given > > testimony from the Spark team, Bizo, etc., on large datasets. This has > kept > > us trying different things to get our flow to work over the course of > > several weeks. > > > > Agreed that the first instinct should be "what did I do wrong". > > > > I believe that is what every person facing this issue has done, in > reaching > > out to the user group repeatedly over the course of the few of months > that > > I've been active here. I also know other companies (all experienced with > > large production datasets on other platforms) facing the same types of > > issues - flows that run on subsets of data but not the whole production > > set. > > > > So I think, as you are saying, it points to the need for further > > diagnostics. And maybe also some type of guidance on typical issues with > > different types of datasets (wide rows, narrow rows, etc.), flow > > topologies. etc.? Hard to tell where we are going wrong right now. We've > > tried many things over the course of 6 weeks or so. > > > > I tried to look for the professional services link on databricks.com but > > didn't find it. ;-) (jk). > > > > -Suren > > > > > > > > On Tue, Jul 8, 2014 at 4:16 PM, Aaron Davidson <ilike...@gmail.com> > wrote: > > > >> Not sure exactly what is happening but perhaps there are ways to > >>> restructure your program for it to work better. Spark is definitely > able > >>> to > >>> handle much, much larger workloads. > >> > >> > >> +1 @Reynold > >> > >> Spark can handle big "big data". There are known issues with informing > >> the > >> user about what went wrong and how to fix it that we're actively working > >> on, but the first impulse when a job fails should be "what did I do > >> wrong" > >> rather than "Spark can't handle this workload". Messaging is a huge part > >> in > >> making this clear -- getting things like a job hanging or an out of > >> memory > >> error can be very difficult to debug, and improving this is one of our > >> highest priorties. > >> > >> > >> On Tue, Jul 8, 2014 at 12:47 PM, Reynold Xin <r...@databricks.com> > wrote: > >> > >>> Not sure exactly what is happening but perhaps there are ways to > >>> restructure your program for it to work better. Spark is definitely > able > >>> to > >>> handle much, much larger workloads. > >>> > >>> I've personally run a workload that shuffled 300 TB of data. I've also > >>> ran something that shuffled 5TB/node and stuffed my disks fairly full > >>> that > >>> the file system is close to breaking. > >>> > >>> We can definitely do a better job in Spark to make it output more > >>> meaningful diagnosis and more robust with partitions of data that don't > >>> fit > >>> in memory though. A lot of the work in the next few releases will be on > >>> that. > >>> > >>> > >>> > >>> On Tue, Jul 8, 2014 at 10:04 AM, Surendranauth Hiraman < > >>> suren.hira...@velos.io> wrote: > >>> > >>>> I'll respond for Dan. > >>>> > >>>> Our test dataset was a total of 10 GB of input data (full production > >>>> dataset for this particular dataflow would be 60 GB roughly). > >>>> > >>>> I'm not sure what the size of the final output data was but I think it > >>>> was on the order of 20 GBs for the given 10 GB of input data. Also, I > >>>> can > >>>> say that when we were experimenting with persist(DISK_ONLY), the size > >>>> of > >>>> all RDDs on disk was around 200 GB, which gives a sense of overall > >>>> transient memory usage with no persistence. > >>>> > >>>> In terms of our test cluster, we had 15 nodes. Each node had 24 cores > >>>> and 2 workers each. Each executor got 14 GB of memory. > >>>> > >>>> -Suren > >>>> > >>>> > >>>> > >>>> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey < > kevin.mar...@oracle.com> > >>>> wrote: > >>>> > >>>>> When you say "large data sets", how large? > >>>>> Thanks > >>>>> > >>>>> > >>>>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote: > >>>>> > >>>>> From a development perspective, I vastly prefer Spark to MapReduce. > >>>>> The MapReduce API is very constrained; Spark's API feels much more > >>>>> natural > >>>>> to me. Testing and local development is also very easy - creating a > >>>>> local > >>>>> Spark context is trivial and it reads local files. For your unit > tests > >>>>> you > >>>>> can just have them create a local context and execute your flow with > >>>>> some > >>>>> test data. Even better, you can do ad-hoc work in the Spark shell and > >>>>> if > >>>>> you want that in your production code it will look exactly the same. > >>>>> > >>>>> Unfortunately, the picture isn't so rosy when it gets to production. > >>>>> In my experience, Spark simply doesn't scale to the volumes that > >>>>> MapReduce > >>>>> will handle. Not with a Standalone cluster anyway - maybe Mesos or > >>>>> YARN > >>>>> would be better, but I haven't had the opportunity to try them. I > find > >>>>> jobs > >>>>> tend to just hang forever for no apparent reason on large data sets > >>>>> (but > >>>>> smaller than what I push through MapReduce). > >>>>> > >>>>> I am hopeful the situation will improve - Spark is developing > quickly > >>>>> - but if you have large amounts of data you should proceed with > >>>>> caution. > >>>>> > >>>>> Keep in mind there are some frameworks for Hadoop which can hide the > >>>>> ugly MapReduce with something very similar in form to Spark's API; > >>>>> e.g. > >>>>> Apache Crunch. So you might consider those as well. > >>>>> > >>>>> (Note: the above is with Spark 1.0.0.) > >>>>> > >>>>> > >>>>> > >>>>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com> > >>>>> wrote: > >>>>> > >>>>>> Hello Experts, > >>>>>> > >>>>>> > >>>>>> > >>>>>> I am doing some comparative study on the below: > >>>>>> > >>>>>> > >>>>>> > >>>>>> Spark vs Impala > >>>>>> > >>>>>> Spark vs MapREduce . Is it worth migrating from existing MR > >>>>>> implementation to Spark? > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Please share your thoughts and expertise. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> Santosh > >>>>>> > >>>>>> ------------------------------ > >>>>>> > >>>>>> This message is for the designated recipient only and may contain > >>>>>> privileged, proprietary, or otherwise confidential information. If > you > >>>>>> have > >>>>>> received it in error, please notify the sender immediately and > delete > >>>>>> the > >>>>>> original. Any other use of the e-mail by you is prohibited. Where > >>>>>> allowed > >>>>>> by local law, electronic communications with Accenture and its > >>>>>> affiliates, > >>>>>> including e-mail and instant messaging (including content), may be > >>>>>> scanned > >>>>>> by our systems for the purposes of information security and > assessment > >>>>>> of > >>>>>> internal compliance with Accenture policy. > >>>>>> > >>>>>> > ______________________________________________________________________________________ > >>>>>> > >>>>>> www.accenture.com > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Daniel Siegmann, Software Developer > >>>>> Velos > >>>>> Accelerating Machine Learning > >>>>> > >>>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 > >>>>> E: daniel.siegm...@velos.io W: www.velos.io > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> > >>>> SUREN HIRAMAN, VP TECHNOLOGY > >>>> Velos > >>>> Accelerating Machine Learning > >>>> > >>>> 440 NINTH AVENUE, 11TH FLOOR > >>>> NEW YORK, NY 10001 > >>>> O: (917) 525-2466 ext. 105 > >>>> F: 646.349.4063 > >>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > >>>> W: www.velos.io > >>>> > >>>> > >>> > >> > > > > > > -- > > > > SUREN HIRAMAN, VP TECHNOLOGY > > Velos > > Accelerating Machine Learning > > > > 440 NINTH AVENUE, 11TH FLOOR > > NEW YORK, NY 10001 > > O: (917) 525-2466 ext. 105 > > F: 646.349.4063 > > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > > W: www.velos.io > > >