Also, our exact same flow but with 1 GB of input data completed fine. -Suren
On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > How wide are the rows of data, either the raw input data or any generated > intermediate data? > > We are at a loss as to why our flow doesn't complete. We banged our heads > against it for a few weeks. > > -Suren > > > > On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey <kevin.mar...@oracle.com> > wrote: > >> Nothing particularly custom. We've tested with small (4 node) >> development clusters, single-node pseudoclusters, and bigger, using >> plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, >> Spark local, Spark Yarn (client and cluster) modes, with total memory >> resources ranging from 4GB to 256GB+. >> >> K >> >> >> >> On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote: >> >> To clarify, we are not persisting to disk. That was just one of the >> experiments we did because of some issues we had along the way. >> >> At this time, we are NOT using persist but cannot get the flow to >> complete in Standalone Cluster mode. We do not have a YARN-capable cluster >> at this time. >> >> We agree with what you're saying. Your results are what we were hoping >> for and expecting. :-) Unfortunately we still haven't gotten the flow to >> run end to end on this relatively small dataset. >> >> It must be something related to our cluster, standalone mode or our >> flow but as far as we can tell, we are not doing anything unusual. >> >> Did you do any custom configuration? Any advice would be appreciated. >> >> -Suren >> >> >> >> >> On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey <kevin.mar...@oracle.com> >> wrote: >> >>> It seems to me that you're not taking full advantage of the lazy >>> evaluation, especially persisting to disk only. While it might be true >>> that the cumulative size of the RDDs looks like it's 300GB, only a small >>> portion of that should be resident at any one time. We've evaluated data >>> sets much greater than 10GB in Spark using the Spark master and Spark with >>> Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn >>> is that it reports the actual memory *demand*, not just the memory >>> requested for driver and workers. Processing a 60GB data set through >>> thousands of stages in a rather complex set of analytics and >>> transformations consumed a total cluster resource (divided among all >>> workers and driver) of only 9GB. We were somewhat startled at first by >>> this result, thinking that it would be much greater, but realized that it >>> is a consequence of Spark's lazy evaluation model. This is even with >>> several intermediate computations being cached as input to multiple >>> evaluation paths. >>> >>> Good luck. >>> >>> Kevin >>> >>> >>> >>> On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: >>> >>> I'll respond for Dan. >>> >>> Our test dataset was a total of 10 GB of input data (full production >>> dataset for this particular dataflow would be 60 GB roughly). >>> >>> I'm not sure what the size of the final output data was but I think it >>> was on the order of 20 GBs for the given 10 GB of input data. Also, I can >>> say that when we were experimenting with persist(DISK_ONLY), the size of >>> all RDDs on disk was around 200 GB, which gives a sense of overall >>> transient memory usage with no persistence. >>> >>> In terms of our test cluster, we had 15 nodes. Each node had 24 cores >>> and 2 workers each. Each executor got 14 GB of memory. >>> >>> -Suren >>> >>> >>> >>> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com> >>> wrote: >>> >>>> When you say "large data sets", how large? >>>> Thanks >>>> >>>> >>>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote: >>>> >>>> From a development perspective, I vastly prefer Spark to MapReduce. >>>> The MapReduce API is very constrained; Spark's API feels much more natural >>>> to me. Testing and local development is also very easy - creating a local >>>> Spark context is trivial and it reads local files. For your unit tests you >>>> can just have them create a local context and execute your flow with some >>>> test data. Even better, you can do ad-hoc work in the Spark shell and if >>>> you want that in your production code it will look exactly the same. >>>> >>>> Unfortunately, the picture isn't so rosy when it gets to production. >>>> In my experience, Spark simply doesn't scale to the volumes that MapReduce >>>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN >>>> would be better, but I haven't had the opportunity to try them. I find jobs >>>> tend to just hang forever for no apparent reason on large data sets (but >>>> smaller than what I push through MapReduce). >>>> >>>> I am hopeful the situation will improve - Spark is developing quickly >>>> - but if you have large amounts of data you should proceed with caution. >>>> >>>> Keep in mind there are some frameworks for Hadoop which can hide the >>>> ugly MapReduce with something very similar in form to Spark's API; e.g. >>>> Apache Crunch. So you might consider those as well. >>>> >>>> (Note: the above is with Spark 1.0.0.) >>>> >>>> >>>> >>>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com> >>>> wrote: >>>> >>>>> Hello Experts, >>>>> >>>>> >>>>> >>>>> I am doing some comparative study on the below: >>>>> >>>>> >>>>> >>>>> Spark vs Impala >>>>> >>>>> Spark vs MapREduce . Is it worth migrating from existing MR >>>>> implementation to Spark? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Please share your thoughts and expertise. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Santosh >>>>> >>>>> ------------------------------ >>>>> >>>>> This message is for the designated recipient only and may contain >>>>> privileged, proprietary, or otherwise confidential information. If you >>>>> have >>>>> received it in error, please notify the sender immediately and delete the >>>>> original. Any other use of the e-mail by you is prohibited. Where allowed >>>>> by local law, electronic communications with Accenture and its affiliates, >>>>> including e-mail and instant messaging (including content), may be scanned >>>>> by our systems for the purposes of information security and assessment of >>>>> internal compliance with Accenture policy. >>>>> >>>>> ______________________________________________________________________________________ >>>>> >>>>> www.accenture.com >>>>> >>>> >>>> >>>> >>>> -- >>>> Daniel Siegmann, Software Developer >>>> Velos >>>> Accelerating Machine Learning >>>> >>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 >>>> E: daniel.siegm...@velos.io W: www.velos.io >>>> >>>> >>>> >>> >>> >>> -- >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 <%28917%29%20525-2466%20ext.%20105> >>> F: 646.349.4063 >>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >>> W: www.velos.io >>> >>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >> W: www.velos.io >> >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io