Also, our exact same flow but with 1 GB of input data completed fine.

-Suren


On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman <
suren.hira...@velos.io> wrote:

> How wide are the rows of data, either the raw input data or any generated
> intermediate data?
>
> We are at a loss as to why our flow doesn't complete. We banged our heads
> against it for a few weeks.
>
> -Suren
>
>
>
> On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey <kevin.mar...@oracle.com>
> wrote:
>
>>  Nothing particularly custom.  We've tested with small (4 node)
>> development clusters, single-node pseudoclusters, and bigger, using
>> plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master,
>> Spark local, Spark Yarn (client and cluster) modes, with total memory
>> resources ranging from 4GB to 256GB+.
>>
>> K
>>
>>
>>
>> On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote:
>>
>> To clarify, we are not persisting to disk. That was just one of the
>> experiments we did because of some issues we had along the way.
>>
>>  At this time, we are NOT using persist but cannot get the flow to
>> complete in Standalone Cluster mode. We do not have a YARN-capable cluster
>> at this time.
>>
>>  We agree with what you're saying. Your results are what we were hoping
>> for and expecting. :-)  Unfortunately we still haven't gotten the flow to
>> run end to end on this relatively small dataset.
>>
>>  It must be something related to our cluster, standalone mode or our
>> flow but as far as we can tell, we are not doing anything unusual.
>>
>>  Did you do any custom configuration? Any advice would be appreciated.
>>
>>  -Suren
>>
>>
>>
>>
>> On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey <kevin.mar...@oracle.com>
>> wrote:
>>
>>>  It seems to me that you're not taking full advantage of the lazy
>>> evaluation, especially persisting to disk only.  While it might be true
>>> that the cumulative size of the RDDs looks like it's 300GB, only a small
>>> portion of that should be resident at any one time.  We've evaluated data
>>> sets much greater than 10GB in Spark using the Spark master and Spark with
>>> Yarn (cluster -- formerly standalone -- mode).  Nice thing about using Yarn
>>> is that it reports the actual memory *demand*, not just the memory
>>> requested for driver and workers.  Processing a 60GB data set through
>>> thousands of stages in a rather complex set of analytics and
>>> transformations consumed a total cluster resource (divided among all
>>> workers and driver) of only 9GB.  We were somewhat startled at first by
>>> this result, thinking that it would be much greater, but realized that it
>>> is a consequence of Spark's lazy evaluation model.  This is even with
>>> several intermediate computations being cached as input to multiple
>>> evaluation paths.
>>>
>>> Good luck.
>>>
>>> Kevin
>>>
>>>
>>>
>>> On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote:
>>>
>>> I'll respond for Dan.
>>>
>>>  Our test dataset was a total of 10 GB of input data (full production
>>> dataset for this particular dataflow would be 60 GB roughly).
>>>
>>>  I'm not sure what the size of the final output data was but I think it
>>> was on the order of 20 GBs for the given 10 GB of input data. Also, I can
>>> say that when we were experimenting with persist(DISK_ONLY), the size of
>>> all RDDs on disk was around 200 GB, which gives a sense of overall
>>> transient memory usage with no persistence.
>>>
>>>  In terms of our test cluster, we had 15 nodes. Each node had 24 cores
>>> and 2 workers each. Each executor got 14 GB of memory.
>>>
>>>  -Suren
>>>
>>>
>>>
>>> On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey <kevin.mar...@oracle.com>
>>> wrote:
>>>
>>>>  When you say "large data sets", how large?
>>>> Thanks
>>>>
>>>>
>>>> On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
>>>>
>>>>  From a development perspective, I vastly prefer Spark to MapReduce.
>>>> The MapReduce API is very constrained; Spark's API feels much more natural
>>>> to me. Testing and local development is also very easy - creating a local
>>>> Spark context is trivial and it reads local files. For your unit tests you
>>>> can just have them create a local context and execute your flow with some
>>>> test data. Even better, you can do ad-hoc work in the Spark shell and if
>>>> you want that in your production code it will look exactly the same.
>>>>
>>>>  Unfortunately, the picture isn't so rosy when it gets to production.
>>>> In my experience, Spark simply doesn't scale to the volumes that MapReduce
>>>> will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN
>>>> would be better, but I haven't had the opportunity to try them. I find jobs
>>>> tend to just hang forever for no apparent reason on large data sets (but
>>>> smaller than what I push through MapReduce).
>>>>
>>>>  I am hopeful the situation will improve - Spark is developing quickly
>>>> - but if you have large amounts of data you should proceed with caution.
>>>>
>>>>  Keep in mind there are some frameworks for Hadoop which can hide the
>>>> ugly MapReduce with something very similar in form to Spark's API; e.g.
>>>> Apache Crunch. So you might consider those as well.
>>>>
>>>>  (Note: the above is with Spark 1.0.0.)
>>>>
>>>>
>>>>
>>>> On Mon, Jul 7, 2014 at 11:07 AM, <santosh.viswanat...@accenture.com>
>>>> wrote:
>>>>
>>>>>  Hello Experts,
>>>>>
>>>>>
>>>>>
>>>>> I am doing some comparative study on the below:
>>>>>
>>>>>
>>>>>
>>>>> Spark vs Impala
>>>>>
>>>>> Spark vs MapREduce . Is it worth migrating from existing MR
>>>>> implementation to Spark?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Please share your thoughts and expertise.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Santosh
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> This message is for the designated recipient only and may contain
>>>>> privileged, proprietary, or otherwise confidential information. If you 
>>>>> have
>>>>> received it in error, please notify the sender immediately and delete the
>>>>> original. Any other use of the e-mail by you is prohibited. Where allowed
>>>>> by local law, electronic communications with Accenture and its affiliates,
>>>>> including e-mail and instant messaging (including content), may be scanned
>>>>> by our systems for the purposes of information security and assessment of
>>>>> internal compliance with Accenture policy.
>>>>>
>>>>> ______________________________________________________________________________________
>>>>>
>>>>> www.accenture.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>  Daniel Siegmann, Software Developer
>>>> Velos
>>>>  Accelerating Machine Learning
>>>>
>>>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
>>>> E: daniel.siegm...@velos.io W: www.velos.io
>>>>
>>>>
>>>>
>>>
>>>
>>>  --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105 <%28917%29%20525-2466%20ext.%20105>
>>> F: 646.349.4063
>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>>> W: www.velos.io
>>>
>>>
>>>
>>
>>
>>  --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
>> W: www.velos.io
>>
>>
>>
>
>
> --
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
> W: www.velos.io
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io
W: www.velos.io

Reply via email to