Compared to Yarn Mesos is just faster. Mesos has a smaller startup time and the delay between tasks is smaller. The run times for terasort 100GB tended towards 110sec median on Mesos vs about double that on Yarn.
Unfortunately we require mature Multi-Tenancy/Isolation/Queues support -which is still initial stages of WIP for Mesos. So we will need to use YARN for the near and likely medium term. 2015-09-17 15:52 GMT-07:00 Marco Massenzio <[email protected]>: > Hey Stephen, > > The spark on mesos is twice as fast as yarn on our 20 node cluster. In >> addition Mesos is handling datasizes that yarn simply dies on it. But >> mesos is still just taking linearly increased time compared to smaller >> datasizes. > > > Obviously delighted to hear that, BUT me not much like "but" :) > I've added Tim who is one of the main contributors to our Mesos/Spark > bindings, and it would be great to hear your use case/experience and find > out whether we can improve on that front too! > > As the case may be, we could also jump on a hangout if it makes > conversation easier/faster. > > Cheers, > > *Marco Massenzio* > > *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* > > On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <[email protected]> wrote: > >> Thanks Vinod. I went back to see the logs and nothing interesting . >> However int he process I found that my spark port was pointing to 7077 >> instead of 5050. After re-running .. spark on mesos worked! >> >> The spark on mesos is twice as fast as yarn on our 20 node cluster. In >> addition Mesos is handling datasizes that yarn simply dies on it. But >> mesos is still just taking linearly increased time compared to smaller >> datasizes. >> >> We have significant additional work to incorporate mesos into operations >> and support but given the strong perforrmance and stability characterstics >> we are initially seeing here that effort is likely to get underway. >> >> >> >> 2015-09-09 12:54 GMT-07:00 Vinod Kone <[email protected]>: >> >>> sounds like it. can you see what the slave/agent and executor logs say? >>> >>> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <[email protected]> >>> wrote: >>> >>>> >>>> I am in the process of learning how to run a mesos cluster with the >>>> intent for it to be the resource manager for Spark. As a small step in >>>> that direction a basic test of mesos was performed, as suggested by the >>>> Mesos Getting Started page. >>>> >>>> In the following output we see tasks launched and resources offered on >>>> a 20 node cluster: >>>> >>>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework >>>> $(hostname -s):5050 >>>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0 >>>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at >>>> [email protected]:5050 >>>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided. >>>> Attempting to register without authentication >>>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with >>>> 20150908-182014-2093760522-5050-15313-0000 >>>> Registered! ID = 20150908-182014-2093760522-5050-15313-0000 >>>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0 >>>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0 >>>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0 >>>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0 >>>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus: 16.0 >>>> and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O14 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O15 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O16 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O17 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O18 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O19 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Received offer 20150908-182014-2093760522-5050-15313-O20 with cpus: >>>> 16.0 and mem: 119855.0 >>>> Status update: task 0 is in state TASK_LOST >>>> Aborting because task 0 is in unexpected state TASK_LOST with reason >>>> 'REASON_EXECUTOR_TERMINATED' from source 'SOURCE_SLAVE' with message >>>> 'Executor terminated' >>>> I0908 18:40:12.466081 31996 sched.cpp:1625] Asked to abort the driver >>>> I0908 18:40:12.467051 31996 sched.cpp:861] Aborting framework >>>> '20150908-182014-2093760522-5050-15313-0000' >>>> I0908 18:40:12.468053 31959 sched.cpp:1591] Asked to stop the driver >>>> I0908 18:40:12.468683 31991 sched.cpp:835] Stopping framework >>>> '20150908-182014-2093760522-5050-15313-0000' >>>> >>>> >>>> Why did the task transition to TASK_LOST ? Is there a >>>> misconfiguration on the cluster? >>>> >>> >>> >> >

