Spark Mesos Architecture Details and Integration details
Hi All I am a new member of this group. Sorry if this question has already been asked. I am looking for some information regarding Spark Mesos Integration: 1. How does Mesos schedules and launch Spark Executors? (Any pointer to code will be helpful) 2. How does Mesos frontend the Spark Administration capabilities? Are there any urls through which mesos proxy's the request for spark master. I am particularly interested to find out how to access Spark Master Admin and Monitoring URLs when running using Mesos. 3. Any other good documentations/architecture for me to get started on Mesos and Spark integration and internals. Regards Sumit Chawla
Re: Quota
The dispatcher needs 1cpu and 1G memory. Regards, Vijay Sent from my iPhone > On Dec 9, 2016, at 4:51 PM, Vinod Konewrote: > > And how many resources does spark need? > >> On Fri, Dec 9, 2016 at 4:05 PM, Vijay Srinivasaraghavan >> wrote: >> Here is the slave state info. I see marathon is registered as "slave_public" >> role and is configured with "default_accepted_resource_roles" as "*" >> >> "slaves":[ >> { >> "id":"69356344-e2c4-453d-baaf-22df4a4cc430-S0", >> "pid":"slave(1)@xxx.xxx.xxx.100:5051", >> "hostname":"xxx.xxx.xxx.100", >> "registered_time":1481267726.19244, >> "resources":{ >> "disk":12099.0, >> "mem":14863.0, >> "gpus":0.0, >> "cpus":4.0, >> "ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, >> 8182-32000]" >> }, >> "used_resources":{ >> "disk":0.0, >> "mem":0.0, >> "gpus":0.0, >> "cpus":0.0 >> }, >> "offered_resources":{ >> "disk":0.0, >> "mem":0.0, >> "gpus":0.0, >> "cpus":0.0 >> }, >> "reserved_resources":{ >> >> }, >> "unreserved_resources":{ >> "disk":12099.0, >> "mem":14863.0, >> "gpus":0.0, >> "cpus":4.0, >> "ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, >> 8182-32000]" >> }, >> "attributes":{ >> >> }, >> "active":true, >> "version":"1.0.1" >> } >>], >> >> Regards >> Vijay >> On Friday, December 9, 2016 3:48 PM, Vinod Kone wrote: >> >> >> How many resources does the agent register with the master? How many >> resources does spark task need? >> >> I'm guessing marathon is not registered with "test" role so it is only >> getting un-reserved resources which are not enough for spark task? >> >> On Fri, Dec 9, 2016 at 2:54 PM, Vijay Srinivasaraghavan >> wrote: >> I have a standalone DCOS setup (Single node Vagrant VM running DCOS >> v.1.9-dev build + Mesos 1.0.1 + Marathon 1.3.0). Both master and agent are >> running on same VM. >> >> Resource: 4 CPU, 16GB Memory, 20G Disk >> >> I have created a quota using new V1 API which creates a role "test" with >> resource constraints of 0.5 CPU and 1G Memory. >> >> When I try to deploy Spark package, Marathon receives the request but the >> task is in "waiting" state since it did not receive any offers from Master >> though I don't see any resource constraints from the hardware perspective. >> >> However, when I deleted the quota, Marathon is able to move forward with the >> deployment and Spark was deployed/up and running. I could see from the Mesos >> master logs that it had sent an offer to the Marathon framework. >> >> To debug the issue, I was trying to create a quota but this time did not >> provide any CPU and Memory (0 cpu and 0 mem). After this, when I try to >> deploy Spark from DCOS UI, I could see Marathon getting offer from Master >> and able to deploy Spark without the need to delete the quota this time. >> >> Did anyone notice similar behavior? >> >> Regards >> Vijay >> >> >> >
Re: Quota
And how many resources does spark need? On Fri, Dec 9, 2016 at 4:05 PM, Vijay Srinivasaraghavan < vijikar...@yahoo.com> wrote: > Here is the slave state info. I see marathon is registered as > "slave_public" role and is configured with "default_accepted_resource_roles" > as "*" > > "slaves":[ > { > "id":"69356344-e2c4-453d-baaf-22df4a4cc430-S0", > "pid":"slave(1)@xxx.xxx.xxx.100:5051", > "hostname":"xxx.xxx.xxx.100", > "registered_time":1481267726.19244, > "resources":{ > "disk":12099.0, > "mem":14863.0, > "gpus":0.0, > "cpus":4.0, > "ports":"[1025-2180, 2182-3887, 3889-5049, > 5052-8079, 8082-8180, 8182-32000]" > }, > "used_resources":{ > "disk":0.0, > "mem":0.0, > "gpus":0.0, > "cpus":0.0 > }, > "offered_resources":{ > "disk":0.0, > "mem":0.0, > "gpus":0.0, > "cpus":0.0 > }, > "reserved_resources":{ > > }, > "unreserved_resources":{ > "disk":12099.0, > "mem":14863.0, > "gpus":0.0, > "cpus":4.0, > "ports":"[1025-2180, 2182-3887, 3889-5049, > 5052-8079, 8082-8180, 8182-32000]" > }, > "attributes":{ > > }, > "active":true, > "version":"1.0.1" > } >], > > Regards > Vijay > On Friday, December 9, 2016 3:48 PM, Vinod Kone> wrote: > > > How many resources does the agent register with the master? How many > resources does spark task need? > > I'm guessing marathon is not registered with "test" role so it is only > getting un-reserved resources which are not enough for spark task? > > On Fri, Dec 9, 2016 at 2:54 PM, Vijay Srinivasaraghavan < > vijikar...@yahoo.com> wrote: > > I have a standalone DCOS setup (Single node Vagrant VM running DCOS > v.1.9-dev build + Mesos 1.0.1 + Marathon 1.3.0). Both master and agent are > running on same VM. > > Resource: 4 CPU, 16GB Memory, 20G Disk > > I have created a quota using new V1 API which creates a role "test" with > resource constraints of 0.5 CPU and 1G Memory. > > When I try to deploy Spark package, Marathon receives the request but the > task is in "waiting" state since it did not receive any offers from Master > though I don't see any resource constraints from the hardware perspective. > > However, when I deleted the quota, Marathon is able to move forward with > the deployment and Spark was deployed/up and running. I could see from the > Mesos master logs that it had sent an offer to the Marathon framework. > > To debug the issue, I was trying to create a quota but this time did not > provide any CPU and Memory (0 cpu and 0 mem). After this, when I try to > deploy Spark from DCOS UI, I could see Marathon getting offer from Master > and able to deploy Spark without the need to delete the quota this time. > > Did anyone notice similar behavior? > > Regards > Vijay > > > > >
Re: Quota
Here is the slave state info. I see marathon is registered as "slave_public" role and is configured with "default_accepted_resource_roles" as "*" "slaves":[ { "id":"69356344-e2c4-453d-baaf-22df4a4cc430-S0", "pid":"slave(1)@xxx.xxx.xxx.100:5051", "hostname":"xxx.xxx.xxx.100", "registered_time":1481267726.19244, "resources":{ "disk":12099.0, "mem":14863.0, "gpus":0.0, "cpus":4.0, "ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]" }, "used_resources":{ "disk":0.0, "mem":0.0, "gpus":0.0, "cpus":0.0 }, "offered_resources":{ "disk":0.0, "mem":0.0, "gpus":0.0, "cpus":0.0 }, "reserved_resources":{ }, "unreserved_resources":{ "disk":12099.0, "mem":14863.0, "gpus":0.0, "cpus":4.0, "ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]" }, "attributes":{ }, "active":true, "version":"1.0.1" } ], RegardsVijayOn Friday, December 9, 2016 3:48 PM, Vinod Konewrote: How many resources does the agent register with the master? How many resources does spark task need? I'm guessing marathon is not registered with "test" role so it is only getting un-reserved resources which are not enough for spark task? On Fri, Dec 9, 2016 at 2:54 PM, Vijay Srinivasaraghavan wrote: I have a standalone DCOS setup (Single node Vagrant VM running DCOS v.1.9-dev build + Mesos 1.0.1 + Marathon 1.3.0). Both master and agent are running on same VM. Resource: 4 CPU, 16GB Memory, 20G Disk I have created a quota using new V1 API which creates a role "test" with resource constraints of 0.5 CPU and 1G Memory. When I try to deploy Spark package, Marathon receives the request but the task is in "waiting" state since it did not receive any offers from Master though I don't see any resource constraints from the hardware perspective. However, when I deleted the quota, Marathon is able to move forward with the deployment and Spark was deployed/up and running. I could see from the Mesos master logs that it had sent an offer to the Marathon framework. To debug the issue, I was trying to create a quota but this time did not provide any CPU and Memory (0 cpu and 0 mem). After this, when I try to deploy Spark from DCOS UI, I could see Marathon getting offer from Master and able to deploy Spark without the need to delete the quota this time. Did anyone notice similar behavior? RegardsVijay
Re: Quota
How many resources does the agent register with the master? How many resources does spark task need? I'm guessing marathon is not registered with "test" role so it is only getting un-reserved resources which are not enough for spark task? On Fri, Dec 9, 2016 at 2:54 PM, Vijay Srinivasaraghavan < vijikar...@yahoo.com> wrote: > I have a standalone DCOS setup (Single node Vagrant VM running DCOS > v.1.9-dev build + Mesos 1.0.1 + Marathon 1.3.0). Both master and agent are > running on same VM. > > Resource: 4 CPU, 16GB Memory, 20G Disk > > I have created a quota using new V1 API which creates a role "test" with > resource constraints of 0.5 CPU and 1G Memory. > > When I try to deploy Spark package, Marathon receives the request but the > task is in "waiting" state since it did not receive any offers from Master > though I don't see any resource constraints from the hardware perspective. > > However, when I deleted the quota, Marathon is able to move forward with > the deployment and Spark was deployed/up and running. I could see from the > Mesos master logs that it had sent an offer to the Marathon framework. > > To debug the issue, I was trying to create a quota but this time did not > provide any CPU and Memory (0 cpu and 0 mem). After this, when I try to > deploy Spark from DCOS UI, I could see Marathon getting offer from Master > and able to deploy Spark without the need to delete the quota this time. > > Did anyone notice similar behavior? > > Regards > Vijay >
Re: Multi-agent machine
Ok, thanks! On Fri, Dec 9, 2016 at 2:32 PM Benjamin Mahlerwrote: > Maintenance should work in this case, it will just be applied to all > agents on the machine. > > On Fri, Dec 9, 2016 at 1:20 PM, Charles Allen < > charles.al...@metamarkets.com> wrote: > > Thanks for the insight. > > I take that to mean the maintenance primitives might not work right for > multi-agent machines? aka, I can't do maintenance on one agent but not the > others? > > > On Fri, Dec 9, 2016 at 12:16 PM Jie Yu wrote: > > Charles, > > It should be possible. Here are the global 'object' that might conflict: > 1) cgroup (you can use different cgroup root) > 2) work_dir and runime_dir (you can set them to be different between > agents) > 3) network (e.g., iptables, if you use host network, should not be a > problem. Otherwise, you might need to configure your network isolator > properly) > > But we haven't tested. Another potential thing that might come up is the > code that rely on hostname of the agent (MachineID in maintenance > primitive?) > > - Jie > > On Fri, Dec 9, 2016 at 12:11 PM, Charles Allen < > charles.al...@metamarkets.com> wrote: > > Is it possible to setup a machine such that multiple mesos agents are > running on the same machine and registering with the same master? > > For example, with different cgroup roots or different default working > directory. > > > >
Quota
I have a standalone DCOS setup (Single node Vagrant VM running DCOS v.1.9-dev build + Mesos 1.0.1 + Marathon 1.3.0). Both master and agent are running on same VM. Resource: 4 CPU, 16GB Memory, 20G Disk I have created a quota using new V1 API which creates a role "test" with resource constraints of 0.5 CPU and 1G Memory. When I try to deploy Spark package, Marathon receives the request but the task is in "waiting" state since it did not receive any offers from Master though I don't see any resource constraints from the hardware perspective. However, when I deleted the quota, Marathon is able to move forward with the deployment and Spark was deployed/up and running. I could see from the Mesos master logs that it had sent an offer to the Marathon framework. To debug the issue, I was trying to create a quota but this time did not provide any CPU and Memory (0 cpu and 0 mem). After this, when I try to deploy Spark from DCOS UI, I could see Marathon getting offer from Master and able to deploy Spark without the need to delete the quota this time. Did anyone notice similar behavior? RegardsVijay
Re: Multi-agent machine
Maintenance should work in this case, it will just be applied to all agents on the machine. On Fri, Dec 9, 2016 at 1:20 PM, Charles Allenwrote: > Thanks for the insight. > > I take that to mean the maintenance primitives might not work right for > multi-agent machines? aka, I can't do maintenance on one agent but not the > others? > > > On Fri, Dec 9, 2016 at 12:16 PM Jie Yu wrote: > >> Charles, >> >> It should be possible. Here are the global 'object' that might conflict: >> 1) cgroup (you can use different cgroup root) >> 2) work_dir and runime_dir (you can set them to be different between >> agents) >> 3) network (e.g., iptables, if you use host network, should not be a >> problem. Otherwise, you might need to configure your network isolator >> properly) >> >> But we haven't tested. Another potential thing that might come up is the >> code that rely on hostname of the agent (MachineID in maintenance >> primitive?) >> >> - Jie >> >> On Fri, Dec 9, 2016 at 12:11 PM, Charles Allen < >> charles.al...@metamarkets.com> wrote: >> >> Is it possible to setup a machine such that multiple mesos agents are >> running on the same machine and registering with the same master? >> >> For example, with different cgroup roots or different default working >> directory. >> >> >>
Re: Multi-agent machine
Thanks for the insight. I take that to mean the maintenance primitives might not work right for multi-agent machines? aka, I can't do maintenance on one agent but not the others? On Fri, Dec 9, 2016 at 12:16 PM Jie Yuwrote: > Charles, > > It should be possible. Here are the global 'object' that might conflict: > 1) cgroup (you can use different cgroup root) > 2) work_dir and runime_dir (you can set them to be different between > agents) > 3) network (e.g., iptables, if you use host network, should not be a > problem. Otherwise, you might need to configure your network isolator > properly) > > But we haven't tested. Another potential thing that might come up is the > code that rely on hostname of the agent (MachineID in maintenance > primitive?) > > - Jie > > On Fri, Dec 9, 2016 at 12:11 PM, Charles Allen < > charles.al...@metamarkets.com> wrote: > > Is it possible to setup a machine such that multiple mesos agents are > running on the same machine and registering with the same master? > > For example, with different cgroup roots or different default working > directory. > > >
Re: Multi-agent machine
Charles, It should be possible. Here are the global 'object' that might conflict: 1) cgroup (you can use different cgroup root) 2) work_dir and runime_dir (you can set them to be different between agents) 3) network (e.g., iptables, if you use host network, should not be a problem. Otherwise, you might need to configure your network isolator properly) But we haven't tested. Another potential thing that might come up is the code that rely on hostname of the agent (MachineID in maintenance primitive?) - Jie On Fri, Dec 9, 2016 at 12:11 PM, Charles Allen < charles.al...@metamarkets.com> wrote: > Is it possible to setup a machine such that multiple mesos agents are > running on the same machine and registering with the same master? > > For example, with different cgroup roots or different default working > directory. >
Multi-agent machine
Is it possible to setup a machine such that multiple mesos agents are running on the same machine and registering with the same master? For example, with different cgroup roots or different default working directory.
Re: Duplicate task IDs
Hey Neil, I concur that using duplicate task IDs is bad practice and asking for trouble. Could you please clarify *why* you want to use a hashmap? Is your goal to remove duplicate task IDs or is this just a side-effect and you have a different reason (e.g. performance) for using a hashmap? I'm wondering why a multi-hashmap is not sufficient. This would be clear if you were explicitly *trying* to get rid of duplicates of course :-) Thanks, Joris — *Joris Van Remoortere* Mesosphere On Fri, Dec 9, 2016 at 7:08 AM, Neil Conwaywrote: > Folks, > > The master stores a cache of metadata about recently completed tasks; > for example, this information can be accessed via the "/tasks" HTTP > endpoint or the "GET_TASKS" call in the new Operator API. > > The master currently stores this metadata using a list; this means > that duplicate task IDs are permitted. We're considering [1] changing > this to use a hashmap instead. Using a hashmap would mean that > duplicate task IDs would be discarded: if two completed tasks have the > same task ID, only the metadata for the most recently completed task > would be retained by the master. > > If this behavior change would cause problems for your framework or > other software that relies on Mesos, please let me know. > > (Note that if you do have two completed tasks with the same ID, you'd > need an unambiguous way to tell them apart. As a recommendation, I > would strongly encourage framework authors to never reuse task IDs.) > > Neil > > [1] https://reviews.apache.org/r/54179/ >
Command healthcheck failed but status KILLED
Hi What is desired behavior when command health check failed? On Mesos 1.0.2 when health check fails task has state KILLED instead of FAILED with reason specifying it was killed due to failing health check. Thanks Tomek