Re: [VOTE] Release Apache Mesos 0.22.0 (rc4)

2015-03-19 Thread Alex Rukletsov
+1 (non-binding)

Mac OS 10.9.5 + clang
CentOS 7 + gcc 4.4.7 [cgroups tests disabled]

On Wed, Mar 18, 2015 at 4:04 PM, Brenden Matthews bren...@diddyinc.com
wrote:

 +1

 Tested with internal testing cluster.

 On Wed, Mar 18, 2015 at 1:25 PM, craig w codecr...@gmail.com wrote:

 +1

 On Wed, Mar 18, 2015 at 3:52 PM, Niklas Nielsen nik...@mesosphere.io
 wrote:

 Hi all,

 Please vote on releasing the following candidate as Apache Mesos 0.22.0.


 0.22.0 includes the following:

 

 * Support for explicitly sending status updates acknowledgements from
   schedulers; refer to the upgrades document for upgrading schedulers.
 * Rate limiting slave removal, to safeguard against unforeseen bugs
 leading to
   widespread slave removal.
 * Disk quota isolation in Mesos containerizer; refer to the
 containerization
   documentation to enable disk quota monitoring and enforcement.
 * Support for module hooks in task launch sequence. Refer to the modules
   documentation for more information.
 * Anonymous modules: a new kind of module that does not receive any
 callbacks
   but coexists with its parent process.
 * New service discovery info in task info allows framework users to
 specify
   discoverability of tasks for external service discovery systems. Refer
 to
   the framework development guide for more information.
 * New '--external_log_file' flag to serve external logs through the
 Mesos web UI.
 * New '--gc_disk_headroom' flag to control maxmimum executor sandbox age.


 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0-rc4

 

 The candidate for Mesos 0.22.0 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/0.22.0-rc4/mesos-0.22.0.tar.gz

 The tag to be voted on is 0.22.0-rc4:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.22.0-rc4

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.22.0-rc4/mesos-0.22.0.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.22.0-rc4/mesos-0.22.0.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1048

 Please vote on releasing this package as Apache Mesos 0.22.0!

 The vote is open until Sat Mar 21 12:49:56 PDT 2015 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.22.0
 [ ] -1 Do not release this package because ...

 Thanks,
 Niklas




 --

 https://github.com/mindscratch
 https://www.google.com/+CraigWickesser
 https://twitter.com/mind_scratch
 https://twitter.com/craig_links





Re: Resource allocation module

2015-03-18 Thread Alex Rukletsov
Hi Gidon,

and thanks for your interest. As you have already noticed, the work is
currently in progress and should land in master branch in around 2 weeks.
It will also be part of 0.23 release. There is no documentation so far, but
we plan to document the API once the patches land. Right now you may want
to look at the Allocator interface and check the DRF implementation for
more details.

—Alex

On Wed, Mar 18, 2015 at 7:05 AM, Gidon Gershinsky gi...@il.ibm.com wrote:

 We need to develop a new resource allocation module, replacing the
 off-the-shelf DRF.
 As I understand, the current mechanism
 http://mesos.apache.org/documentation/latest/allocation-module/

 is being replaced with a less intrusive module architecture,
 https://issues.apache.org/jira/browse/MESOS-2160

 The capabilities of the new mechanism have real advantages for us.
 However, it is not clear when it will be released. The jira has an 'in
 progress' status. What is the current target / horizon for making this
 available to the users? Also, is there any documentation on the SPIs /
 technical interfaces of these modules (what info is passed from slaves,
 frameworks, offers; what  calls can be made by the modules; etc)?

 Regards, Gidon


Re: Deploying containers to every mesos slave node

2015-03-12 Thread Alex Rukletsov
No, this won't make it into 0.22.

On Thu, Mar 12, 2015 at 10:28 AM, Gurvinder Singh 
gurvinder.si...@uninett.no wrote:

 On 03/12/2015 02:00 PM, Tim St Clair wrote:
  You may want to also view
  - https://issues.apache.org/jira/browse/MESOS-1806
 
  as folks have discussed straight up consul integration on that JIRA.
 Any plans to resolve this JIRA for upcoming 0.22 release.

 - Gurvinder
 
  
 
  *From: *Aaron Carey aca...@ilm.com
  *To: *user@mesos.apache.org
  *Sent: *Thursday, March 12, 2015 3:54:52 AM
  *Subject: *Deploying containers to every mesos slave node
 
  Hi All,
 
  In setting up our cluster, we require things like consul to be
  running on all of our nodes. I was just wondering if there was any
  sort of best practice (or a scheduler perhaps) that people could
  share for this sort of thing?
 
  Currently the approach is to use salt to provision each node and add
  consul/mesos slave process and so on to it, but it'd be nice to
  remove the dependency on salt.
 
  Thanks,
  Aaron
 
 
 
 
  --
  Cheers,
  Timothy St. Clair
  Red Hat Inc.




Re: Deploying containers to every mesos slave node

2015-03-12 Thread Alex Rukletsov
You don't even need to create a custom framework: you can run a separate
instance of Marathon for a dedicated role.

On Thu, Mar 12, 2015 at 10:58 AM, Brian Devins badev...@gmail.com wrote:

 This was actually going to be my suggestion. You could create a custom
 framework/scheduler to handle these types of tasks and configure mesos to
 give priority to this framework using roles, and weights.

 On Thu, Mar 12, 2015 at 1:38 PM, Konrad Scherer 
 konrad.sche...@windriver.com wrote:

 On 03/12/2015 04:54 AM, Aaron Carey wrote:

 Hi All,

 In setting up our cluster, we require things like consul to be running
 on all of
 our nodes. I was just wondering if there was any sort of best practice
 (or a
 scheduler perhaps) that people could share for this sort of thing?


 I am in a similar situation. I want to start a single source cache
 (over 200GB) data container on each of my builder nodes. I had the idea of
 creating a custom resource on each slave and creating a scheduler to handle
 this resource only. Has anyone tried this? The only problem I can see is
 that there is no way to prevent another scheduler from taking the offered
 custom resource, but since it is custom it seems unlikely.

 I would love to use Marathon for this, but looks like Marathon does not
 support custom resources and the issue[1] is in the backlog. Perhaps when
 Mesos and Marathon get dynamic resources[2] support?

 [1]: https://github.com/mesosphere/marathon/issues/375
 [2]: https://issues.apache.org/jira/browse/MESOS-2018

 --
 Konrad Scherer, MTS, Linux Products Group, Wind River





Re: Question on Monitoring a Mesos Cluster

2015-03-11 Thread Alex Rukletsov
The master/cpus_percent metric is nothing else than used / total. It
however represent resources allocated to tasks, but tasks may not use
them fully (or use more if isolation is not enabled). You can't get
actual cluster utilisation, the best option is to aggregate system/*
metrics, that report the node load. This however includes all the
process running on a node, not only mesos and its tasks. Hope this
helps.


On Mon, Mar 9, 2015 at 8:16 AM, Andras Kerekes 
andras.kere...@ishisystems.com wrote:

 We use the same monitoring script from rayrod2030. However instead of the
 master_cpus_percent, we use the master_cpus_used and master_cpus_total to
 calculate a percentage. And this will give the allocated percentage of
 CPUs in
 the cluster, the actual utilization is measured by collectd.

 -Original Message-
 From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick
 Davies
 Sent: Saturday, March 07, 2015 2:15 PM
 To: user@mesos.apache.org
 Subject: Re: Question on Monitoring a Mesos Cluster

 Yeah, that confused me too - I think that figure is specific to the
 master/slave polled (and that'll just be the active one since you're only
 reporting when master/elected is true.

 I'm using this one https://github.com/rayrod2030/collectd-mesos  , not
 sure if
 that's the same as yours?


 On 7 March 2015 at 18:56, Jeff Schroeder jeffschroe...@computer.org
 wrote:
  Responses inline
 
  On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote:
 
  ... snip ...
 
  After getting everything working, I built a few dashboards, one of
  which displays these stats from http://master:5051/metrics/snapshot:
 
  master/disk_percent
  master/cpus_percent
  master/mem_percent
 
  I had assumed that this was something like aggregate cluster
  utilization, but this seems incorrect in practice. I have a small
  cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had
  a dozen or so small tasks running, and launched 500 tasks with 1G of
  memory and 1 CPU each.
 
  Now I'd expect to se the disk/cpu/mem percentage metrics above go up
  considerably. I did notice that cpus_percent went to around 0.94.
 
  What is the correct way to measure overall cluster utilization for
  capacity planning? We can have the NOC watch this and simply add
  more hardware when the number starts getting low.
 
 
  Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
  development group has more accurate information if not some vague
  roadmap on resource/process monitoring. Sooner or later, this is
  going to become a quintessential need; so I hope the deep thinkers
  are all over this need both in the user and dev groups.
 
  In fact the monitoring can easily create a significant loading on the
  cluster/cloud, if one is not judicious in how this is architect,
  implemented and dynamically tuned.
 
 
 
 
  Monitoring via passive metrics gathering and application telemetry
  is one of the best ways to do it. That is how I've implemented things
 
 
 
  The beauty of the rest api is that it isn't heavyweight, and every
  master has it on port 5050 (by default) and every slave has it on port
  5051 (by default). Since I'm throwing this all into graphite (well
  technically cassandra fronted by cyanite fronted by graphite-api...
  but same difference), I found a reasonable way to do capacity
  planning. Collectd will poll the master/slave on each mesos host every
  10 seconds (localhost:5050 on masters and localhost:5151 on slaves).
  This gets put into graphite via collectd's write_graphite plugin.
  These 3 graphite targets give me percentages of utilization for nice
 graphs:
 
  alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
  collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage)
  alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
  collectd.mesos.clustername.gauge-master_mem_total), Total Memory
  Usage)
  alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
  collectd.mesos.clustername.gauge-master_disk_total), Total Disk
  Usage)
 
  With that data, you can have your monitoring tools such as
  nagios/icinga poll graphite. Using the native graphite render api, you
 can
  do things like:
 
  * if the cpu usage is over 80% for 24 hours, send a warning event
  * if the cpu usage is over 95% for 6 hours, send a critical event
 
  This allows mostly no-impact monitoring since the monitoring tools are
  hitting graphite.
 
  Anyways, back to the original questions:
 
  How does everyone do proper monitoring and capacity planning for large
  mesos clusters? I expect my cluster to grow beyond what it currently
  is by quite a bit.
 
  --
  Jeff Schroeder
 
  Don't drink and derive, alcohol and analysis don't mix.
  http://www.digitalprognosis.com



Re: Compiling Slave on Windows

2015-03-06 Thread Alex Rukletsov
Hi Alexandre,

sorry for a tardy reply. Mesos master and slaves (or workers, as per
MESOS-1478) communicate via protobuf messages. Any agent that understands
these messages can be (or pretend) a Mesos slave. So the answer to your
question is yes, it is possible to provide an alternative slave
implementation. The question is what such slave will do with tasks it will
get from the master after being successfully registered? But since for you
a simplified version suffices, you can start such slave with custom
resources, say win-cpus:4;win-mem:1024. Since there will be no overlap in
resources between standard slaves and custom ones, you will have two
independent subclusters in your cluster, with your win tasks sent only to
the win subcluster. Does it make sense?

In reality, implementing (and maintaining) such a slave is a lot of work.
Anyway I would be happy to see and help out with this effort if you decide
to work on it.

Alex


Re: Weird behavior when stopping the mesos master leader of a HA mesos cluster

2015-03-06 Thread Alex Rukletsov
Geoffroy,

could you please provide master logs (both from killed and taking over
masters)?

On Fri, Mar 6, 2015 at 4:26 AM, Geoffroy Jabouley 
geoffroy.jabou...@gmail.com wrote:

 Hello

 we are facing some unexpecting issues when testing high availability
 behaviors of our mesos cluster.

 *Our use case:*

 *State*: the mesos cluster is up (3 machines), 1 docker task is running
 on each slave (started from marathon)

 *Action*: stop the mesos master leader process

 *Expected*: mesos master leader has changed, *active tasks remain
 unchanged*

 *Seen*: mesos master leader has changed, *all active tasks are now FAILED
 but docker containers are still running*, marathon detects FAILED tasks
 and starts new tasks. We end with 2 docker containers running on each
 machine, but only one is linked to a RUNNING mesos task.


 Is the seen behavior correct?

 Have we misunderstood the high availability concept? We thought that doing
 this use case would not have any impact on the current cluster state
 (except leader re-election)

 Thanks in advance for your help
 Regards

 ---

 our setup is the following:
 3 identical mesos nodes with:
 + zookeeper
 + docker 1.5
 + mesos master 0.21.1 configured in HA mode
 + mesos slave 0.21.1 configured with checkpointing, strict and
 reconnect
 + marathon 0.8.0 configured in HA mode with checkpointing

 ---

 Command lines:


 *mesos-master*usr/sbin/mesos-master --zk=zk://10.195.30.19:2181,
 10.195.30.20:2181,10.195.30.21:2181/mesos --port=5050
 --cluster=ECP_FFaaS_Cluster --hostname=10.195.30.19 --ip=10.195.30.19
 --quorum=2 --slave_reregister_timeout=1days --work_dir=/var/lib/mesos

 *mesos-slave*
 /usr/sbin/mesos-slave --master=zk://10.195.30.19:2181,10.195.30.20:2181,
 10.195.30.21:2181/mesos --checkpoint --containerizers=docker,mesos
 --executor_registration_timeout=5mins --hostname=10.195.30.19
 --ip=10.195.30.19 --isolation=cgroups/cpu,cgroups/mem --recover=reconnect
 --recovery_timeout=120mins --strict --resources=ports:[31000-32000,80,443]

 *marathon*
 java -Djava.library.path=/usr/local/lib:/usr/lib:/usr/lib64
 -Djava.util.logging.SimpleFormatter.format=%2$s%5$s%6$s%n -Xmx512m -cp
 /usr/bin/marathon mesosphere.marathon.Main --local_port_max 32000
 --local_port_min 31000 --task_launch_timeout 30 --http_port 8080
 --hostname 10.195.30.19 --event_subscriber http_callback --ha --https_port
 8443 --checkpoint --zk zk://10.195.30.19:2181,10.195.30.20:2181,
 10.195.30.21:2181/marathon --master zk://10.195.30.19:2181,
 10.195.30.20:2181,10.195.30.21:2181/mesos



Re: multiple frameworks or one big one

2015-03-04 Thread Alex Rukletsov
Not exactly the Enterprise oriented benchmark, but may give some insight
though.
https://mesos.apache.org/documentation/latest/powered-by-mesos/
https://www.youtube.com/watch?v=LQxnuPcRl4st=1m31s

On Wed, Mar 4, 2015 at 2:18 AM, Diego Medina di...@fmpwizard.com wrote:


 Well, I deeply think that additionally to the architecture and the
 organisations concerns, Mesos need to provide some Enterprise oriented
 benchmark and proof to be able to really prime time on the enterprise
 world and not only on the Startup style enterprises, but it's not the
 topic of your post and I'll made my own regarding this specific topic.


 Looking forward to that discussion.



 Anyway, thank you very much for your answers.

 Regarding your choose of Golang instead of Scala because of some pain
 points, could you send me some exemples (except for compile time)? Even in
 private if you do not want to steal the thread, as I'm really balancing
 between those two.



 I'll send you a separate private message with the reply, I don't mind
 talking about it, but wouldn't want to distract this mailing list with the
 topic.

 Thanks

 Diego



 2015-03-03 14:26 GMT+01:00 Diego Medina di...@fmpwizard.com:

 Hi Alex,

 On Tue, Mar 3, 2015 at 7:37 AM, Alex Rukletsov a...@mesosphere.io
 wrote:

 Next good big thing would be to handle task state updates. Instead of
 dying on TASK_LOST, you may want to retry this task several times.


 Yes, this is definitely something I need to address, for now I use it to
 help me find bugs in the code, if the app stops, I know I did something
 wrong :)
 I also need to find out why some tasks stay in status Staging on the
 Mesos UI, but I'll start a separate thread for it.

 Thanks

 Diego





 On Tue, Mar 3, 2015 at 10:38 AM, Billy Bones gael.ther...@gmail.com
 wrote:

 Oh and you've got a glitch on one of your executor name in your first
 code block.

 You've got:

 *extractorExe := mesos.ExecutorInfo{
   ExecutorId: util.NewExecutorID(owl-cralwer-extractor),
   Name:   proto.String(OwlCralwer Fetcher),
   Source: proto.String(owl-cralwer),
   Command: mesos.CommandInfo{
   Value: proto.String(extractorExecutorCommand),
   Uris:  executorUris,
   },
 }*

 It should rather be:

 *extractorExe := mesos.ExecutorInfo{
   ExecutorId: util.NewExecutorID(owl-cralwer-extractor),
   Name:   proto.String(OwlCralwer Extractor),
   Source: proto.String(owl-cralwer),
   Command: mesos.CommandInfo{
   Value: proto.String(extractorExecutorCommand),
   Uris:  executorUris,
   },
 }*



 2015-03-03 10:28 GMT+01:00 Billy Bones gael.ther...@gmail.com:

 Hi Diego, did you already plan to make a benchmark of your result on
 the Mesos platform VS Bare-metal servers ?
 It would be really interesting for Enterprise evangelism, they love
 benchmark and metrics.

 I'm impress by your project and how it goes fast. I'm myself a fan of
 Golang, but why did you choose it?

 2015-03-03 3:03 GMT+01:00 Diego Medina di...@fmpwizard.com:

 Hi everyone, based on all the great feedback I got here I updated
 the code and now I have one scheduler and two executors, one for 
 fetching
 html and a second one that extracts links and text from the html.
 I also run the actual work on their own goroutines (like threads for
 tose not familiar with Go) and it's working great.

 I wrote about the changes here

 http://blog.fmpwizard.com/blog/owlcrawler-multiple-executors-using-meso
 and you can find the updated code here
 https://github.com/fmpwizard/owlcrawler

 Again, thanks everyone for your input.

 Diego




 On Fri, Feb 27, 2015 at 1:52 PM, Diego Medina di...@fmpwizard.com
 wrote:

 Thanks for looking at the code and the feedback Alex. I'll be
 working on those changes later tonight!

 Diego

 On Fri, Feb 27, 2015 at 12:15 PM, Alex Rukletsov 
 a...@mesosphere.io wrote:

 Diego,

 I've checked your code, nice effort! Great to see people hacking
 with mesos and go bindings!

 One thing though. You do the actual job in the launchTask() of
 your executor. This prevents you from multiple tasks in parallel on 
 one
 executor. That means you can't have more simultaneous tasks than 
 executors
 in your cluster. You may want to spawn a thread for every incoming 
 task and
 do the job there, while launchTasks() will do solely task 
 initialization
 (basically, starting a thread). Check the project John referenced to:
 https://github.com/mesosphere/RENDLER.

 Best,
 Alex

 On Fri, Feb 27, 2015 at 11:03 AM, Diego Medina 
 di...@fmpwizard.com wrote:

 Hi Billy,

 comments inline:

 On Fri, Feb 27, 2015 at 4:07 AM, Billy Bones 
 gael.ther...@gmail.com wrote:

 Hi diego, as a real fan of the golang, I'm cudoes and clap for
 your work on this distributed crawler and hope you'll finally 
 release it ;-)



 Thanks! my 3 month old baby is making sure I don't sleep much and
 have plenty of time to work on this project :)


 About your question, the common architecture is to have one
 scheduler and multiple

Re: multiple frameworks or one big one

2015-02-27 Thread Alex Rukletsov
Diego,

I've checked your code, nice effort! Great to see people hacking with mesos
and go bindings!

One thing though. You do the actual job in the launchTask() of your
executor. This prevents you from multiple tasks in parallel on one
executor. That means you can't have more simultaneous tasks than executors
in your cluster. You may want to spawn a thread for every incoming task and
do the job there, while launchTasks() will do solely task initialization
(basically, starting a thread). Check the project John referenced to:
https://github.com/mesosphere/RENDLER.

Best,
Alex

On Fri, Feb 27, 2015 at 11:03 AM, Diego Medina di...@fmpwizard.com wrote:

 Hi Billy,

 comments inline:

 On Fri, Feb 27, 2015 at 4:07 AM, Billy Bones gael.ther...@gmail.com
 wrote:

 Hi diego, as a real fan of the golang, I'm cudoes and clap for your work
 on this distributed crawler and hope you'll finally release it ;-)



 Thanks! my 3 month old baby is making sure I don't sleep much and have
 plenty of time to work on this project :)


 About your question, the common architecture is to have one scheduler and
 multiple executors rather than one big executor.
 The basics of mesos is to take any resources, put them together on a pool
 to then swarm tasks on this pool, so, basically the architecture of your
 application should share this philosophy and then explode / decouple your
 application as much as possible but be carreful to not loop lock yourself
 on threads and tasks if they're dependents.

 I don't know if I'm explaining myself correctly so do not hesitate if you
 need more clarification.



 Your answer was very clear. Today I started to split the executor into
 two, one that simply fetches the html and then a second one that extracts
 text without tags from it, this second executor gets the data from a
 database, so far it seems like a natural way to split the tasks. I was
 going with the idea of also having two schedulers, but I think I just
 figured out how to use just one.

 Thanks!

 Diego




 2015-02-26 21:50 GMT+01:00 Diego Medina di...@fmpwizard.com:

 @John: thanks for the link, i see that RENDLER uses the ExecutorId from
 ExecutorInfo to decide what to do, I'll give this a try
 @Craig: you are right, after I sent the email I continued to read more
 of the mesos docs and saw that I used the wrong term, where I meant
 scheduler instead of framework, thanks.

 Thanks and looking forward to any other feedback you may all have.

 Diego


 On Thu, Feb 26, 2015 at 5:24 AM, craig w codecr...@gmail.com wrote:

 Diego,

 I'm also interested in hearing feedback to your qusestion. One minor
 thing I'd point out is that a Framework is made up of a Scheduler and
 Executor(s), so I think it's more correct to say you've created a Scheduler
 (instead of one big framework) and an Executor.

 Anyhow, for what it's worth, the Aurora framework has multiple
 executors (
 https://github.com/apache/incubator-aurora/blob/master/examples/vagrant/aurorabuild.sh#L61).
 You might pop into the #aurora IRC chat room and ask, usually a few Aurora
 contributors are in there answering questions when they can.

 On Wed, Feb 25, 2015 at 9:02 PM, John Pampuch j...@mesosphere.io
 wrote:

 Diego-

 You might want to look at this project for some insights:

 https://github.com/mesosphere/RENDLER


 -John


 On Wed, Feb 25, 2015 at 5:27 PM, Diego Medina di...@fmpwizard.com
 wrote:

 Hi,


 Short: Is it better to have one big framework and executor with if
 statements to select what to do or several smaller framework - 
 executors
 when writing a Mesos app?

 Longer question:

 Last week I started a side project based on mesos (using Go),

 http://blog.fmpwizard.com/blog/web-crawler-using-mesos-and-golang
 https://github.com/fmpwizard/owlcrawler

 It's a web crawler written on top of Mesos, The very first version of
 it had a framework that sent a task to an executor and that single 
 executor
 would fetch the page, extract links from the html and then send them to a
 message queue.

 Then the framework reads that queue and starts again, run the
 executor, etc, etc.

 Now I'm splitting fetching the html and extracting links into two
 separate tasks, and putting those two tasks in the same executor doesn't
 feel right, so I'm thinking that I need at least two diff executors and 
 one
 framework, but then I wonder if people more experienced with mesos would
 normally write several pairs of framework - executors to keep the 
 design
 cleaner.

 On this particular case, I can see the project growing into even more
 tasks that can be decoupled.

 Any feedback on the design would be great and let me know if I should
 explain this better.

 Thanks


 Diego





 --
 Diego Medina
 Lift/Scala consultant
 di...@fmpwizard.com
 http://fmpwizard.telegr.am





 --

 https://github.com/mindscratch
 https://www.google.com/+CraigWickesser
 https://twitter.com/mind_scratch
 https://twitter.com/craig_links




 --
 Diego Medina
 Lift/Scala consultant
 

Re: Mesos-slave start error

2015-02-05 Thread Alex Rukletsov
Hi Siva,

it looks like you bumped into
https://issues.apache.org/jira/browse/MESOS-2276. Feel free to upvote!

On Thu, Feb 5, 2015 at 1:56 PM, Sivaram Kannan sivara...@gmail.com wrote:


 Hi,

 I am our deployments of mesos-slave, we are getting the following error
 during start up. I understand the slave is failing due to large number of
 fd's being opened. I have increased the ulimit of fd's to 4096 from 1024
 but still the same behavior. What can I do to solve this problem, and what
 should I do to prevent it.

 Thanks,
 ./Siva.


 Initiating client connection, host=11.0.190.1:2181 sessionTimeout=1
 watcher=0x7f6de4
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07628915
 slave.cpp:169] Slave started on 1)@11.1.6.1:5051
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07654415
 slave.cpp:289] Slave resources: cpus(*):24; mem(*):47336; disk(*):469416;
 ports(*):[31000-32000]
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07657515
 slave.cpp:318] Slave hostname: 11.1.6.1
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07658215
 slave.cpp:319] Slave checkpoint: true
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07813525
 state.cpp:33] Recovering state from '/var/lib/mesos/slave/meta'
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07823320
 status_update_manager.cpp:197] Recovering status update manager
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.07833320
 docker.cpp:767] Recovering Docker containers
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: 2015-02-05
 12:33:58,102:6(0x7f6dc3fff700):ZOO_INFO@check_events@1703: initiated
 connection to server [11.0.190.1:2181]
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: 2015-02-05
 12:33:58,104:6(0x7f6dc3fff700):ZOO_INFO@check_events@1750: session
 establishment complete on server [11.0.190.1:2181],
 sessionId=0x14b3c82555299c7,
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10467130
 group.cpp:313] Group process (group(1)@11.1.6.1:5051) connected to
 ZooKeeper
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10470830
 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas)
 = (0, 0, 0)
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10472530
 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10637622
 detector.cpp:138] Detected a new leader: (id='3')
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10647725
 group.cpp:659] Trying to get '/mesos/info_03' in ZooKeeper
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.10729330
 detector.cpp:433] A new leading master (UPID=master@11.1.4.1:5050) is
 detected
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Failed to perform recovery:
 Collect failed: Failed to create pipe: Too many open files
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: To remedy this do as follows:
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Step 1: rm -f
 /var/lib/mesos/slave/meta/slaves/latest
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: This ensures slave doesn't
 recover old live executors.
 Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Step 2: Restart the slave.
 Feb 05 12:33:58 node-d4856455ad5c systemd[1]: mesos-slave.service: main
 process exited, code=exited, status=1/FAILURE
 Feb 05 12:33:58 node-d4856455ad5c docker[3351]: mesos_slave
 Feb 05 12:33:58 node-d4856455ad5c systemd[1]: Unit mesos-slave.service
 entered failed state.





Re: Unable to follow Sandbox links from Mesos UI.

2015-01-27 Thread Alex Rukletsov
Let's make sure instead of assuming. Could you please add this line:
console.log('Fetch url:', url);
between lines 17 and 18, click the link, copy the output from Firebug or
Chrome dev console and paste it here together with the link corresponding
to the download button?

Thanks,
Alex

On Tue, Jan 27, 2015 at 9:19 PM, Dan Dong dongda...@gmail.com wrote:

 Hi, All,
   Checked again that when I hover cursor on the stdout/stderr, it points
 to links of IP address of the master node, so that's why when clicking on
 it nothing will show up. While the Download button beside it points
 correctly to the IP address of slave node, so I can download them without
 problem. Seems a config problem somewhere? A bit lost here.

 So seems the host in the pailer function is resolved to master instead
 of slave node:

  14   // Invokes the pailer for the specified host and path using the
  15   // specified window_title.
  16   function pailer(host, path, window_title) {
  17 var url = 'http://' + host + '/files/read.json?path=' + path;
  18 var pailer =
  19   window.open('/static/pailer.html', url, 'width=580px,
 height=700px');
  20
  21 // Need to use window.onload instead of document.ready to make
  22 // sure the title doesn't get overwritten.
  23 pailer.onload = function() {
  24   pailer.document.title = window_title + ' (' + host + ')';
  25 };
  26   }
  27


 Cheers,
 Dan


 2015-01-27 2:51 GMT-06:00 Alex Rukletsov a...@mesosphere.io:

 Dan,

 the link to the sandbox on a slave is prepared in the JS. As Geoffroy
 suggests, could you please check that the JS code works correctly and
 the url is constructed normally (see controllers.js:16)? If
 everything's fine on your side, could you please file a JIRA for this
 issue?

 On Tue, Jan 27, 2015 at 8:21 AM, Geoffroy Jabouley
 geoffroy.jabou...@gmail.com wrote:
  Hello
 
  just in case, which internet browser are you using?
 
  Do you have installed any extensions (NoScript, Ghostery, ...) that
 could
  prevent the display /statis/pailer display?
 
  I personnaly use NoScript with Firefox, and i have to turn it off on
 all @IP
  of our cluster to correctly access slave information from Mesos UI.
 
  My 2 cents
  Regards
 
  2015-01-26 21:08 GMT+01:00 Suijian Zhou suijian.z...@ige-project.eu:
 
  Hi, Alex,
Yes, I can see the link points to the slave machine when I hover on
 the
  Download button and stdout/stderr can be downloaded. So do you mean
 it is
  expected/designed that clicking on 'stdout/stderr' themselves will not
 show
  you anything? Thanks!
 
  Cheers,
  Dan
 
 
  2015-01-26 7:44 GMT-06:00 Alex Rukletsov a...@mesosphere.io:
 
  Dan,
 
  that's correct. The 'static/pailer.html' is a page that lives on the
  master and it gets a url to the actual slave as a parameter. The url
  is computed in 'controllers.js' based on where the associated executor
  lives. You should see this 'actual' url if you hover over the Download
  button. Please check this url for correctness and that you can access
  it from your browser.
 
  On Fri, Jan 23, 2015 at 9:24 PM, Dan Dong dongda...@gmail.com
 wrote:
   I see the problem: when I move the cursor onto the link, e.g:
 stderr,
   it
   actually points to the IP address of the master machine, so it trys
 to
   follow links of Master_IP:/tmp/mesos/slaves/...
which is not there. So why the link does not point to the IP
 address
   of
   slaves( config problems somewhere?)?
  
   Cheers,
   Dan
  
  
   2015-01-23 11:25 GMT-06:00 Dick Davies d...@hellooperator.net:
  
   Start with 'inspect element' in the browser and see if that gives
 any
   clues.
   Sounds like your network is a little strict so it may be something
   else needs opening up.
  
   On 23 January 2015 at 16:56, Dan Dong dongda...@gmail.com wrote:
Hi, Alex,
  That is what expected, but when I click on it, it pops a new
 blank
window(pailer.html) without the content of the file(9KB size).
 Any
hints?
   
Cheers,
Dan
   
   
2015-01-23 4:37 GMT-06:00 Alex Rukletsov a...@mesosphere.io:
   
Dan,
   
you should be able to view file contents just by clicking on the
link.
   
On Thu, Jan 22, 2015 at 9:57 PM, Dan Dong dongda...@gmail.com
wrote:
   
Yes, --hostname solves the problem. Now I can see all files
 there
like
stdout, stderr etc, but when I click on e.g stdout, it pops a
 new
blank
window(pailer.html) without the content of the file(9KB size).
Although it
provides a Download link beside, it would be much more
convenient if
one
can view the stdout and stderr directly. Is this normal or
 there
is
still
problem on my envs? Thanks!
   
Cheers,
Dan
   
   
2015-01-22 11:33 GMT-06:00 Adam Bordelon a...@mesosphere.io:
   
Try the --hostname parameters for master/slave. If you want
 to be
extra
explicit about the IP (e.g. publish the public IP instead of
 the
private one
in a cloud environment), you

Re: Trying to debug an issue in mesos task tracking

2015-01-26 Thread Alex Rukletsov
Itamar,

you are right, Mesos executor and containerizer cannot distinguish
between busy and stuck processes. However, since you use your own
custom executor, you may want to implement a sort of health checks. It
depends on what your task processes are doing.

There are hundreds of reasons why an OS process may get stuck; it
doesn't look like it's Mesos-related in this case.

On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher ita...@yowza3d.com wrote:
 Alex, Sharma, thanks for your input!

 Trying to recreate the issue with a small cluster for the last few days, I
 was not able to observe a scenario that I can be sure that my executor sent
 the TASK_FINISHED update, but the scheduler did not receive it.
 I did observe multiple times a scenario that a task seemed to be stuck in
 TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I
 always saw that the process related to that task is still running (by
 grepping `ps aux`). Most of the times, it seemed that the process did the
 work (by examining the logs produced by the PID), but for some reason it was
 stuck without exiting cleanly. Some times it seemed that the process
 didn't do any work (an empty log file with the PID). All times, as soon as I
 killed the PID, a TASK_FAILED update was sent and received successfully.

 So, it seems that the problem is in processes spawned by my executor, but I
 don't fully understand why this happens.
 Any ideas why a process would do some work (either 1% (just creating a log
 file) or 99% (doing everything but not exiting) and get stuck?

 On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov a...@mesosphere.io wrote:

 Itamar,

 beyond checking master and slave logs, could you pleasse verify your
 executor does send the TASK_FINISHED update? You may want to add some
 logging and the check executor log. Mesos guarantees the delivery of
 status updates, so I suspect the problem is on the executor's side.

 On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila spod...@netflix.com
 wrote:
  Have you checked the mesos-slave and mesos-master logs for that task id?
  There should be logs in there for task state updates, including
  FINISHED.
  There can be specific cases where sometimes the task status is not
  reliably
  sent to your scheduler (due to mesos-master restarts, leader election
  changes, etc.). There is a task reconciliation support in Mesos. A
  periodic
  call to reconcile tasks from the scheduler can be helpful. There are
  also
  newer enhancements coming to the task reconciliation. In the mean time,
  there are other strategies such as what I use, which is periodic
  heartbeats
  from my custom executor to my scheduler (out of band). The timeouts for
  task
  runtimes are similar to heartbeats, except, you need a priori knowledge
  of
  all tasks' runtimes.
 
  Task runtime limits are not support inherently, as far as I know. Your
  executor can implement it, and that may be one simple way to do it. That
  could also be a good way to implement shell's rlimit*, in general.
 
 
 
  On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com
  wrote:
 
  I'm using a custom internal framework, loosely based on MesosSubmit.
  The phenomenon I'm seeing is something like this:
  1. Task X is assigned to slave S.
  2. I know this task should run for ~10minutes.
  3. On the master dashboard, I see that task X is in the Running state
  for several *hours*.
  4. I SSH into slave S, and see that task X is *not* running. According
  to
  the local logs on that slave, task X finished a long time ago, and
  seemed to
  finish OK.
  5. According to the scheduler logs, it never got any update from task X
  after the Staging-Running update.
 
  The phenomenon occurs pretty often, but it's not consistent or
  deterministic.
 
  I'd appreciate your input on how to go about debugging it, and/or
  implement a workaround to avoid wasted resources.
 
  I'm pretty sure the executor on the slave sends the TASK_FINISHED
  status
  update (how can I verify that beyond my own logging?).
  I'm pretty sure the scheduler never receives that update (again, how
  can I
  verify that beyond my own logging?).
  I have no idea if the master got the update and passed it through (how
  can
  I check that?).
  My scheduler and executor are written in Python.
 
  As for a workaround - setting a timeout on a task should do the trick.
  I
  did not see any timeout field in the TaskInfo message. Does mesos
  support
  the concept of per-task timeouts? Or should I implement my own task
  tracking
  and timeouting mechanism in the scheduler?
 
 




Re: Unable to follow Sandbox links from Mesos UI.

2015-01-23 Thread Alex Rukletsov
Dan,

you should be able to view file contents just by clicking on the link.

On Thu, Jan 22, 2015 at 9:57 PM, Dan Dong dongda...@gmail.com wrote:

 Yes, --hostname solves the problem. Now I can see all files there like
 stdout, stderr etc, but when I click on e.g stdout, it pops a new blank
 window(pailer.html) without the content of the file(9KB size). Although it
 provides a Download link beside, it would be much more convenient if one
 can view the stdout and stderr directly. Is this normal or there is still
 problem on my envs? Thanks!

 Cheers,
 Dan


 2015-01-22 11:33 GMT-06:00 Adam Bordelon a...@mesosphere.io:

 Try the --hostname parameters for master/slave. If you want to be extra
 explicit about the IP (e.g. publish the public IP instead of the private
 one in a cloud environment), you can also set the --ip parameter on
 master/slave.

 On Thu, Jan 22, 2015 at 8:43 AM, Dan Dong dongda...@gmail.com wrote:

 Thanks Ryan, yes, from the machine where the browser is on slave
 hostnames could not be resolved, so that's why failure, but it can reach
 them by IP address( I don't think sys admin would like to add those VMs
 entries to /etc/hosts on the server).  I tried to change masters and slaves
 of mesos to IP addresses instead of hostname but UI still points to
 hostnames of slaves. Is threre a way to let mesos only use IP address of
 master and slaves?

 Cheers,
 Dan


 2015-01-22 9:48 GMT-06:00 Ryan Thomas r.n.tho...@gmail.com:

 It is a request from your browser session, not from the master that is
 going to the slaves - so in order to view the sandbox you need to ensure
 that the machine your browser is on can resolve and route to the masters
 _and_ the slaves.

 The master doesn't proxy the sandbox requests through itself (yet) -
 they are made directly from your browser instance to the slaves.

 Make sure you can resolve the slaves from the machine you're browsing
 the UI on.

 Cheers,

 ryan

 On 22 January 2015 at 15:42, Dan Dong dongda...@gmail.com wrote:

 Thank you all, the master and slaves can resolve each others' hostname
 and ssh login without password, firewalls have been switched off on all 
 the
 machines too.
 So I'm confused what will block such a pull of info of slaves from UI?

 Cheers,
 Dan


 2015-01-21 16:35 GMT-06:00 Cody Maloney c...@mesosphere.io:

 Also see https://issues.apache.org/jira/browse/MESOS-2129 if you want
 to track progress on changing this.

 Unfortunately it is on hold for me at the moment to fix.

 Cody

 On Wed, Jan 21, 2015 at 2:07 PM, Ryan Thomas r.n.tho...@gmail.com
 wrote:

 Hey Dan,

 The UI will attempt to pull that info directly from the slave so you
 need to make sure the host is resolvable  and routeable from your 
 browser.

 Cheers,

 Ryan

 From my phone


 On Wednesday, 21 January 2015, Dan Dong dongda...@gmail.com wrote:

 Hi, All,
  When I try to access sandbox  on mesos UI, I see the following info( 
 The
  same error appears on every slave sandbox.):

  Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0'
  on 'centos-2.local:5051'.

  Potential reasons:
  The slave's hostname, 'centos-2.local', is not accessible from your
 network  The slave's port, '5051', is not accessible from your network


  I checked that:
  slave centos-2.local can be login from any machine in the cluster 
 without
  password by ssh centos-2.local ;

  port 5051 on slave centos-2.local could be connected from master by
  telnet centos-2.local 5051
 The stdout and stderr are there on each slave's /tmp/mesos/..., but 
 seems mesos UI just could not access it.
 (and Both master and slaves are on the same network IP ranges).  
 Should I open any port on slaves? Any hint what's the problem here?

  Cheers,
  Dan










Re: Trying to debug an issue in mesos task tracking

2015-01-23 Thread Alex Rukletsov
Itamar,

beyond checking master and slave logs, could you pleasse verify your
executor does send the TASK_FINISHED update? You may want to add some
logging and the check executor log. Mesos guarantees the delivery of
status updates, so I suspect the problem is on the executor's side.

On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila spod...@netflix.com wrote:
 Have you checked the mesos-slave and mesos-master logs for that task id?
 There should be logs in there for task state updates, including FINISHED.
 There can be specific cases where sometimes the task status is not reliably
 sent to your scheduler (due to mesos-master restarts, leader election
 changes, etc.). There is a task reconciliation support in Mesos. A periodic
 call to reconcile tasks from the scheduler can be helpful. There are also
 newer enhancements coming to the task reconciliation. In the mean time,
 there are other strategies such as what I use, which is periodic heartbeats
 from my custom executor to my scheduler (out of band). The timeouts for task
 runtimes are similar to heartbeats, except, you need a priori knowledge of
 all tasks' runtimes.

 Task runtime limits are not support inherently, as far as I know. Your
 executor can implement it, and that may be one simple way to do it. That
 could also be a good way to implement shell's rlimit*, in general.



 On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com
 wrote:

 I'm using a custom internal framework, loosely based on MesosSubmit.
 The phenomenon I'm seeing is something like this:
 1. Task X is assigned to slave S.
 2. I know this task should run for ~10minutes.
 3. On the master dashboard, I see that task X is in the Running state
 for several *hours*.
 4. I SSH into slave S, and see that task X is *not* running. According to
 the local logs on that slave, task X finished a long time ago, and seemed to
 finish OK.
 5. According to the scheduler logs, it never got any update from task X
 after the Staging-Running update.

 The phenomenon occurs pretty often, but it's not consistent or
 deterministic.

 I'd appreciate your input on how to go about debugging it, and/or
 implement a workaround to avoid wasted resources.

 I'm pretty sure the executor on the slave sends the TASK_FINISHED status
 update (how can I verify that beyond my own logging?).
 I'm pretty sure the scheduler never receives that update (again, how can I
 verify that beyond my own logging?).
 I have no idea if the master got the update and passed it through (how can
 I check that?).
 My scheduler and executor are written in Python.

 As for a workaround - setting a timeout on a task should do the trick. I
 did not see any timeout field in the TaskInfo message. Does mesos support
 the concept of per-task timeouts? Or should I implement my own task tracking
 and timeouting mechanism in the scheduler?




<    1   2