Re: Marathon can no longer deploy any apps after a failover
Thanks for guidelines! Ill try these paths out, and join the marathon mailing-list (was oblivious there was one ;)) Maciej Strzelecki Operations Engineer Tel: +49 30 6098381-50 Fax: +49 851-213728-88 E-mail: mstrzele...@crealytics.de www.crealytics.comhttp://www.crealytics.com blog.crealytics.com crealytics GmbH - Semantic PPC Advertising Technology Brunngasse 1 - 94032 Passau - Germany Oranienstraße 185 - 10999 Berlin - Germany Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch Register court: Amtsgericht Passau, HRB 7466 Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost Reg.-Gericht: Amtsgericht Passau, HRB 7466 From: Vinod Kone vinodk...@gmail.com Sent: Thursday, July 16, 2015 7:09 PM To: user@mesos.apache.org Subject: Re: Marathon can no longer deploy any apps after a failover Sounds like a marathon issue. Mind asking in marathon's mailing list? On Thu, Jul 16, 2015 at 8:02 AM, Nikolay Borodachev nbo...@adobe.commailto:nbo...@adobe.com wrote: Maciej, I had a similar problem but it got solved by setting LIBPROCESS_IP environment variable to the host IP address for the Marathon process. Nikolay From: Maciej Strzelecki [mailto:maciej.strzele...@crealytics.commailto:maciej.strzele...@crealytics.com] Sent: Thursday, July 16, 2015 7:30 AM To: user@mesos.apache.orgmailto:user@mesos.apache.org Subject: Marathon can no longer deploy any apps after a failover Problem: If i restart a current framework leader for marathon ( the host from active frameworks tab in mesos ui) , a new one is elected after a moment and any new deployments are stuck infinitely at 'deploying' state (empty black bar, 0/1 and hanging - with debug level i dont see any errors in marathon/mesos logs) Also the old tasks are untouchable at that time - yes, they keep running, but cant kill, restart nor scale them. When that happens i can: stop marathon on all masters remove the framework via a curl to mesos api /shutdown purge /marathon from zookeper cli restart docker services on all slaves (that kills the zombie containers) restart mesos-slave services on all slaves (pampering my paranoia here) then i can deploy apps again. How can i avoid this problem? Any basic settings im missing? This is scary, as the reboot of a single master (out of 3 or 5 servers) freezes everything that is deployed using marathon, and the steps to reclaim control introduce downtime to every single app sunning there. Configuration: Running ubuntu 14.04.2. LTS mesos 0.22.1-1.0.ubuntu1404 marathon0.9.0-1.0.381.ubuntu1404 chronos 2.3.4-1.0.81.ubuntu1404 The cluster uses 3 masters and a 15 slaves. Also the master machines are running mesos-slave process (albeit those machines give only a portion of resources as offerrings) The configuration for mesos/marathon is very default dependant, options specified You can see below. The quorum is 2. Marathon service is run on 3 master machines root@mesos-master1 ~ # tree /etc/marathon/ /etc/marathon/ `-- conf |-- event_subscriber |-- framework_name |-- hostname |-- logging_level `-- zk 1 directory, 5 files root@mesos-master1 ~ # tree /etc/mesos /etc/mesos `-- zk 0 directories, 1 file root@mesos-master1 ~ # tree /etc/mesos-slave/ /etc/mesos-slave/ |-- containerizers |-- docker_stop_timeout |-- executor_registration_timeout |-- executor_shutdown_grace_period |-- hostname |-- ip |-- logging_level `-- resources 0 directories, 8 files root@mesos-master1 ~ # tree /etc/mesos-master /etc/mesos-master |-- cluster |-- hostname |-- ip |-- logging_level |-- quorum `-- work_dir
Marathon can no longer deploy any apps after a failover
Problem: If i restart a current framework leader for marathon ( the host from active frameworks tab in mesos ui) , a new one is elected after a moment and any new deployments are stuck infinitely at 'deploying' state (empty black bar, 0/1 and hanging - with debug level i dont see any errors in marathon/mesos logs) Also the old tasks are untouchable at that time - yes, they keep running, but cant kill, restart nor scale them. When that happens i can: stop marathon on all masters remove the framework via a curl to mesos api /shutdown purge /marathon from zookeper cli restart docker services on all slaves (that kills the zombie containers) restart mesos-slave services on all slaves (pampering my paranoia here) then i can deploy apps again. How can i avoid this problem? Any basic settings im missing? This is scary, as the reboot of a single master (out of 3 or 5 servers) freezes everything that is deployed using marathon, and the steps to reclaim control introduce downtime to every single app sunning there. Configuration: Running ubuntu 14.04.2. LTS mesos 0.22.1-1.0.ubuntu1404 marathon0.9.0-1.0.381.ubuntu1404 chronos 2.3.4-1.0.81.ubuntu1404 The cluster uses 3 masters and a 15 slaves. Also the master machines are running mesos-slave process (albeit those machines give only a portion of resources as offerrings) The configuration for mesos/marathon is very default dependant, options specified You can see below. The quorum is 2. Marathon service is run on 3 master machines root@mesos-master1 ~ # tree /etc/marathon/ /etc/marathon/ `-- conf |-- event_subscriber |-- framework_name |-- hostname |-- logging_level `-- zk 1 directory, 5 files root@mesos-master1 ~ # tree /etc/mesos /etc/mesos `-- zk 0 directories, 1 file root@mesos-master1 ~ # tree /etc/mesos-slave/ /etc/mesos-slave/ |-- containerizers |-- docker_stop_timeout |-- executor_registration_timeout |-- executor_shutdown_grace_period |-- hostname |-- ip |-- logging_level `-- resources 0 directories, 8 files root@mesos-master1 ~ # tree /etc/mesos-master /etc/mesos-master |-- cluster |-- hostname |-- ip |-- logging_level |-- quorum `-- work_dir
Can marathon cancel a deployment if the application is sick?
How to make marathon cancel a deployment if the app is not starting after several tries? I saw those three settings (with defaults) in the documentation backoffSeconds: 1, backoffFactor: 1.15, maxLaunchDelaySeconds: 3600, backoffSeconds, backoffFactor and maxLaunchDelaySeconds Configures exponential backoff behavior when launching potentially sick apps. This prevents sandboxes associated with consecutively failing tasks from filling up the hard disk on Mesos slaves. The backoff period is multiplied by the factor for each consecutive failure until it reaches maxLaunchDelaySeconds. This applies also to tasks that are killed due to failing too many health checks. I would expect to be able to tell marathon to give up after it tried few times. Is there a way? backoffseconds - 5 factor - high, - 100-200ish (so it reaches max delay very quickly after just a few failures) maxdelay - 600 ( to allow for a docker pull to finish and general startup lag) Root cause - a developer deploys application with either code failure - skipped test - or a docker image cant be pulled. If this task is left on marathon-retry-deployment for some time, mesos-ui shows thousands of failed tasks. Id love to see one, maybe two failed starts attempts, then back-off. Maciej Strzelecki Operations Engineer Tel: +49 30 6098381-50 Fax: +49 851-213728-88 E-mail: mstrzele...@crealytics.de www.crealytics.comhttp://www.crealytics.com blog.crealytics.com crealytics GmbH - Semantic PPC Advertising Technology Brunngasse 1 - 94032 Passau - Germany Oranienstraße 185 - 10999 Berlin - Germany Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch Register court: Amtsgericht Passau, HRB 7466 Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost Reg.-Gericht: Amtsgericht Passau, HRB 7466
Re: [DISCUSS] Renaming Mesos Slave
Please don't change the naming. There is absolutely no reason for it apart from rambling of a delusional, political-correctness maniac. What happened to all the inside jokes and humour of the IT world - why are we letting the bacteria of political-correctness into IT? Apart from the confusion, it will create an unnecessary architecture/provisioning changes in many, many places. Possible benefits are *none*. Literally - *none*. Please focus on the PRs that really matter and make apache-mesos a better product, this name-change is only a (pretty good) troll attempt. ( also the logic that led You to changing the slave part is somewhat flawed - after all, we all are equal right? so there shouldn't be any master either. Rename master to democratically-elected-member-of-equal-rights-community - to match the insanity after all, its only fair if we take into account all areas. ) PS. First of April was 2 months ago. Maciej Strzelecki Operations Engineer Tel: +49 30 6098381-50 Fax: +49 851-213728-88 E-mail: mstrzele...@crealytics.de www.crealytics.comhttp://www.crealytics.com blog.crealytics.com crealytics GmbH - Semantic PPC Advertising Technology Brunngasse 1 - 94032 Passau - Germany Oranienstraße 185 - 10999 Berlin - Germany Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch Register court: Amtsgericht Passau, HRB 7466 Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost Reg.-Gericht: Amtsgericht Passau, HRB 7466 From: Alexander Rojas alexan...@mesosphere.io Sent: Wednesday, June 3, 2015 9:58 AM To: user@mesos.apache.org Subject: Re: [DISCUSS] Renaming Mesos Slave +1 to Isabel's plan. Times change, language change so lets not be anachronistic. On 02 Jun 2015, at 22:19, Isabel Jimenez contact.isabeljime...@gmail.commailto:contact.isabeljime...@gmail.com wrote: Hi Adam, 1. Mesos Agent 2. Mesos Agent 3. No but is master-agent a logical coupling? 4. +1 Dave, documentation, then API, then rest of the code base. We should also make sure that we only have to change once and that we cover all the connotations that might offend. On Tue, Jun 2, 2015 at 11:50 AM, Dave Lester d...@davelester.orgmailto:d...@davelester.org wrote: Hi Adam, I've been using Master/Worker in presentations for the past 9 months and it hasn't led to any confusion. 1. Mesos worker 2. Mesos worker 3. No 4. Documentation, then API with a full deprecation cycle Dave On Mon, Jun 1, 2015, at 02:18 PM, Adam Bordelon wrote: There has been much discussion about finding a less offensive name than Slave, and many of these thoughts have been captured in https://issues.apache.org/jira/browse/MESOS-1478 I would like to open up the discussion on this topic for one week, and if we cannot arrive at a lazy consensus, I will draft a proposal from the discussion and call for a VOTE. Here are the questions I would like us to answer: 1. What should we call the Mesos Slave node/host/machine? 2. What should we call the mesos-slave process (could be the same)? 3. Do we need to rename Mesos Master too? Another topic worth discussing is the deprecation process, but we don't necessarily need to decide on that at the same time as deciding the new name(s). 4. How will we phase in the new name and phase out the old name? Please voice your thoughts and opinions below. Thanks! -Adam- P.S. My personal thoughts: 1. Mesos Worker [Node] 2. Mesos Worker or Agent 3. No 4. Carefully
EXECUTOR_SIGNAL_ESCALATION_TIMEOUT vs EXECUTOR_SHUTDOWN_GRACE_PERIOD vs docker_stop_timeout
Hi, EXECUTOR_SIGNAL_ESCALATION_TIMEOUT is set to 3 seconds, hard-coded. EXECUTOR_SHUTDOWN_GRACE_PERIOD has a default of 5, and can be configured docker_stop_timeout - default of 0, configurable as well I am running a jobsystem app that needs to clean up and write back some data before it dies. Its run by mesos through docker and, preferably, it needs more than 3 seconds (15 would be safe) For testing, I have set: docker_stop_timeout = 20 secs and executor_shutdown_grace_period = 30secs How do the above two play with EXECUTOR_SIGNAL_ESCALATION_TIMEOUT (which is 3 seconds) ? Could someone explain the logic and order in which those params are enforced? Maciej Strzelecki Operations Engineer Tel: +49 30 6098381-50 Fax: +49 851-213728-88 E-mail: mstrzele...@crealytics.de www.crealytics.comhttp://www.crealytics.com blog.crealytics.com crealytics GmbH - Semantic PPC Advertising Technology Brunngasse 1 - 94032 Passau - Germany Oranienstraße 185 - 10999 Berlin - Germany Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch Register court: Amtsgericht Passau, HRB 7466 Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost Reg.-Gericht: Amtsgericht Passau, HRB 7466
Re: mesos hadoop jobtracker - cant start.
Hi, regarding symlinks i followed rather explicit directions: cd hadoop-2.5.0-cdh5.2.0 mv bin bin-mapreduce2 mv examples examples-mapreduce2 ln -s bin-mapreduce1 bin ln -s examples-mapreduce1 examples pushd etc mv hadoop hadoop-mapreduce2 ln -s hadoop-mapreduce1 hadoop popd pushd share/hadoop rm mapreduce ln -s mapreduce1 mapreduce popd Please tell if the mesos/hadoop readme is not specific enough or wrong From: Elizabeth Lingg elizab...@mesosphere.io Sent: Friday, May 22, 2015 8:22 PM To: user Subject: Re: mesos hadoop jobtracker - cant start. Just looking at this quickly, are you sure you set up your symlinks correctly, i.e. Since CDH5 includes both MRv1 and MRv2 (YARN) and is configured for YARN by default, we need update the symlinks to point to the correct directories. If you are interested in running Yarn, MR-2, instead of MR-1, you might be interested in trying out the Myriad project, https://github.com/mesos/myriad. https://github.com/mesos/hadoop is for Hadoop 1/ MR-1. Thanks, Elizabeth On Fri, May 22, 2015 at 6:49 AM, Maciej Strzelecki maciej.strzele...@crealytics.commailto:maciej.strzele...@crealytics.com wrote: I am following the install steps on https://github.com/mesos/hadoop Question one: Where should be conf/mapred-site.xml I can only see etc/hadoop/mapred-site.xml , there is no conf/ directory Question two: --- How can i start the hadoop mesos framework? Steps: - mvn package and put target/mesos-hadoop-mr1-0.1.1-SNAPSHOT.jar into hadoop-2.5.0-cdh5.2.0/share/hadoop/common/lib/ - done the link changes described - packaged tar, uploaded to hdfs (hdfs://hdfs/hadoop-2.5.0-cdh5.2.0.tar.gz) - patched the config file (etc/hadoop/mapred-site.xml) - uploaded again (with the changed config file) invoking the hadoop binary from downloaded/unpacked tar: root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so ./bin/hadoop jobtracker Error: Could not find or load main class org.apache.hadoop.mapred.JobTracker my classpath echo: root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so ./bin/hadoop jobtracker /root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../etc/hadoop:/usr/lib/tools.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../share/hadoop/mapreduce1/hadoop-core-2.5.0-mr1-cdh5.2.0.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../lib/*.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../lib/jsp-2.1/*.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../etc/hadoop:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/common/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/common/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/yarn/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/yarn/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce/* Error: Could not find or load main class org.apache.hadoop.mapred.JobTracker invoking hadoop gives me an error about jobtracker command no longer supported. MESOS_NATIVE_JAVA_LIBRARY=/path/to/libmesos.so hadoop jobtracker root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so hadoop jobtracker DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. Sorry, the jobtracker command is no longer supported. You may find similar functionality with the yarn shell command. Usage: mapred [--config confdir] COMMAND where COMMAND is one of: pipesrun a Pipes job job manipulate MapReduce jobs queueget information regarding JobQueues classpathprints the class path needed for running mapreduce subcommands historyserverrun job history servers as a standalone daemon distcp srcurl desturl copy file or directories recursively archive -archiveName NAME -p parent path src* dest create a hadoop archive hsadmin job history server admin interface Most commands print help when invoked w/o parameters.
mesos hadoop jobtracker - cant start.
I am following the install steps on https://github.com/mesos/hadoop Question one: Where should be conf/mapred-site.xml I can only see etc/hadoop/mapred-site.xml , there is no conf/ directory Question two: --- How can i start the hadoop mesos framework? Steps: - mvn package and put target/mesos-hadoop-mr1-0.1.1-SNAPSHOT.jar into hadoop-2.5.0-cdh5.2.0/share/hadoop/common/lib/ - done the link changes described - packaged tar, uploaded to hdfs (hdfs://hdfs/hadoop-2.5.0-cdh5.2.0.tar.gz) - patched the config file (etc/hadoop/mapred-site.xml) - uploaded again (with the changed config file) invoking the hadoop binary from downloaded/unpacked tar: root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so ./bin/hadoop jobtracker Error: Could not find or load main class org.apache.hadoop.mapred.JobTracker my classpath echo: root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so ./bin/hadoop jobtracker /root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../etc/hadoop:/usr/lib/tools.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../share/hadoop/mapreduce1/hadoop-core-2.5.0-mr1-cdh5.2.0.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../lib/*.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../lib/jsp-2.1/*.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../etc/hadoop:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/common/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/common/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/yarn/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/yarn/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce/* Error: Could not find or load main class org.apache.hadoop.mapred.JobTracker invoking hadoop gives me an error about jobtracker command no longer supported. MESOS_NATIVE_JAVA_LIBRARY=/path/to/libmesos.so hadoop jobtracker root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so hadoop jobtracker DEPRECATED: Use of this script to execute mapred command is deprecated. Instead use the mapred command for it. Sorry, the jobtracker command is no longer supported. You may find similar functionality with the yarn shell command. Usage: mapred [--config confdir] COMMAND where COMMAND is one of: pipesrun a Pipes job job manipulate MapReduce jobs queueget information regarding JobQueues classpathprints the class path needed for running mapreduce subcommands historyserverrun job history servers as a standalone daemon distcp srcurl desturl copy file or directories recursively archive -archiveName NAME -p parent path src* dest create a hadoop archive hsadmin job history server admin interface Most commands print help when invoked w/o parameters.