Re: Marathon can no longer deploy any apps after a failover

2015-07-17 Thread Maciej Strzelecki
Thanks for guidelines! Ill try these paths out, and join the marathon 
mailing-list (was oblivious there was one ;))


Maciej Strzelecki
Operations Engineer
Tel: +49 30 6098381-50
Fax: +49 851-213728-88
E-mail: mstrzele...@crealytics.de
www.crealytics.comhttp://www.crealytics.com
blog.crealytics.com

crealytics GmbH - Semantic PPC Advertising Technology

Brunngasse 1 - 94032 Passau - Germany
Oranienstraße 185 - 10999 Berlin - Germany

Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch
Register court: Amtsgericht Passau, HRB 7466
Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost
Reg.-Gericht: Amtsgericht Passau, HRB 7466


From: Vinod Kone vinodk...@gmail.com
Sent: Thursday, July 16, 2015 7:09 PM
To: user@mesos.apache.org
Subject: Re: Marathon can no longer deploy any apps after a failover

Sounds like a marathon issue. Mind asking in marathon's mailing list?

On Thu, Jul 16, 2015 at 8:02 AM, Nikolay Borodachev 
nbo...@adobe.commailto:nbo...@adobe.com wrote:
Maciej,

I had a similar problem but it got solved by setting LIBPROCESS_IP environment 
variable to the host IP address for the Marathon process.

Nikolay


From: Maciej Strzelecki 
[mailto:maciej.strzele...@crealytics.commailto:maciej.strzele...@crealytics.com]
Sent: Thursday, July 16, 2015 7:30 AM
To: user@mesos.apache.orgmailto:user@mesos.apache.org
Subject: Marathon can no longer deploy any apps after a failover


Problem:



If i restart a current framework leader for marathon ( the host from active 
frameworks tab in mesos ui) , a new one is elected after a moment and any new 
deployments are stuck infinitely at  'deploying' state (empty black bar, 0/1 
and hanging - with debug level i dont see any errors in marathon/mesos logs)

Also the old tasks are untouchable at that time - yes, they keep running, but 
cant kill, restart nor scale them.


When that happens i can:

stop marathon on all masters

remove the framework via a curl to mesos api /shutdown

purge /marathon from zookeper cli

restart docker services on all slaves (that kills the zombie containers)
restart mesos-slave services on all slaves (pampering my paranoia here)
then i can deploy apps again.



How can i avoid this problem? Any basic settings im missing? This is scary, as 
the reboot of a single master (out of 3 or 5 servers) freezes everything that 
is deployed using marathon, and the steps to reclaim control introduce downtime 
to every single app sunning there.









Configuration:



Running ubuntu 14.04.2. LTS

mesos   0.22.1-1.0.ubuntu1404

marathon0.9.0-1.0.381.ubuntu1404

chronos 2.3.4-1.0.81.ubuntu1404



The cluster  uses 3 masters and a 15 slaves. Also the master machines are 
running mesos-slave process (albeit those machines give only a  portion of 
resources as offerrings)



The configuration for mesos/marathon is very default dependant, options 
specified You can see below. The quorum is 2.



Marathon service is run on 3 master machines



root@mesos-master1 ~ # tree /etc/marathon/
/etc/marathon/
`-- conf
|-- event_subscriber
|-- framework_name
|-- hostname
|-- logging_level
`-- zk

1 directory, 5 files
root@mesos-master1 ~ # tree /etc/mesos
/etc/mesos
`-- zk

0 directories, 1 file
root@mesos-master1 ~ # tree /etc/mesos-slave/
/etc/mesos-slave/
|-- containerizers
|-- docker_stop_timeout
|-- executor_registration_timeout
|-- executor_shutdown_grace_period
|-- hostname
|-- ip
|-- logging_level
`-- resources

0 directories, 8 files
root@mesos-master1 ~ # tree /etc/mesos-master
/etc/mesos-master
|-- cluster
|-- hostname
|-- ip
|-- logging_level
|-- quorum
`-- work_dir



Marathon can no longer deploy any apps after a failover

2015-07-16 Thread Maciej Strzelecki
Problem:


If i restart a current framework leader for marathon ( the host from active 
frameworks tab in mesos ui) , a new one is elected after a moment and any new 
deployments are stuck infinitely at  'deploying' state (empty black bar, 0/1 
and hanging - with debug level i dont see any errors in marathon/mesos logs)

Also the old tasks are untouchable at that time - yes, they keep running, but 
cant kill, restart nor scale them.


When that happens i can:


stop marathon on all masters

remove the framework via a curl to mesos api /shutdown

purge /marathon from zookeper cli

restart docker services on all slaves (that kills the zombie containers)

restart mesos-slave services on all slaves (pampering my paranoia here)
then i can deploy apps again.


How can i avoid this problem? Any basic settings im missing? This is scary, as 
the reboot of a single master (out of 3 or 5 servers) freezes everything that 
is deployed using marathon, and the steps to reclaim control introduce downtime 
to every single app sunning there.





Configuration:


Running ubuntu 14.04.2. LTS

mesos   0.22.1-1.0.ubuntu1404

marathon0.9.0-1.0.381.ubuntu1404

chronos 2.3.4-1.0.81.ubuntu1404


The cluster  uses 3 masters and a 15 slaves. Also the master machines are 
running mesos-slave process (albeit those machines give only a  portion of 
resources as offerrings)


The configuration for mesos/marathon is very default dependant, options 
specified You can see below. The quorum is 2.


Marathon service is run on 3 master machines


root@mesos-master1 ~ # tree /etc/marathon/
/etc/marathon/
`-- conf
|-- event_subscriber
|-- framework_name
|-- hostname
|-- logging_level
`-- zk

1 directory, 5 files
root@mesos-master1 ~ # tree /etc/mesos
/etc/mesos
`-- zk

0 directories, 1 file
root@mesos-master1 ~ # tree /etc/mesos-slave/
/etc/mesos-slave/
|-- containerizers
|-- docker_stop_timeout
|-- executor_registration_timeout
|-- executor_shutdown_grace_period
|-- hostname
|-- ip
|-- logging_level
`-- resources

0 directories, 8 files
root@mesos-master1 ~ # tree /etc/mesos-master
/etc/mesos-master
|-- cluster
|-- hostname
|-- ip
|-- logging_level
|-- quorum
`-- work_dir


Can marathon cancel a deployment if the application is sick?

2015-07-07 Thread Maciej Strzelecki
How to make marathon cancel a deployment if the app is not starting after 
several tries?

I saw those three settings (with defaults) in the documentation

backoffSeconds: 1,
backoffFactor: 1.15,
maxLaunchDelaySeconds: 3600,

backoffSeconds, backoffFactor and maxLaunchDelaySeconds

Configures exponential backoff behavior when launching potentially sick apps. 
This prevents sandboxes associated with consecutively failing tasks from 
filling up the hard disk on Mesos slaves. The backoff period is multiplied by 
the factor for each consecutive failure until it reaches maxLaunchDelaySeconds. 
This applies also to tasks that are killed due to failing too many health 
checks.



I would expect to be able to tell marathon to give up after it tried few 
times. Is there a way?


backoffseconds - 5

factor -  high, -  100-200ish (so it reaches max delay very quickly after just 
a few failures)

maxdelay - 600 ( to allow for a docker pull to finish and general startup lag)


Root cause - a developer deploys application with either code failure - skipped 
test - or a docker image cant be pulled. If this task is left on 
marathon-retry-deployment for some time, mesos-ui shows thousands of failed 
tasks. Id love to see one, maybe two failed starts attempts, then back-off.





Maciej Strzelecki
Operations Engineer
Tel: +49 30 6098381-50
Fax: +49 851-213728-88
E-mail: mstrzele...@crealytics.de
www.crealytics.comhttp://www.crealytics.com
blog.crealytics.com

crealytics GmbH - Semantic PPC Advertising Technology

Brunngasse 1 - 94032 Passau - Germany
Oranienstraße 185 - 10999 Berlin - Germany

Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch
Register court: Amtsgericht Passau, HRB 7466
Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost
Reg.-Gericht: Amtsgericht Passau, HRB 7466


Re: [DISCUSS] Renaming Mesos Slave

2015-06-03 Thread Maciej Strzelecki
Please don't change the naming. There is absolutely no reason for it apart from 
rambling of a delusional, political-correctness maniac.


What happened to all the inside jokes and humour of the IT world - why are we 
letting the bacteria of political-correctness into IT?


Apart from the confusion, it will create an unnecessary 
architecture/provisioning changes in many, many places.


Possible benefits are *none*. Literally - *none*.


Please focus on the PRs that really matter and make apache-mesos a better 
product, this name-change is only a (pretty good) troll attempt.


(

also the logic that led You to changing the slave part is somewhat flawed - 
after all, we all are equal right? so there shouldn't be any master either.

Rename master to democratically-elected-member-of-equal-rights-community - to 
match the insanity

after all, its only fair if we take into account all areas.

)


PS. First of April was 2 months ago.

Maciej Strzelecki
Operations Engineer
Tel: +49 30 6098381-50
Fax: +49 851-213728-88
E-mail: mstrzele...@crealytics.de
www.crealytics.comhttp://www.crealytics.com
blog.crealytics.com

crealytics GmbH - Semantic PPC Advertising Technology

Brunngasse 1 - 94032 Passau - Germany
Oranienstraße 185 - 10999 Berlin - Germany

Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch
Register court: Amtsgericht Passau, HRB 7466
Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost
Reg.-Gericht: Amtsgericht Passau, HRB 7466


From: Alexander Rojas alexan...@mesosphere.io
Sent: Wednesday, June 3, 2015 9:58 AM
To: user@mesos.apache.org
Subject: Re: [DISCUSS] Renaming Mesos Slave

+1 to Isabel's plan.

Times change, language change so lets not be anachronistic.

On 02 Jun 2015, at 22:19, Isabel Jimenez 
contact.isabeljime...@gmail.commailto:contact.isabeljime...@gmail.com wrote:

Hi Adam,

1. Mesos Agent
2. Mesos Agent
3. No but is master-agent a logical coupling?
4. +1 Dave, documentation, then API, then rest of the code base. We should also 
make sure that we only have to change once and that we cover all the 
connotations that might offend.



On Tue, Jun 2, 2015 at 11:50 AM, Dave Lester 
d...@davelester.orgmailto:d...@davelester.org wrote:
Hi Adam,

I've been using Master/Worker in presentations for the past 9 months and it 
hasn't led to any confusion.

1. Mesos worker
2. Mesos worker
3. No
4. Documentation, then API with a full deprecation cycle

Dave

On Mon, Jun 1, 2015, at 02:18 PM, Adam Bordelon wrote:
There has been much discussion about finding a less offensive name than 
Slave, and many of these thoughts have been captured in 
https://issues.apache.org/jira/browse/MESOS-1478

I would like to open up the discussion on this topic for one week, and if we 
cannot arrive at a lazy consensus, I will draft a proposal from the discussion 
and call for a VOTE.
Here are the questions I would like us to answer:
1. What should we call the Mesos Slave node/host/machine?
2. What should we call the mesos-slave process (could be the same)?
3. Do we need to rename Mesos Master too?

Another topic worth discussing is the deprecation process, but we don't 
necessarily need to decide on that at the same time as deciding the new name(s).
4. How will we phase in the new name and phase out the old name?

Please voice your thoughts and opinions below.

Thanks!
-Adam-

P.S. My personal thoughts:
1. Mesos Worker [Node]
2. Mesos Worker or Agent
3. No
4. Carefully





EXECUTOR_SIGNAL_ESCALATION_TIMEOUT vs EXECUTOR_SHUTDOWN_GRACE_PERIOD vs docker_stop_timeout

2015-06-01 Thread Maciej Strzelecki
Hi,


EXECUTOR_SIGNAL_ESCALATION_TIMEOUT is set to 3 seconds, hard-coded.

EXECUTOR_SHUTDOWN_GRACE_PERIOD  has a default of 5, and can be configured

docker_stop_timeout - default of 0, configurable as well


I am running a jobsystem app that needs to clean up and write back some data 
before it dies.  Its run by mesos through docker and, preferably, it needs more 
than 3 seconds (15 would be safe)


For testing, I have set:


docker_stop_timeout = 20 secs


and


executor_shutdown_grace_period = 30secs


How do the above two play with EXECUTOR_SIGNAL_ESCALATION_TIMEOUT (which is 3 
seconds) ? Could someone explain the logic and order in which those params are 
enforced?




Maciej Strzelecki
Operations Engineer
Tel: +49 30 6098381-50
Fax: +49 851-213728-88
E-mail: mstrzele...@crealytics.de
www.crealytics.comhttp://www.crealytics.com
blog.crealytics.com

crealytics GmbH - Semantic PPC Advertising Technology

Brunngasse 1 - 94032 Passau - Germany
Oranienstraße 185 - 10999 Berlin - Germany

Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch
Register court: Amtsgericht Passau, HRB 7466
Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost
Reg.-Gericht: Amtsgericht Passau, HRB 7466


Re: mesos hadoop jobtracker - cant start.

2015-05-23 Thread Maciej Strzelecki
Hi,

regarding symlinks i followed rather explicit directions:


cd hadoop-2.5.0-cdh5.2.0

mv bin bin-mapreduce2
mv examples examples-mapreduce2
ln -s bin-mapreduce1 bin
ln -s examples-mapreduce1 examples

pushd etc
mv hadoop hadoop-mapreduce2
ln -s hadoop-mapreduce1 hadoop
popd

pushd share/hadoop
rm mapreduce
ln -s mapreduce1 mapreduce
popd


Please tell if the mesos/hadoop readme is not specific enough or wrong





From: Elizabeth Lingg elizab...@mesosphere.io
Sent: Friday, May 22, 2015 8:22 PM
To: user
Subject: Re: mesos hadoop jobtracker - cant start.

Just looking at this quickly, are you sure you set up your symlinks correctly, 
i.e. Since CDH5 includes both MRv1 and MRv2 (YARN) and is configured for YARN 
by default, we need update the symlinks to point to the correct directories.

If you are interested in running Yarn, MR-2, instead of MR-1, you might be 
interested in trying out the Myriad project, https://github.com/mesos/myriad. 
https://github.com/mesos/hadoop is for Hadoop 1/ MR-1.

Thanks,
Elizabeth

On Fri, May 22, 2015 at 6:49 AM, Maciej Strzelecki 
maciej.strzele...@crealytics.commailto:maciej.strzele...@crealytics.com 
wrote:
I am following the install steps on
https://github.com/mesos/hadoop

Question one:

Where should be
conf/mapred-site.xml

I can only see etc/hadoop/mapred-site.xml , there is no conf/ directory


Question two:
---
How can i start the hadoop mesos framework?

Steps:
- mvn package and put target/mesos-hadoop-mr1-0.1.1-SNAPSHOT.jar into 
hadoop-2.5.0-cdh5.2.0/share/hadoop/common/lib/
- done the link changes described
- packaged tar, uploaded to hdfs (hdfs://hdfs/hadoop-2.5.0-cdh5.2.0.tar.gz)
- patched the config file (etc/hadoop/mapred-site.xml)
- uploaded again (with the changed config file)

invoking the hadoop binary from downloaded/unpacked tar:

root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # 
MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so ./bin/hadoop jobtracker
Error: Could not find or load main class org.apache.hadoop.mapred.JobTracker

my classpath echo:
root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # 
MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so ./bin/hadoop jobtracker
/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../etc/hadoop:/usr/lib/tools.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../share/hadoop/mapreduce1/hadoop-core-2.5.0-mr1-cdh5.2.0.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../lib/*.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../lib/jsp-2.1/*.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../etc/hadoop:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/common/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/common/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/yarn/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/yarn/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce/*
Error: Could not find or load main class org.apache.hadoop.mapred.JobTracker


invoking hadoop gives me an error about jobtracker command no longer 
supported.
MESOS_NATIVE_JAVA_LIBRARY=/path/to/libmesos.so hadoop jobtracker

root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # 
MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so hadoop jobtracker
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

Sorry, the jobtracker command is no longer supported.
You may find similar functionality with the yarn shell command.
Usage: mapred [--config confdir] COMMAND
   where COMMAND is one of:
  pipesrun a Pipes job
  job  manipulate MapReduce jobs
  queueget information regarding JobQueues
  classpathprints the class path needed for running
   mapreduce subcommands
  historyserverrun job history servers as a standalone daemon
  distcp srcurl desturl copy file or directories recursively
  archive -archiveName NAME -p parent path src* dest create a hadoop 
archive
  hsadmin  job history server admin interface

Most commands print help when invoked w/o parameters.




mesos hadoop jobtracker - cant start.

2015-05-22 Thread Maciej Strzelecki
I am following the install steps on
https://github.com/mesos/hadoop

Question one:

Where should be
conf/mapred-site.xml

I can only see etc/hadoop/mapred-site.xml , there is no conf/ directory


Question two:
---
How can i start the hadoop mesos framework?

Steps:
- mvn package and put target/mesos-hadoop-mr1-0.1.1-SNAPSHOT.jar into 
hadoop-2.5.0-cdh5.2.0/share/hadoop/common/lib/
- done the link changes described
- packaged tar, uploaded to hdfs (hdfs://hdfs/hadoop-2.5.0-cdh5.2.0.tar.gz)
- patched the config file (etc/hadoop/mapred-site.xml)
- uploaded again (with the changed config file)

invoking the hadoop binary from downloaded/unpacked tar:

root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # 
MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so ./bin/hadoop jobtracker
Error: Could not find or load main class org.apache.hadoop.mapred.JobTracker

my classpath echo:
root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # 
MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so ./bin/hadoop jobtracker
/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../etc/hadoop:/usr/lib/tools.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../share/hadoop/mapreduce1/hadoop-core-2.5.0-mr1-cdh5.2.0.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../lib/*.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../lib/jsp-2.1/*.jar:/root/hadoop-2.5.0-cdh5.2.0/bin-mapreduce1/../etc/hadoop:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/common/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/common/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/hdfs/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/yarn/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/yarn/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce/lib/*:/root/hadoop-2.5.0-cdh5.2.0/share/hadoop/mapreduce/*
Error: Could not find or load main class org.apache.hadoop.mapred.JobTracker


invoking hadoop gives me an error about jobtracker command no longer 
supported.
MESOS_NATIVE_JAVA_LIBRARY=/path/to/libmesos.so hadoop jobtracker

root@mesos-master3 ~/hadoop-2.5.0-cdh5.2.0 # 
MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so hadoop jobtracker
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

Sorry, the jobtracker command is no longer supported.
You may find similar functionality with the yarn shell command.
Usage: mapred [--config confdir] COMMAND
   where COMMAND is one of:
  pipesrun a Pipes job
  job  manipulate MapReduce jobs
  queueget information regarding JobQueues
  classpathprints the class path needed for running
   mapreduce subcommands
  historyserverrun job history servers as a standalone daemon
  distcp srcurl desturl copy file or directories recursively
  archive -archiveName NAME -p parent path src* dest create a hadoop 
archive
  hsadmin  job history server admin interface

Most commands print help when invoked w/o parameters.