Re: Apache Spark Over Mesos

2016-03-15 Thread Tim Chen
Hi Pradeep,

You'll need to specify a s3:// url instead of specifying a relative path
like you did, you can also also use http or hdfs if you want.

You also need to make sure s3 credentials are available in your hadoop
configuration that's also embedded in your docker image, you should be able
to find help information about that easily (We're still working around
these user experience problems around configuration, for now it's easier if
the docker image has all the right configuration).

Tim

On Tue, Mar 15, 2016 at 11:18 AM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> Hello TIm,
>
> I was able to start the spark tasks also as docker containers.
>
> I have one question:
>
> Currently, when i am submit a sample job like follows:
>
> $ bin/spark-submit --deploy-mode cluster --master
> mesos://spark-dispatcher.service.consul:7077 --class
> org.apache.spark.examples.SparkPi lib/spark-examples*.jar 10
>
> It tries to copy the spark-examples*.jar on host from absolute path to the
> sandbox. Is there any way i can make it to pull from S3 instead of looking
> on host for the jar ? So that my build pipeline will push the jar to S3 and
> running spark-submit as a deployment job will pull that jar from S3.
>
> Thanks.
>
>
>
> On Tue, Mar 15, 2016 at 5:17 PM, Pradeep Chhetri <
> pradeep.chhetr...@gmail.com> wrote:
>
>> TIm, sorry I am wrong above.
>>
>> The above config is from spark dispatcher container and those
>> configuration is not being propagated to the driver.
>>
>> I will try the workaround you told and let you know how it goes.
>>
>> Thanks
>>
>>
>>
>> On Tue, Mar 15, 2016 at 4:42 PM, Pradeep Chhetri <
>> pradeep.chhetr...@gmail.com> wrote:
>>
>>> Hello Tim,
>>>
>>> Here is my conf/spark-defaults.conf which is inside the docker image:
>>>
>>> $ cat conf/spark-defaults.conf
>>>
>>> spark.mesos.coarse: false
>>> spark.mesos.executor.docker.image: docker-registry/mesos-spark:master-12
>>> spark.mesos.mesosExecutor.cores: 0.25
>>> spark.mesos.executor.home: /opt/spark
>>> spark.mesos.uris: file:///etc/docker.tar.gz
>>>
>>> I am already setting it inside the docker image.
>>>
>>> Am I missing something ?
>>>
>>> Regards,
>>>
>>> On Tue, Mar 15, 2016 at 4:37 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>
>>>> Hi Pradeep,
>>>>
>>>> Yes we still have a pending PR that will start propagating these
>>>> settings down to the executors, right now it's only applied on the driver.
>>>> As a work around you can download or set spark.mesos.executor.docker.image
>>>> in the spark-default.conf file in the docker image you use to launch the
>>>> driver and it should automatically get this setting when the driver is
>>>> launched.
>>>>
>>>> Tim
>>>>
>>>> On Tue, Mar 15, 2016 at 9:26 AM, Pradeep Chhetri <
>>>> pradeep.chhetr...@gmail.com> wrote:
>>>>
>>>>> Hello Timothy,
>>>>>
>>>>> I am setting spark.mesos.executor.docker.image. In my case, the
>>>>> driver is actually started as a docker container (SparkPi in screenshot)
>>>>> but the tasks which are spawned by driver are not starting as containers
>>>>> but plain java processes. Is this expected ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Tue, Mar 15, 2016 at 4:19 PM, Timothy Chen <t...@mesosphere.io>
>>>>> wrote:
>>>>>
>>>>>> You can launch the driver and executor in docker containers as well
>>>>>> by setting spark.mesos.executor.docker.image to the image you want to use
>>>>>> to launch them.
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On Mar 15, 2016, at 8:49 AM, Radoslaw Gruchalski <
>>>>>> ra...@gruchalski.com> wrote:
>>>>>>
>>>>>> Pradeep,
>>>>>>
>>>>>> You can mount a spark directory as a volume. This means you have to
>>>>>> have spark deployed on every agent.
>>>>>>
>>>>>> Another thing you can do, place spark in hdfs, assuming that you have
>>>>>> hdfs available but that too will download a copy to the sandbox.
>>>>>>
>>>>>> I'd prefer the former.
>>>>>>
>>>>>> Sent from Outlook Mobile <h

Re: Apache Spark Over Mesos

2016-03-15 Thread Tim Chen
Hi Pradeep,

Yes we still have a pending PR that will start propagating these settings
down to the executors, right now it's only applied on the driver. As a work
around you can download or set spark.mesos.executor.docker.image in the
spark-default.conf file in the docker image you use to launch the driver
and it should automatically get this setting when the driver is launched.

Tim

On Tue, Mar 15, 2016 at 9:26 AM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> Hello Timothy,
>
> I am setting spark.mesos.executor.docker.image. In my case, the driver is
> actually started as a docker container (SparkPi in screenshot) but the
> tasks which are spawned by driver are not starting as containers but plain
> java processes. Is this expected ?
>
> Thanks
>
> On Tue, Mar 15, 2016 at 4:19 PM, Timothy Chen  wrote:
>
>> You can launch the driver and executor in docker containers as well by
>> setting spark.mesos.executor.docker.image to the image you want to use to
>> launch them.
>>
>> Tim
>>
>> On Mar 15, 2016, at 8:49 AM, Radoslaw Gruchalski 
>> wrote:
>>
>> Pradeep,
>>
>> You can mount a spark directory as a volume. This means you have to have
>> spark deployed on every agent.
>>
>> Another thing you can do, place spark in hdfs, assuming that you have
>> hdfs available but that too will download a copy to the sandbox.
>>
>> I'd prefer the former.
>>
>> Sent from Outlook Mobile 
>>
>> _
>> From: Pradeep Chhetri 
>> Sent: Tuesday, March 15, 2016 4:41 pm
>> Subject: Apache Spark Over Mesos
>> To: 
>>
>>
>> Hello,
>>
>> I am able to run Apache Spark over Mesos. Its quite simple to run Spark
>> Dispatcher over marathon and ask it to run Spark Executor (I guess also can
>> be called as Spark Driver) as docker container.
>>
>> I have a query regarding this:
>>
>> All spark tasks are spawned directly by first downloading the spark
>> artifacts. I was thinking if there is some way I can start them too as
>> docker containers. This will save the time for downloading the spark
>> artifacts. I am running spark in fine-grained mode.
>>
>> I have attached a screenshot of a sample job
>>
>> 
>> ​
>> Thanks,
>>
>> --
>> Pradeep Chhetri
>>
>>
>>
>
>
> --
> Pradeep Chhetri
>


Re: Mesos 0.27 and docker

2016-03-11 Thread Tim Chen
Hi Walter,

The parameters field in the container.docker is actually optional
parameters that you can pass to the Docker client when you start the
container, not actual command line arguments for your docker command.

You should specify these in the "command" json value, either part of the
value string or in the args array.

Tim

On Fri, Mar 11, 2016 at 8:56 AM, haosdent  wrote:

> Hi, you don't need add "--" before key. Just use "application" and
> "instance" should be enough. Mesos would append "--" before parameter key
> automatically.
>
> On Fri, Mar 11, 2016 at 11:53 PM, Walter Heestermans (TME) <
> walter.heesterm...@external.toyota-europe.com> wrote:
>
>> Just a small question I have it up and running for simple HelloWorld ,
>> but I have this docker run command
>>
>>
>>
>> *sudo docker run --rm -i -t --hostname=${HOSTNAME} --volumes-from
>> toyota-apps --volumes-from toyota-logs -p 8080:8080
>> toyota/oraclelinux-7-toyota-jdk-8-jetty-9.3:1.0 --application=Test
>> --jvm-heap-size="-Xms64m -Xmx256m”*
>>
>>
>>
>> These  *--application=Test --jvm-heap-size="-Xms64m -Xmx256m”*  are
>> input arguments to the container, no real docker options
>>
>>
>>
>> I specified these as
>>
>>
>>
>> {
>>
>> …
>>
>> "container": {
>>
>> "type": "DOCKER",
>>
>> "docker": {
>>
>> …
>>
>> "parameters": [
>>
>> { "key": "--application", "value": "Test" },
>>
>> { "key": "--instance", "value": "Test" }
>>
>> ]
>>
>> },
>>
>>…
>>
>> }
>>
>>
>>
>> These doesn’t seems to be passed to the container. These parameters one
>> is for real docker command options, or? And is if so, how can I give input
>> arguments to the container?
>>
>>
>>
>> Walter
>>
>>
>>
>> *From:* Rad Gruchalski [mailto:ra...@gruchalski.com]
>> *Sent:* 11 March 2016 12:51
>>
>> *To:* user@mesos.apache.org
>> *Subject:* Re: Mesos 0.27 and docker
>>
>>
>>
>> I like my life easy so I use Marathon.
>>
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality: *This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Friday, 11 March 2016 at 12:24, Walter Heestermans (TME) wrote:
>>
>> You are specifying two ways, what’s the preferred way?
>>
>>
>>
>> Walter
>>
>>
>>
>>
>>
>> *From:* Rad Gruchalski [mailto:ra...@gruchalski.com
>> ]
>> *Sent:* 11 March 2016 12:20
>> *To:* user@mesos.apache.org
>> *Subject:* Re: Mesos 0.27 and docker
>>
>>
>>
>> Walter,
>>
>>
>>
>> All you need to know to start is documented here:
>> https://mesosphere.github.io/marathon/docs/native-docker.html.
>>
>> That’s with Marathon, if you are planning on using it directly with
>> Mesos, http://mesos.apache.org/documentation/latest/docker-containerizer/
>>
>> No problem using latest Docker, I have a 0.27.2 cluster with Docker
>> 1.10.2 (docker-engine). All working perfectly fine.
>>
>> Kind regards,
>> Radek Gruchalski
>> ra...@gruchalski.com 
>> de.linkedin.com/in/radgruchalski/
>>
>>
>> *Confidentiality: *This communication is intended for the above-named
>> person and may be confidential and/or legally privileged.
>> If it has come to you in error you must take no action based on it, nor
>> must you copy or show it to anyone; please delete/destroy and inform the
>> sender immediately.
>>
>> On Friday, 11 March 2016 at 11:20, Walter Heestermans (TME) wrote:
>>
>> Hi,
>>
>>
>>
>> I’m new using mesos, and I like to make study of the docker
>> containerization inside mesos.
>>
>>
>>
>> Can somebody provide me some interesting links and some links to samples
>> on how to use, configure,…
>>
>>
>>
>> Walter
>>
>>
>>
>>
>>
>> This e-mail may contain confidential information. If you are not an
>> addressee or otherwise authorised to receive this message, you should not
>> use, copy, disclose or take any action based on this e-mail. If you have
>> received this e-mail in error, please inform the sender promptly and delete
>> this message and any attachments immediately.
>>
>>
>>
>> This e-mail may contain confidential information. If you are not an
>> addressee or otherwise authorised to receive this message, you should not
>> use, copy, disclose or take any action based on this e-mail. If you have
>> received this e-mail in error, please inform the sender promptly and delete
>> this message and any attachments immediately.
>>
>>
>>
>> This e-mail may contain confidential information. If you are not an
>> addressee or otherwise authorised to receive this message, you should not
>> use, copy, disclose or take any action based on this e-mail. If you have
>> received this e-mail in error, please inform the sender promptly and 

Re: Asking for Help: Destroy docker container from marathon kills mesos slave

2016-03-01 Thread Tim Chen
Not in particularly, I remember seeing something similiar in the past when
Mesos slave itself is launched in a Docker container, but in your case I
don't think you guys are doing that.

Does it repro 100% of the time? If you create a ticket with repro steps we
can take a look.

Tim

On Tue, Mar 1, 2016 at 5:41 PM, zhz shi <messi.sh...@gmail.com> wrote:

> Yes we have a plan to do the upgrade but do you know the root cause of
> this problem for 0.25?
>
> On Wed, Mar 2, 2016 at 1:49 AM, Tim Chen <t...@mesosphere.io> wrote:
>
>> Are you able to try out the latest Mesos release instead of 0.25?
>>
>> Tim
>>
>> On Mon, Feb 29, 2016 at 9:11 PM, shizhz <messi.sh...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Is this the correct place to ask for help? If it is could anybody help
>>> me on the problem I posted on SOF:
>>> http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave
>>> .
>>>
>>> Sorry for bothering and wish you all have a nice day.
>>>
>>> BR, Shizhz
>>>
>>
>>
>
>
> --
> BR, Zhenzhong
>


Re: Asking for Help: Destroy docker container from marathon kills mesos slave

2016-03-01 Thread Tim Chen
Are you able to try out the latest Mesos release instead of 0.25?

Tim

On Mon, Feb 29, 2016 at 9:11 PM, shizhz  wrote:

> Hi all,
>
> Is this the correct place to ask for help? If it is could anybody help me
> on the problem I posted on SOF:
> http://stackoverflow.com/questions/35713985/destroy-docker-container-from-marathon-kills-mesos-slave
> .
>
> Sorry for bothering and wish you all have a nice day.
>
> BR, Shizhz
>


Re: Help needed (alas, urgently)

2016-01-15 Thread Tim Chen
dout: write unix @: broken pipe*
>>>>> * [34mINFO [0m[3190] Container
>>>>> cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
>>>>> exit within 15 seconds of SIGTERM - using the force *
>>>>> * [34mINFO [0m[3200] Container cf7fc7c48324 failed to exit within 10
>>>>> seconds of kill - trying direct SIGKILL *
>>>>>
>>>>> *STDOUT from Mesos:*
>>>>>
>>>>> *--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
>>>>> *--docker="/usr/local/ecxmcc/weaveShim" --help="false"
>>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>>> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
>>>>> --stop_timeout="15secs"
>>>>> --container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
>>>>> --docker="/usr/local/ecxmcc/weaveShim" --help="false"
>>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>>> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
>>>>> --stop_timeout="15secs"
>>>>> Registered docker executor on 71.100.202.99
>>>>> Starting task ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285
>>>>> 2016-01-14T20:45:38.613+ [initandlisten] MongoDB starting : pid=1
>>>>> port=27017 dbpath=/data/db 64-bit host=ecxconfigdb
>>>>> 2016-01-14T20:45:38.614+ [initandlisten] db version v2.6.8
>>>>> 2016-01-14T20:45:38.614+ [initandlisten] git version:
>>>>> 3abc04d6d4f71de00b57378e3277def8fd7a6700
>>>>> 2016-01-14T20:45:38.614+ [initandlisten] build info: Linux
>>>>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3
>>>>> 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
>>>>> 2016-01-14T20:45:38.614+ [initandlisten] allocator: tcmalloc
>>>>> 2016-01-14T20:45:38.614+ [initandlisten] options: { storage: {
>>>>> journal: { enabled: true } } }
>>>>> 2016-01-14T20:45:38.616+ [initandlisten] journal
>>>>> dir=/data/db/journal
>>>>> 2016-01-14T20:45:38.616+ [initandlisten] recover : no journal
>>>>> files present, no recovery needed
>>>>> 2016-01-14T20:45:39.006+ [initandlisten] waiting for connections
>>>>> on port 27017
>>>>> 2016-01-14T20:46:38.975+ [clientcursormon] mem (MB) res:77
>>>>> virt:12942
>>>>> 2016-01-14T20:46:38.975+ [clientcursormon]  mapped (incl journal
>>>>> view):12762
>>>>> 2016-01-14T20:46:38.975+ [clientcursormon]  connections:0
>>>>> Killing docker task
>>>>> Shutting down
>>>>> Killing docker task
>>>>> Shutting down
>>>>> Killing docker task
>>>>> Shutting down
>>>>>
>>>>> On Thu, Jan 14, 2016 at 3:38 PM, Paul Bell <arach...@gmail.com> wrote:
>>>>>
>>>>>> Hey Tim,
>>>>>>
>>>>>> Thank you very much for your reply.
>>>>>>
>>>>>> Yes, I am in the midst of trying to reproduce the problem. If
>>>>>> successful (so to speak), I will do as you ask.
>>>>>>
>>>>>> Cordially,
>>>>>>
>>>>>> Paul
>>>>>>
>>>>>> On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> Looks like we've already issued the docker stop as you seen in the
>>>>>>> ps output, but the containers are still running. Can you look at the 
>>>>>>> Docker
>>>>>>>

Re: Help needed (alas, urgently)

2016-01-14 Thread Tim Chen
Hi Paul,

Looks like we've already issued the docker stop as you seen in the ps
output, but the containers are still running. Can you look at the Docker
daemon logs and see what's going on there?

And also can you also try to modify docker_stop_timeout to 0 so that we
SIGKILL the containers right away, and see if this still happens?

Tim



On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell  wrote:

> Hi All,
>
> It's been quite some time since I've posted here and that's chiefly
> because up until a day or two ago, things were working really well.
>
> I actually may have posted about this some time back. But then the problem
> seemed more intermittent.
>
> In summa, several "docker stops" don't work, i.e., the containers are not
> stopped.
>
> Deployment:
>
> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
> Zookeeper
> Mesos-master (0.23.0)
> Mesos-slave (0.23.0)
> Marathon (0.10.0)
> Docker 1.9.1
> Weave 1.1.0
> Our application contains which include
> MongoDB (4)
> PostGres
> ECX (our product)
>
> The only thing that's changed at all in the config above is the version of
> Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
> problem.
>
>
> My automater program stops the application by sending Marathon an "http
> delete" for each running up. Every now & then (reliably reproducible today)
> not all containers get stopped. Most recently, 3 containers failed to stop.
>
> Here are the attendant phenomena:
>
> Marathon shows the 3 applications in deployment mode (presumably
> "deployment" in the sense of "stopping")
>
> *ps output:*
>
> root@71:~# ps -ef | grep docker
> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
> --hostname=71.100.202.99 --ip=71.100.202.99
> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
> 6783
> root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
> 6783
> root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
> 53
> root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
> 53
> root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
> --stop_timeout=15secs
> root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> --stop_timeout=15secs
> root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
> --stop_timeout=15secs
> *root  9696  9695  0 14:06 ?00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
> *root  9709  9708  0 14:06 ?00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
> *root  9720  9719  0 14:06 ?00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*
>
> *docker ps output:*
>
> root@71:~# docker ps
> CONTAINER IDIMAGE COMMAND
>  CREATED STATUS

Re: Mesos fetcher in dockerized slave

2015-12-24 Thread Tim Chen
cs="0" --logging_level="INFO"
>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>> --sandbox_directory="/tmp/mesos/slaves/db70e09f-f39d-491c-8480-73d9858c140b-S0/frameworks/246b272b-d649-47c0-88ca-6b1ff35f437a-/executors/test-app.d4398af9-a67a-11e5-b1cf-fa163e920cd0/runs/d965f59b-cc1a-4081-95d2-f3370214c84d"
>> --stop_timeout="0ns"
>> I1219 18:03:40.177598 6 exec.cpp:136] Version: 0.26.0
>> I1219 18:03:40.19206010 exec.cpp:210] Executor registered on slave
>> db70e09f-f39d-491c-8480-73d9858c140b-S0
>> Registered docker executor on 90.147.170.246
>> Starting task test-app.d4398af9-a67a-11e5-b1cf-fa163e920cd0
>>
>> root@mesos-slave:~# *docker exec -it
>> mesos-db70e09f-f39d-491c-8480-73d9858c140b-S0.d965f59b-cc1a-4081-95d2-f3370214c84d.executor
>> bash*
>> root@mesos-slave:/# ls -R /tmp/
>> hsperfdata_root/ mesos/
>> root@mesos-slave:/# ls -R /tmp/mesos/
>> /tmp/mesos/:
>> *slaves*
>>
>> /tmp/mesos/slaves:
>> *db70e09f-f39d-491c-8480-73d9858c140b-S0*
>>
>> /tmp/mesos/slaves/db70e09f-f39d-491c-8480-73d9858c140b-S0:
>> *frameworks*
>>
>> /tmp/mesos/slaves/db70e09f-f39d-491c-8480-73d9858c140b-S0/frameworks:
>> *246b272b-d649-47c0-88ca-6b1ff35f437a-*
>>
>>
>> /tmp/mesos/slaves/db70e09f-f39d-491c-8480-73d9858c140b-S0/frameworks/246b272b-d649-47c0-88ca-6b1ff35f437a-:
>> *executors*
>>
>>
>> /tmp/mesos/slaves/db70e09f-f39d-491c-8480-73d9858c140b-S0/frameworks/246b272b-d649-47c0-88ca-6b1ff35f437a-/executors:
>> *test-app.d4398af9-a67a-11e5-b1cf-fa163e920cd0*
>>
>>
>> /tmp/mesos/slaves/db70e09f-f39d-491c-8480-73d9858c140b-S0/frameworks/246b272b-d649-47c0-88ca-6b1ff35f437a-/executors/test-app.d4398af9-a67a-11e5-b1cf-fa163e920cd0:
>> *runs*
>>
>>
>> /tmp/mesos/slaves/db70e09f-f39d-491c-8480-73d9858c140b-S0/frameworks/246b272b-d649-47c0-88ca-6b1ff35f437a-/executors/test-app.d4398af9-a67a-11e5-b1cf-fa163e920cd0/runs:
>> *d965f59b-cc1a-4081-95d2-f3370214c84d*
>>
>>
>> /tmp/mesos/slaves/db70e09f-f39d-491c-8480-73d9858c140b-S0/frameworks/246b272b-d649-47c0-88ca-6b1ff35f437a-/executors/test-app.d4398af9-a67a-11e5-b1cf-fa163e920cd0/runs/d965f59b-cc1a-4081-95d2-f3370214c84d:
>> stderr  stdout
>>
>>
>> root@mesos-slave:~# *docker exec -it
>> mesos-db70e09f-f39d-491c-8480-73d9858c140b-S0.d965f59b-cc1a-4081-95d2-f3370214c84d
>> bash*
>> root@mesos-slave:/# env
>> HOSTNAME=mesos-slave
>> HOST=90.147.170.246
>> PORT0=31220
>> PORT_1=31220
>> MESOS_TASK_ID=test-app.d4398af9-a67a-11e5-b1cf-fa163e920cd0
>> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>> PWD=/
>> PORTS=31220
>>
>> MESOS_CONTAINER_NAME=mesos-db70e09f-f39d-491c-8480-73d9858c140b-S0.d965f59b-cc1a-4081-95d2-f3370214c84d
>> SHLVL=1
>> HOME=/
>> MARATHON_APP_ID=/test-app
>> MARATHON_APP_DOCKER_IMAGE=libmesos/ubuntu
>> MARATHON_APP_VERSION=2015-12-19T18:03:37.542Z
>> PORT=31220
>> MESOS_SANDBOX=/mnt/mesos/sandbox
>> _=/usr/bin/env
>> root@mesos-slave:/# ls -R $MESOS_SANDBOX
>> /mnt/mesos/sandbox:
>> stderr stdout
>>
>>
>> root@mesos-slave:/# ls /var/log/mesos*
>> */var/log/mesos-slave.INFO*
>> /var/log/mesos-slave.mesos-slave.invalid-user.log.INFO.20151219-182512.20647
>>
>> /var/log/mesos:
>>
>>
>> Disabling the flag —docker_mesos_image the fetcher is called, log is
>> created, the file is downloaded in the sandbox:
>>
>> root@mesos-slave:~# docker exec -it slave bash
>> root@mesos-slave:/#
>> root@mesos-slave:/#
>> root@mesos-slave:/#
>> root@mesos-slave:/# ls /var/log/mesos*
>> */var/log/mesos-fetcher.INFO*
>> /var/log/mesos-fetcher.mesos-slave.invalid-user.log.INFO.20151219-181621.20157
>> */var/log/mesos-slave.INFO*
>> /var/log/mesos-slave.mesos-slave.invalid-user.log.INFO.20151219-181612.20124
>>
>> /var/log/mesos:
>> root@mesos-slave:/#
>> root@mesos-slave:/# cat /var/log/mesos-fetcher.INFO
>> Log file created at: 2015/12/19 18:16:21
>> Running on machine: mesos-slave
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> I1219 18:16:21.127075 20157 logging.cpp:172] INFO level logging started!
>> I1219 18:16:21.127499 20157 fetcher.cpp:422] Fetcher Info:
>> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/db70e09f-f39d-491c-8480-73d9858c140b-S1","items":[{"action":"BYPASS_CACHE","uri":{"extract":false,"value":"http:\/

Re: Mesos fetcher in dockerized slave

2015-12-18 Thread Tim Chen
Can you share exactly how you run the slave in a docker container?

Tim

On Thu, Dec 17, 2015 at 1:11 PM, Marica Antonacci <
marica.antona...@ba.infn.it> wrote:

> No, using the socket:
>
> -v /var/run/docker.sock:/var/run/docker.sock
>
>
> Il giorno 17/dic/2015, alle ore 18:07, tommy xiao  ha
> scritto:
>
> docker in docker mode?
>
> 2015-12-17 19:08 GMT+08:00 Marica Antonacci :
>
>> Dear all,
>>
>> I'm testing the URIs fetching mechanism for both Marathon applications
>> and Chronos jobs and I have found that if the slave is running inside a
>> docker container (using *docker_mesos_image* startup flag) and you
>> submit the deployment of a dockerized application or job the fetcher step
>> is not performed. On the other hand, if I request the deployment of a
>> non-dockerized application, the URIs are correctly fetched. Moreover, if I
>> don’t provide the docker_mesos_image flag, the fetcher works fine again for
>> both dockerized and non-dockerized applications.
>>
>> Therefore, it seems that the information about the URIs gets lost when
>> the dockerized mesos slave spawns the executor docker container that in
>> turn launches the application docker container…Has anyone seen this problem
>> before? I would like to know if there is a workaround or a fixing.
>>
>> Thanks a lot in advance for you help
>> Best Regards,
>> Marica
>>
>>
>> --
>>
>> Marica ANTONACCI
>> INFN - National Institute of Nuclear Physics
>> Via Orabona 4
>> 70126 Bari - ITALY
>> Phone +39 080 5443244
>> Skype: marica.antonacci
>> e-mail marica.antona...@ba.infn.it
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>
>
> --
>
> Marica ANTONACCI
> INFN - National Institute of Nuclear Physics
> Via Orabona 4
> 70126 Bari - ITALY
> Phone +39 080 5443244
> Skype: marica.antonacci
> e-mail marica.antona...@ba.infn.it
>
>
>
>
>
>
>
>
>
>
>


Re: Mesos fetcher in dockerized slave

2015-12-18 Thread Tim Chen
HI Marica,

Did you see the fetcher invoked at all from the Slave logs? Doesn't seem
possible we don't pass down the URI flags and if the fetcher failed the
container launch should have failed too.

Also another possible situation is that web UI not really showing the exact
content of the sandbox, can you actually go into the directory and see if
the files are there?

Thanks,

Tim

On Fri, Dec 18, 2015 at 4:23 PM, Marica Antonacci <
marica.antona...@ba.infn.it> wrote:

> Hi Tim,
>
> looking at the sandbox I can see only the stderr and stout file (see the
> attached screenshot). If I remove —docker_mesos_image (and therefore the
> executor is run inside the slave container) the file specified in the URI
> field is shown in the sandbox.
> Did you verify that the fetcher is called when using the
> —docker_mesos_image flag ?
>
> Thanks a lot for your feedback
> Marica
>
>
> Il giorno 19/dic/2015, alle ore 00:25, Tim Chen <t...@mesosphere.io> ha
> scritto:
>
> Hi Marica,
>
> It should work as we fetch all the files before we launch the executor and
> place them in the sandbox, and we mount the sandbox into that container as
> well.
>
> How did you verify that the file is not downloaded?
>
> Tim
>
> On Fri, Dec 18, 2015 at 5:26 AM, Marica Antonacci <
> marica.antona...@ba.infn.it> wrote:
>
>> Hi Grzegorz,
>>
>> I’m using this command line for docker run
>>
>> # docker run -d MESOS_HOSTNAME= -e MESOS_IP= -e
>> MESOS_MASTER=zk://:2181,:2181,:2181/mesos
>> -e MESOS_CONTAINERIZERS=docker,mesos -e
>> MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins -e MESOS_LOG_DIR=/var/log -e
>> MESOS_docker_mesos_image=mesos-slave -v /sys/fs/cgroup:/sys/fs/cgroup -v
>> /var/run/docker.sock:/var/run/docker.sock -v /tmp/mesos:/tmp/mesos --name
>> slave --net host --privileged --pid host mesos-slave
>>
>> where mesos-slave is the image built from the docker file in this repo
>> https://github.com/maricaantonacci/mesos-slave-dev
>>
>> I have tested successfully the deployment of dockerized applications
>> through Marathon and dockerized jobs through Chronos and also the recovery
>> seems to work fine with the flag docker_mesos_image. What is not working
>> for me is the fetcher: it seems that when the executor is launched as
>> separate container (thanks to the flag docker_mesos_image) the information
>> about the URIs to be downloaded is lost…I hope someone can help to
>> understand if this a bug or I’ m missing something.
>>
>> Cheers,
>> Marica
>>
>>
>>
>> Il giorno 18/dic/2015, alle ore 12:11, Grzegorz Graczyk <
>> gregor...@gmail.com> ha scritto:
>>
>> I've tried to use this flag, but cannot really run any container when
>> this flag is set.
>> I've raised this issue here:
>> https://www.mail-archive.com/user@mesos.apache.org/msg04975.html and
>> here:
>> https://github.com/mesosphere/docker-containers/issues/6#issuecomment-155364351
>>  but
>> sadly no one was able to help me...
>>
>> pt., 18.12.2015 o 11:33 użytkownik Marica Antonacci <
>> marica.antona...@ba.infn.it> napisał:
>>
>>> OK, the problem I spotted is related to the usage of the
>>> flag —docker_mesos_image that allows the executor to
>>>
>>>
>>> --docker_mesos_image=VALUEThe docker image used to launch this mesos
>>> slave instance. If an image is specified, the docker containerizer assumes
>>> the slave is running in a docker container, and launches executors with
>>> docker containers in order to recover them when the slave restarts and
>>> recovers.
>>> Has anyone used this flag and tested the behavior of the fetcher?
>>>
>>> Thank you
>>> Marica
>>>
>>>
>>> Il giorno 18/dic/2015, alle ore 10:38, tommy xiao <xia...@gmail.com> ha
>>> scritto:
>>>
>>> no docker_mesos_image flag in my docker run,  and the docker image is
>>> build by myself.
>>>
>>>
>>>
>>> 2015-12-18 17:20 GMT+08:00 Marica Antonacci <marica.antona...@ba.infn.it
>>> >:
>>>
>>> Yes, I did check inside the container and the csv file was not
>>>> downloaded as shown also by the app details (see the screenshot below).
>>>>
>>>> Are you running your slave with the --docker_mesos_image flag? Can you
>>>> please provide me the docker run command you are using to run your
>>>> dockerized slave?
>>>>
>>>> Thank you very much
>>>>
>>> Marica
>>>>
>>>>

Re: Mesos fetcher in dockerized slave

2015-12-18 Thread Tim Chen
Hi Marica,

It should work as we fetch all the files before we launch the executor and
place them in the sandbox, and we mount the sandbox into that container as
well.

How did you verify that the file is not downloaded?

Tim

On Fri, Dec 18, 2015 at 5:26 AM, Marica Antonacci <
marica.antona...@ba.infn.it> wrote:

> Hi Grzegorz,
>
> I’m using this command line for docker run
>
> # docker run -d MESOS_HOSTNAME= -e MESOS_IP= -e
> MESOS_MASTER=zk://:2181,:2181,:2181/mesos
> -e MESOS_CONTAINERIZERS=docker,mesos -e
> MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins -e MESOS_LOG_DIR=/var/log -e
> MESOS_docker_mesos_image=mesos-slave -v /sys/fs/cgroup:/sys/fs/cgroup -v
> /var/run/docker.sock:/var/run/docker.sock -v /tmp/mesos:/tmp/mesos --name
> slave --net host --privileged --pid host mesos-slave
>
> where mesos-slave is the image built from the docker file in this repo
> https://github.com/maricaantonacci/mesos-slave-dev
>
> I have tested successfully the deployment of dockerized applications
> through Marathon and dockerized jobs through Chronos and also the recovery
> seems to work fine with the flag docker_mesos_image. What is not working
> for me is the fetcher: it seems that when the executor is launched as
> separate container (thanks to the flag docker_mesos_image) the information
> about the URIs to be downloaded is lost…I hope someone can help to
> understand if this a bug or I’ m missing something.
>
> Cheers,
> Marica
>
>
>
> Il giorno 18/dic/2015, alle ore 12:11, Grzegorz Graczyk <
> gregor...@gmail.com> ha scritto:
>
> I've tried to use this flag, but cannot really run any container when this
> flag is set.
> I've raised this issue here:
> https://www.mail-archive.com/user@mesos.apache.org/msg04975.html and
> here:
> https://github.com/mesosphere/docker-containers/issues/6#issuecomment-155364351
>  but
> sadly no one was able to help me...
>
> pt., 18.12.2015 o 11:33 użytkownik Marica Antonacci <
> marica.antona...@ba.infn.it> napisał:
>
>> OK, the problem I spotted is related to the usage of the
>> flag —docker_mesos_image that allows the executor to
>>
>>
>> --docker_mesos_image=VALUEThe docker image used to launch this mesos
>> slave instance. If an image is specified, the docker containerizer assumes
>> the slave is running in a docker container, and launches executors with
>> docker containers in order to recover them when the slave restarts and
>> recovers.
>> Has anyone used this flag and tested the behavior of the fetcher?
>>
>> Thank you
>> Marica
>>
>>
>> Il giorno 18/dic/2015, alle ore 10:38, tommy xiao  ha
>> scritto:
>>
>> no docker_mesos_image flag in my docker run,  and the docker image is
>> build by myself.
>>
>>
>>
>> 2015-12-18 17:20 GMT+08:00 Marica Antonacci 
>> :
>>
>> Yes, I did check inside the container and the csv file was not downloaded
>>> as shown also by the app details (see the screenshot below).
>>>
>>> Are you running your slave with the --docker_mesos_image flag? Can you
>>> please provide me the docker run command you are using to run your
>>> dockerized slave?
>>>
>>> Thank you very much
>>>
>> Marica
>>>
>>>
>>> 
>>>
>>
>>>
>>> Il giorno 18/dic/2015, alle ore 10:00, tommy xiao  ha
>>> scritto:
>>>
>>> Hi Marica,
>>>
>>> use your test-app json, i can run it correctly, the csv is truely
>>> download by mesos slave. please check mesos-master:5050 to check the task
>>> detail download files.
>>>
>>> you describe the app container why not found the csv, because the csv is
>>> download in slave container's folder, not in app container. so if you run
>>>
>>> cd $MESOS_SANDBOX;
>>>
>>> the folder in app container is default value:
>>>
>>> MESOS_SANDBOX=/mnt/mesos/sandbox
>>> but in real world, the sandbox is in slave container, not in app
>>> container.
>>>
>>>
>>>
>>> 2015-12-18 16:11 GMT+08:00 Marica Antonacci >> >:
>>>
 Thank you very much,

 I’m using a sample application definition file, just for testing
 purpose:

 {
  "id": "test-app",
  "container": {
"type": "DOCKER",
"docker": {
  "image": "libmesos/ubuntu"
}
  },
  "cpus": 1,
  "mem": 512,
  *"uris": [
 "http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv
 "
 ],*
  "cmd": "cd $MESOS_SANDBOX; ls -latr; while sleep 10; do date -u +%T;
 done"
 }

 Here is the docker run command line:

 # docker run -d -e MESOS_HOSTNAME= -e MESOS_IP= -e
 MESOS_MASTER=zk://:2181,:2181,:2181/mesos
 -e MESOS_CONTAINERIZERS=docker,mesos \
   -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins -e
 MESOS_LOG_DIR=/var/log -e MESOS_docker_mesos_image=mesos-slave
   -v /sys/fs/cgroup:/sys/fs/cgroup -v
 /var/run/docker.sock:/var/run/docker.sock --name slave --net host

Re: Mesos fetcher in dockerized slave

2015-12-18 Thread Tim Chen
Hi Shuai,

You need to specify the --pid=host flag.

Tim

On Fri, Dec 18, 2015 at 5:19 AM, Shuai Lin  wrote:

> The problem happens to me if I don't specify the --docker_mesos_image
> flag. However, specifying the flag only makes things worse: the task is
> failed again and agin, but there does exist a container for this task.
>
> master and zookeeper is running on host, and slave is running inside a
> docker image:
>
> ```
> sudo docker run -it --rm \
> -e MESOS_HOSTNAME=localhost \
> -e MESOS_IP=127.0.0.1 \
> -e MESOS_MASTER=zk://127.0.0.1:2181/mesos \
> -v /sys/fs/cgroup:/sys/fs/cgroup \
> -v /var/run/docker.sock:/var/run/docker.sock \
> --name mesos-slave \
> --net host \
> --privileged \
> mesoscloud/mesos-slave:0.24.1-ubuntu-14.04
>
> ```
>
> However my setup may affect the outcome: master is 0.25.0 and slave is
> 0.24.1 (can't find a public docker image for mesos 2.5.1)
>
> Output of http http://127.0.0.1:8080/v2/apps (unrelevant part ommited)
>
> ```
> {
>   "apps": [
>   "container": {
> "docker": {
>   "parameters": [],
>   "privileged": false,
>   "network": "BRIDGE",
>   "image": "testapp:latest"
> },
> "volumes": [],
> "type": "DOCKER"
>   },
>   "uris": [
> "https://google.com/robots.txt;
>   ],
> }
>   ]
> }
> ```
>
> On Fri, Dec 18, 2015 at 7:11 PM, Grzegorz Graczyk 
> wrote:
>
>> I've tried to use this flag, but cannot really run any container when
>> this flag is set.
>> I've raised this issue here:
>> https://www.mail-archive.com/user@mesos.apache.org/msg04975.html and
>> here:
>> https://github.com/mesosphere/docker-containers/issues/6#issuecomment-155364351
>>  but
>> sadly no one was able to help me...
>>
>> pt., 18.12.2015 o 11:33 użytkownik Marica Antonacci <
>> marica.antona...@ba.infn.it> napisał:
>>
>>> OK, the problem I spotted is related to the usage of the
>>> flag —docker_mesos_image that allows the executor to
>>>
>>>
>>> --docker_mesos_image=VALUEThe docker image used to launch this mesos
>>> slave instance. If an image is specified, the docker containerizer assumes
>>> the slave is running in a docker container, and launches executors with
>>> docker containers in order to recover them when the slave restarts and
>>> recovers.
>>> Has anyone used this flag and tested the behavior of the fetcher?
>>>
>>> Thank you
>>> Marica
>>>
>>>
>>> Il giorno 18/dic/2015, alle ore 10:38, tommy xiao  ha
>>> scritto:
>>>
>>> no docker_mesos_image flag in my docker run,  and the docker image is
>>> build by myself.
>>>
>>>
>>>
>>> 2015-12-18 17:20 GMT+08:00 Marica Antonacci >> >:
>>>
>>> Yes, I did check inside the container and the csv file was not
 downloaded as shown also by the app details (see the screenshot below).

 Are you running your slave with the --docker_mesos_image flag? Can you
 please provide me the docker run command you are using to run your
 dockerized slave?

 Thank you very much

>>> Marica


 

>>>

 Il giorno 18/dic/2015, alle ore 10:00, tommy xiao 
 ha scritto:

 Hi Marica,

 use your test-app json, i can run it correctly, the csv is truely
 download by mesos slave. please check mesos-master:5050 to check the task
 detail download files.

 you describe the app container why not found the csv, because the csv
 is download in slave container's folder, not in app container. so if you
 run

 cd $MESOS_SANDBOX;

 the folder in app container is default value:

 MESOS_SANDBOX=/mnt/mesos/sandbox
 but in real world, the sandbox is in slave container, not in app
 container.



 2015-12-18 16:11 GMT+08:00 Marica Antonacci <
 marica.antona...@ba.infn.it>:

> Thank you very much,
>
> I’m using a sample application definition file, just for testing
> purpose:
>
> {
>  "id": "test-app",
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "libmesos/ubuntu"
>}
>  },
>  "cpus": 1,
>  "mem": 512,
>  *"uris": [
> "http://www.stat.cmu.edu/~cshalizi/402/lectures/16-glm-practicals/snoqualmie.csv
> "
> ],*
>  "cmd": "cd $MESOS_SANDBOX; ls -latr; while sleep 10; do date -u +%T;
> done"
> }
>
> Here is the docker run command line:
>
> # docker run -d -e MESOS_HOSTNAME= -e MESOS_IP= -e
> MESOS_MASTER=zk://:2181,:2181,:2181/mesos
> -e MESOS_CONTAINERIZERS=docker,mesos \
>   -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins -e
> MESOS_LOG_DIR=/var/log -e MESOS_docker_mesos_image=mesos-slave
>   -v 

Re: How does Mesos parse hadoop command??

2015-11-04 Thread Tim Chen
What OS are you running this with?

And I assume if you run /bin/sh and try to run hadoop it can be found in
your PATH as well?

Tim

On Wed, Nov 4, 2015 at 12:34 AM, Du, Fan  wrote:

> Hi Mesos experts
>
> I setup a small mesos cluster with 1 master and 6 slaves,
> and deploy hdfs on the same cluster topology, both with root user role.
>
> #cat spark-1.5.1-bin-hadoop2.6/conf/spark-env.sh
> export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
> export
> JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.1.el7_1.x86_64/jre/
> export SPARK_EXECUTOR_URI=hdfs://test/spark-1.5.1-bin-hadoop2.6.tgz
>
> When I run a simple SparkPi test
> #export MASTER=mesos://Mesos_Master_IP:5050
> #spark-1.5.1-bin-hadoop2.6/bin/run-example SparkPi 1
>
> I got this on slaves:
>
> I1104 22:24:02.238471 14518 fetcher.cpp:414] Fetcher Info:
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/556b49c1-7e6a-4f99-b320-c3f0c849e836-S6\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/test\/spark-1.5.1-bin-hadoop2.6.tgz"}}],"sandbox_directory":"\/ws\/mesos\/slaves\/556b49c1-7e6a-4f99-b320-c3f0c849e836-S6\/frameworks\/556b49c1-7e6a-4f99-b320-c3f0c849e836-0003\/executors\/556b49c1-7e6a-4f99-b320-c3f0c849e836-S6\/runs\/9ec70f41-67d5-4a95-999f-933f3aa9e261","user":"root"}
> I1104 22:24:02.240910 14518 fetcher.cpp:369] Fetching URI
> 'hdfs://test/spark-1.5.1-bin-hadoop2.6.tgz'
> I1104 22:24:02.240931 14518 fetcher.cpp:243] Fetching directly into the
> sandbox directory
> I1104 22:24:02.240952 14518 fetcher.cpp:180] Fetching URI
> 'hdfs://test/spark-1.5.1-bin-hadoop2.6.tgz'
> E1104 22:24:02.245264 14518 shell.hpp:90] Command 'hadoop version 2>&1'
> failed; this is the output:
> sh: hadoop: command not found
> Failed to fetch 'hdfs://test/spark-1.5.1-bin-hadoop2.6.tgz': Skipping
> fetch with Hadoop client: Failed to execute 'hadoop version 2>&1'; the
> command was either not found or exited with a non-zero exit status: 127
> Failed to synchronize with slave (it's probably exited)
>
>
> As for "sh: hadoop: command not found", it indicates when mesos executes
> "hadoop version" command,
> it cannot find any valid hadoop command, but actually when I log into the
> slave, "hadoop vesion"
> runs well, because I update hadoop path into PATH env.
>
> cat ~/.bashrc
> export
> JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.91-2.6.2.1.el7_1.x86_64/jre/
> export HADOOP_PREFIX=/opt/hadoop-2.6.0
> export HADOOP_HOME=$HADOOP_PREFIX
> export HADOOP_COMMON_HOME=$HADOOP_PREFIX
> export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
> export HADOOP_HDFS_HOME=$HADOOP_PREFIX
> export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
> export HADOOP_YARN_HOME=$HADOOP_PREFIX
> export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin
>
> I also try to set hadoop_home when launching mesos-slave, hmm, no luck,
> the slave
> complains it can find JAVA_HOME env when executing "hadoop version"
>
> Finally I check the Mesos code where this error happens, it looks quite
> straight forward.
>
>  ./src/hdfs/hdfs.hpp
>  44 // HTTP GET on hostname:port and grab the information in the
>  45 // ... (this is the best hack I can think of to get
>  46 // 'fs.default.name' given the tools available).
>  47 struct HDFS
>  48 {
>  49   // Look for `hadoop' first where proposed, otherwise, look for
>  50   // HADOOP_HOME, otherwise, assume it's on the PATH.
>  51   explicit HDFS(const std::string& _hadoop)
>  52 : hadoop(os::exists(_hadoop)
>  53  ? _hadoop
>  54  : (os::getenv("HADOOP_HOME").isSome()
>  55 ? path::join(os::getenv("HADOOP_HOME").get(),
> "bin/hadoop")
>  56 : "hadoop")) {}
>  57
>  58   // Look for `hadoop' in HADOOP_HOME or assume it's on the PATH.
>  59   HDFS()
>  60 : hadoop(os::getenv("HADOOP_HOME").isSome()
>  61  ? path::join(os::getenv("HADOOP_HOME").get(),
> "bin/hadoop")
>  62  : "hadoop") {}
>  63
>  64   // Check if hadoop client is available at the path that was set.
>  65   // This can be done by executing `hadoop version` command and
>  66   // checking for status code == 0.
>  67   Try available()
>  68   {
>  69 Try command = strings::format("%s version", hadoop);
>  70
>  71 CHECK_SOME(command);
>  72
>  73 // We are piping stderr to stdout so that we can see the error (if
>  74 // any) in the logs emitted by `os::shell()` in case of failure.
>  75 Try out = os::shell(command.get() + " 2>&1");
>  76
>  77 if (out.isError()) {
>  78   return Error(out.error());
>  79 }
>  80
>  81 return true;
>  82   }
>
> It puzzled me for a while, am I missing something obviously?
> Thanks in advance.
>
>


Re: Can't start docker container when SSL_ENABLED is on.

2015-10-31 Thread Tim Chen
nday and give you feedback.
>
> On Fri, Oct 30, 2015 at 11:30 AM, Xiaodong Zhang <xdzh...@alauda.io>
> wrote:
>
>> Anybody know about this?
>>
>> 发件人: Xiaodong Zhang <xdzh...@alauda.io>
>> 答复: "user@mesos.apache.org" <user@mesos.apache.org>
>> 日期: 2015年10月29日 星期四 下午7:38
>>
>> 至: "user@mesos.apache.org" <user@mesos.apache.org>
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>>
>> I think it is easy to reproduce this error.
>>
>> Start master with env:
>>
>> SSL_SUPPORT_DOWNGRADE
>> SSL_ENABLED
>> SSL_KEY_FILE
>> SSL_CERT_FILE
>>
>> Start slave with env:
>>
>> SSL_ENABLED
>> SSL_KEY_FILE
>> SSL_CERT_FILE
>> LIBPROCESS_ADVERTISE_IP
>>
>>
>> Then run a docker task via marathon.
>>
>> 发件人: Xiaodong Zhang <xdzh...@alauda.io>
>> 日期: 2015年10月29日 星期四 下午3:09
>> 至: "user@mesos.apache.org" <user@mesos.apache.org>
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>>
>> So now, mesos task work well but docker task doesn’t.
>>
>> 发件人: Xiaodong Zhang <xdzh...@alauda.io>
>> 答复: "user@mesos.apache.org" <user@mesos.apache.org>
>> 日期: 2015年10月29日 星期四 下午2:08
>> 至: "user@mesos.apache.org" <user@mesos.apache.org>
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>>
>> I run a task by marathon:
>>
>> {
>> "id": "basic-0",
>> "cmd": "while [ true ] ; do echo 'Hello Marathon' ; sleep 5 ; done",
>> "cpus": 0.1,
>> "mem": 10.0,
>> "instances": 1}
>>
>>
>> It works well.
>>
>> <742629F2-78E8-43F2-9015-F3D22720826B.png>
>>
>> Docker task can pull image but can’t run as I mentioned.
>>
>> My docker version 1.5.0
>>
>> 发件人: Tim Chen <t...@mesosphere.io>
>> 答复: "user@mesos.apache.org" <user@mesos.apache.org>
>> 日期: 2015年10月29日 星期四 下午1:48
>> 至: "user@mesos.apache.org" <user@mesos.apache.org>
>> 主题: Re: Can't start docker container when SSL_ENABLED is on.
>>
>> Does running a task without docker container (Mesos containerizer) works
>> with ssl in your environment?
>>
>> Tim
>>
>> On Wed, Oct 28, 2015 at 10:19 PM, Xiaodong Zhang <xdzh...@alauda.io>
>> wrote:
>>
>>> Thanks a lot. I find the log file in slave.
>>>
>>> One of the task:
>>>
>>> Stdout:
>>>
>>> --container="mesos-20151029-043755-3549436724-5050-5674-S0.e2c2580f-8082-4f17-b0cc-4e32e040d444"
>>> --docker="/home/ubuntu/luna/bin/docker" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/20151029-043755-3549436724-5050-5674-S0/frameworks/20151029-043755-3549436724-5050-5674-/executors/e4a3bed5-64e6-4970-8bb1-df6404656a48.e3a20f3b-7dfb-11e5-b57b-0247b493b22f/runs/e2c2580f-8082-4f17-b0cc-4e32e040d444"
>>> --stop_timeout="0ns"
>>> --container="mesos-20151029-043755-3549436724-5050-5674-S0.e2c2580f-8082-4f17-b0cc-4e32e040d444"
>>> --docker="/home/ubuntu/luna/bin/docker" --help="false"
>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>> --sandbox_directory="/tmp/mesos/slaves/20151029-043755-3549436724-5050-5674-S0/frameworks/20151029-043755-3549436724-5050-5674-/executors/e4a3bed5-64e6-4970-8bb1-df6404656a48.e3a20f3b-7dfb-11e5-b57b-0247b493b22f/runs/e2c2580f-8082-4f17-b0cc-4e32e040d444"
>>> --stop_timeout="0ns"
>>> Shutting down
>>>
>>> Stderr:
>>>
>>> I1029 05:14:06.529364 27862 fetcher.cpp:414] Fetcher Info:
>>> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/20151029-043755-3549436724-5050-5674-S0","items":[{"action":"BYPASS_CACHE","uri":{"extract":false,"value":"file:\/\/\/etc\/.dockercfg"}}],"sandbox_directory":"\/tmp\/mesos\/slaves\/20151029-043755-3549436724-5050-5674-S0\/frameworks\/20151029-043755-3549436724-5050-5674-\/executors\/e4a3bed5-64e6-4970-8bb1-df6404656a48.e3a20f3b-7dfb-11e5-b57b-024

Re: Can't start docker container when SSL_ENABLED is on.

2015-10-28 Thread Tim Chen
Hi Xiaodong,

That's the master log, but if you click on "sandbox" next to the
TASK_FAILED task and find the stdout/stderr files, click on them and paste
the results here.

Tim

On Wed, Oct 28, 2015 at 9:59 PM, Xiaodong Zhang  wrote:

>
> The webui have a LOG link, when click it shows like this:
>
> I1029 04:44:32.293445  5697 http.cpp:321] HTTP GET for /master/state.json
> from 114.113.20.135:55682 with User-Agent='Mozilla/5.0 (Macintosh; Intel
> Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/46.0.2490.71 Safari/537.36'
> I1029 04:44:34.533504  5704 master.cpp:4613] Sending 1 offers to framework
> 20151029-043755-3549436724-5050-5674- (marathon) at
> scheduler-b532233f-2fc5-4455-b1e6-7a66ae79a8b9@172.31.43.77:53373
> I1029 04:44:34.539579  5702 master.cpp:2739] Processing ACCEPT call for
> offers: [ 20151029-043755-3549436724-5050-5674-O2 ] on slave
> 20151029-043755-3549436724-5050-5674-S0 at slave(1)@50.112.136.148:5051 (
> ec2-50-112-136-148.us-west-2.compute.amazonaws.com) for framework
> 20151029-043755-3549436724-5050-5674- (marathon) at
> scheduler-b532233f-2fc5-4455-b1e6-7a66ae79a8b9@172.31.43.77:53373
> I1029 04:44:34.539710  5702 hierarchical.hpp:814] Recovered cpus(*):1;
> mem(*):999; disk(*):3962; ports(*):[31000-32000] (total: cpus(*):1;
> mem(*):999; disk(*):3962; ports(*):[31000-32000], allocated: ) on slave
> 20151029-043755-3549436724-5050-5674-S0 from framework
> 20151029-043755-3549436724-5050-5674-
> I1029 04:44:37.360901  5703 master.cpp:4294] Performing implicit task
> state reconciliation for framework
> 20151029-043755-3549436724-5050-5674- (marathon) at
> scheduler-b532233f-2fc5-4455-b1e6-7a66ae79a8b9@172.31.43.77:53373
> I1029 04:44:40.539989  5704 master.cpp:4613] Sending 1 offers to framework
> 20151029-043755-3549436724-5050-5674- (marathon) at
> scheduler-b532233f-2fc5-4455-b1e6-7a66ae79a8b9@172.31.43.77:53373
> I1029 04:44:40.610321  5702 master.cpp:2739] Processing ACCEPT call for
> offers: [ 20151029-043755-3549436724-5050-5674-O3 ] on slave
> 20151029-043755-3549436724-5050-5674-S0 at slave(1)@50.112.136.148:5051 (
> ec2-50-112-136-148.us-west-2.compute.amazonaws.com) for framework
> 20151029-043755-3549436724-5050-5674- (marathon) at
> scheduler-b532233f-2fc5-4455-b1e6-7a66ae79a8b9@172.31.43.77:53373
> I1029 04:44:40.610846  5702 master.hpp:170] Adding task
> e4a3bed5-64e6-4970-8bb1-df6404656a48.c4239b84-7df7-11e5-b57b-0247b493b22f
> with resources cpus(*):0.0625; mem(*):256; ports(*):[31864-31864] on slave
> 20151029-043755-3549436724-5050-5674-S0 (
> ec2-50-112-136-148.us-west-2.compute.amazonaws.com)
> I1029 04:44:40.610911  5702 master.cpp:3069] Launching task
> e4a3bed5-64e6-4970-8bb1-df6404656a48.c4239b84-7df7-11e5-b57b-0247b493b22f
> of framework 20151029-043755-3549436724-5050-5674- (marathon) at
> scheduler-b532233f-2fc5-4455-b1e6-7a66ae79a8b9@172.31.43.77:53373 with
> resources cpus(*):0.0625; mem(*):256; ports(*):[31864-31864] on slave
> 20151029-043755-3549436724-5050-5674-S0 at slave(1)@50.112.136.148:5051 (
> ec2-50-112-136-148.us-west-2.compute.amazonaws.com)
> I1029 04:44:40.611095  5702 hierarchical.hpp:814] Recovered
> cpus(*):0.9375; mem(*):743; disk(*):3962; ports(*):[31000-31863,
> 31865-32000] (total: cpus(*):1; mem(*):999; disk(*):3962;
> ports(*):[31000-32000], allocated: cpus(*):0.0625; mem(*):256;
> ports(*):[31864-31864]) on slave 20151029-043755-3549436724-5050-5674-S0
> from framework 20151029-043755-3549436724-5050-5674-
> I1029 04:44:43.324970  5698 http.cpp:321] HTTP GET for /master/state.json
> from 114.113.20.135:55682 with User-Agent='Mozilla/5.0 (Macintosh; Intel
> Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko)
> Chrome/46.0.2490.71 Safari/537.36'
> I1029 04:44:46.546671  5703 master.cpp:4613] Sending 1 offers to framework
> 20151029-043755-3549436724-5050-5674- (marathon) at
> scheduler-b532233f-2fc5-4455-b1e6-7a66ae79a8b9@172.31.43.77:53373
> I1029 04:44:46.557266  5699 master.cpp:2739] Processing ACCEPT call for
> offers: [ 20151029-043755-3549436724-5050-5674-O4 ] on slave
> 20151029-043755-3549436724-5050-5674-S0 at slave(1)@50.112.136.148:5051 (
> ec2-50-112-136-148.us-west-2.compute.amazonaws.com) for framework
> 20151029-043755-3549436724-5050-5674- (marathon) at
> scheduler-b532233f-2fc5-4455-b1e6-7a66ae79a8b9@172.31.43.77:53373
> I1029 04:44:46.557394  5699 hierarchical.hpp:814] Recovered
> cpus(*):0.9375; mem(*):743; disk(*):3962; ports(*):[31000-31863,
> 31865-32000] (total: cpus(*):1; mem(*):999; disk(*):3962;
> ports(*):[31000-32000], allocated: cpus(*):0.0625; mem(*):256;
> ports(*):[31864-31864]) on slave 20151029-043755-3549436724-5050-5674-S0
> from framework 20151029-043755-3549436724-5050-5674-
> I1029 04:44:47.267562  5700 master.cpp:4069] Status update TASK_FAILED
> (UUID: 0ea607fc-bf24-4bda-b107-55a54aba31cf) for task
> e4a3bed5-64e6-4970-8bb1-df6404656a48.c4239b84-7df7-11e5-b57b-0247b493b22f
> of framework 

Re: Spark Job Submitting on Mesos Cluster

2015-09-14 Thread Tim Chen
Thanks Haosdent!

Tim

On Mon, Sep 14, 2015 at 1:29 AM, SLiZn Liu <sliznmail...@gmail.com> wrote:

> I found the --no-switch_user flag in mesos slave configuration. Will give
> it a try. Thanks Tim, and haosdent !
> ​
>
> On Mon, Sep 14, 2015 at 4:15 PM haosdent <haosd...@gmail.com> wrote:
>
>> > turn off --switch-user flag in the Mesos slave
>> --no-switch_user :-)
>>
>> On Mon, Sep 14, 2015 at 4:03 PM, Tim Chen <t...@mesosphere.io> wrote:
>>
>>> Actually --proxy-user is more about which user you're impersonated to
>>> run the driver, but not the user that is going to be passed to Mesos to run
>>> as.
>>>
>>> The way to use a partciular user when running a spark job is to set the
>>> SPARK_USER environment variable, and that user will be passed to Mesos.
>>>
>>> Atlernatively you can also turn off --switch-user flag in the Mesos
>>> slave so that all jobs will just use the Slave's current user.
>>>
>>> Tim
>>>
>>> On Sun, Sep 13, 2015 at 11:20 PM, SLiZn Liu <sliznmail...@gmail.com>
>>> wrote:
>>>
>>>> Thx Tommy, did you mean add proxy user like this:
>>>>
>>>> spark-submit --proxy-user  ...
>>>>
>>>> where represents the user who started Mesos?
>>>>
>>>> and is this parameter documented anywhere?
>>>> ​
>>>>
>>>> On Mon, Sep 14, 2015 at 1:34 PM tommy xiao <xia...@gmail.com> wrote:
>>>>
>>>>> @SLiZn Liu  yes, you need add proxy_user parameter and your cluster
>>>>> should have the proxy_user in the /etc/passwd in every node.
>>>>>
>>>>> 2015-09-14 13:05 GMT+08:00 haosdent <haosd...@gmail.com>:
>>>>>
>>>>>> Do you start your mesos cluster with root?
>>>>>>
>>>>>> On Mon, Sep 14, 2015 at 12:10 PM, SLiZn Liu <sliznmail...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mesos Users,
>>>>>>>
>>>>>>> I’m trying to run Spark jobs on my Mesos cluster. However I
>>>>>>> discovered that my Spark job must be submitted by the same user who 
>>>>>>> started
>>>>>>> Mesos, otherwise a ExecutorLostFailure will rise, and the job won’t
>>>>>>> be executed. Is there anyway that every user share a same Mesos cluster 
>>>>>>> in
>>>>>>> harmony? =D
>>>>>>>
>>>>>>> BR,
>>>>>>> Todd Leo
>>>>>>> ​
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Haosdent Huang
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Deshi Xiao
>>>>> Twitter: xds2000
>>>>> E-mail: xiaods(AT)gmail.com
>>>>>
>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>


Re: Spark Job Submitting on Mesos Cluster

2015-09-14 Thread Tim Chen
Actually --proxy-user is more about which user you're impersonated to run
the driver, but not the user that is going to be passed to Mesos to run as.

The way to use a partciular user when running a spark job is to set the
SPARK_USER environment variable, and that user will be passed to Mesos.

Atlernatively you can also turn off --switch-user flag in the Mesos slave
so that all jobs will just use the Slave's current user.

Tim

On Sun, Sep 13, 2015 at 11:20 PM, SLiZn Liu  wrote:

> Thx Tommy, did you mean add proxy user like this:
>
> spark-submit --proxy-user  ...
>
> where represents the user who started Mesos?
>
> and is this parameter documented anywhere?
> ​
>
> On Mon, Sep 14, 2015 at 1:34 PM tommy xiao  wrote:
>
>> @SLiZn Liu  yes, you need add proxy_user parameter and your cluster
>> should have the proxy_user in the /etc/passwd in every node.
>>
>> 2015-09-14 13:05 GMT+08:00 haosdent :
>>
>>> Do you start your mesos cluster with root?
>>>
>>> On Mon, Sep 14, 2015 at 12:10 PM, SLiZn Liu 
>>> wrote:
>>>
 Hi Mesos Users,

 I’m trying to run Spark jobs on my Mesos cluster. However I discovered
 that my Spark job must be submitted by the same user who started Mesos,
 otherwise a ExecutorLostFailure will rise, and the job won’t be
 executed. Is there anyway that every user share a same Mesos cluster in
 harmony? =D

 BR,
 Todd Leo
 ​

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>>
>> --
>> Deshi Xiao
>> Twitter: xds2000
>> E-mail: xiaods(AT)gmail.com
>>
>


Re: mesos-slave crashing with CHECK_SOME

2015-09-02 Thread Tim Chen
Hi Scott,

I wonder if you can try the latest Mesos and see if you can repro this?

And if it is can you put down the example task and steps? I couldn't see
disk full in your slave log so I'm not sure if it's exactly the same
problem of MESOS-2684.

Tim

On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin  wrote:

> Hi Marco,
>
> I certainly don’t want to start a flame war, and I actually realized after
> I added my comment to MESOS-2684 that it’s not quite the same thing.
>
> As far as I can tell, in our situation, there’s no underlying disk issue.
> It seems like this is some sort of race condition (maybe?) with docker
> containers and executors shutting down.  I’m perfectly happy with Mesos
> choosing to shut down in the case of a failure or unexpected situation –
> that’s a methodology that we adopt ourselves.  I’m just trying to get a
> little more information about what the underlying issue is so that we can
> resolve it. I don’t know enough about Mesos internals to be able to answer
> that question just yet.
>
> It’s also inconvenient because, while Mesos is well-behaved and restarts
> gracefully, as of 0.22.1, it’s not recovering the Docker executors – so a
> mesos-slave crash also brings down applications.
>
> Thanks,
> Scott
>
> From: Marco Massenzio
> Reply-To: "user@mesos.apache.org"
> Date: Tuesday, September 1, 2015 at 7:33 PM
> To: "user@mesos.apache.org"
> Subject: Re: mesos-slave crashing with CHECK_SOME
>
> That's one of those areas for discussions that is so likely to generate a
> flame war that I'm hesitant to wade in :)
>
> In general, I would agree with the sentiment expressed there:
>
> > If the task fails, that is unfortunate, but not the end of the world.
> Other tasks should not be affected.
>
> which is, in fact, to large extent exactly what Mesos does; the example
> given in MESOS-2684, as it happens, is for a "disk full failure" - carrying
> on as if nothing had happened, is only likely to lead to further (and
> worse) disappointment.
>
> The general philosophy back at Google (and which certainly informs the
> design of Borg[0]) was "fail early, fail hard" so that either (a) the
> service is restarted and hopefully the root cause cleared or (b) someone
> (who can hopefully do something) will be alerted about it.
>
> I think it's ultimately a matter of scale: up to a few tens of servers,
> you can assume there is some sort of 'log-monitor' that looks out for
> errors and other anomalies and alerts humans that will then take a look and
> possibly apply some corrective action - when you're up to hundreds or
> thousands (definitely Mesos territory) that's not practical: the system
> should either self-heal or crash-and-restart.
>
> All this to say, that it's difficult to come up with a general *automated*
> approach to unequivocally decide if a failure is "fatal" or could just be
> safely "ignored" (after appropriate error logging) - in general, when in
> doubt it's probably safer to "noisily crash & restart" and rely on the
> overall system's HA architecture to take care of replication and
> consistency.
> (and an intelligent monitoring system that only alerts when some failure
> threshold is exceeded).
>
> From what I've seen so far (granted, still a novice here) it seems that
> Mesos subscribes to this notion, assuming that Agent Nodes will come and
> go, and usually Tasks survive (for a certain amount of time anyway) a Slave
> restart (obviously, if the physical h/w is the ultimate cause of failure,
> well, then all bets are off).
>
> Having said all that - if there are areas where we have been over-eager
> with our CHECKs, we should definitely revisit that and make it more
> crash-resistant, absolutely.
>
> [0] http://research.google.com/pubs/pub43438.html
>
> *Marco Massenzio*
>
> *Distributed Systems Engineer http://codetrips.com *
>
> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
> sschlans...@opentable.com> wrote:
>
>>
>>
>> On Aug 31, 2015, at 11:54 AM, Scott Rankin  wrote:
>> >
>> > tag=mesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354]
>> CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory
>>
>> I reported a similar bug a while back:
>>
>> https://issues.apache.org/jira/browse/MESOS-2684
>>
>> This seems to be a class of bugs where some filesystem operations which
>> may fail for unforeseen reasons are written as assertions which crash the
>> process, rather than failing only the task and communicating back the error
>> reason.
>>
>>
>>
> This email message contains information that Motus, LLC considers
> confidential and/or proprietary, or may later designate as confidential and
> proprietary. It is intended only for use of the individual or entity named
> above and should not be forwarded to any other persons or entities without
> the express consent of Motus, LLC, nor should it be used for any purpose
> other than in the course of any potential or actual business 

Re: Use docker start rather than docker run?

2015-08-28 Thread Tim Chen
We have primitives for persistent volumes in next release (0.25.0) but
DockerContainerizer integration will happen most likely the version after.

Tim

On Fri, Aug 28, 2015 at 11:50 AM, Tim Chen t...@mesosphere.io wrote:

 Hi Paul,

 Alternatively you can try to launch your task on the same host by
 specifying a constraint with marathon and mount a directory on the host in
 your container everytime to work-around as well.

 Tim

 On Fri, Aug 28, 2015 at 11:44 AM, Paul Bell arach...@gmail.com wrote:

 Alex  Tim,

 Thank you both; most helpful.

 Alex, can you dispel my confusion on this point: I keep reading that a
 framework in Mesos (e.g., Marathon) consists of a scheduler and an
 executor. This reference to executor made me think that Marathon must
 have *some* kind of presence on the slave node. But the more familiar I
 become with Mesos the less likely this seems to me. So, what does it mean
 to talk about the Marathon framework executor?

 Tim, I did come up with a simple work-around that involves re-copying the
 needed file into the container each time the application is started. For
 reasons unknown, this file is not kept in a location that would readily
 lend itself to my use of persistent storage (Docker -v). That said, I am
 keenly interested in learning how to write both custom executors 
 schedulers. Any sense for what release of Mesos will see persistent
 volumes?

 Thanks again, gents.

 -Paul



 On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen t...@mesosphere.io wrote:

 Hi Paul,

 We don't [re]start a container since we assume once the task terminated
 the container is no longer reused. In Mesos to allow tasks to reuse the
 same executor and handle task logic accordingly people will opt to choose
 the custom executor route.

 We're working on a way to keep your sandbox data beyond a container
 lifecycle, which is called persistent volumes. We haven't integrated that
 with Docker containerizer yet, so you'll have to wait to use that feature.

 You could also choose to implement a custom executor for now if you like.

 Tim

 On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov a...@mesosphere.com
 wrote:

 Paul,

 that component is called DockerContainerizer and it's part of Mesos
 Agent (check
 /Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp). @Tim,
 could you answer the docker start vs. docker run question?

 On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I first posted this to the Marathon list, but someone suggested I try
 it here.

 I'm still not sure what component (mesos-master, mesos-slave,
 marathon) generates the docker run command that launches containers on a
 slave node. I suppose that it's the framework executor (Marathon) on the
 slave that actually executes the docker run, but I'm not sure.

 What I'm really after is whether or not we can cause the use of
 docker start rather than docker run.

 At issue here is some persistent data inside
 /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch
 my application with a different CTR_ID effectively rendering that data
 inaccessible. But docker start will restart the container and its old
 data will still be there.

 Thanks.

 -Paul








Re: Use docker start rather than docker run?

2015-08-28 Thread Tim Chen
Hi Paul,

Alternatively you can try to launch your task on the same host by
specifying a constraint with marathon and mount a directory on the host in
your container everytime to work-around as well.

Tim

On Fri, Aug 28, 2015 at 11:44 AM, Paul Bell arach...@gmail.com wrote:

 Alex  Tim,

 Thank you both; most helpful.

 Alex, can you dispel my confusion on this point: I keep reading that a
 framework in Mesos (e.g., Marathon) consists of a scheduler and an
 executor. This reference to executor made me think that Marathon must
 have *some* kind of presence on the slave node. But the more familiar I
 become with Mesos the less likely this seems to me. So, what does it mean
 to talk about the Marathon framework executor?

 Tim, I did come up with a simple work-around that involves re-copying the
 needed file into the container each time the application is started. For
 reasons unknown, this file is not kept in a location that would readily
 lend itself to my use of persistent storage (Docker -v). That said, I am
 keenly interested in learning how to write both custom executors 
 schedulers. Any sense for what release of Mesos will see persistent
 volumes?

 Thanks again, gents.

 -Paul



 On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen t...@mesosphere.io wrote:

 Hi Paul,

 We don't [re]start a container since we assume once the task terminated
 the container is no longer reused. In Mesos to allow tasks to reuse the
 same executor and handle task logic accordingly people will opt to choose
 the custom executor route.

 We're working on a way to keep your sandbox data beyond a container
 lifecycle, which is called persistent volumes. We haven't integrated that
 with Docker containerizer yet, so you'll have to wait to use that feature.

 You could also choose to implement a custom executor for now if you like.

 Tim

 On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov a...@mesosphere.com
 wrote:

 Paul,

 that component is called DockerContainerizer and it's part of Mesos
 Agent (check
 /Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp). @Tim,
 could you answer the docker start vs. docker run question?

 On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I first posted this to the Marathon list, but someone suggested I try
 it here.

 I'm still not sure what component (mesos-master, mesos-slave, marathon)
 generates the docker run command that launches containers on a slave
 node. I suppose that it's the framework executor (Marathon) on the slave
 that actually executes the docker run, but I'm not sure.

 What I'm really after is whether or not we can cause the use of docker
 start rather than docker run.

 At issue here is some persistent data inside
 /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch
 my application with a different CTR_ID effectively rendering that data
 inaccessible. But docker start will restart the container and its old
 data will still be there.

 Thanks.

 -Paul







Re: [VOTE] Release Apache Mesos 0.24.0 (rc1)

2015-08-27 Thread Tim Chen
I'm working on a quick fix to fix the test, if we can just apply this we
shouldn't worry about this anymore.

Tim

On Thu, Aug 27, 2015 at 4:27 PM, Jie Yu yujie@gmail.com wrote:

 Tim, maybe just remove CgroupsCpushareIsolatorProcess
 from CgroupsIsolatorTypes and add a TODO there for this release?

 We can definitely work around it (like you said).

 - Jie

 On Thu, Aug 27, 2015 at 4:05 PM, Timothy Chen tnac...@gmail.com wrote:

 That test is failing because of a wierd bug in CentOS 7 not naming the
 cgroups correctly (or at least not following the pattern every other
 OS).

 I filed a CentOS bug but no response so far, if we want to fix it we
 will have to work around this problem by hardcoding another cgroup
 name to test cpuacct,cpu.

 Tim

 On Thu, Aug 27, 2015 at 4:00 PM, Vinod Kone vinodk...@apache.org wrote:
  Happy to cut another RC.
 
  IIUC, https://reviews.apache.org/r/37684 doesn't fix the below test.
 
  [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where
  TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess
 
  Is someone working on fixing that (MESOS-3294
  https://issues.apache.org/jira/browse/MESOS-3294)? If yes, I would
 wait a
  day or two to get that in.
 
  Any other issues people have encountered with RC1?
 
 
 
  On Thu, Aug 27, 2015 at 3:45 PM, Niklas Nielsen nik...@mesosphere.io
  wrote:
 
  If it is that easy to fix, why not get it in?
 
  How about https://issues.apache.org/jira/browse/MESOS-3053 (which
  Haosdent ran into)?
 
  On 27 August 2015 at 15:36, Jie Yu yujie@gmail.com wrote:
 
  Niklas,
 
  This is the known problem reported by Marco. I am OK with both because
  the linux filesystem isolator cannot be used in 0.24.0.
 
  If you guys prefer to cut another RC, here is the patch that needs to
 be
  cherry picked:
 
  commit 3ecd54320397c3a813d555f291b51778372e273b
  Author: Greg Mann g...@mesosphere.io
  Date:   Fri Aug 21 13:21:10 2015 -0700
 
  Added symlink test for /bin, lib, and /lib64 when preparing test
 root
  filesystem.
 
  Review: https://reviews.apache.org/r/37684
 
 
 
  On Thu, Aug 27, 2015 at 3:30 PM, Niklas Nielsen nik...@mesosphere.io
 
  wrote:
 
  -1: sudo make check on centos 7
 
  [--] Global test environment tear-down
 
  [==] 793 tests from 121 test cases ran. (606946 ms total)
 
  [  PASSED  ] 786 tests.
 
  [  FAILED  ] 7 tests, listed below:
 
  [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where
  TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess
 
  [  FAILED  ] LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystem
 
  [  FAILED  ] LinuxFilesystemIsolatorTest.ROOT_VolumeFromSandbox
 
  [  FAILED  ] LinuxFilesystemIsolatorTest.ROOT_VolumeFromHost
 
  [  FAILED  ]
  LinuxFilesystemIsolatorTest.ROOT_VolumeFromHostSandboxMountPoint
 
  [  FAILED  ]
  LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithRootFilesystem
 
  [  FAILED  ] MesosContainerizerLaunchTest.ROOT_ChangeRootfs
 
  Configured with:
 
  ../mesos/configure --prefix=/home/vagrant/releases/0.24.0/
  --disable-python
 
  On 26 August 2015 at 17:00, Khanduja, Vaibhav 
 vaibhav.khand...@emc.com
  wrote:
 
  +1
 
   On Aug 26, 2015, at 4:43 PM, Vinod Kone vinodk...@gmail.com
 wrote:
  
   Pinging the thread for more (binding) votes. Hopefully people have
  caught
   up with emails after Mesos madness.
  
   On Wed, Aug 19, 2015 at 1:28 AM, haosdent haosd...@gmail.com
  wrote:
  
   +1
  
   OS: Ubutnu 14.04
   Verify command: sudo make -j8 check
   Compiler: Both gcc4.8 and clang3.5
   Configuration: default configuration
   Result: all tests(828 tests) pass
  
   MESOS-3053 https://issues.apache.org/jira/browse/MESOS-3053 is
  because
   need update add iptable first.
  
   On Wed, Aug 19, 2015 at 2:39 PM, haosdent haosd...@gmail.com
  wrote:
  
   Could not
   pass
 DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged in
  Ubuntu
   14.04. Already have a issue for this
   https://issues.apache.org/jira/browse/MESOS-3053, it is
 acceptable?
  
   On Wed, Aug 19, 2015 at 12:55 PM, Marco Massenzio 
  ma...@mesosphere.io
   wrote:
  
   +1 (non-binding)
  
   All tests (including ROOT) pass on:
   Ubuntu 14.04 (physical box)
  
   All non-ROOT tests pass on:
   CentOS 7 (VirtualBox VM)
  
   Known issue (MESOS-3050) for ROOT tests on CentOS 7,
 non-blocker.
  
   Thanks,
  
   *Marco Massenzio*
  
   *Distributed Systems Engineerhttp://codetrips.com 
  http://codetrips.com*
  
   On Tue, Aug 18, 2015 at 3:26 PM, Vinod Kone 
 vinodk...@apache.org
   wrote:
  
   0.24.0 includes the following:
  
  
  
 
 
  
   Experimental support for v1 scheduler HTTP API!
  
   This release also wraps up support for fetcher.
  
  
   The CHANGELOG for the release is available at:
  
  
  
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.24.0-rc1
  
  
  
 
 

Re: [VOTE] Release Apache Mesos 0.24.0 (rc1)

2015-08-27 Thread Tim Chen
The fix is in now, Vinod can you include the test fix?

https://issues.apache.org/jira/browse/MESOS-3294

Tim

On Thu, Aug 27, 2015 at 4:46 PM, Tim Chen t...@mesosphere.io wrote:

 I'm working on a quick fix to fix the test, if we can just apply this we
 shouldn't worry about this anymore.

 Tim

 On Thu, Aug 27, 2015 at 4:27 PM, Jie Yu yujie@gmail.com wrote:

 Tim, maybe just remove CgroupsCpushareIsolatorProcess
 from CgroupsIsolatorTypes and add a TODO there for this release?

 We can definitely work around it (like you said).

 - Jie

 On Thu, Aug 27, 2015 at 4:05 PM, Timothy Chen tnac...@gmail.com wrote:

 That test is failing because of a wierd bug in CentOS 7 not naming the
 cgroups correctly (or at least not following the pattern every other
 OS).

 I filed a CentOS bug but no response so far, if we want to fix it we
 will have to work around this problem by hardcoding another cgroup
 name to test cpuacct,cpu.

 Tim

 On Thu, Aug 27, 2015 at 4:00 PM, Vinod Kone vinodk...@apache.org
 wrote:
  Happy to cut another RC.
 
  IIUC, https://reviews.apache.org/r/37684 doesn't fix the below test.
 
  [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where
  TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess
 
  Is someone working on fixing that (MESOS-3294
  https://issues.apache.org/jira/browse/MESOS-3294)? If yes, I would
 wait a
  day or two to get that in.
 
  Any other issues people have encountered with RC1?
 
 
 
  On Thu, Aug 27, 2015 at 3:45 PM, Niklas Nielsen nik...@mesosphere.io
  wrote:
 
  If it is that easy to fix, why not get it in?
 
  How about https://issues.apache.org/jira/browse/MESOS-3053 (which
  Haosdent ran into)?
 
  On 27 August 2015 at 15:36, Jie Yu yujie@gmail.com wrote:
 
  Niklas,
 
  This is the known problem reported by Marco. I am OK with both
 because
  the linux filesystem isolator cannot be used in 0.24.0.
 
  If you guys prefer to cut another RC, here is the patch that needs
 to be
  cherry picked:
 
  commit 3ecd54320397c3a813d555f291b51778372e273b
  Author: Greg Mann g...@mesosphere.io
  Date:   Fri Aug 21 13:21:10 2015 -0700
 
  Added symlink test for /bin, lib, and /lib64 when preparing test
 root
  filesystem.
 
  Review: https://reviews.apache.org/r/37684
 
 
 
  On Thu, Aug 27, 2015 at 3:30 PM, Niklas Nielsen 
 nik...@mesosphere.io
  wrote:
 
  -1: sudo make check on centos 7
 
  [--] Global test environment tear-down
 
  [==] 793 tests from 121 test cases ran. (606946 ms total)
 
  [  PASSED  ] 786 tests.
 
  [  FAILED  ] 7 tests, listed below:
 
  [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where
  TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess
 
  [  FAILED  ] LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystem
 
  [  FAILED  ] LinuxFilesystemIsolatorTest.ROOT_VolumeFromSandbox
 
  [  FAILED  ] LinuxFilesystemIsolatorTest.ROOT_VolumeFromHost
 
  [  FAILED  ]
  LinuxFilesystemIsolatorTest.ROOT_VolumeFromHostSandboxMountPoint
 
  [  FAILED  ]
  LinuxFilesystemIsolatorTest.ROOT_PersistentVolumeWithRootFilesystem
 
  [  FAILED  ] MesosContainerizerLaunchTest.ROOT_ChangeRootfs
 
  Configured with:
 
  ../mesos/configure --prefix=/home/vagrant/releases/0.24.0/
  --disable-python
 
  On 26 August 2015 at 17:00, Khanduja, Vaibhav 
 vaibhav.khand...@emc.com
  wrote:
 
  +1
 
   On Aug 26, 2015, at 4:43 PM, Vinod Kone vinodk...@gmail.com
 wrote:
  
   Pinging the thread for more (binding) votes. Hopefully people
 have
  caught
   up with emails after Mesos madness.
  
   On Wed, Aug 19, 2015 at 1:28 AM, haosdent haosd...@gmail.com
  wrote:
  
   +1
  
   OS: Ubutnu 14.04
   Verify command: sudo make -j8 check
   Compiler: Both gcc4.8 and clang3.5
   Configuration: default configuration
   Result: all tests(828 tests) pass
  
   MESOS-3053 https://issues.apache.org/jira/browse/MESOS-3053
 is
  because
   need update add iptable first.
  
   On Wed, Aug 19, 2015 at 2:39 PM, haosdent haosd...@gmail.com
  wrote:
  
   Could not
   pass
 DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged in
  Ubuntu
   14.04. Already have a issue for this
   https://issues.apache.org/jira/browse/MESOS-3053, it is
 acceptable?
  
   On Wed, Aug 19, 2015 at 12:55 PM, Marco Massenzio 
  ma...@mesosphere.io
   wrote:
  
   +1 (non-binding)
  
   All tests (including ROOT) pass on:
   Ubuntu 14.04 (physical box)
  
   All non-ROOT tests pass on:
   CentOS 7 (VirtualBox VM)
  
   Known issue (MESOS-3050) for ROOT tests on CentOS 7,
 non-blocker.
  
   Thanks,
  
   *Marco Massenzio*
  
   *Distributed Systems Engineerhttp://codetrips.com 
  http://codetrips.com*
  
   On Tue, Aug 18, 2015 at 3:26 PM, Vinod Kone 
 vinodk...@apache.org
   wrote:
  
   0.24.0 includes the following:
  
  
  
 
 
  
   Experimental support for v1 scheduler HTTP API!
  
   This release also wraps up support for fetcher

Re: Custom Scheduler: Diagnosing cause of container task failures

2015-08-20 Thread Tim Chen
It received a TASK_FAILED from the executor, so you'll need to look at the
sandbox logs of your task stdout and stderr files to see what went wrong.

These files should be reachable by the Mesos UI.

Tim

On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote:

 Hey everyone,

 I am writing a scheduler for Mesos and on of my first goals is to get
 simple a docker container to run.

 The tasks get marked as failed with the failure messages originating from
 the slave logs.  Now I'm not sure how to determine exactly what is causing
 the failure.

 The most informative log messages I've found were in the slave log:

 == /var/log/mesos/mesos-slave.INFO ==
 W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown
 container: e190037a-b011-4681-9e10-dcbacf6cb819
 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received
 status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for
 task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update
 TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task
 jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to
 master@63.198.215.105:5050
 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received
 status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60)
 for task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor
 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework
 20150804-211741-1608624320-5050-18273-0060

 And this doesn't really tell me much about *why* it's failed.

 Is there somewhere else I should be looking or an option that needs to be
 turned on to show more information?

 Your assistance is greatly appreciated!

 Jay



Re: MesosCon Seattle attendee introduction thread

2015-08-18 Thread Tim Chen
Hi all,

I'm a Engineer here at Mesosphere and also a Mesos PMC/Committer, for the
most part working on Docker and Containerizer related things in Mesos.

Looking forward to meet you all at the Hackathon and during the conference!

Tim

On Mon, Aug 17, 2015 at 10:30 PM, Adam Bordelon a...@mesosphere.io wrote:

 Greetings, friends.

 I'm adam-mesos, and I'm a Distributed Systems Architect at Mesosphere and
 an Apache Mesos committer. My current areas of focus are security
 https://mesoscon2015.sched.org/event/7c88ed02112a47935292554102f6d25a,
 storage, and stateful services
 https://mesoscon2015.sched.org/event/7151c36724e5c3bc9de9e452fe4c866a#.VdK815M2tyQ.
 I have been involved in the development of the Kubernetes-Mesos, HDFS, and
 Myriad
 https://mesoscon2015.sched.org/event/76ed472dbfb388b5f939dde31c7a3302
 (Hadoop2) frameworks, and will be sharing some of that experience in our 
 Framework
 Development Workshop
 https://mesoscon2015.sched.org/event/db9d4039e0bdf91d4ba25af65028644c#.VdK9s5M2tyQ.
 I am also excited about the recent/upcoming developments in networking,
 oversubscription, and maintenance primitives
 https://mesoscon2015.sched.org/event/748798f2d3eb45de4c1538a3f46d0258.
 I'm eager to see some old friends and meet new ones as we grow the Mesos
 project, community, and ecosystem. I'd love to hear your thoughts,
 concerns, and experiences working with Mesos so we can keep improving it.
 Tell me what Mesos can't do for you (yet!)

 Cheers,
 -A-


 On Mon, Aug 17, 2015 at 12:12 PM, Joseph Smith yasumo...@gmail.com
 wrote:

 Howdy all!

 I’m Joe Smith, Site Reliability Engineer for Aurora/Mesos at Twitter.
 I’ve been running our Aurora/Mesos clusters for over three years, so I’ve
 got lots of war stories around migrations, pitching teams/organizations,
 and operations + maintenance.

 I’m really excited to share our experience at Twitter at 2pm on Thursday
 http://mesoscon2015.sched.org/event/81e64ff605ec62217d2efec90376281a#.VdIverQaxB8,
  as
 well as learn a lot from the rest of the community during the conference!
 I’m particularly interested to hear how people are provisioning new
 machines, as we’re hoping to start revising our approach (to take advantage
 of the forthcoming Filesystem Isolation) within the next few months.

 Thanks!
 Joe

 On Aug 17, 2015, at 12:06 PM, Mark Eijsermans 
 mark.eijserm...@hootsuite.com wrote:

 I’m Mark Eijsermans, Sr Software Engineer at Hootsuite on the platform
 team. Currently running Mesos for our build (Jenkins), some internal
 tooling and looking to move our stateless dockerized scala services in the
 future. Really excited to meet everyone and hear people’s experiences
 transitioning to Mesos on production. From both technical and
 organizational perspectives. Curious to hear any challenges people have
 overcome with convincing adoption of Mesos to upper and C-level management.


 On Monday, August 17, 2015 at 11:10 AM, Alexander Gallego wrote:


 Hi,

 I'm Alex, I'm working on a distributed stream processor in c++ (
 concord.io (http://concord.io)). Looking forward to connecting with all
 of you. Would be great to meet with people doing large cluster load testing
 on mesos :)

 I'll be at the hackathon with some coworkers as well.



 On Mon, Aug 17, 2015 at 1:48 PM, Nic Grayson nic.gray...@banno.com (
 mailto:nic.gray...@banno.com nic.gray...@banno.com) wrote:

 Hi,

 I'm Nic Grayson, Software Engineer at Banno/Jack Henry  Associates. I'm
 excited to return to mesoscon this year. I'll be bringing more of our team
 with me this year, 7 in total.

 We've been hard at work automating deployments with terraform, marathon,
 and mesos. I’m excited to see the progress all of the major frameworks have
 made over the last year. We are now using terraform to interact with the
 kafka framework api (http://nicgrayson.com/mesos-kafka-terraform/)

 Nic

 On Mon, Aug 17, 2015 at 12:20 PM, Sharma Podila spod...@netflix.com (
 mailto:spod...@netflix.com spod...@netflix.com) wrote:

 Hello Everyone,

 I am Sharma Podila, senior software engineer at Netflix. It is exciting
 to be a part of MesosCon again this year.
 We developed a cloud native Mesos framework to run a mix of service,
 batch, and stream processing workloads. To which end we created a reusable
 plug-ins based scheduling library, Fenzo. I am looking forward to
 presenting an in-depth look on Thurs at 2pm about how we achieve scheduling
 objectives and cluster autoscaling, as well as share some of our results
 with you.

 I am interested in learning about and collaborating with you all
 regarding scheduling and framework development.

 Sharma



 On Mon, Aug 17, 2015 at 2:11 AM, Ankur Chauhan an...@malloc64.com (
 mailto:an...@malloc64.com an...@malloc64.com) wrote:

 Hi all,

 I am Ankur Chauhan. I am a Sr. Software engineer with the Reporting and
 Analytics team
 at Brightcove Inc. I have been evaluating, tinkering, developing with
 mesos for about an year
 now. My latest adventure has been in the spark 

Re: Custom docker executor

2015-08-08 Thread Tim Chen
Hi Kapil,

What kind of pre/post actions do you like to perform?

The community has been contributing hooks that can be performed pre and
post container launch, so like to see what your use cases are
and perhaps the new hooks can satisfy your need, or maybe even some other
way that can already do what you like to achieve.

Tim

On Sat, Aug 8, 2015 at 1:01 AM, Kapil Malik kma...@adobe.com wrote:

 … posting in a fresh thread

 Hi,



 We have a usecase to run multi-user workloads on mesos. Users provide
 docker images encapsulating application logic, which we (we = say some
 “Central API”) schedule on Chronos / Marathon. However, we need to run some
 standard pre / post steps for every docker submitted by users. We have
 following options –



 1.   Ask every user to embed their logic inside a pre-defined docker
 template which will perform pre/post steps.

 è This is error prone, makes us dependent on whether the users followed
 template, and not very popular with users either.



 2.   Extend every user docker (FROM ) and find a way to add
 pre-post steps in our docker. Refer this docker when scheduling on chronos
 / marathon.

 è Building new dockers does not scale as users and applications grow



 3.   Write a custom executor which will perform the pre-post steps
 and manage the user docker lifetime.

 è Deals with user docker lifetime and is obviously complex.



 Is there a standard / openly available DockerExecutor which manages the
 docker lifetime and which I can extend to build my custom executor?

 For instance, do you suggest extending
 https://github.com/apache/mesos/blob/master/src/docker/executor.cpp as a
 starting point? Can I access it in Java?



 This way I will be concerned only with my custom logic (pre/post steps)
 and still get benefits of a standard way to manage docker containers.





 Thanks and regards,



 Kapil Malik | kma...@adobe.com | 33430 / 8800836581





Re: Troubles with slave recovery via Docker containerizer on 0.23.0

2015-08-06 Thread Tim Chen
Got it, this shouldn't happen. Can you open a JIRA ticket? I'll try to
repro today.

Tim

On Thu, Aug 6, 2015 at 9:37 AM, Benjamin Anderson benja...@ivysoftworks.com
 wrote:

 Hi Tim,

 That's the output from `docker inspect`. I've gisted the full contents
 of the container's log file (in all of its JSON-encoded glory) here:


 https://gist.githubusercontent.com/banjiewen/6450a06f958a2e7630bf/raw/12183fe891c1ddaf7019b478278c47c479d77c01/gistfile1.txt

 The slave itself isn't logging much of interest, just various
 Executor has terminated with unknown status messages, etc.

 For context, my container is running 0.23.0 installed from packages on
 Ubuntu 14.04. Docker is at 1.6.2.

 --
 b

 On Wed, Aug 5, 2015 at 4:28 PM, Tim Chen t...@mesosphere.io wrote:
  Hi Ben,
 
  Did you get the command from docker inspect or from the slave log?
 
  If it's from the slave log then we don't actually print out the exact
 way we
  exec the command, but just joining the exec arguments with a space in
  between.
 
  What's the exact error in the slave/sandbox stderr log?
 
  Tim
 
 
  On Wed, Aug 5, 2015 at 4:18 PM, Benjamin Anderson
  benja...@ivysoftworks.com wrote:
 
  Hi there - I'm working on setting up a Mesos environment with the
  Docker containerizer and can't seem to get the recovery feature
  working. I'm running CoreOS, so the slave processes themselves are
  containerized. I have no issues running jobs without the recovery
  features enabled, but all jobs fail to boot when I add the following
  flags:
 
  MESOS_DOCKER_KILL_ORPHANS=false
  MESOS_DOCKER_MESOS_IMAGE=myrepo/my-slave-container
 
  Inspecting the Docker images and their log output reveals that the
  container invocation appears to be flawed - see this gist:
 
  https://gist.github.com/banjiewen/a2dc1784a82ed87edd6b
 
  The containerizer is attempting to invoke an unquoted command via
  `/bin/sh -c`, which, predictably, fails to pass the complete command.
  This results in the error message shown in the second file in the
  linked gist.
 
  This is reproducible manually; quoting the arguments to `/bin/sh -c`
  results in success (at least, it correctly receives the supplied
  arguments).
 
  I gather that this is related to MESOS-2115, and it's clear that this
  patch[1] changed that behavior significantly, but if it introduced a
  bug I can't see it. It's possible that my instance is configured
  incorrectly as well; the documentation here is a bit vague and there
  aren't many examples on the web.
 
  Thanks in advance,
  --
  b
 
  [1]:
 
 https://github.com/apache/mesos/commit/3baa60965407bf0c3eb9c3da1b2ba7c0a4fee968
 
 



Re: Troubles with slave recovery via Docker containerizer on 0.23.0

2015-08-05 Thread Tim Chen
Hi Ben,

Did you get the command from docker inspect or from the slave log?

If it's from the slave log then we don't actually print out the exact way
we exec the command, but just joining the exec arguments with a space in
between.

What's the exact error in the slave/sandbox stderr log?

Tim


On Wed, Aug 5, 2015 at 4:18 PM, Benjamin Anderson benja...@ivysoftworks.com
 wrote:

 Hi there - I'm working on setting up a Mesos environment with the
 Docker containerizer and can't seem to get the recovery feature
 working. I'm running CoreOS, so the slave processes themselves are
 containerized. I have no issues running jobs without the recovery
 features enabled, but all jobs fail to boot when I add the following
 flags:

 MESOS_DOCKER_KILL_ORPHANS=false
 MESOS_DOCKER_MESOS_IMAGE=myrepo/my-slave-container

 Inspecting the Docker images and their log output reveals that the
 container invocation appears to be flawed - see this gist:

 https://gist.github.com/banjiewen/a2dc1784a82ed87edd6b

 The containerizer is attempting to invoke an unquoted command via
 `/bin/sh -c`, which, predictably, fails to pass the complete command.
 This results in the error message shown in the second file in the
 linked gist.

 This is reproducible manually; quoting the arguments to `/bin/sh -c`
 results in success (at least, it correctly receives the supplied
 arguments).

 I gather that this is related to MESOS-2115, and it's clear that this
 patch[1] changed that behavior significantly, but if it introduced a
 bug I can't see it. It's possible that my instance is configured
 incorrectly as well; the documentation here is a bit vague and there
 aren't many examples on the web.

 Thanks in advance,
 --
 b

 [1]:
 https://github.com/apache/mesos/commit/3baa60965407bf0c3eb9c3da1b2ba7c0a4fee968



Re: Docker on Marathon 0.9.0 on Mesos 0.23.0

2015-08-04 Thread Tim Chen
It seems like the binary (mesos-docker-executor) that was built is looking
for libmesos-0.23.0 at a place where it doesn't exist.

How are you running Mesos? Are you running from the source/build/src folder
or after make install?

Usually this happens when you don't make install before you run.

Tim

On Tue, Aug 4, 2015 at 4:08 PM, John Omernik j...@omernik.com wrote:

 I am finding that Docker Containers won't start for me in the versions
 above, the only information I am getting from the sandbox is below, I am
 not sure what the issue is in that the file is in the same location where
 the previous version files were...  Any help is appreciated.

 John



 mesos-docker-executor: error while loading shared libraries: 
 libmesos-0.23.0.so: cannot open shared object file: No such file or directory






Re: Custom executor

2015-07-28 Thread Tim Chen
Can you explain what your motivations are and what your new custom executor
will do?

Tim

On Tue, Jul 28, 2015 at 5:08 AM, Aaron Carey aca...@ilm.com wrote:

  Hi,

 Is it possible to build a custom executor which is not associated with a
 particular scheduler framework? I want to be able to write a custom
 executor which is available to multiple schedulers (eg Marathon, Chronos
 and our own custom scheduler). Is this possible? I couldn't quite figure
 out the best way to go about this from the docs? Is it possible to mix and
 match languages for schedulers and executors? (ie one is python one is C++)

 Thanks,
 Aaron



Re: Problems connecting with Mesos Master

2015-07-28 Thread Tim Chen
spark-env.sh works as it will be called by spark-submit/spark-shell, or you
can just set it before you call spark-shell yourself.

Tim

On Tue, Jul 28, 2015 at 1:43 PM, Haripriya Ayyalasomayajula 
aharipriy...@gmail.com wrote:

 Hi,

 Where can I set the libprocess_ip env variable? spark_env.sh? Thats the
 only place I can think of. Can you please point me to any related
 documentation?

 On Tue, Jul 28, 2015 at 12:46 PM, Nikolaos Ballas neXus 
 nikolaos.bal...@nexusgroup.com wrote:

  If you are not using any dns like service  under /etc/mesos-master/
 create two files called ip  and hostname  and put the ip of the eth
  interface.



  Sent from my Samsung device


  Original message 
 From: Haripriya Ayyalasomayajula aharipriy...@gmail.com
 Date: 28/07/2015 20:18 (GMT+01:00)
 To: user@mesos.apache.org
 Subject: Problems connecting with Mesos Master

  Hi all,

  I am trying to use Spark 1.4.1 with Mesos 0.23.0.

  When I try to start my spark-shell, it gives me the following warning :

 **

 Scheduler driver bound to loopback interface! Cannot communicate with
 remote master(s). You might want to set 'LIBPROCESS_IP' environment
 variable to use a routable IP address.
 ---

  Spark-shell works fine on the node where I run master, but if I start
 running on any of the other slave nodes it gives me the following error:

  E0728 11:22:53.176515 10503 socket.hpp:107] Shutdown failed on fd=6:
 Transport endpoint is not connected [107]

 E0728 11:22:53.210146 10503 socket.hpp:107] Shutdown failed on fd=6:
 Transport endpoint is not connected [107]

 I have the following configs:


- zookeeper configured to the mesos master
- /etc/mesos/zk on all nodes pointing to mesos master ip.

  I am not sure if I have to set the ip flag and where I have to set the
 --ip flag?

  --
   Regards,
 Haripriya Ayyalasomayajula




 --
 Regards,
 Haripriya Ayyalasomayajula




Re: mesos-execute + docker_image

2015-07-07 Thread Tim Chen
Hi there,

What kind of parameters do you like to pass to mesos-execute?

You can run mesos-execute --help and it shows you all the available
parameters.

Tim

On Tue, Jul 7, 2015 at 7:26 AM, Jürgen Jakobitsch 
j.jakobit...@semantic-web.at wrote:

 hi,

 i just installed mesos-0.22.0 (from the mesossphere repos) on centOS6.
 can anyone point me into the right direction on how to run a docker image
 inside mesos using mesos-execute plus the docker_image parameter.

 also note that i would like to pass some parameters to the docker run
 command

 any pointer really appreciated.

 wkr j


 | Jürgen Jakobitsch,
 | Software Developer
 | Semantic Web Company GmbH
 | Mariahilfer Straße 70 / Neubaugasse 1, Top 8
 | A - 1070 Wien, Austria
 | Mob +43 676 62 12 710 | Fax +43.1.402 12 35 - 22

 COMPANY INFORMATION
 | web   : http://www.semantic-web.at/
 | foaf  : http://company.semantic-web.at/person/juergen_jakobitsch
 PERSONAL INFORMATION
 | web   : http://www.turnguard.com
 | foaf  : http://www.turnguard.com/turnguard
 | g+: https://plus.google.com/111233759991616358206/posts
 | skype : jakobitsch-punkt
 | xmlns:tg  = http://www.turnguard.com/turnguard#;



Re: Running storm over mesos

2015-07-03 Thread Tim Chen
Hi Pradeep,

Without any more information it's quite impossible to know what's going on.

What's in the slave logs and storm framework logs?

Tim

On Fri, Jul 3, 2015 at 10:06 AM, Pradeep Chhetri 
pradeep.chhetr...@gmail.com wrote:

 Hello all,

 I am trying to run Storm over Mesos using the tutorial (
 http://open.mesosphere.com/tutorials/run-storm-on-mesos) over vagrant.
 When I am trying to submit a sample topology, it is not spawning any storm
 supervisors over the mesos-slaves. I didn't find anything interesting in
 the logs as well. Can someone help in figuring out the problem.

 Thank you.

 --
 Pradeep Chhetri




Re: service discovery in Mesos on CoreOS

2015-07-01 Thread Tim Chen
As others has mentioned earlier, definitely don't use the mesos- prefix
to name your docker containers since at the time we did the integration
docker labels wasn't merged.

Also you'll need to run mesos-slave with --pid=host, bind mount in the
docker socket, and also bind mount a host directory into the slave as its
work directory so the slave can recover its tasks when it restarts.

Tim





On Tue, Jun 30, 2015 at 11:11 PM, zhou weitao zhouwtl...@gmail.com wrote:

 If I understand right, the root trouble is mesos-slave-in-docker. I know
 CoreOS little, do u run mesos-slave on CoreOS as following?

  docker run --rm -it --name mesos-slave --net host
 --volume /var/run/docker.sock:/var/run/docker.sock --entrypoint mesos-slave
 

 or map the CoreOS docker.sock into mesos-slave container to discovery
 service better.

 If I said nothing, sorry for my interruption.


 2015-06-30 21:10 GMT+08:00 Andras Kerekes andras.kere...@ishisystems.com
 :

 Would not using Bamboo to update haproxy config have the same problems I
 described for the Marathon provided script? It would still run in a
 separate container.



 *From:* zhou weitao [mailto:zhouwtl...@gmail.com]
 *Sent:* Monday, June 29, 2015 10:51 PM
 *To:* user@mesos.apache.org
 *Subject:* Re: service discovery in Mesos on CoreOS







 2015-06-30 6:23 GMT+08:00 Andras Kerekes andras.kere...@ishisystems.com
 :

 Hi,



 Is there a preferred way to do service discovery in Mesos via mesos-dns
 running on CoreOS? I’m trying to implement a simple app which consists of
 two docker containers and one of them (A) depends on the other (B). What
 I’d like to do is to tell container A to use a fix dns name
 (containerB.marathon.mesos in case of mesos-dns) to find the other service.
 There are at least 3 different ways I think it can be done, but the 3 I
 found all have some shortcomings.



 1.   Use SRV records to get the port along with the IP. Con: I’d
 prefer not to build the logic of handling SRV records into the app, it can
 be a legacy app that is difficult to modify

 2.   Use haproxy on slaves and connect via a well-known port on
 localhost. Cons: the Marathon provided script does not run on CoreOS, also
 I don’t know how to run haproxy on CoreOS outside of a docker container. If
 it is running in a docker container, then how can it dynamically allocate
 ports on localhost if a new service is discovered in Marathon/Mesos?

 Do you know this repo? https://github.com/QubitProducts/bamboo . And
 here our corp one https://github.com/Dataman-Cloud/bamboo branched from
 the above.

 3.   Use dedicated port to bind the containers to. Con: I can have
 only as many instances of a service as many slaves I have because they bind
 to the same port.



 What other alternatives are there?



 Thanks,

 Andras







Re: Cluster autoscaling in Spark+Mesos ?

2015-06-05 Thread Tim Chen
Hi Sharma,

What metrics do you watch for demand and supply for Spark? Do you just
watch node resources or you actually look at some Spark JMX stats?

Tim

On Thu, Jun 4, 2015 at 10:35 PM, Sharma Podila spod...@netflix.com wrote:

 We Autoscale our Mesos cluster in EC2 from within our framework. Scaling
 up can be easy via watching demand Vs supply. However, scaling down
 requires bin packing the tasks tightly onto as few servers as possible.
 Do you have any specific ideas on how you would leverage Mantis/Mesos for
 Spark based jobs? Fenzo, the scheduler part of Mantis, could be another
 point of leverage, which could give a framework the ability to autoscale
 the cluster among other benefits.



 On Thu, Jun 4, 2015 at 1:06 PM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Vinod. I'm really interested in how we could leverage something
 like Mantis and Mesos to achieve autoscaling in a Spark-based data
 processing system...

 On Jun 4, 2015, at 3:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Hey Dmitry. At the current time there is no built-in support for Mesos to
 autoscale nodes in the cluster. I've heard people (Netflix?) do it out of
 band on EC2.

 On Thu, Jun 4, 2015 at 9:08 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 A Mesos noob here. Could someone point me at the doc or summary for the
 cluster autoscaling capabilities in Mesos?

 Is there a way to feed it events and have it detect the need to bring in
 more machines or decommission machines?  Is there a way to receive events
 back that notify you that machines have been allocated or decommissioned?

 Would this work within a certain set of
 preallocated/pre-provisioned/stand-by machines or will Mesos go and
 grab machines from the cloud?

 What are the integration points of Apache Spark and Mesos?  What are the
 true advantages of running Spark on Mesos?

 Can Mesos autoscale the cluster based on some signals/events coming out
 of Spark runtime or Spark consumers, then cause the consumers to run on the
 updated cluster, or signal to the consumers to restart themselves into an
 updated cluster?

 Thanks.






Re: Cluster autoscaling in Spark+Mesos ?

2015-06-04 Thread Tim Chen
Spark is aware there are more resources by getting more resource offers and
using those new offers.

I don't think there is a way to refresh the Spark context for streaming.

Tim

On Thu, Jun 4, 2015 at 1:59 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:

 Thanks, Ankur. I'd be curious to understand how the data exchange happens
 in this case. How does Spark become aware of the fact that machines have
 been added to the cluster or have been removed from it?  And then, do you
 have some mechanism to perhaps restart the Spark consumers into refreshed
 Spark context's which are aware of the new cluster topology?

 On Thu, Jun 4, 2015 at 4:23 PM, Ankur Chauhan an...@malloc64.com wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 AFAIK Mesos does not support host level auto-scaling because that is
 not the scope of the mesos-master or mesos-slave. In EC2 (like in my
 case) we have autoscaling groups set with cloudwatch metrics hooked up
 to scaling policies. In our case, we have the following.
 * Add 1 host per AZ when cpu load is  85% for 15 mins continuously.
 * Remove 1 host if the cpu load is  15% for 15 mins continuously.
 * Similar monitoring + scale-up/scale-down based on memory.

 All of these rules have a cooldown period of 30mins so that we don't
 end-up scaling up/down too fast.

 Then again, our workload is bursty (spark on mesos in fine-grained
 mode). So, the new resources get used up and tasks distribute pretty
 fast. The above may not work in case you have long-running tasks (such
 as marathon tasks) because they would not be redistributed till some
 task restarting happens.

 - -- Ankur

 On 04/06/2015 13:13, Dmitry Goldenberg wrote:
  Would it be accurate to say that Mesos helps you optimize resource
  utilization out of a preset  pool of resources, presumably servers?
  And its level of autoscaling is within that pool?
 
 
  On Jun 4, 2015, at 3:54 PM, Vinod Kone vinodk...@gmail.com
  mailto:vinodk...@gmail.com wrote:
 
  Hey Dmitry. At the current time there is no built-in support for
  Mesos to autoscale nodes in the cluster. I've heard people
  (Netflix?) do it out of band on EC2.
 
  On Thu, Jun 4, 2015 at 9:08 AM, Dmitry Goldenberg
  dgoldenberg...@gmail.com mailto:dgoldenberg...@gmail.com
  wrote:
 
  A Mesos noob here. Could someone point me at the doc or summary
  for the cluster autoscaling capabilities in Mesos?
 
  Is there a way to feed it events and have it detect the need to
  bring in more machines or decommission machines?  Is there a way
  to receive events back that notify you that machines have been
  allocated or decommissioned?
 
  Would this work within a certain set of
  preallocated/pre-provisioned/stand-by machines or will Mesos
  go and grab machines from the cloud?
 
  What are the integration points of Apache Spark and Mesos?  What
  are the true advantages of running Spark on Mesos?
 
  Can Mesos autoscale the cluster based on some signals/events
  coming out of Spark runtime or Spark consumers, then cause the
  consumers to run on the updated cluster, or signal to the
  consumers to restart themselves into an updated cluster?
 
  Thanks.
 
 
 -BEGIN PGP SIGNATURE-

 iQEcBAEBAgAGBQJVcLO2AAoJEOSJAMhvLp3LDuEH/1Bu3vhALR8+TPbsM5TscDOy
 vFwyb+ACh8tKL2XoXPwBaMkXU5qPFGX9Wa5weDNCqcUqbvoZ6G9ScrXbpTpWVFTn
 n240CxKGMqplgelDZmQAlixlPB8jUi9ZUfn6Z4FjuPUz1scLSyIOATxh57z0qRyp
 kdbS3pcU5ZmS9N/CHwNGOI9qwk7ebA1HPLqkRnBJLHKXJ6savW4FbANYb8OLWcAM
 It2GzbyAdrMMs7dgeaaEPnvwqnF5nSf2aERA9EjFyxBhJMgKidlUxFSxvMTD1jkx
 xjMZJeeVDqVsdZWtJkNwNsjXQG7X7f2bWY14rDL4XM59X8XCLnxkODRMTeGjXBM=
 =cHZK
 -END PGP SIGNATURE-





Re: Cluster autoscaling in Spark+Mesos ?

2015-06-04 Thread Tim Chen
Hi Dmitry,

That certainly can work just needs to coordinate the events you mentioned,
and make sure it does happen accordingly. Currently the Spark scheduler is
very job agnostic and doesn't understand what Spark job is it running, and
that's the next type of optimizations I'd like to put into the roadmap,
that understands the job type that it's running and can support certain
actions depending on what it is.

Do you have a specific use case you can prototype this? We can certainly
make this happen in the Spark side.

Tim





On Thu, Jun 4, 2015 at 2:11 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:

 Tim,

 Aware of more resources - is that if it runs on Mesos or via any type of
 cluster manager?  Our thinking was that once we can determine that the
 cluster has changed, we could notify the streaming consumers to finish
 processing the current batch, then terminate, then resume streaming with a
 new instance of the Context.  Would that not cause Spark to refresh its
 awareness of the cluster resources?

 - Dmitry

 On Thu, Jun 4, 2015 at 5:03 PM, Tim Chen t...@mesosphere.io wrote:

 Spark is aware there are more resources by getting more resource offers
 and using those new offers.

 I don't think there is a way to refresh the Spark context for streaming.

 Tim

 On Thu, Jun 4, 2015 at 1:59 PM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Ankur. I'd be curious to understand how the data exchange
 happens in this case. How does Spark become aware of the fact that machines
 have been added to the cluster or have been removed from it?  And then, do
 you have some mechanism to perhaps restart the Spark consumers into
 refreshed Spark context's which are aware of the new cluster topology?

 On Thu, Jun 4, 2015 at 4:23 PM, Ankur Chauhan an...@malloc64.com
 wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 AFAIK Mesos does not support host level auto-scaling because that is
 not the scope of the mesos-master or mesos-slave. In EC2 (like in my
 case) we have autoscaling groups set with cloudwatch metrics hooked up
 to scaling policies. In our case, we have the following.
 * Add 1 host per AZ when cpu load is  85% for 15 mins continuously.
 * Remove 1 host if the cpu load is  15% for 15 mins continuously.
 * Similar monitoring + scale-up/scale-down based on memory.

 All of these rules have a cooldown period of 30mins so that we don't
 end-up scaling up/down too fast.

 Then again, our workload is bursty (spark on mesos in fine-grained
 mode). So, the new resources get used up and tasks distribute pretty
 fast. The above may not work in case you have long-running tasks (such
 as marathon tasks) because they would not be redistributed till some
 task restarting happens.

 - -- Ankur

 On 04/06/2015 13:13, Dmitry Goldenberg wrote:
  Would it be accurate to say that Mesos helps you optimize resource
  utilization out of a preset  pool of resources, presumably servers?
  And its level of autoscaling is within that pool?
 
 
  On Jun 4, 2015, at 3:54 PM, Vinod Kone vinodk...@gmail.com
  mailto:vinodk...@gmail.com wrote:
 
  Hey Dmitry. At the current time there is no built-in support for
  Mesos to autoscale nodes in the cluster. I've heard people
  (Netflix?) do it out of band on EC2.
 
  On Thu, Jun 4, 2015 at 9:08 AM, Dmitry Goldenberg
  dgoldenberg...@gmail.com mailto:dgoldenberg...@gmail.com
  wrote:
 
  A Mesos noob here. Could someone point me at the doc or summary
  for the cluster autoscaling capabilities in Mesos?
 
  Is there a way to feed it events and have it detect the need to
  bring in more machines or decommission machines?  Is there a way
  to receive events back that notify you that machines have been
  allocated or decommissioned?
 
  Would this work within a certain set of
  preallocated/pre-provisioned/stand-by machines or will Mesos
  go and grab machines from the cloud?
 
  What are the integration points of Apache Spark and Mesos?  What
  are the true advantages of running Spark on Mesos?
 
  Can Mesos autoscale the cluster based on some signals/events
  coming out of Spark runtime or Spark consumers, then cause the
  consumers to run on the updated cluster, or signal to the
  consumers to restart themselves into an updated cluster?
 
  Thanks.
 
 
 -BEGIN PGP SIGNATURE-

 iQEcBAEBAgAGBQJVcLO2AAoJEOSJAMhvLp3LDuEH/1Bu3vhALR8+TPbsM5TscDOy
 vFwyb+ACh8tKL2XoXPwBaMkXU5qPFGX9Wa5weDNCqcUqbvoZ6G9ScrXbpTpWVFTn
 n240CxKGMqplgelDZmQAlixlPB8jUi9ZUfn6Z4FjuPUz1scLSyIOATxh57z0qRyp
 kdbS3pcU5ZmS9N/CHwNGOI9qwk7ebA1HPLqkRnBJLHKXJ6savW4FbANYb8OLWcAM
 It2GzbyAdrMMs7dgeaaEPnvwqnF5nSf2aERA9EjFyxBhJMgKidlUxFSxvMTD1jkx
 xjMZJeeVDqVsdZWtJkNwNsjXQG7X7f2bWY14rDL4XM59X8XCLnxkODRMTeGjXBM=
 =cHZK
 -END PGP SIGNATURE-







Re: Running mesos-execute inside docker.

2015-06-01 Thread Tim Chen
Hi Giulio,

Can you share your exact docker commands to start the mesos slave and
master?

Thanks!

Tim

On Thu, May 21, 2015 at 12:17 PM, Giulio Eulisse giulio.euli...@cern.ch
wrote:

 Mmm, no this does not seem to work. The message is still there. Any other
 suggestions?

 --
 Ciao,
 Giulio

 On 21 May 2015, at 17:43, Tyson Norris wrote:

  You might try adding --pid=host - I found that running a docker based
 executor when running slave as a docker container also, I had to do this so
 the the pids are visible between containers.

 Tyson

 On May 21, 2015, at 6:04 AM, Giulio Eulisse giulio.euli...@cern.ch
 mailto:giulio.euli...@cern.ch wrote:


 Hi,

 I've a problem which can be reduced to running:

  mesos-execute --name=foo --command=uname -a  hostname
 --master=leader.mesos:5050



 inside a docker container. If I run without --net=host, it blocks
 completely (I guess the master / slave cannot communicate back to the
 framework), if I run with --net=host everything is fine but I get:

 May 21 14:59:13 cmsbuild30 mesos-slave[1514]: I0521 14:59:13.115659  1546
 slave.cpp:1533] Asked to shut down framework
  20150418-223037-3834547840-5050-6-2757 by master@128.142.142.228mailto:
 master@128.142.142.228:5050
 May 21 14:59:13 cmsbuild30 mesos-slave[1514]: W0521 14:59:13.117231  1546
 slave.cpp:1548] Cannot shut down unknown framework
 20150418-223037-3834547840-5050-6-2757


 in my host machine logs, which is not ideal. Any idea on how to do this
 correctly?

 The actual problem I'm trying to solve is using the mesos plugin for a
 jenkins instance which runs inside docker.

 --
 Ciao
 Giulio




Re: Batch Scheduler with dependency support

2015-05-13 Thread Tim Chen
Hi Alex,

Thanks for replying and looks like a really interesting framework! I was
orginally aiming the question back to Aaron as he stated he's looking for a
batch job scheduler.

Do you have some rough stats how many jobs or data you guys are computing
with Stolos?

Tim



On Wed, May 13, 2015 at 1:04 PM, Alex Gaudio adgau...@gmail.com wrote:

 Hi Tim (and everyone else!),

 I am the primary author of Stolos.  We use Stolos to run all of our batch
 jobs on Mesos.  The batch jobs are scripts we can run from the
 command-line.  Scripts range from bash scripts, Spark jobs and R scripts.

 It's a great tool for us because, unlike Chronos, it lets us define a
 script as stage in a dependency chain, where the script can run with
 different parameters for different dependency contexts.  (The closest usage
 of this would be to have many Chronos servers, though this does not work in
 all cases).

 The tool is a critical component of Sailthru's data science
 infrastructure, but I believe we are the only people who use the tool right
 now.

 If you are interested in learning more, I'm happy to invest time to talk
 more about Stolos, what it does and how we use it!

 Alex

 On Wed, May 13, 2015 at 2:02 PM Tim Chen t...@mesosphere.io wrote:

 How are you running your batch jobs? Is the batch job script/executable
 an in-house app?

 Tim

 On Wed, May 13, 2015 at 9:46 AM, Andras Kerekes 
 andras.kere...@ishisystems.com wrote:

 You might want to have a look at stolos too:



 https://github.com/sailthru/stolos



 Andras





 *From:* Aaron Carey [mailto:aca...@ilm.com]
 *Sent:* Wednesday, May 13, 2015 11:54 AM
 *To:* user@mesos.apache.org
 *Subject:* RE: Batch Scheduler with dependency support



 Thanks! I hadn't come across that one before :)
 --

 *From:* jeffschr...@gmail.com [jeffschr...@gmail.com] on behalf of Jeff
 Schroeder [jeffschroe...@computer.org]
 *Sent:* 13 May 2015 16:39
 *To:* user@mesos.apache.org
 *Subject:* Re: Batch Scheduler with dependency support

 Lookup Hubspot's Singularity

 On Wednesday, May 13, 2015, Aaron Carey aca...@ilm.com wrote:

 Thanks Jeff,

 Any other options around as well?
 --

 *From:* jeffschr...@gmail.com http://UrlBlockedError.aspx [
 jeffschr...@gmail.com http://UrlBlockedError.aspx] on behalf of Jeff
 Schroeder [jeffschroe...@computer.org http://UrlBlockedError.aspx]
 *Sent:* 13 May 2015 14:12
 *To:* user@mesos.apache.org http://UrlBlockedError.aspx
 *Subject:* Batch Scheduler with dependency support

 It does both just as well, along with cron-like functionality. It is
 harder to install and takes a bit more understanding however. The official
 tutorial is a process that loops 100 times and then exits.



 http://aurora.apache.org/documentation/latest/tutorial/#the-script

 Aurora is pretty much a superset of most other generic frameworks sans
 maybe hubspot's singularity.


 On Wednesday, May 13, 2015, Aaron Carey aca...@ilm.com
 http://UrlBlockedError.aspx wrote:

 I was under the impression Aurora was for long running services? Is it
 suitable for scheduling one of batch processes too?

 thanks,
 Aaron
 --

 *From:* jeffschr...@gmail.com [jeffschr...@gmail.com] on behalf of Jeff
 Schroeder [jeffschroe...@computer.org]
 *Sent:* 13 May 2015 13:12
 *To:* user@mesos.apache.org
 *Subject:* Re: Batch Scheduler with dependency support

 Apache Aurora does this and you can be explicit about the ordering

 On Wednesday, May 13, 2015, Aaron Carey aca...@ilm.com wrote:

 Hi All,

 I was just wondering if anyone out there knew of a good mesos batch
 scheduler which supports dependencies between tasks? (ie Task B cannot run
 until Task A is complete)

 Thanks,
 Aaron



 --
 Text by Jeff, typos by iPhone



 --
 Text by Jeff, typos by iPhone



 --
 Text by Jeff, typos by iPhone





Re: Spark Bootstrapping on Mesos

2015-05-13 Thread Tim Chen
Hi Stephen,

I'm not quite sure what you mean by bootstrapping classes, do you have some
particular examples?

Usually to run any user jar you just need it to be reachable by your slaves
so it can be either S3 or any accessible place, then you just provide your
jar url when you run spark-submit.

Tim

On Wed, May 13, 2015 at 8:09 AM, Stephen Carman scar...@coldlight.com
wrote:

 Hi,

 We have a small mesos cluster and we’d like to be able to initialize some
 of our classes, mostly we have a vfs we setup to be allow our code
 to access S3, but there doesn’t seem to be any readily obvious way to
 bootstrap these kind of classes so that they have the properly initialized
 configuration they need
 to operate.

 Is there some accepted way to accomplish this?


 thanks,
 Steve
 This e-mail is intended solely for the above-mentioned recipient and it
 may contain confidential or privileged information. If you have received it
 in error, please notify us immediately and delete the e-mail. You must not
 copy, distribute, disclose or take any action in reliance on it. In
 addition, the contents of an attachment to this e-mail may contain software
 viruses which could damage your own computer system. While ColdLight
 Solutions, LLC has taken every reasonable precaution to minimize this risk,
 we cannot accept liability for any damage which you sustain as a result of
 software viruses. You should perform your own virus checks before opening
 the attachment.



Re: Batch Scheduler with dependency support

2015-05-13 Thread Tim Chen
How are you running your batch jobs? Is the batch job script/executable an
in-house app?

Tim

On Wed, May 13, 2015 at 9:46 AM, Andras Kerekes 
andras.kere...@ishisystems.com wrote:

 You might want to have a look at stolos too:



 https://github.com/sailthru/stolos



 Andras





 *From:* Aaron Carey [mailto:aca...@ilm.com]
 *Sent:* Wednesday, May 13, 2015 11:54 AM
 *To:* user@mesos.apache.org
 *Subject:* RE: Batch Scheduler with dependency support



 Thanks! I hadn't come across that one before :)
 --

 *From:* jeffschr...@gmail.com [jeffschr...@gmail.com] on behalf of Jeff
 Schroeder [jeffschroe...@computer.org]
 *Sent:* 13 May 2015 16:39
 *To:* user@mesos.apache.org
 *Subject:* Re: Batch Scheduler with dependency support

 Lookup Hubspot's Singularity

 On Wednesday, May 13, 2015, Aaron Carey aca...@ilm.com wrote:

 Thanks Jeff,

 Any other options around as well?
 --

 *From:* jeffschr...@gmail.com http://UrlBlockedError.aspx [
 jeffschr...@gmail.com http://UrlBlockedError.aspx] on behalf of Jeff
 Schroeder [jeffschroe...@computer.org http://UrlBlockedError.aspx]
 *Sent:* 13 May 2015 14:12
 *To:* user@mesos.apache.org http://UrlBlockedError.aspx
 *Subject:* Batch Scheduler with dependency support

 It does both just as well, along with cron-like functionality. It is
 harder to install and takes a bit more understanding however. The official
 tutorial is a process that loops 100 times and then exits.



 http://aurora.apache.org/documentation/latest/tutorial/#the-script

 Aurora is pretty much a superset of most other generic frameworks sans
 maybe hubspot's singularity.


 On Wednesday, May 13, 2015, Aaron Carey aca...@ilm.com
 http://UrlBlockedError.aspx wrote:

 I was under the impression Aurora was for long running services? Is it
 suitable for scheduling one of batch processes too?

 thanks,
 Aaron
 --

 *From:* jeffschr...@gmail.com [jeffschr...@gmail.com] on behalf of Jeff
 Schroeder [jeffschroe...@computer.org]
 *Sent:* 13 May 2015 13:12
 *To:* user@mesos.apache.org
 *Subject:* Re: Batch Scheduler with dependency support

 Apache Aurora does this and you can be explicit about the ordering

 On Wednesday, May 13, 2015, Aaron Carey aca...@ilm.com wrote:

 Hi All,

 I was just wondering if anyone out there knew of a good mesos batch
 scheduler which supports dependencies between tasks? (ie Task B cannot run
 until Task A is complete)

 Thanks,
 Aaron



 --
 Text by Jeff, typos by iPhone



 --
 Text by Jeff, typos by iPhone



 --
 Text by Jeff, typos by iPhone



Re: cpu hard limit for docker containerizer?

2015-05-07 Thread Tim Chen
Hi Chengwei,

It's a known issue and there is a open JIRA (MESOS-2154) and also a open
reviewboard that hasn't been updated for a while.

I'd like this to go into to 0.23 if we can get to it, if you like to pick
up the reviewboard feel free to do so.

Tim

On Thu, May 7, 2015 at 7:21 PM, Chengwei Yang chengwei.yang...@gmail.com
wrote:

 Hi List,

 I see mesos-slave has `--cgroups_enable_cfs` option to enable CFS hard cpu
 limit, that's may real helpful to running online aand offline jobs within a
 single mesos cluster, since some offline jobs are very CPU bindings.

 However, after having a small source code trip, I saw
 `--cgroups_enable_cfs` is
 only used by *mesos* containerizer, is there a plan to reuse this in
 *docker*
 containerizer?

 Please correct me if I was wrong, thanks in advance!

 --
 Thanks,
 Chengwei



Re: Kill task with configurable options?

2015-04-30 Thread Tim Chen
Hi Chengwei,

If you're launching tasks with Docker Containerizer, then we support a flag
you can set on the slave (docker_stop_timeout) then basically does what you
described.

At first when you kill a docker task, we're using the docker stop command
with that timeout value, which basically the docker daemon sends a SIGTERM
and escaltes to SIGKILL after the timeout.

Tim

On Wed, Apr 29, 2015 at 11:33 PM, Adam Bordelon a...@mesosphere.io wrote:

 Chengwei, see the discussion of configurable SIGTERM/KILL escalation on
 https://issues.apache.org/jira/browse/MESOS-1571

 On Wed, Apr 29, 2015 at 8:55 PM, Chengwei Yang chengwei.yang...@gmail.com
  wrote:

 Hi List,

 Is there a way to configure how mesos to kill task?

 The background is we have a type of task, which running with mesos native
 docker
 support. The task is a kind of message consumer which will take task from
 message queue and take care of it. However, a task may be very large,
 costs
 hours to finish, so we'd like to kill the container like this: send an
 signal(TERM maybe) to it and wait for a configurable timeout to kill it
 with
 KILL. So our task consumer can handle TERM and will exit once it finished
 current task.

 Is there a good way to do that or any other tips?

 Thank you all in advance.


 --
 Thanks,
 Chengwei





Re: Storm Mesos Error

2015-04-29 Thread Tim Chen
We'll need to fix the README for sure, thanks for reporting!

Tim

On Wed, Apr 29, 2015 at 11:39 AM, John Omernik j...@omernik.com wrote:

 Thanks Tim. That fixed it.  I unpacked, renamed to storm-mesos-0.9.3 and
 repacked, copied to hdfs and executed, all is well... that's a bit unclear
 to us n00bs in the audience, but I explained verbose here to help anyone
 else who made the same mistake as me.



 On Wed, Apr 29, 2015 at 1:34 PM, John Omernik j...@omernik.com wrote:

 I used the bin/build-release.sh package

 and it put in all in a folder named apache-storm-0.9.3... that's probably
 my problem? :)

 On Wed, Apr 29, 2015 at 1:30 PM, Tim Chen t...@mesosphere.io wrote:

 Hi John,

 Does your storm-mesos tar ball as a folder storm-mesos-0.9.3 in there?

 Tim

 On Wed, Apr 29, 2015 at 11:26 AM, John Omernik j...@omernik.com wrote:

 Greetings all,

 I got my storm nimbus running, but when I try to run a test topology,
 the task enters a lost state and  I get the below in my stderr on the
 sandbox. Note, the URL for the storm.yaml works fine, not sure why it's
 causing an issue on the cp.




 cp: cannot create regular file `storm-mesos*/conf': No such file or
 directory

 Full:

 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0429 13:20:08.873922  5061 fetcher.cpp:76] Fetching URI
 'file:///mapr/brewpot/mesos/storm-mesos-0.9.3.tgz'
 I0429 13:20:08.874048  5061 fetcher.cpp:179] Copying resource from
 '/mapr/brewpot/mesos/storm-mesos-0.9.3.tgz' to
 '/tmp/mesos/slaves/20150402-121212-1644210368-5050-21808-S0/frameworks/20150402-121212-1644210368-5050-21808-0008/executors/myTopo-1-1430331608/runs/8ac24d9e-bb7f-49fe-9519-a07aa6a1d969'
 I0429 13:20:09.103682  5061 fetcher.cpp:64] Extracted resource
 '/tmp/mesos/slaves/20150402-121212-1644210368-5050-21808-S0/frameworks/20150402-121212-1644210368-5050-21808-0008/executors/myTopo-1-1430331608/runs/8ac24d9e-bb7f-49fe-9519-a07aa6a1d969/storm-mesos-0.9.3.tgz'
 into
 '/tmp/mesos/slaves/20150402-121212-1644210368-5050-21808-S0/frameworks/20150402-121212-1644210368-5050-21808-0008/executors/myTopo-1-1430331608/runs/8ac24d9e-bb7f-49fe-9519-a07aa6a1d969'
 I0429 13:20:09.109590  5061 fetcher.cpp:76] Fetching URI '
 http://hadoopmapr1.brewingintel.com:43938/conf/storm.yaml'
 I0429 13:20:09.109658  5061 fetcher.cpp:126] Downloading '
 http://hadoopmapr1.brewingintel.com:43938/conf/storm.yaml' to
 '/tmp/mesos/slaves/20150402-121212-1644210368-5050-21808-S0/frameworks/20150402-121212-1644210368-5050-21808-0008/executors/myTopo-1-1430331608/runs/8ac24d9e-bb7f-49fe-9519-a07aa6a1d969/storm.yaml'
 cp: cannot create regular file `storm-mesos*/conf': No such file or
 directory







Re: Storm Mesos Error

2015-04-29 Thread Tim Chen
Hi John,

Does your storm-mesos tar ball as a folder storm-mesos-0.9.3 in there?

Tim

On Wed, Apr 29, 2015 at 11:26 AM, John Omernik j...@omernik.com wrote:

 Greetings all,

 I got my storm nimbus running, but when I try to run a test topology, the
 task enters a lost state and  I get the below in my stderr on the
 sandbox. Note, the URL for the storm.yaml works fine, not sure why it's
 causing an issue on the cp.




 cp: cannot create regular file `storm-mesos*/conf': No such file or
 directory

 Full:

 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0429 13:20:08.873922  5061 fetcher.cpp:76] Fetching URI
 'file:///mapr/brewpot/mesos/storm-mesos-0.9.3.tgz'
 I0429 13:20:08.874048  5061 fetcher.cpp:179] Copying resource from
 '/mapr/brewpot/mesos/storm-mesos-0.9.3.tgz' to
 '/tmp/mesos/slaves/20150402-121212-1644210368-5050-21808-S0/frameworks/20150402-121212-1644210368-5050-21808-0008/executors/myTopo-1-1430331608/runs/8ac24d9e-bb7f-49fe-9519-a07aa6a1d969'
 I0429 13:20:09.103682  5061 fetcher.cpp:64] Extracted resource
 '/tmp/mesos/slaves/20150402-121212-1644210368-5050-21808-S0/frameworks/20150402-121212-1644210368-5050-21808-0008/executors/myTopo-1-1430331608/runs/8ac24d9e-bb7f-49fe-9519-a07aa6a1d969/storm-mesos-0.9.3.tgz'
 into
 '/tmp/mesos/slaves/20150402-121212-1644210368-5050-21808-S0/frameworks/20150402-121212-1644210368-5050-21808-0008/executors/myTopo-1-1430331608/runs/8ac24d9e-bb7f-49fe-9519-a07aa6a1d969'
 I0429 13:20:09.109590  5061 fetcher.cpp:76] Fetching URI '
 http://hadoopmapr1.brewingintel.com:43938/conf/storm.yaml'
 I0429 13:20:09.109658  5061 fetcher.cpp:126] Downloading '
 http://hadoopmapr1.brewingintel.com:43938/conf/storm.yaml' to
 '/tmp/mesos/slaves/20150402-121212-1644210368-5050-21808-S0/frameworks/20150402-121212-1644210368-5050-21808-0008/executors/myTopo-1-1430331608/runs/8ac24d9e-bb7f-49fe-9519-a07aa6a1d969/storm.yaml'
 cp: cannot create regular file `storm-mesos*/conf': No such file or
 directory



Re: docker based executor

2015-04-18 Thread Tim Chen
Hi Tyson,

The error message you saw in the logs about the executor exited actually
just means the executor process has exited.

Since you're launching a custom executor with MesosSupervisor, it seems
like MesosSupervisor simply exited without reporting any task status.

Can you look at what's the actual logs of the container? They can be found
in the sandbox stdout and stderr logs.

Tim

On Fri, Apr 17, 2015 at 11:16 PM, Tyson Norris tnor...@adobe.com wrote:

  The sequence I see in the docker.log when my executor is launched is
 something like:
 GET /containers/id/json
 POST /containers/id/wait
 POST /containers/id/stop
 GET /containers/id/logs

  So I’m wondering if the slave is calling docker-stop out of order in
 slave/containerizer/docker.cpp
 I only see it being called in recover and destroy and I don’t see logs
 indicating either of those happening, but I may be missing something else

  Tyson

  On Apr 17, 2015, at 9:42 PM, Tyson Norris tnor...@adobe.com wrote:

  mesos master INFO log says:
 I0418 04:26:31.573763 6 master.cpp:3755] Sending 1 offers to framework
 20150411-165219-771756460-5050-1- (marathon) at scheduler-
 8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34:44364
 I0418 04:26:31.580003 9 master.cpp:2268] Processing ACCEPT call for
 offers: [ 20150418-041001-553718188-5050-1-O165 ] on
 slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051
 (mesos-slave1.service.consul) for framework
 20150411-165219-771756460-5050-1- (marathon) at
 scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34:44364
 I0418 04:26:31.580369 9 hierarchical.hpp:648] Recovered cpus(*):6;
 mem(*):3862; disk(*):13483; ports(*):[31001-32000] (total allocatable:
 cpus(*):6; mem(*):3862; disk(*):13483; ports(*):[31001-32000]) on slave
 20150418-041001-553718188-5050-1-S0 from
 framework 20150411-165219-771756460-5050-1-
 I0418 04:26:32.48003612 master.cpp:3388] Executor
 insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 on
 slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051
 (mesos-slave1.service.consul) terminated with signal Unknown signal 127

  mesos slave  INFO log says:
 I0418 04:26:31.390650 8 slave.cpp:1231] Launching task
 mesos-slave1.service.consul-31000 for framework
 20150418-041001-553718188-5050-1-0001
 I0418 04:26:31.392432 8 slave.cpp:4160] Launching executor
 insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001
 in work directory '/tmp/mesos/slaves/20150418-041001-553718188-5050-

 1-S0/frameworks/20150418-041001-553718188-5050-1-0001/executors/insights-1-1429330829/runs/3cc411b0-c2e0-41ae-80c2-f0306371da5a'
 I0418 04:26:31.392587 8 slave.cpp:1378] Queuing task
 'mesos-slave1.service.consul-31000' for executor insights-1-1429330829
 of framework '20150418-041001-553718188-5050-1-0001
 I0418 04:26:31.397415 7 docker.cpp:755] Starting container
 '3cc411b0-c2e0-41ae-80c2-f0306371da5a' for executor
 'insights-1-1429330829' and framework
 '20150418-041001-553718188-5050-1-0001'
 I0418 04:26:31.397835 7 fetcher.cpp:238] Fetching URIs using command
 '/usr/libexec/mesos/mesos-fetcher'
 I0418 04:26:32.17747911 docker.cpp:1333] Executor for container
 '3cc411b0-c2e0-41ae-80c2-f0306371da5a' has exited
 I0418 04:26:32.17781711 docker.cpp:1159] Destroying container
 '3cc411b0-c2e0-41ae-80c2-f0306371da5a'
 I0418 04:26:32.17799911 docker.cpp:1248] Running docker stop on
 container '3cc411b0-c2e0-41ae-80c2-f0306371da5a'
 I0418 04:26:32.177620 6 slave.cpp:3135] Monitoring executor
 'insights-1-1429330829' of framework
 '20150418-041001-553718188-5050-1-0001' in container
 '3cc411b0-c2e0-41ae-80c2-f0306371da5a'
 I0418 04:26:32.47799012 slave.cpp:3186] Executor
 'insights-1-1429330829' of framework 20150418-041001-553718188-5050-1-0001
 has terminated with unknown status
 I0418 04:26:32.47939412 slave.cpp:2508] Handling status update
 TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task
 mesos-slave1.service.consul-31000 of framework 20150418-041001-553718188-
 5050-1-0001 from @0.0.0.0:0
 W0418 04:26:32.47964512 docker.cpp:841] Ignoring updating unknown
 container: 3cc411b0-c2e0-41ae-80c2-f0306371da5a
 I0418 04:26:32.48004110 status_update_manager.cpp:317] Received status
 update TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task
 mesos-slave1.service.consul-31000 of framework 20150418-04
 1001-553718188-5050-1-0001
 I0418 04:26:32.48107312 slave.cpp:2753] Forwarding the update
 TASK_LOST (UUID: 9dbc3859-0409-47b4-888f-2871b0b48dfa) for task
 mesos-slave1.service.consul-31000 of framework 20150418-041001-553718188-5
 050-1-0001 to master@172.17.1.33:5050

  docker.log says
 time=2015-04-18T04:26:31Z level=debug msg=Calling POST
 /containers/create

 time=2015-04-18T04:26:31Z level=info msg=POST
 /v1.18/containers/create?name=mesos-3cc411b0-c2e0-41ae-80c2-f0306371da5a


 time=2015-04-18T04:26:31Z level=info msg=+job
 

Re: docker based executor

2015-04-18 Thread Tim Chen
That does seems odd, how did you run this via mesos? Are you using your own
framework or through another framework like Marathon?

And what does the TaskInfo look like?

Also note that if you're just testing a container, you don't want to set
the ExecutorInfo with a command as Executors in Mesos are expected to
communicate back to Mesos slave and implement the protocol between mesos
and executor. For a test image like this you want to set the CommandInfo
with a ContainerInfo holding the docker image instead.

Tim

On Sat, Apr 18, 2015 at 12:17 PM, Tyson Norris tnor...@adobe.com wrote:

  Hi Tim -
 Yes, I mentioned below when using a script like:
 --
 #!/bin/bash
 until false; do
   echo waiting for something to do something
   sleep 0.2
 done
 --

 In my sandbox stdout I get exactly 2 lines:
 waiting for something to do something
 waiting for something to do something

  Running this container any other way, e.g. docker run --rm -it
 testexecutor, the output is an endless stream of waiting for something to
 do something”.

  So something is stopping the container, as opposed to the container just
 exiting; at least that’s how it looks - I only get the container to stop
 when it is launched as an executor.

  Also, based on the docker logs, something is calling the
 /container/id/stop endpoint, *before* the /container/id/logs endpoint - so
 the stop is arriving before the logs are tailed, which also seems
 incorrect, and suggests that there is some code explicating stopping the
 container, instead of the container exiting itself.

  Thanks
 Tyson



  On Apr 18, 2015, at 3:33 AM, Tim Chen t...@mesosphere.io wrote:

  Hi Tyson,

  The error message you saw in the logs about the executor exited actually
 just means the executor process has exited.

  Since you're launching a custom executor with MesosSupervisor, it seems
 like MesosSupervisor simply exited without reporting any task status.

  Can you look at what's the actual logs of the container? They can be
 found in the sandbox stdout and stderr logs.

  Tim

 On Fri, Apr 17, 2015 at 11:16 PM, Tyson Norris tnor...@adobe.com wrote:

  The sequence I see in the docker.log when my executor is launched is
 something like:
 GET /containers/id/json
 POST /containers/id/wait
 POST /containers/id/stop
 GET /containers/id/logs

  So I’m wondering if the slave is calling docker-stop out of order in
 slave/containerizer/docker.cpp
 I only see it being called in recover and destroy and I don’t see logs
 indicating either of those happening, but I may be missing something else

  Tyson

  On Apr 17, 2015, at 9:42 PM, Tyson Norris tnor...@adobe.com wrote:

  mesos master INFO log says:
 I0418 04:26:31.573763 6 master.cpp:3755] Sending 1 offers to
 framework 20150411-165219-771756460-5050-1- (marathon) at scheduler-
 8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34:44364
 I0418 04:26:31.580003 9 master.cpp:2268] Processing ACCEPT call for
 offers: [ 20150418-041001-553718188-5050-1-O165 ] on
 slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051
 (mesos-slave1.service.consul) for framework
 20150411-165219-771756460-5050-1- (marathon) at
 scheduler-8b8d994e-5881-4687-81eb-5b3694c66342@172.17.1.34:44364
 I0418 04:26:31.580369 9 hierarchical.hpp:648] Recovered cpus(*):6;
 mem(*):3862; disk(*):13483; ports(*):[31001-32000] (total allocatable:
 cpus(*):6; mem(*):3862; disk(*):13483; ports(*):[31001-32000]) on slave
 20150418-041001-553718188-5050-1-S0 from
 framework 20150411-165219-771756460-5050-1-
 I0418 04:26:32.48003612 master.cpp:3388] Executor
 insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001 on
 slave 20150418-041001-553718188-5050-1-S0 at slave(1)@172.17.1.35:5051
 (mesos-slave1.service.consul) terminated with signal Unknown signal 127

  mesos slave  INFO log says:
 I0418 04:26:31.390650 8 slave.cpp:1231] Launching task
 mesos-slave1.service.consul-31000 for framework
 20150418-041001-553718188-5050-1-0001
 I0418 04:26:31.392432 8 slave.cpp:4160] Launching executor
 insights-1-1429330829 of framework 20150418-041001-553718188-5050-1-0001
 in work directory '/tmp/mesos/slaves/20150418-041001-553718188-5050-

 1-S0/frameworks/20150418-041001-553718188-5050-1-0001/executors/insights-1-1429330829/runs/3cc411b0-c2e0-41ae-80c2-f0306371da5a'
 I0418 04:26:31.392587 8 slave.cpp:1378] Queuing task
 'mesos-slave1.service.consul-31000' for executor insights-1-1429330829
 of framework '20150418-041001-553718188-5050-1-0001
 I0418 04:26:31.397415 7 docker.cpp:755] Starting container
 '3cc411b0-c2e0-41ae-80c2-f0306371da5a' for executor
 'insights-1-1429330829' and framework
 '20150418-041001-553718188-5050-1-0001'
 I0418 04:26:31.397835 7 fetcher.cpp:238] Fetching URIs using command
 '/usr/libexec/mesos/mesos-fetcher'
 I0418 04:26:32.17747911 docker.cpp:1333] Executor for container
 '3cc411b0-c2e0-41ae-80c2

Re: docker based executor

2015-04-18 Thread Tim Chen
Hi Tyson,

Glad you figured it out, sorry didn't realize you were running mesos slave
in a docker (which surely complicates things).

 I have a series of patches that is pending to be merged that will also
make recovering tasks when relaunching mesos-slave in a docker works.
Currently even with --pid=host when your slave dies your tasks are not able
to recover when it restarts.

Tim

On Sat, Apr 18, 2015 at 10:32 PM, Tyson Norris tnor...@adobe.com wrote:

  Yes, this was the problem - sorry for the noise.

  For the record, running mesos-slave in a container requires --pid=host”
 option as mentioned in MESOS-2183

  Now if docker-compose would just get released with the support for
 setting pid flag, life would be easy...

  Thanks
 Tyson

  On Apr 18, 2015, at 9:48 PM, Tyson Norris tnor...@adobe.com wrote:

  I think I may be running into this:
 https://issues.apache.org/jira/browse/MESOS-2183

  I’m trying to get docker-compose to launch slave with --pid=host, but
 having a few separate problems with that.

  I will update this thread when I’m able to test that.

  Thanks
 Tyson

  On Apr 18, 2015, at 1:14 PM, Tyson Norris tnor...@adobe.com wrote:

  Hi Tim - Actually, rereading your email: For a test image like this you
 want to set the CommandInfo with a ContainerInfo holding the docker image
 instead.” it sounds like you are suggesting running the container as a task
 command? But part of what I’m doing is trying to provide a custom executor,
 so I think what I had before is appropriate - eventually I want to make the
 tasks launch (same e.g. similar to existing mesos-storm framework), but I
 am trying to launch the executor as a container instead of a script
 command, which I think should be possible.

  So maybe you can comment on using a container within an ExecutorInfo as
 below?
 Docs here:
 https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L267
 suggest that ContainerInfo and CommandInfo should be provided - I am using
 setShell(false) to avoid changing the entry point, which already uses the
 default /bin/sh -c”.


  Thanks
 Tyson


  On Apr 18, 2015, at 1:03 PM, Tyson Norris tnor...@adobe.com wrote:

  Hi Tim -
 I am using my own framework - a modified version of mesos-storm,
 attempting to use docker containers instead of

  TaskInfo is like:
   TaskInfo task = TaskInfo.newBuilder()
   .setName(worker  + slot.getNodeId() + : +
 slot.getPort())
   .setTaskId(taskId)
   .setSlaveId(offer.getSlaveId())
   .setExecutor(ExecutorInfo.newBuilder()

 .setExecutorId(ExecutorID.newBuilder().setValue(details.getId()))

 .setData(ByteString.copyFromUtf8(executorDataStr))
   .setCommand(CommandInfo.newBuilder()
   .setShell(false)
   )

   .setContainer(ContainerInfo.newBuilder()

 .setType(ContainerInfo.Type.DOCKER)

 .setDocker(ContainerInfo.DockerInfo.newBuilder()
   .setImage(testexecutor)
   )
 )

  I understand this test image will be expected to fail  - I expect it to
 fail by registration timeout, and not by simply dying though. I’m only
 using a test image, because I see the same behavior with my actual image
 that properly handles mesos - executor registration protocol.

  I will try moving the Container inside the Command, and see if it
 survives longer.

  I see now at
 https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L675
 it mentions Either ExecutorInfo or CommandInfo should be set”

  Thanks
 Tyson


  On Apr 18, 2015, at 12:38 PM, Tim Chen t...@mesosphere.io wrote:

  That does seems odd, how did you run this via mesos? Are you using your
 own framework or through another framework like Marathon?

  And what does the TaskInfo look like?

  Also note that if you're just testing a container, you don't want to set
 the ExecutorInfo with a command as Executors in Mesos are expected to
 communicate back to Mesos slave and implement the protocol between mesos
 and executor. For a test image like this you want to set the CommandInfo
 with a ContainerInfo holding the docker image instead.

  Tim

 On Sat, Apr 18, 2015 at 12:17 PM, Tyson Norris tnor...@adobe.com wrote:

 Hi Tim -
 Yes, I mentioned below when using a script like:
 --
 #!/bin/bash
 until false; do
   echo waiting for something to do something
   sleep 0.2
 done
 --

 In my sandbox stdout I get exactly 2 lines:
 waiting for something to do something
 waiting for something to do something

  Running this container any other way, e.g. docker run --rm -it
 testexecutor, the output is an endless stream of waiting for something to
 do something”.

  So something is stopping the container

Re: Spark on Mesos / Executor Memory

2015-04-11 Thread Tim Chen
(Adding spark user list)

Hi Tom,

If I understand correctly you're saying that you're running into memory
problems because the scheduler is allocating too much CPUs and not enough
memory to acoomodate them right?

In the case of fine grain mode I don't think that's a problem since we have
a fixed amount of CPU and memory per task.
However, in coarse grain you can run into that problem if you're with in
the spark.cores.max limit, and memory is a fixed number.

I have a patch out to configure how much max cpus should coarse grain
executor use, and it also allows multiple executors in coarse grain mode.
So you could say try to launch multiples of max 4 cores with
spark.executor.memory (+ overhead and etc) in a slave. (
https://github.com/apache/spark/pull/4027)

It also might be interesting to include a cores to memory multiplier so
that with a larger amount of cores we try to scale the memory with some
factor, but I'm not entirely sure that's intuitive to use and what people
know what to set it to, as that can likely change with different workload.

Tim







On Sat, Apr 11, 2015 at 9:51 AM, Tom Arnfeld t...@duedil.com wrote:

 We're running Spark 1.3.0 (with a couple of patches over the top for
 docker related bits).

 I don't think SPARK-4158 is related to what we're seeing, things do run
 fine on the cluster, given a ridiculously large executor memory
 configuration. As for SPARK-3535 although that looks useful I think we'e
 seeing something else.

 Put a different way, the amount of memory required at any given time by
 the spark JVM process is directly proportional to the amount of CPU it has,
 because more CPU means more tasks and more tasks means more memory. Even if
 we're using coarse mode, the amount of executor memory should be
 proportionate to the amount of CPUs in the offer.

 On 11 April 2015 at 17:39, Brenden Matthews bren...@diddyinc.com wrote:

 I ran into some issues with it a while ago, and submitted a couple PRs to
 fix it:

 https://github.com/apache/spark/pull/2401
 https://github.com/apache/spark/pull/3024

 Do these look relevant? What version of Spark are you running?

 On Sat, Apr 11, 2015 at 9:33 AM, Tom Arnfeld t...@duedil.com wrote:

 Hey,

 Not sure whether it's best to ask this on the spark mailing list or the
 mesos one, so I'll try here first :-)

 I'm having a bit of trouble with out of memory errors in my spark
 jobs... it seems fairly odd to me that memory resources can only be set at
 the executor level, and not also at the task level. For example, as far as
 I can tell there's only a *spark.executor.memory* config option.

 Surely the memory requirements of a single executor are quite
 dramatically influenced by the number of concurrent tasks running? Given a
 shared cluster, I have no idea what % of an individual slave my executor is
 going to get, so I basically have to set the executor memory to a value
 that's correct when the whole machine is in use...

 Has anyone else running Spark on Mesos come across this, or maybe
 someone could correct my understanding of the config options?

 Thanks!

 Tom.






Re: [RESULT][VOTE] Release Apache Mesos 0.22.0 (rc4)

2015-03-25 Thread Tim Chen
Hi there,

You can already pass in multiple values seperated by comma
(cgroups/cpu,cgroups/mem,posix/disk)

Tim

On Wed, Mar 25, 2015 at 12:46 AM, Dick Davies d...@hellooperator.net
wrote:

 Thanks Craig, that's really handy!

 Dumb question for the list: are there any plans to support multiple
 isolation flags somehow?
 I need cgroups, but would really like the disk quota feature too (and
 network isolation come to that.
 And a pony).

 On 25 March 2015 at 01:00, craig w codecr...@gmail.com wrote:
  Congrats, I was working on a quick post summarizing what's new (based on
  jira and the video from niklas) which I just posted (great timing)
 
  http://craigwickesser.com/2015/03/mesos-022-release/
 
  On Tue, Mar 24, 2015 at 8:30 PM, Paul Otto p...@ottoops.com wrote:
 
  This is awesome! Thanks for all the hard work you all have put into
 this!
  I am really excited to update to the latest stable version of Apache
 Mesos!
 
  Regards,
  Paul
 
 
  Paul Otto
  Principal DevOps Architect, Co-founder
  Otto Ops LLC | OttoOps.com
  970.343.4561 office
  720.381.2383 cell
 
  On Tue, Mar 24, 2015 at 6:04 PM, Niklas Nielsen nik...@mesosphere.io
  wrote:
 
  Hi all,
 
  The vote for Mesos 0.22.0 (rc4) has passed with the
  following votes.
 
  +1 (Binding)
  --
  Ben Mahler
  Tim St Clair
  Adam Bordelon
  Brenden Matthews
 
  +1 (Non-binding)
  --
  Alex Rukletsov
  Craig W
  Ben Whitehead
  Elizabeth Lingg
  Dario Rexin
  Jeff Schroeder
  Michael Park
  Alexander Rojas
  Andrew Langhorn
 
  There were no 0 or -1 votes.
 
  Please find the release at:
  https://dist.apache.org/repos/dist/release/mesos/0.22.0
 
  It is recommended to use a mirror to download the release:
  http://www.apache.org/dyn/closer.cgi
 
  The CHANGELOG for the release is available at:
 
 
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.22.0
 
  The mesos-0.22.0.jar has been released to:
  https://repository.apache.org
 
  The website (http://mesos.apache.org) will be updated shortly to
 reflect
  this release.
 
  Thanks,
  Niklas
 
 
 
 
 
  --
 
  https://github.com/mindscratch
  https://www.google.com/+CraigWickesser
  https://twitter.com/mind_scratch
  https://twitter.com/craig_links



Re: Mesos slaves connecting but not active.

2015-03-23 Thread Tim Chen
How many containers are you running, and what is your system like?

Also are you able to capture through perf or strace what docker rm is
blocked on?

Tim


On Mon, Mar 23, 2015 at 10:12 AM, Giulio Eulisse giulio.euli...@cern.ch
wrote:

 I suspect my problem is that docker rm takes forever in my case. I'm not
 running docker in docker though.


 On 23 Mar 2015, at 18:01, haosdent wrote:

  Are your issue relevant to this?
 https://issues.apache.org/jira/browse/MESOS-2115

 On Tue, Mar 24, 2015 at 12:52 AM, Giulio Eulisse giulio.euli...@cern.ch
 wrote:

  Hi,

 I'm running using 0.20.1 and I seem to have troubles due to the fact a
 mesos slave is not able to recover the docker containers after a restart,
 resulting in a very long wait.

 Is this some known issue?

 --
 Ciao,
 Giulio




 --
 Best Regards,
 Haosdent Huang




Re: mesos on coreos

2015-03-10 Thread Tim Chen
Hi all,

As Alex said you can run Mesos in CoreOS without Docker if you put in the
dependencies in.

It is a common ask though to run Mesos-slave in a Docker container in
general, either on CoreOS or not. It's definitely a bit involved as you
need to mount in a directory for persisting work dir and also mounting in
/sys/fs for cgroups, also you should use the --pid=host flag since Docker
1.5 so it shares the host pid namespace.

Although you get a lot less isolation, there are still motivations to run
slave in Docker regardless.

One thing that's missing from the mesos docker containerizer is that it
won't be able to recover tasks on restart, and I have a series of patches
pending review to fix that.

Tim

On Tue, Mar 10, 2015 at 3:16 PM, Alex Rukletsov a...@mesosphere.io wrote:

 My 2¢.


 First of all, it doesn’t look like a great idea to package resource
 manager into Docker putting one more abstraction layer between a resource
 itself and resource manager.


 You can run mesos-slave on CoreOS node without putting it into a Docker
 container.

 —Alex



Re: spark on mesos.

2015-02-27 Thread Tim Chen
Hi Dan,

You won't see active frameworks happening until you start running a Spark
job. This is because each Spark job actually launches a new Spark framework
that is scheduling for that single job.

Tim

On Fri, Feb 27, 2015 at 1:39 PM, Dan Dong dongda...@gmail.com wrote:

 Hi, Dick,
   By Spark daemons I mean Master and Worker process running on master and
 slaves respectively when you run sbin/start-master.sh and
 ./sbin/start-slaves.sh.

 Cheers,
 Dan


 2015-02-27 15:02 GMT-06:00 Dick Davies d...@hellooperator.net:

 What do you mean by spark daemons?

 the spark shell (or any other spark application) acts as a Mesos
 framework,
 so until that's running spark isn't 'on' Mesos.

 On 27 February 2015 at 16:23, Dan Dong dongda...@gmail.com wrote:
  Hi, All,
When I configured and started spark daemons, why I could not see it in
  Active Frameworks on Mesos UI(for Hadoop, it will show up immediately
 when
  hadoop daemons started)?. I can see the spark framework on mesos UI only
  when I run the spark-shell command interactively. Is it normal?
 
  export
  MESOS_NATIVE_LIBRARY=/home/ubuntu/mesos-0.21.0/build/lib/
 libmesos-0.21.0.so
  export
  PROTOBUF_JAR=/home/ubuntu/hadoop-2.5.0-cdh5.2.0/protobuf-java-2.5.0.jar
  export MESOS_JAR=/home/ubuntu/hadoop-2.5.0-cdh5.2.0//mesos-0.21.0.jar
  export
 
 SPARK_EXECUTOR_URI=hdfs://clus-1:9000/user/ubuntu/spark-1.1.1-bin-hadoop2.4_mesos.tar.gz
  export MASTER=mesos://clus-1:5050/mesos
 
 
  Cheers,
  Dan
 





Re: Updating FrameworkInfo settings

2015-02-24 Thread Tim Chen
Mesos checkpoints the FrameworkInfo into disk, and recovers it on relaunch.

I don't think we expose any API to remove the framework manually though if
you really want to keep the FrameworkID. If you hit the failover timeout
the framework will get removed from the master and slave.

I think for now the best way is just use a new FrameworkID when you want to
change the FrameworkInfo.

Tim



On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr tp...@hubspot.com wrote:

 Hey folks,

 Is there a best practice for rolling out FrameworkInfo changes? We need to
 set checkpoint to true, so I redeployed our framework with the new
 settings (with tasks still running), but when I hit a slave's stats.json
 endpoint, it appears that the old FrameworkInfo data is still there (which
 makes sense since there's active executors running). I then tried draining
 the tasks and completely restarting a Mesos slave, but still no luck.

 Is there anything additional / special I need to do here? Is some part of
 Mesos caching FrameworkInfo based on the framework ID?

 Another wrinkle with our setup is we have a rather large failover_timeout
 set for the framework -- maybe that's affecting things too?

 Thanks,
 Tom



Re: preparing a host on task launch event

2015-02-23 Thread Tim Chen
Hi Michael,

If you're only launching docker containers, one possibility is to also use
the new powerstrips extension for Docker:

https://github.com/ClusterHQ/powerstrip

You can override default docker behaviors and do custom actions on the host
before a container is launched. Note this is not a production ready
solution as it's claimed on the github site.

Tim

On Sun, Feb 22, 2015 at 5:16 PM, Michael Neale michael.ne...@gmail.com
wrote:

 Hi Adam - yes the hooks one possibly fit the bill - not entirely clear how
 to use it yet. The peristent one *should* work, but the problem for me
 isn't so much the management of volume, but the preparation of it (I have
 no need to make tasks sticky when the data can freely move around anyway).

 On Sun Feb 22 2015 at 8:04:58 PM Adam Bordelon a...@mesosphere.io wrote:

 Michael, check out https://issues.apache.org/jira/browse/MESOS-2060 for
 a recent feature to provide task launch hooks like you're asking about,
 although it acts as a master/slave-specific library rather than a
 task-specific prep step, so you'll have to customize the behavior based on
 some information about the task.
 Alternatively, you could use the upcoming Persistent Volumes feature (
 https://issues.apache.org/jira/browse/MESOS-1554) in such a way that you
 first launch a task to prep the state in a volume, and after its completion
 launch the long-running docker task that uses that volume.

 On Thu, Feb 19, 2015 at 6:45 PM, Michael Neale michael.ne...@gmail.com
 wrote:

 (in a vain help to try and clarify) - I started with a similar pattern
 to what I have seen with redis - people ensure there is a redis on each
 host listening on a known port so apps can use it (by setting a unique
 constraint on host name, and then making sure number of instances == size
 of cluster). I started doing the same thing with a service that provides
 the volume data - this works great - but has to prepare the data *before*
 the docker container launches - or perhaps just as it is launching (docker
 can't see host mounts in bind mounts after it has launched - for boring
 reasons...).

 On Fri Feb 20 2015 at 1:38:20 PM Michael Neale michael.ne...@gmail.com
 wrote:

 well not specifically talking about the mesos containerizer - it was
 just something I tried. The main aim is to deploy containers that can be
 bind mounted in a volume which is prepared on the host - the container
 apps (docker apps) being deployed don't particularly care how that was
 prepared - just that it was there. I was hoping for another task (or
 something) that had run before had prepared it (in some cases it may simply
 be rsyncing some data in place, in others, mounting a device - result is
 the same - a volume/path can be provided to the docker container).

 Does that make a little more sense ? (a bit hard to explain).

 On Fri Feb 20 2015 at 1:23:46 PM Tim Chen t...@mesosphere.io wrote:

 Hi Michael,

 Can you elaborate how you use the Mesos containerizer to you prepare
 your host?

 In general hooks are exactly for this purpose, which is underway right
 now for defining the hooks in Mesos and also allowing it to be customized.

 Tim

 On Thu, Feb 19, 2015 at 6:18 PM, Michael Neale 
 michael.ne...@gmail.com wrote:

 I am currently using marathon and have a need to prepare the host
 in some cases (currently looking at mounting a volume that the task may
 need - how that device is created is out of band BTW).

 In theory this would be ideally done on some hook - but I am not sure
 where (the hook would be called before the task proper is launched) - it
 could be simply as part of a task launch script if a plain command.

 With the docker containerizer - I can actually use priv mode and
 control the host (if I want) - but then I would like to have this task 
 run
 separately to the main marathon long running task (as it has extra access
 which normally apps don't need) - I could bind mount in the docker socket
 and launch a non priv container from within the mesos launched start
 container ...

 I can also use the default (?) mesos containerizer - which seems to
 let me run docker commands (ie bypassing the firstclass support in mesos
 for docker) but this feels like I am doing it wrong - is that wrong?

 So in summary: is there a concept of a pre-launch step, and should I
 be working around the docker containerizer by using the mesos default
 containerizer instead?

 pointers appreciated.






Re: Spark on Mesos Submitted from multiple users

2015-02-20 Thread Tim Chen
Hi John,

I'm currently working on a cluster mode design a PoC, but it is also not
sharing drivers as Spark AFAIK is designed to not share drivers between
apps.

The cluster mode for Mesos is going to be a way to submit apps to your
cluster, and each app will be running in the cluster as a new driver that
is managed by a cluster dispatcher, and you don't need to wait for the
client to finish to get all the results.

I'll be updating the JIRA and PR once I have this ready, which is aimed for
this next release.

Tim

On Fri, Feb 20, 2015 at 8:09 AM, John Omernik j...@omernik.com wrote:

 Tim - on the Spark list your name was brought up in relation to
 https://issues.apache.org/jira/browse/SPARK-5338 I asked this question
 there but I'll ask it here too, what can I do to help on this. I am
 not a coder unfortunately, but I am user willing to try things :) This
 looks really cool for what we would like to do with Spark and Mesos
 and I'd love to be able to contribute and/or get an understanding of a
 (even tentative) timeline.  I am not trying to be pushy, I understand
 lots of things are likely on your agenda :)

 John



 On Tue, Feb 17, 2015 at 6:33 AM, John Omernik j...@omernik.com wrote:
  Tim, thanks, that makes sense, the checking for ports and incrementing
  was new to me, so hearing about that helps.  Next question is it
  possible, for a driver to be shared by the same user some how? This
  would be desirable from the standpoint of running an iPython notebook
  server (Jupyter Hub).  I have it setup that every time a notebook is
  opened, that the imports for spark are run, (the idea is the
  environment is ready to go for analysis) however, if each user, has 5
  notebooks open at any time, that would be a lot of spark drivers! But,
  I suppose before asking that, I should ask about the sequence of
  drivers... are they serial? i.e. can one driver server only one query
  at a time?   What is the optimal size for a driver (in memory) what
  does the memory affect in the driver? I.e. is a driver with smaller
  amounts of memory limited in the number of results etc?
 
  Lots of questions here, if these are more spark related questions, let
  me know, I can hop over to spark users, but since I am curious on
  spark on mesos, I figured I'd try here first.
 
  Thanks for your help!
 
 
 
  On Mon, Feb 16, 2015 at 10:30 AM, Tim Chen t...@mesosphere.io wrote:
  Hi John,
 
  With Spark on Mesos, each client (spark-submit) starts a SparkContext
 which
  initializes its own SparkUI and framework. There is a default 4040 for
 the
  Spark UI port, but if it's occupied Spark automatically tries ports
  incrementally for you, so your next could be 4041 if it's available.
 
  Driver is not shared between user, each user creates its own driver.
 
  About slowness it's hard to say without any information, you need to
 tell us
  your cluster setup, what mode you're Mesos with and if there is anything
  else running in the cluster, the job, etc.
 
  Tim
 
  On Sat, Feb 14, 2015 at 5:06 PM, John Omernik j...@omernik.com wrote:
 
  Hello all, I am running Spark on Mesos and I think I am love, but I
  have some questions. I am running the python shell via iPython
  Notebooks (Jupyter) and it works great, but I am trying to figure out
  how things are actually submitted... like for example, when I submit
  the spark app from the iPython notebook server, I am opening a new
  kernel and I see a new spark submit (similar to the below) for each
  kernel... but, how is that actually working on the cluster, I can
  connect to the spark server UI on 4040, but shouldn't there be a
  different one for each driver? Is that causing conflicts? after a
  while things seem to run slow is this due to some weird conflicts?
  Should I be specifying unique ports for each server? Is the driver
  shared between users? what about between kerne's for the same user?
  Curious if anyone has any insight.
 
  Thanks!
 
 
  java org.apache.spark.deploy.SparkSubmitDriverBootstrapper --master
  mesos://hadoopmapr3:5050 --driver-memory 1G --executor-memory 4096M
  pyspark-shell
 
 



Re: preparing a host on task launch event

2015-02-19 Thread Tim Chen
Hi Michael,

Can you elaborate how you use the Mesos containerizer to you prepare your
host?

In general hooks are exactly for this purpose, which is underway right now
for defining the hooks in Mesos and also allowing it to be customized.

Tim

On Thu, Feb 19, 2015 at 6:18 PM, Michael Neale michael.ne...@gmail.com
wrote:

 I am currently using marathon and have a need to prepare the host in
 some cases (currently looking at mounting a volume that the task may need -
 how that device is created is out of band BTW).

 In theory this would be ideally done on some hook - but I am not sure
 where (the hook would be called before the task proper is launched) - it
 could be simply as part of a task launch script if a plain command.

 With the docker containerizer - I can actually use priv mode and control
 the host (if I want) - but then I would like to have this task run
 separately to the main marathon long running task (as it has extra access
 which normally apps don't need) - I could bind mount in the docker socket
 and launch a non priv container from within the mesos launched start
 container ...

 I can also use the default (?) mesos containerizer - which seems to let me
 run docker commands (ie bypassing the firstclass support in mesos for
 docker) but this feels like I am doing it wrong - is that wrong?

 So in summary: is there a concept of a pre-launch step, and should I be
 working around the docker containerizer by using the mesos default
 containerizer instead?

 pointers appreciated.



Re: Spark on Mesos Submitted from multiple users

2015-02-16 Thread Tim Chen
Hi John,

With Spark on Mesos, each client (spark-submit) starts a SparkContext which
initializes its own SparkUI and framework. There is a default 4040 for the
Spark UI port, but if it's occupied Spark automatically tries ports
incrementally for you, so your next could be 4041 if it's available.

Driver is not shared between user, each user creates its own driver.

About slowness it's hard to say without any information, you need to tell
us your cluster setup, what mode you're Mesos with and if there is anything
else running in the cluster, the job, etc.

Tim

On Sat, Feb 14, 2015 at 5:06 PM, John Omernik j...@omernik.com wrote:

 Hello all, I am running Spark on Mesos and I think I am love, but I
 have some questions. I am running the python shell via iPython
 Notebooks (Jupyter) and it works great, but I am trying to figure out
 how things are actually submitted... like for example, when I submit
 the spark app from the iPython notebook server, I am opening a new
 kernel and I see a new spark submit (similar to the below) for each
 kernel... but, how is that actually working on the cluster, I can
 connect to the spark server UI on 4040, but shouldn't there be a
 different one for each driver? Is that causing conflicts? after a
 while things seem to run slow is this due to some weird conflicts?
 Should I be specifying unique ports for each server? Is the driver
 shared between users? what about between kerne's for the same user?
 Curious if anyone has any insight.

 Thanks!


 java org.apache.spark.deploy.SparkSubmitDriverBootstrapper --master
 mesos://hadoopmapr3:5050 --driver-memory 1G --executor-memory 4096M
 pyspark-shell



Re: Mesos 0.22.0

2015-01-20 Thread Tim Chen
Hi Dave,

Sorry about the blog post, I lost track of it in the middle of other tasks.

I'm going to update the website and the blog post very soon.

Tim

On Tue, Jan 20, 2015 at 12:37 PM, Dave Lester d...@davelester.org wrote:

  Thanks Niklas for kicking off this thread. +1 to you as release manager,
 could you please create a JIRA ticket to track the progress so we could
 subscribe?

 A minor correction to your email, Mesos 0.21.1 was voted on in late
 December (see http://markmail.org/message/e2iam7guxukl3r6c), however the
 website wasn't updated nor was blogged about like we normally do. Tim
 (cc'd), do you still plan to make this update? Any way others can help? I'd
 like to see this updated before we cut another release.

 +1 to Chris' suggestion of a page to plan future release managers, this
 would bring some longer-term clarity to who is driving feature releases and
 what they include.

 Dave

 On Tue, Jan 20, 2015, at 12:03 PM, Chris Aniszczyk wrote:

 definite +1, lets keep the release rhythm going!

 maybe some space on the wiki for release planning / release managers would
 be a step forward

 On Tue, Jan 20, 2015 at 1:59 PM, Joe Stein joe.st...@stealth.ly wrote:

 +1

 so excited for the persistence primitives, awesome!

 /***
  Joe Stein
  Founder, Principal Consultant
  Big Data Open Source Security LLC
 http://www.stealth.ly
  Twitter: @allthingshadoop http://www.twitter.com/allthingshadoop
 /

 On Tue, Jan 20, 2015 at 2:55 PM, John Pampuch j...@mesosphere.io wrote:

 +1!

 -John


 On Tue, Jan 20, 2015 at 11:52 AM, Niklas Nielsen nik...@mesosphere.io
 wrote:

  Hi all,
 
  We have been releasing major versions of Mesos roughly every second month
  (current average is ~66 days) and we are now 2 months after the 0.21.0
  release, so I would like to propose that we start planning for 0.22.0
  Not only in terms of timing, but also because we have some exciting
  features which are getting ready, including persistence primitives,
 modules
  and SSL support (I probably forgot a ton - please chime in).
 
  Since we are stakeholders in SSL and Modules, I would like to volunteer
 as
  release manager.
  Like in previous releases, I'd be happy to collaborate with co-release
  managers to make 0.22.0 a successful release.
 
  Niklas
 






 --
 Cheers,

 Chris Aniszczyk | Open Source | Twitter, Inc.
 @cra | +1 512 961 6719





Re: implementing data locality via mesos resource offers

2015-01-16 Thread Tim Chen
Hi Douglas,

The simplest way that Mesos can support is to add attributes via cli flags
when you launch a mesos slave. And when this slave's resources is being
offered, it will also include all the attributes you've tagged.

This currently is static information on launch, and I believe there is JIRA
tickets to make this dynamic (updatable at runtime).

Tim

On Thu, Jan 15, 2015 at 7:23 PM, Douglas Voet dv...@broadinstitute.org
wrote:

 Hello,

 I am evaluating mesos in the context of running analyses of many large
 files. I only want to download a file to a small subset of my nodes and
 route the related processing there. The mesos paper talks about using
 resource offers as a mechanism to achieve data locality but I can't find
 any reference to how one might do this in the documentation. How would a
 mesos slave know what data is available keeping in mind that that might
 change over time? How can I configure a slave to include this information
 in resource offers?

 Thanks in advance for any pointers.

 -Doug



Re: implementing data locality via mesos resource offers

2015-01-16 Thread Tim Chen
Hi Sharma,

You're correct and that's how most schedulers handle this, which is to
handle the locality information itself.

We've considering and finding primitives to help in this front though, so
if you have any input let us know how to help manage locality that fits at
the level of Mesos.

Tim

On Fri, Jan 16, 2015 at 9:34 AM, Sharma Podila spod...@netflix.com wrote:

 Using the attributes would be the simplest way, if the slave were to
 support dynamic updates of the attributes. The JIRA that Tim references
 would be nice! Otherwise one would have to resort to something like a
 wrapper script of the mesos-slave process that detects new data
 availability and restarts mesos-slave with new attributes in cmdline.
 Restarts may be OK when slaves are run to checkpoint state and recover
 state upon restart.

 Another possibility in the interim would be for the framework scheduler to
 launch the task that does the download of the file(s) to the small subset
 of nodes. Then, the scheduler can maintain this state information and
 assign the tasks based on that. This has the additional advantage of
 maintaining the list of that subset of nodes in a more dynamic way, if that
 is useful to you.

 In general, I am a fan of achieving data locality via the scheduler's
 state info. In a more generic scenario, the data would be created
 dynamically by tasks previously run (instead of just an initial download)
 and therefore locality for such data is easier done via the scheduler.



 On Fri, Jan 16, 2015 at 12:15 AM, Tim Chen t...@mesosphere.io wrote:

 Hi Douglas,

 The simplest way that Mesos can support is to add attributes via cli
 flags when you launch a mesos slave. And when this slave's resources is
 being offered, it will also include all the attributes you've tagged.

 This currently is static information on launch, and I believe there is
 JIRA tickets to make this dynamic (updatable at runtime).

 Tim

 On Thu, Jan 15, 2015 at 7:23 PM, Douglas Voet dv...@broadinstitute.org
 wrote:

 Hello,

 I am evaluating mesos in the context of running analyses of many large
 files. I only want to download a file to a small subset of my nodes and
 route the related processing there. The mesos paper talks about using
 resource offers as a mechanism to achieve data locality but I can't find
 any reference to how one might do this in the documentation. How would a
 mesos slave know what data is available keeping in mind that that might
 change over time? How can I configure a slave to include this information
 in resource offers?

 Thanks in advance for any pointers.

 -Doug






Re: Accessing stdout/stderr of a task programmattically?

2015-01-13 Thread Tim Chen
You can get the slave_id, framework_id and executor_id of a task all from
state.json.

ie:


   - {
  - executor_id: 20141231-115728-16777343-5050-49193-S0,
  - framework_id: 20141231-115728-16777343-5050-49193-,
  - id: 1,
  - labels: [ ],
  - name: Task 1,
  - resources:
  {
 - cpus: 6,
 - disk: 0,
 - mem: 13312
 },
  - slave_id: 20141231-115728-16777343-5050-49193-S0,
  - state: TASK_KILLED,
  - statuses:
  [
 -
 {
- state: TASK_RUNNING,
- timestamp: 1420056049.88177
},
 -
 {
- state: TASK_KILLED,
- timestamp: 1420056124.66483
}
 ]
  },


On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg dsg123456...@gmail.com
wrote:

 I was trying to figure out how to programmatically access a task's stdout
  stderr, and I don't fully understand how the URL is constructed. It seems
 to be of the form http://
 $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something

 What is the $something? Is there an easier way, given just the task_id, to
 find where the output is?

 Thanks,
 David



Re: concepts clarification between framework,executor, and task?

2015-01-09 Thread Tim Chen
Hi Sujin,

A framework can be either a long running service or just runs during the
duration of the tasks.
A custom executor can also run longer than the tasks themselves as well.
Tasks have several states, and few them (TASK_KILLED, TASK_FAILED,
TASK_FINISHED, TASK_ERROR) are states that makes them considered completed.

Executor by itself requires resources to run just like any other task, as
it may require resources to fork and run arbitrary commands.

Tim

On Fri, Jan 9, 2015 at 7:17 AM, sujinzhao 43183...@qq.com wrote:

 According to my understanding, one framework may contains several
 executors, and one executor may include many tasks, one framework is
 considered completed iff all of its executors are completed, and one
 executor is considered completed iff all of its tasks are completed, one
 task is considered completed iff its state equals TASK_FINISHED, please
 fix if I am right.

 another question: I know that one task may occupy resources, how about an
 executor? while I am reading the code, it seems that an executor also needs
 resources? what's the reason?

 Thank you very much! if you could provide me with some documents that
 would be much helpful!


Re: Architecture question

2015-01-09 Thread Tim Chen
Hi Srinivas,

Can you elaborate more about what does maintaining a dynamic count of
executors?

You can always write a custom framework that provides the scheduling,
similiar to what Marathon or Aurora is doing if it doesn't fit your need.

Tim

On Fri, Jan 9, 2015 at 1:18 PM, Srinivas Murthy srinimur...@gmail.com
wrote:

 Thanks Vinod. I need to deal with a very conservative management that
 needs a lot of selling for each additional open source framework. I have
 glossed over Marathon so far. I was hoping to hear there's some way I could
 override the Scheduler and work with what I have, but I hear you say that
 isn't the route I should be pursuing :-)


 On Fri, Jan 9, 2015 at 11:43 AM, Vinod Kone vinodk...@apache.org wrote:

 Have you looked at Aurora or Marathon? They have some (most?) of the
 features you are looking for.

 On Fri, Jan 9, 2015 at 10:59 AM, Srinivas Murthy srinimur...@gmail.com
 wrote:

 We have a legacy system with home brewn workflows defined in XPDL,
 running across multiple dozens of nodes. Resources are mapped in XML
 definition files, and availability of resource to a given task at hand
 managed by a custom written job scheduler. Jobs communicate status with
 callback/JMS messages. Job completion decides steps in the workflow.

 To this eco system now comes some Hadoop/Spark jobs.
 I am tentatively exploring Mesos to manage this disparate set of
 clusters.
 How can I maintain a dynamic count of Executors, how can I provide
 dynamic workflow orchestration to pull off above architecture in the Mesos
 world? Sorry for the noob question!






Re: Running services on all slaves

2015-01-08 Thread Tim Chen
Hi Itamar,

You can pass the amount of CPU and memory that the slave is advertising to
the master for scheduling as part of the resources slave flag. So you can
only schedule 12 cpus and leave 4 for your services if you want.

That said, there are discussions about launching multiple tasks co-located
at once (aka Pods) for a while, but it's not yet concrete how it really
looks like in Mesos yet.

Tim



On Wed, Jan 7, 2015 at 11:30 PM, Itamar Ostricher ita...@yowza3d.com
wrote:

 Thanks everybody for all your insights!

 I totally agree with the last response from Tom.
 The per-node services definitely belong to the level that provisions the
 machine and the mesos-slave service itself (in our case, pre-configured GCE
 images).

 So I guess the problem I wanted to solve is more general - how can I make
 sure there are resources reserved for all of the system-level stuff that
 are running outside of the mesos context?
 To be more specific, if I have a machine with 16 CPUs, it is common that
 my framework will schedule 16 heavy number-crunching processes on it.
 This can starve anything else that's running on the machine... (like the
 logging aggregation service, and the mesos-slave service itself)
 (this probably explains phenomena of lost tasks we've been observing)
 What's the best-practice solution for this situation?

 On Wed, Jan 7, 2015 at 2:09 AM, Tom Arnfeld t...@duedil.com wrote:

 I completely agree with Charles, though I think I can appreciate what
 you're trying to do here. Take the log aggregation service as an example,
 you want that on every slave to aggregate logs, but want to avoid using yet
 another layer of configuration management to deploy it.

 I'm of the opinion that these kind of auxiliary services which all work
 together (the mesos-slave process included) to define what we mean by a
 slave are the responsibility of whoever/whatever is provisioning the
 mesos-slave process and possibly even the machine itself. In our case,
 that's Chef. IMO once a slave registers with the mesos cluster it's
 immediately ready to start doing work, and mesos will actually start
 offering that slave immediately.

 If you continue down this path you're also going to run into a variety of
 interesting timing issues when these services fail, or when you want to
 upgrade them. I'd suggest taking a look at some kind of more advanced
 process monitor to run these aux services like M/Monit instead of mesos
 (via Marathon).

 Think of it another way, would you want something running through mesos
 to install apt package updates once a day? That'd be super weird, so why
 would log aggregation by any different?

 --

 Tom Arnfeld
 Developer // DueDil


 On Tue, Jan 6, 2015 at 11:57 PM, Charles Baker cnob...@gmail.com wrote:

 It seems like an 'anti-pattern' (for lack of a better term) to attempt
 to force locality on a bunch of dependency services launched through
 Marathon. I thought the whole idea of Mesos (and Marathon) was to treat the
 data center as one giant computer in which it fundamentally should not
 matter where your services are launched. Although I obviously don't know
 the details of the use-case and may be grossly misunderstanding what you
 are trying to do but to me it sounds like you are attempting to shoehorn a
 non-distributed application into a distributed architecture. If this is the
 case, you may want to revisit your implementation and try to decouple the
 application's requirement of node-level dependency locality. It is also a
 good opportunity to possibly redesign a monolithic application into a
 distributed one.

 On Tue, Jan 6, 2015 at 12:53 PM, David Greenberg dsg123456...@gmail.com
  wrote:

 Tom is absolutely correct--you also need to ensure that your special
 tasks run as a user which is assigned a role w/ a special reservation to
 ensure they can always launch.

 On Tue, Jan 6, 2015 at 2:38 PM, Tom Arnfeld t...@duedil.com wrote:

 I'm not sure if I'm fully aware of the use case but if you use a
 different framework (aka Marathon) to launch these services, should the
 service die and need to be re-launched (or even the slave restarts) could
 you not be in a position where another framework has consumed all 
 resources
 on that slave and your core tasks cannot launch?

 Maybe if you're just using Marathon it might provide a sort of
 priority to decide who gets what resources first, but with multiple
 frameworks you might need to look into the slave resource reservations and
 framework roles.

 FWIW We're configuring these things out of band (via Chef to be
 specific).

 Hope this helps!

 --

 Tom Arnfeld
 Developer // DueDil

 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS


 On Tue, Jan 6, 2015 at 9:05 AM, Itamar Ostricher ita...@yowza3d.com
 wrote:

 Hi,

 I was wondering if the best approach to do what I want is to use
 mesos itself, or other Linux system tools.

 There are a bunch of services that our framework assumes are running
 on all participating slaves (e.g. 

Re: Running Spark on Mesos

2015-01-07 Thread Tim Chen
Hi John,

I'm not quite familiar how SparkSQL thrift servers are started, but in
general you can't share a Mesos driver with two different frameworks in
Spark. Each spark shell or spark submit creates a new framework that is
independently getting offers and using these resources from Mesos.

If you want your executors to be long running, then you will want to run it
in coarse grain mode which also keeps your cache as well.

Tim

On Tue, Jan 6, 2015 at 5:40 AM, John Omernik j...@omernik.com wrote:

 I have Spark 1.2 running nicely with both the SparkSQL thrift server
 and running it in iPython.

 My question is this. I am running on Mesos in fine grained mode, what
 is the appropriate way to manage the two instances? Should I run a
 Course grained mode for the Spark SQL Thrift Server so that RDDs can
 persist?  Should Run both as separate Spark instances in Fine Grain
 Mode (I'ld have to change the port on one of them)  Is there a way to
 have one spark driver server both things so I only use resources for
 one driver?   How would you run this in a production environment?

 Thanks!

 John



[RESULT][VOTE] Release Apache Mesos 0.21.1 (rc2)

2015-01-02 Thread Tim Chen
Hi all,

The vote for Mesos 0.21.1 (rc2) has passed with the
following votes.

+1 (Binding)
--
Niklas Nielsen
Timothy Chen
Till Toenshoff


+1 (Non-binding)
--
Tom Arnfeld
Ankur Chauhan

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/0.21.1

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.1

The mesos-0.21.1.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,

Tim  Till


Re: [VOTE] Release Apache Mesos 0.21.1 (rc2)

2014-12-30 Thread Tim Chen
Hi all,

Just a reminder the vote is up for another 2 hours, let me know if any of
you have any objections.

Thanks,

Tim

On Mon, Dec 29, 2014 at 5:32 AM, Niklas Nielsen nik...@mesosphere.io
wrote:

 +1, Compiled and tested on Ubuntu Trusty, CentOS Linux 7 and Mac OS X

 Thanks guys!
 Niklas


 On 19 December 2014 at 22:02, Tim Chen t...@mesosphere.io wrote:

 Hi Ankur,

 Since MESOS-1711 is just a minor improvement I'm inclined to include it
 for the next major release which shouldn't be too far away from this
 release.

 If anyone else thinks otherwise please let me know.

 Tim

 On Fri, Dec 19, 2014 at 12:44 PM, Ankur Chauhan an...@malloc64.com
 wrote:

 Sorry for a late join in can we get
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/MESOS-1711
 in too or is it too late?
 -- ankur
 Sent from my iPhone

 On Dec 19, 2014, at 12:23, Tim Chen t...@mesosphere.io wrote:

 Hi all,

 Please vote on releasing the following candidate as Apache Mesos 0.21.1.


 0.21.1 includes the following:

 
 * This is a bug fix release.

 ** Bug
   * [MESOS-2047] Isolator cleanup failures shouldn't cause TASK_LOST.
   * [MESOS-2071] Libprocess generates invalid HTTP
   * [MESOS-2147] Large number of connections slows statistics.json
 responses.
   * [MESOS-2182] Performance issue in libprocess SocketManager.

 ** Improvement
   * [MESOS-1925] Docker kill does not allow containers to exit gracefully
   * [MESOS-2113] Improve configure to find apr and svn libraries/headers
 in OSX

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.1-rc2

 

 The candidate for Mesos 0.21.1 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz

 The tag to be voted on is 0.21.1-rc2:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.21.1-rc2

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1046

 Please vote on releasing this package as Apache Mesos 0.21.1!

 The vote is open until Tue Dec 23 18:00:00 PST 2014 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.21.1
 [ ] -1 Do not release this package because ...

 Thanks,

 Tim  Till






[VOTE] Release Apache Mesos 0.21.1 (rc2)

2014-12-19 Thread Tim Chen
Hi all,

Please vote on releasing the following candidate as Apache Mesos 0.21.1.


0.21.1 includes the following:

* This is a bug fix release.

** Bug
  * [MESOS-2047] Isolator cleanup failures shouldn't cause TASK_LOST.
  * [MESOS-2071] Libprocess generates invalid HTTP
  * [MESOS-2147] Large number of connections slows statistics.json
responses.
  * [MESOS-2182] Performance issue in libprocess SocketManager.

** Improvement
  * [MESOS-1925] Docker kill does not allow containers to exit gracefully
  * [MESOS-2113] Improve configure to find apr and svn libraries/headers in
OSX

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.1-rc2


The candidate for Mesos 0.21.1 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz

The tag to be voted on is 0.21.1-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.21.1-rc2

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1046

Please vote on releasing this package as Apache Mesos 0.21.1!

The vote is open until Tue Dec 23 18:00:00 PST 2014 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 0.21.1
[ ] -1 Do not release this package because ...

Thanks,

Tim  Till


Re: [VOTE] Release Apache Mesos 0.21.1 (rc2)

2014-12-19 Thread Tim Chen
Hi Ankur,

Since MESOS-1711 is just a minor improvement I'm inclined to include it for
the next major release which shouldn't be too far away from this release.

If anyone else thinks otherwise please let me know.

Tim

On Fri, Dec 19, 2014 at 12:44 PM, Ankur Chauhan an...@malloc64.com wrote:

 Sorry for a late join in can we get
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/MESOS-1711 in
 too or is it too late?
 -- ankur
 Sent from my iPhone

 On Dec 19, 2014, at 12:23, Tim Chen t...@mesosphere.io wrote:

 Hi all,

 Please vote on releasing the following candidate as Apache Mesos 0.21.1.


 0.21.1 includes the following:

 
 * This is a bug fix release.

 ** Bug
   * [MESOS-2047] Isolator cleanup failures shouldn't cause TASK_LOST.
   * [MESOS-2071] Libprocess generates invalid HTTP
   * [MESOS-2147] Large number of connections slows statistics.json
 responses.
   * [MESOS-2182] Performance issue in libprocess SocketManager.

 ** Improvement
   * [MESOS-1925] Docker kill does not allow containers to exit gracefully
   * [MESOS-2113] Improve configure to find apr and svn libraries/headers
 in OSX

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.1-rc2

 

 The candidate for Mesos 0.21.1 release is available at:
 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz

 The tag to be voted on is 0.21.1-rc2:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.21.1-rc2

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc2/mesos-0.21.1.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1046

 Please vote on releasing this package as Apache Mesos 0.21.1!

 The vote is open until Tue Dec 23 18:00:00 PST 2014 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.21.1
 [ ] -1 Do not release this package because ...

 Thanks,

 Tim  Till




[VOTE] Release Apache Mesos 0.21.1 (rc1)

2014-12-18 Thread Tim Chen
Hi all,

Please vote on releasing the following candidate as Apache Mesos 0.21.1.

0.21.1 includes the following:

* This is a bug fix release.

** Bug
  * [MESOS-2047] Isolator cleanup failures shouldn't cause TASK_LOST.
  * [MESOS-2071] Libprocess generates invalid HTTP
  * [MESOS-2147] Large number of connections slows statistics.json
responses.

** Improvement
  * [MESOS-1925] Docker kill does not allow containers to exit gracefully
  * [MESOS-2113] Improve configure to find apr and svn libraries/headers in
OSX


The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.1-rc1


The candidate for Mesos 0.21.1 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc1/mesos-0.21.1.tar.gz

The tag to be voted on is 0.21.1-rc1:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.21.1-rc1

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc1/mesos-0.21.1.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc1/mesos-0.21.1.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1044

Please vote on releasing this package as Apache Mesos 0.21.1!

The vote is open until Monday Dec 22 18:00:00 PST 2014 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 0.21.1
[ ] -1 Do not release this package because ...

Thanks,

Tim  Till


Re: [VOTE] Release Apache Mesos 0.21.1 (rc1)

2014-12-18 Thread Tim Chen
Hi Ben,

Sounds good, we'll take these patches for rc2.

I'll wait for tomorrow morning in case anyone else found any issue or want
any other patches to be part of 0.21.1.

Thanks,

Tim

On Thu, Dec 18, 2014 at 4:00 PM, Benjamin Mahler benjamin.mah...@gmail.com
wrote:

 While not a blocker (since it's a long-standing issue), I'd recommend
 cherry-picking MESOS-2182
 https://issues.apache.org/jira/browse/MESOS-2182.

 With very large clusters this issue appears to trigger false removals of
 slaves due to blocking of the SocketManager. For us it was about 5% of the
 slaves that were removed over the course of a failover, which is quite
 dangerous for production use-cases.

 I've attached the cherry-picks for getting these on top of 0.21.0 (not
 sure if apache will strip them, but you're mesosphere email should get
 them).

 On Thu, Dec 18, 2014 at 3:36 PM, Tim Chen t...@mesosphere.io wrote:

 Hi all,

 Please vote on releasing the following candidate as Apache Mesos 0.21.1.

 0.21.1 includes the following:

 
 * This is a bug fix release.

 ** Bug
   * [MESOS-2047] Isolator cleanup failures shouldn't cause TASK_LOST.
   * [MESOS-2071] Libprocess generates invalid HTTP
   * [MESOS-2147] Large number of connections slows statistics.json
 responses.

 ** Improvement
   * [MESOS-1925] Docker kill does not allow containers to exit gracefully
   * [MESOS-2113] Improve configure to find apr and svn libraries/headers
 in OSX


 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.21.1-rc1

 

 The candidate for Mesos 0.21.1 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc1/mesos-0.21.1.tar.gz

 The tag to be voted on is 0.21.1-rc1:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.21.1-rc1

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc1/mesos-0.21.1.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.21.1-rc1/mesos-0.21.1.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1044

 Please vote on releasing this package as Apache Mesos 0.21.1!

 The vote is open until Monday Dec 22 18:00:00 PST 2014 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.21.1
 [ ] -1 Do not release this package because ...

 Thanks,

 Tim  Till




Re: DockerContainerizer error on two slaves

2014-12-16 Thread Tim Chen
Hi Tinco,

What OS/environment are you running mesos-slave on? You might need to
enable cgroups if it's not enabled/mounted by default.

Tim

On Tue, Dec 16, 2014 at 11:26 AM, Ian Downes idow...@twitter.com wrote:

 Can you also please post the output of these commands for a working and a
 non-working host?

 $ cat /proc/cgroups

 $ cat /proc/mounts

 Are you running inside a Docker or systemd container?

 On Tue, Dec 16, 2014 at 11:22 AM, Benjamin Mahler 
 benjamin.mah...@gmail.com wrote:

 +Tim Chen (please chime in if I'm missing something)

 Sorry for the delay, from a quick glance it looks like the
 DockerContainerizer it a bit less liberal in the setting up of cgroups if
 they are not mounted on the machine. I'm curious, if you remove docker
 from the containerizers flag, does it work?

 Otherwise, you can try mount the cgroups manually, as suggested by the
 error message.

 Feel free to file a ticket to capture this!

 Hope this helps,
 Ben

 On Fri, Dec 12, 2014 at 1:55 AM, Tinco Andringa m...@tinco.nl wrote:

 Hi, I'm provisioning a mesos cluster and on two of my machines I get the
 following error when starting mesos-slave:

 oot@web1:~# /usr/local/sbin/mesos-slave
 --master=zk://localhost:2181/mesos --log_dir=/var/log/mesos
 --isolation=cgroups/cpu,cgroups/mem --containerizers=docker,mesos
 --executor_registration_timeout=5mins --work_dir=/var/run/work
 I1212 10:46:30.782308 32590 logging.cpp:172] INFO level logging started!
 I1212 10:46:30.782580 32590 main.cpp:142] Build: 2014-11-22 05:29:13 by
 root
 I1212 10:46:30.782615 32590 main.cpp:144] Version: 0.21.0
 I1212 10:46:30.782640 32590 main.cpp:147] Git tag: 0.21.0
 I1212 10:46:30.782665 32590 main.cpp:151] Git SHA:
 ab8fa655d34e8e15a4290422df38a18db1c09b5b
 Failed to create a containerizer: Could not create DockerContainerizer:
 Failed to find a mounted cgroups hierarchy for the 'cpu' subsystem; you
 probably need to mount cgroups manually!

 I have four machines in total, on two machines, db1 and db2 everything
 runs fine and the slaves get added to the cluster. On web1 and web2 it
 fails and they don't appear in the cluster. I run the exact same command on
 each of the machines. Mesos-master runs fine on web1, web2 and db1.

 Obviously there's some difference between the web and db machines, but
 I'm really unclear on what that difference is specifically. Most of my chef
 scripts are ran on both types of machine, there's just some extra webserver
 stuff on the web machines, and some extra db stuff on the db machines. Db2
 is the only node that doesn't run zookeeper or mesos-master.

 Any hints or tips to get closer to the root of the problem would be much
 appreciated, I'm not affraid to dive into the source a little if necessary.

 Kind regards,
 Tinco




Re: Mesos slaves keep disconnecting

2014-12-15 Thread Tim Chen
Is there anything in the ERROR/WARNING logs?

Tim

On Mon, Dec 15, 2014 at 4:22 PM, Arunabha Ghosh arunabha...@gmail.com
wrote:

 Hi,
 I've setup a test mesos cluster on a few VM's running locally. I have
 three masters and two slaves

 masters : 192.168.48.14[5 - 7]
 slaves : 192.168.48.15[0 - 1]

 The masters startup correctly and are able to elect a leader. The slaves
 can find the master and register, but for some reason they immediately
 disconnect.


 *On the master (mesos-master.INFO)*

 master.cpp:3122] Registered slave
 20141215-160321-2435885248-5050-20424-S68 at slave(1)@127.0.1.1:5051
 (192.168.48.150) with cpus(*):1; mem(*):489; disk(*):13901;
 ports(*):[31000-32000]
 I1215 16:15:51.970082 20448 hierarchical_allocator_process.hpp:442] Added
 slave 20141215-160321-2435885248-5050-20424-S68 (192.168.48.150) with
 cpus(*):1; mem(*):489; disk(*):13901; ports(*):[31000-32000] (and
 cpus(*):1; mem(*):489; disk(*):13901; ports(*):[31000-32000] available)
 I1215 16:15:51.970474 20454 master.cpp:839] Slave
 20141215-160321-2435885248-5050-20424-S68 at slave(1)@127.0.1.1:5051
 (192.168.48.150) disconnected
 I1215 16:15:51.970546 20454 master.cpp:1789] Disconnecting slave
 20141215-160321-2435885248-5050-20424-S68 at slave(1)@127.0.1.1:5051
 (192.168.48.150)
 I1215 16:15:51.970612 20454 master.cpp:1808] Deactivating slave
 20141215-160321-2435885248-5050-20424-S68 at slave(1)@127.0.1.1:5051
 (192.168.48.150)
 I1215 16:15:51.970772 20454 hierarchical_allocator_process.hpp:481] Slave
 20141215-160321-2435885248-5050-20424-S68 deactivated
 I1215 16:15:51.975980 20453 replica.cpp:655] Replica received learned
 notice for position 276
 I1215 16:15:51.977501 20453 leveldb.cpp:343] Persisting action (20 bytes)
 to leveldb took 1.475474ms
 I1215 16:15:51.977625 20453 leveldb.cpp:401] Deleting ~2 keys from leveldb
 took 50280ns

 *On the slave (mesos-slave.INFO)*

 Dec 15 16:06:09 ubuntu mesos-slave[18118]: 2014-12-15
 16:06:09,209:18118(0x7fa67d700700):ZOO_INFO@check_events@1750: session
 establishment complete on server [192.168.48.147:2181],
 sessionId=0x34a5067fd9e0001, negotiated timeout=1
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.210183 18140
 group.cpp:313] Group process (group(1)@127.0.1.1:5051) connected to
 ZooKeeper
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.210248 18140
 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas)
 = (0, 0, 0)
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.210270 18140
 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.213835 18140
 detector.cpp:138] Detected a new leader: (id='55')
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.214570 18140
 group.cpp:659] Trying to get '/mesos/info_55' in ZooKeeper
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.215833 18141
 detector.cpp:433] A new leading master (UPID=master@192.168.48.145:5050)
 is detected
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.220592 18141
 state.cpp:33] Recovering state from '/home/agh/mesos-work/meta'
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.220757 18141
 state.cpp:62] Failed to find the latest slave from
 '/home/agh/mesos-work/meta'
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.226416 18136
 status_update_manager.cpp:197] Recovering status update manager
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.226963 18134
 containerizer.cpp:281] Recovering containerizer
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.228973 18135
 slave.cpp:3466] Finished recovery
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.230242 18137
 status_update_manager.cpp:171] Pausing sending status updates
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.230450 18135
 slave.cpp:602] New master detected at master@192.168.48.145:5050
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.230873 18135
 slave.cpp:627] No credentials provided. Attempting to register without
 authentication
 Dec 15 16:06:09 ubuntu mesos-slave[18118]: I1215 16:06:09.231045 18135
 slave.cpp:638] Detecting new master
 Dec 15 16:07:09 ubuntu mesos-slave[18118]: I1215 16:07:09.225389 18141
 slave.cpp:3321] Current usage 12.01%. Max allowed age: 5.459239732780289days
 Dec 15 16:08:09 ubuntu mesos-slave[18118]: I1215 16:08:09.228869 18141
 slave.cpp:3321] Current usage 12.01%. Max allowed age: 5.459239732780289days
 Dec 15 16:09:09 ubuntu mesos-slave[18118]: I1215 16:09:09.252048 18141
 slave.cpp:3321] Current usage 12.01%. Max allowed age: 5.459239732780289days
 Dec 15 16:09:27 ubuntu mesos-slave[18118]: I1215 16:09:27.288277 18141
 http.cpp:330] HTTP request for '/slave(1)/state.json'
 Dec 15 16:10:09 ubuntu mesos-slave[18118]: I1215 16:10:09.271672 18138
 slave.cpp:3321] Current usage 12.01%. Max allowed age: 5.459239732780289days

 It does not look like the slave is disconnecting, so why does the 

Re: Question about External Containerizer

2014-12-03 Thread Tim Chen
Forgot to mention, unless you have a custom executor that you launch as a
docker container (by putting DockerInfo in the ExecutorInfo in your
TaskInfo), you can then re-use that executor for multiple tasks.

Tim

On Wed, Dec 3, 2014 at 11:47 AM, Tim Chen t...@mesosphere.io wrote:

 Hi Sharma,

 Yes currently docker doesn't really support (out-of-box) launching
 multiple processes in the same container. They just recently added docker
 exec but not quite clear how it's best fit in mesos integration yet.

 So each task run in the Docker containerizer has to be a seperate
 container for now.

 Tim

 On Wed, Dec 3, 2014 at 11:09 AM, Sharma Podila spod...@netflix.com
 wrote:

 Yes, although, there's a nuance to this specific situation. Here, the
 same executor is being used for multiple tasks, but, the executor is
 launching a different Docker container for each task. I was extending the
 coarse grain allocation concept to within the executor (which is in a fine
 grained allocation model).
 What you mention, we do use already for a different framework, not the
 one Diptanu is talking about.

 On Wed, Dec 3, 2014 at 11:04 AM, Connor Doyle con...@mesosphere.io
 wrote:

 You're right Sharma, it's dependent upon the framework.  If your
 scheduler sets a unique ExecutorID for each TaskInfo, then the executor
 will not be re-used and you won't have to worry about resizing the
 executor's container to accomodate subsequent tasks.  This might be a
 reasonable simplification to start with, especially if your executor adds
 relatively low resource overhead.
 --
 Connor


  On Dec 3, 2014, at 10:20, Sharma Podila spod...@netflix.com wrote:
 
  This may have to do with fine-grain Vs coarse-grain resource
 allocation. Things may be easier for you, Diptanu, if you are using one
 Docker container per task (sort of coarse grain). In that case, I believe
 there's no need to alter a running Docker container's resources. Instead,
 the resource update of your executor translates into the right Docker
 containers running. There's some details to be worked out there, I am sure.
  It sounds like Tom's strategy uses the same Docker container for
 multiple tasks. Tom, do correct me otherwise.
 
  On Wed, Dec 3, 2014 at 3:38 AM, Tom Arnfeld t...@duedil.com wrote:
  When Mesos is asked to a launch a task (with either a custom Executor
 or the built in CommandExecutor) it will first spawn the executor which
 _has_ to be a system process, launched via command. This process will be
 launched inside of a Docker container when using the previously mentioned
 containerizers.
 
  Once the Executor registers with the slave, the slave will send it a
 number of launchTask calls based on the number of tasks queued up for that
 executor. The Executor can then do as it pleases with those tasks, whether
 it's just a sleep(1) or to spawn a subprocess and do some other work. Given
 it is possible for the framework to specify resources for both tasks and
 executors, and the only thing which _has_ to be a system process is the
 executor, the mesos slave will limit the resources of the executor process
 to the sum of (TaskInfo.Executor.Resources + TaskInfo.Resources).
 
  Mesos also has the ability to launch new tasks on an already running
 executor, so it's important that mesos is able to dynamically scale the
 resource limits up and down over time. Designing a framework around this
 idea can lead to some complex and powerful workflows which would be a lot
 more complex to build without Mesos.
 
  Just for an example... Spark.
 
  1) User launches a job on spark to map over some data
  2) Spark launches a first wave of tasks based on the offers it
 received (let's say T1 and T2)
  3) Mesos launches executors for those tasks (let's say E1 and E2) on
 different slaves
  4) Spark launches another wave of tasks based on offers, and tells
 mesos to use the same executor (E1 and E2)
  5) Mesos will simply call launchTasks(T{3,4}) on the two already
 running executors
 
  At point (3) mesos is going to launch a Docker container and execute
 your executor. However at (5) the executor is already running so the tasks
 will be handed to the already running executor.
 
  Mesos will guarantee you (i'm 99% sure) that the resources for your
 container have been updated to reflect the limits set on the tasks before
 handing the tasks to you.
 
  I hope that makes some sense!
 
  --
 
  Tom Arnfeld
  Developer // DueDil
 
 
  On Wed, Dec 3, 2014 at 10:54 AM, Diptanu Choudhury dipta...@gmail.com
 wrote:
 
  Thanks for the explanation Tom, yeah I just figured that out by
 reading your code! You're touching the memory.soft_limit_in_bytes and
 memory.limit_in_bytes directly.
 
  Still curios to understand in which situations Mesos Slave would call
 the external containerizer to update the resource limits of a container? My
 understanding was that once resource allocation happens for a task,
 resources are not taken away until the task exits[fails, crashes or
 finishes

Re: Question about External Containerizer

2014-12-03 Thread Tim Chen
Hi Sharma,

Yes currently docker doesn't really support (out-of-box) launching multiple
processes in the same container. They just recently added docker exec but
not quite clear how it's best fit in mesos integration yet.

So each task run in the Docker containerizer has to be a seperate container
for now.

Tim

On Wed, Dec 3, 2014 at 11:09 AM, Sharma Podila spod...@netflix.com wrote:

 Yes, although, there's a nuance to this specific situation. Here, the same
 executor is being used for multiple tasks, but, the executor is launching a
 different Docker container for each task. I was extending the coarse grain
 allocation concept to within the executor (which is in a fine grained
 allocation model).
 What you mention, we do use already for a different framework, not the one
 Diptanu is talking about.

 On Wed, Dec 3, 2014 at 11:04 AM, Connor Doyle con...@mesosphere.io
 wrote:

 You're right Sharma, it's dependent upon the framework.  If your
 scheduler sets a unique ExecutorID for each TaskInfo, then the executor
 will not be re-used and you won't have to worry about resizing the
 executor's container to accomodate subsequent tasks.  This might be a
 reasonable simplification to start with, especially if your executor adds
 relatively low resource overhead.
 --
 Connor


  On Dec 3, 2014, at 10:20, Sharma Podila spod...@netflix.com wrote:
 
  This may have to do with fine-grain Vs coarse-grain resource
 allocation. Things may be easier for you, Diptanu, if you are using one
 Docker container per task (sort of coarse grain). In that case, I believe
 there's no need to alter a running Docker container's resources. Instead,
 the resource update of your executor translates into the right Docker
 containers running. There's some details to be worked out there, I am sure.
  It sounds like Tom's strategy uses the same Docker container for
 multiple tasks. Tom, do correct me otherwise.
 
  On Wed, Dec 3, 2014 at 3:38 AM, Tom Arnfeld t...@duedil.com wrote:
  When Mesos is asked to a launch a task (with either a custom Executor
 or the built in CommandExecutor) it will first spawn the executor which
 _has_ to be a system process, launched via command. This process will be
 launched inside of a Docker container when using the previously mentioned
 containerizers.
 
  Once the Executor registers with the slave, the slave will send it a
 number of launchTask calls based on the number of tasks queued up for that
 executor. The Executor can then do as it pleases with those tasks, whether
 it's just a sleep(1) or to spawn a subprocess and do some other work. Given
 it is possible for the framework to specify resources for both tasks and
 executors, and the only thing which _has_ to be a system process is the
 executor, the mesos slave will limit the resources of the executor process
 to the sum of (TaskInfo.Executor.Resources + TaskInfo.Resources).
 
  Mesos also has the ability to launch new tasks on an already running
 executor, so it's important that mesos is able to dynamically scale the
 resource limits up and down over time. Designing a framework around this
 idea can lead to some complex and powerful workflows which would be a lot
 more complex to build without Mesos.
 
  Just for an example... Spark.
 
  1) User launches a job on spark to map over some data
  2) Spark launches a first wave of tasks based on the offers it received
 (let's say T1 and T2)
  3) Mesos launches executors for those tasks (let's say E1 and E2) on
 different slaves
  4) Spark launches another wave of tasks based on offers, and tells
 mesos to use the same executor (E1 and E2)
  5) Mesos will simply call launchTasks(T{3,4}) on the two already
 running executors
 
  At point (3) mesos is going to launch a Docker container and execute
 your executor. However at (5) the executor is already running so the tasks
 will be handed to the already running executor.
 
  Mesos will guarantee you (i'm 99% sure) that the resources for your
 container have been updated to reflect the limits set on the tasks before
 handing the tasks to you.
 
  I hope that makes some sense!
 
  --
 
  Tom Arnfeld
  Developer // DueDil
 
 
  On Wed, Dec 3, 2014 at 10:54 AM, Diptanu Choudhury dipta...@gmail.com
 wrote:
 
  Thanks for the explanation Tom, yeah I just figured that out by reading
 your code! You're touching the memory.soft_limit_in_bytes and
 memory.limit_in_bytes directly.
 
  Still curios to understand in which situations Mesos Slave would call
 the external containerizer to update the resource limits of a container? My
 understanding was that once resource allocation happens for a task,
 resources are not taken away until the task exits[fails, crashes or
 finishes] or Mesos asks the slave to kill the task.
 
  On Wed, Dec 3, 2014 at 2:47 AM, Tom Arnfeld t...@duedil.com wrote:
  Hi Diptanu,
 
  That's correct, the ECP has the responsibility of updating the resource
 for a container, and it will do as new tasks are launched and killed for an
 executor. Since 

Re: Timeline for 0.22.0?

2014-12-02 Thread Tim Chen
Hi Scott,

The patch for MESOS-1925 is already merged into master, so you should be
able to just grab master in the mean time.

As for 0.22.0 timeline, I don't think we set a timeline yet, usually we
call a estimated time to release when we have enough to release a new
version.

Tim

On Tue, Dec 2, 2014 at 2:08 PM, Scott Rankin sran...@motus.com wrote:

   Hi all,

  We’re very excited here for Mesos and are working on our first
 production deployment using Mesos/Marathon/Chronos.  One thing that we need
 for production readiness is MESOS-1925.  I see it’s been assigned to 0.22.0
 – I was wondering if there was any timeline yet for when that release will
 come out.  I can put together our own branch with 0.21.0 + the patch, but
 I’d rather wait for the release.

  Thanks!
 Scott

   This email message contains information that Motus, LLC considers
 confidential and/or proprietary, or may later designate as confidential and
 proprietary. It is intended only for use of the individual or entity named
 above and should not be forwarded to any other persons or entities without
 the express consent of Motus, LLC, nor should it be used for any purpose
 other than in the course of any potential or actual business relationship
 with Motus, LLC. If the reader of this message is not the intended
 recipient, or the employee or agent responsible to deliver it to the
 intended recipient, you are hereby notified that any dissemination,
 distribution, or copying of this communication is strictly prohibited. If
 you have received this communication in error, please notify sender
 immediately and destroy the original message.

 Internal Revenue Service regulations require that certain types of written
 advice include a disclaimer. To the extent the preceding message contains
 advice relating to a Federal tax issue, unless expressly stated otherwise
 the advice is not intended or written to be used, and it cannot be used by
 the recipient or any other taxpayer, for the purpose of avoiding Federal
 tax penalties, and was not written to support the promotion or marketing of
 any transaction or matter discussed herein.



Rocket

2014-12-01 Thread Tim Chen
Hi all,

Per the announcement from CoreOS about Rocket (
https://coreos.com/blog/rocket/) , it seems to be an exciting containerizer
runtime that has composable isolation/components, better security and image
specification/distribution.

All of these design goals also fits very well into Mesos, where in Mesos we
also have a pluggable isolators model and have been experiencing some pain
points with our existing containerizers around image distribution and
security as well.

I'd like to propose to integrate Rocket into Mesos with a new Rocket
containerizer, where I can see we can potentially integrate our existing
isolators into Rocket runtime.

Like to learn what you all think,

Thanks!


Re: Rocket

2014-12-01 Thread Tim Chen
Hi Jie,

I don't think they've published any API yet, the actual integration story
is TBD but given the early stage we can help shape the API as well.

Tim

On Mon, Dec 1, 2014 at 12:01 PM, Jie Yu yujie@gmail.com wrote:

 Sounds great Tim!

 Do you know if they have published an API for the rocket toolset? Are we
 gonna rely on the command line interface?

 - Jie

 On Mon, Dec 1, 2014 at 11:10 AM, Tim Chen t...@mesosphere.io wrote:

 Hi all,

 Per the announcement from CoreOS about Rocket (
 https://coreos.com/blog/rocket/) , it seems to be an exciting
 containerizer runtime that has composable isolation/components, better
 security and image specification/distribution.

 All of these design goals also fits very well into Mesos, where in Mesos
 we also have a pluggable isolators model and have been experiencing some
 pain points with our existing containerizers around image distribution and
 security as well.

 I'd like to propose to integrate Rocket into Mesos with a new Rocket
 containerizer, where I can see we can potentially integrate our existing
 isolators into Rocket runtime.

 Like to learn what you all think,

 Thanks!





Re: Mesos killing Spark Driver

2014-12-01 Thread Tim Chen
There are different reasons, but most commonly is when the framework ask to
kill the task.

Can you provide some easy repro steps/artifacts? I've been working on Spark
on Mesos these days and can help try this out.

Tim

On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi,

 Sorry if this has been discussed before. I'm new to the list.

 We are currently running our Spark + Spark Streaming jobs on Mesos,
 submitting our jobs through Marathon.

 We see with some regularity that the Spark Streaming driver gets killed by
 Mesos and then restarted on some other node by Marathon.

 I've no clue why Mesos is killing the driver and looking at both the Mesos
 and Spark logs didn't make me any wiser.

 On the Spark Streaming driver logs, I find this entry of Mesos signing
 off my driver:

 Shutting down
 Sending SIGTERM to process tree at pid 17845
 Killing the following process trees:
 [
 -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
  \-+- 17846 sh ./run-mesos.sh application-ts.conf
\--- 17847 java -cp core-compute-job.jar
 -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
 ]
 Command terminated with signal Terminated (pid: 17845)


 What would be the reasons for Mesos to kill an executor?
 Have anybody seen something similar? Any hints on where to start digging?

 -kr, Gerard.
 .








Re: Mesos killing Spark Driver

2014-12-01 Thread Tim Chen
Hi Gerard,

I see. What will be helpful to help diagnoise your problem is that if you
can enable verbose logging (GLOG_v=1) before running the slave, and share
the slave logs when it happens.

Tim

On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi Tim,

 It's quite hard to reproduce. It just happens... some time worst than
 others, mostly when the system is under load. We notice b/c the framework
 starts 'jumping' from one slave to other, but so far we have no clue why
 this is happening.

 What I'm currently looking for is some potential conditions that could
 cause Mesos to kill the executor (not the task) to validate whether any of
 those conditions apply to our case and try to narrow down the problem to
 some reproducible subset.

 -kr, Gerard.


 On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen t...@mesosphere.io wrote:

 There are different reasons, but most commonly is when the framework ask
 to kill the task.

 Can you provide some easy repro steps/artifacts? I've been working on
 Spark on Mesos these days and can help try this out.

 Tim

 On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas gerard.m...@gmail.com
 wrote:

 Hi,

 Sorry if this has been discussed before. I'm new to the list.

 We are currently running our Spark + Spark Streaming jobs on Mesos,
 submitting our jobs through Marathon.

 We see with some regularity that the Spark Streaming driver gets killed
 by Mesos and then restarted on some other node by Marathon.

 I've no clue why Mesos is killing the driver and looking at both the
 Mesos and Spark logs didn't make me any wiser.

 On the Spark Streaming driver logs, I find this entry of Mesos signing
 off my driver:

 Shutting down
 Sending SIGTERM to process tree at pid 17845
 Killing the following process trees:
 [
 -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
  \-+- 17846 sh ./run-mesos.sh application-ts.conf
\--- 17847 java -cp core-compute-job.jar
 -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
 ]
 Command terminated with signal Terminated (pid: 17845)


 What would be the reasons for Mesos to kill an executor?
 Have anybody seen something similar? Any hints on where to start digging?

 -kr, Gerard.
 .










Re: With docker containerizer enabled, How to check whether(or how) mesos successfully running tasks in docker container?

2014-11-29 Thread Tim Chen
Hi Sujinzhao,

Your steps s1-s3 are all correct for starting Mesos itself, but you also
need a framework that can get offers from Mesos and launch Tasks.

The easiest and simplest to use is the example framework that Mesos ship
with (mesos-execute), or you can use richer frameworks like
Marathon/Chronos as well. You can also implement your own if you want more
fine grained control.

Marathon:
https://mesosphere.github.io/marathon/docs/

Chronos:
https://github.com/airbnb/chronos

Tim

On Sat, Nov 29, 2014 at 1:59 AM, Billy Bones gael.ther...@gmail.com wrote:

 Did you try to launch it using marathon ?

 Le sam. 29 nov. 2014 10:41, sujinzhao sujinz...@gmail.com a écrit :

 For clarification, I just want to know how to start an application
 running in a docker container with mesos ?




Re: CFS for Docker Containers running on Mesos?

2014-11-23 Thread Tim Chen
Hi Andrew,

The Docker containerizer right now simply calls the docker cli cpu and
memory parameters and let the Docker daemon set the cgroup shares
accordingly, allow we do go behind the docker daemon to perform updates on
the share amount.

We didn't port the existing Mesos containerizer CFS support into the Docker
containerizer as it wasn't something we did at v1, and can be integrated
into Docker containerizer as well. If you like to see this feature please
add a ticket in JIRA about this.

The work around for now is to perform exactly what you saw in the docker
user forum, since from 0.21.0 we support passing arbitrary key/value params
to Docker containerizer you can also launch docker daemon with lxc backend
and pass the cfs params as part of your docker task launch.

Let me know if you have any more questions,

Tim

On Sun, Nov 23, 2014 at 10:14 PM, Andrew Ortman 
and...@impossibleventures.com wrote:

  Hi all!

  I’ve been toying around this evening with CPU resource isolation in
 Mesos - specifically with the Docker Containerizer.

  It seems the Docker Containerizer (via experimentation and peeking at
 the source) modifies the relative CPU weight for the docker containers
 running on the slave in order to isolate the container's CPU usage. This of
 course isn’t a hard limit, and allows services to burst up to the full
 resources of the slave as long as no other tasks are consuming resources
 running on that server. It looks like the cpu isolator for the Mesos
 containerizer already has support for cgroup/cpu isolation and even an
 ability to enable CFS support via the command line flag
 —cgroups_enable_cfs. When I enable this flag, the Docker Containerizer
 doesn’t seem to use it at all or pass it onto the container at all. After
 googling, I came across this post:
 https://groups.google.com/forum/#!topic/docker-user/UF0GxTp3NHI where
 Hamilton was able to set the the relevant cgroup parameters to the docker
 daemon in order to provide a hard CPU limit on a running docker container.

  I’m fairly new to Mesos - so I apologize if something obvious is flying
 right over my head :-) Does anyone have any advice / suggestions / thoughts
 on achieving such a hard CPU cap for docker containers running on Mesos?
 This isn’t a blocker for me - I am purely curious

  Thanks!




Re: Implementing an Executor

2014-11-20 Thread Tim Chen
Hi Janet,

Can you elaborate more what you like to get back from the docker container
that you launched?

Thanks,

Tim

On Wed, Nov 19, 2014 at 5:22 PM, Tom Arnfeld t...@duedil.com wrote:

 Hi Janet,

 Oh sorry my mistake, I didn't read your email correctly, I thought you
 were using the containerizer. What you're doing here is actually going to
 be quite difficult to do, the mesos docker containerizer has some quite
 complex logic implemented to ensure the slave stays in sync with the
 containers that are running, and kills anything that goes rogue.

 It's going to be non-trivial for you to do that from the executor, though
 I guess you could make use of the docker events API or poll other endpoints
 in the API to check the status of your containers, and off the back of that
 send status updates to the cluster. Doing this however brings no guarantees
 that if your executor dies exceptionally (perhaps OOMd) the containers
 spawned will die... they'll keep running in the background and it'll be
 hard for you to know the state of your containers on the cluster.

 You probably want to be aware (if you don't know already) that the
 resource limits assigned to your tasks aren't going to be enforced by mesos
 because docker is running outside of its control. You'll need to pass the
 correct CPU/Memory limit parameters to your docker containers to ensure
 this happens correctly.

 Here are the docker API docs;
 https://docs.docker.com/reference/api/docker_remote_api_v1.15/

 Something you might want to consider, if all you're trying to do is allow
 your container access to details about itself (e.g `docker inspect`) is to
 open up the docker remote API to be queried by your containers on the
 slave, and switch to using the mesos docker containerizer.

 I hope that helps somewhat!

 Tom.

 --

 Tom Arnfeld
 Developer // DueDil

 (+44) 7525940046
 25 Christopher Street, London, EC2A 2BS


 On Wed, Nov 19, 2014 at 10:16 PM, Janet Borschowa 
 janet.borsch...@codefutures.com wrote:

  Hi,
 I'm implementing an executor which is used by the mesos slave to launch
 tasks. The tasks are to launch a docker container - this is because I need
 more info about the launched container than what the docker containerizer
 returns.

 Is it OK to block in the executor's launchTask method until the task
 completes? If not, how does the framework discover when that task
 completes? I could spawn a process which notifies my executor when the task
 completes and then have my executor send a status update. Or is there some
 other recommended way to deal with this when the task could run for an
 indefinite period of time before completing its work?

 Thanks!

 Janet

 --
  Janet Borschowa
 CodeFutures Corporation






Re: Why rely on url scheme for fetching?

2014-11-01 Thread Tim Chen
Hi Ankur,

There is a fetcher_tests.cpp in src/tests.

Tim

On Sat, Nov 1, 2014 at 7:27 PM, Ankur Chauhan an...@malloc64.com wrote:

 Hi Tim,

 I am trying to find/write some test cases. I couldn't find a
 fetcher_tests.{cpp|hpp} so once I have something, I'll post on review
 board. I am new to gmock/gtest so bear with me while i get up to speed.

 -- Ankur

 On 1 Nov 2014, at 19:23, Timothy Chen t...@mesosphere.io wrote:

 Hi Ankur,

 Can you post on reviewboard? We can discuss more about the code there.

 Tim

 Sent from my iPhone

 On Nov 1, 2014, at 6:29 PM, Ankur Chauhan an...@malloc64.com wrote:

 Hi Tim,

 I don't think there is an issue which is directly in line with what i
 wanted but the closest one that I could find in JIRA is
 https://issues.apache.org/jira/browse/MESOS-1711

 I have a branch (
 https://github.com/ankurcha/mesos/compare/prefer_hadoop_fetcher ) that
 has a change that would enable users to specify whatever hdfs compatible
 uris to the mesos-fetcher but maybe you can weight in on it. Do you think
 this is the right track? if so, i would like to pick this issue and submit
 a patch for review.

 -- Ankur


 On 1 Nov 2014, at 04:32, Tom Arnfeld t...@duedil.com wrote:

 Completely +1 to this. There are now quite a lot of hadoop compatible
 filesystem wrappers out in the wild and this would certainly be very useful.

 I'm happy to contribute a patch. Here's a few related issues that might be
 of interest;

 - https://issues.apache.org/jira/browse/MESOS-1887
 - https://issues.apache.org/jira/browse/MESOS-1316
 - https://issues.apache.org/jira/browse/MESOS-336
 - https://issues.apache.org/jira/browse/MESOS-1248

 On 31 October 2014 22:39, Tim Chen t...@mesosphere.io wrote:

 I believe there is already a JIRA ticket for this, if you search for
 fetcher in Mesos JIRA I think you can find it.

 Tim

 On Fri, Oct 31, 2014 at 3:27 PM, Ankur Chauhan an...@malloc64.com
 wrote:

 Hi,

 I have been looking at some of the stuff around the fetcher and saw
 something interesting. The code for fetcher::fetch method is dependent on a
 hard coded list of url schemes. No doubt that this works but is very
 restrictive.
 Hadoop/HDFS in general is pretty flexible when it comes to being able to
 fetch stuff from urls and has the ability to fetch a large number of types
 of urls and can be extended by adding configuration into the
 conf/hdfs-site.xml and core-site.xml

 What I am proposing is that we refactor the fetcher.cpp to prefer to use
 the hdfs (using hdfs/hdfs.hpp) to do all the fetching if HADOOP_HOME is set
 and $HADOOP_HOME/bin/hadoop is available. This logic already exists and we
 can just use it. The fallback logic for using net::download or local file
 copy is may be left in place for installations that do not have hadoop
 configured. This means that if hadoop is present we can directly fetch urls
 such as tachyon://... snackfs:// ... cfs://  ftp://... s3://...
 http:// ... file:// with no extra effort. This makes up for a much
 better experience when it comes to debugging and extensibility.

 What do others think about this?

 - Ankur








Re: Why rely on url scheme for fetching?

2014-10-31 Thread Tim Chen
I believe there is already a JIRA ticket for this, if you search for
fetcher in Mesos JIRA I think you can find it.

Tim

On Fri, Oct 31, 2014 at 3:27 PM, Ankur Chauhan an...@malloc64.com wrote:

 Hi,

 I have been looking at some of the stuff around the fetcher and saw
 something interesting. The code for fetcher::fetch method is dependent on a
 hard coded list of url schemes. No doubt that this works but is very
 restrictive.
 Hadoop/HDFS in general is pretty flexible when it comes to being able to
 fetch stuff from urls and has the ability to fetch a large number of types
 of urls and can be extended by adding configuration into the
 conf/hdfs-site.xml and core-site.xml

 What I am proposing is that we refactor the fetcher.cpp to prefer to use
 the hdfs (using hdfs/hdfs.hpp) to do all the fetching if HADOOP_HOME is set
 and $HADOOP_HOME/bin/hadoop is available. This logic already exists and we
 can just use it. The fallback logic for using net::download or local file
 copy is may be left in place for installations that do not have hadoop
 configured. This means that if hadoop is present we can directly fetch urls
 such as tachyon://... snackfs:// ... cfs://  ftp://... s3://...
 http:// ... file:// with no extra effort. This makes up for a much better
 experience when it comes to debugging and extensibility.

 What do others think about this?

 - Ankur


Re: Exposing host services in docker container

2014-10-27 Thread Tim Chen
Hi Ankur,

You can't access host via 'localhost' in a docker container since localhost
(127.0.0.1) is just the loopback that is in the container and not to your
host.
However since docker automatically creates the bridge for you that connects
the container and host, you can access the host by it's ip from the
container.

Try the following instead:
HOST_IP=`hostname -i`
docker run -it --rm dockerfile/mongodb bash -c mongo --host $HOST_IP:27017

Tim

On Sun, Oct 26, 2014 at 11:28 PM, Ankur Chauhan an...@malloc64.com wrote:

 Hi Tim,

 Thanks for the reply.

 Yes these are services that are running on all the nodes. Imagine it like
 this: All slave hosts have a mongos running on 27017, a kafka broker, etc
 always running. I think I just don't know how to access them from within
 the container? My perception is that accessing localhost:27017 from within
 the container doesn't connect to the hosts' localhost:27017. If that is
 possible, how do i do it? I started a mongos on 27017 but when i tried the
 command
 `docker run -it --rm dockerfile/mongodb bash -c 'mongo --host
 localhost:27017'` it was unable to connect.

 -- Ankur

 On 26 Oct 2014, at 21:12, Tim Chen t...@mesosphere.io wrote:

 Hi Ankur,

 Not sure I understand exactly, are these common services all running on
 the same host where you're running the container?

 If it's running the same host, docker container should be able to access
 any port in the host, if it's cross hosts then you have to setup your own
 bridge and use the lxc-conf option, or use something like pipework.

 I'm adding the lxc-conf options into the next release of mesos.

 Tim

 On Sun, Oct 26, 2014 at 6:39 PM, Ankur Chauhan an...@malloc64.com wrote:

 Hi all,

 So I have been dabbling with docker containers in and mesos. I have the
 following scenario and i was wondering if someone had experience with
 something like this.

 I have a bunch of dockerized apps running and the slave hosts have some
 common services running, something like mongos and kafka brokers etc. I was
 wondering if there was a way to expose these services (i.e. some tcp port)
 to the docker containers?

 Is this even a legitimate way of exposing services to apps but my main
 intention is to avoid going over the network or something.

 -- Ankur






Re: Docker: pull on app start?

2014-10-27 Thread Tim Chen
Originally we don't want to always pull with latest as that's what docker
run does as well which skips pull if the image exists, and has different
issues involved with this.

However with MESOS-1886 we can make this optional, just figuring out where
this configuration should be, at the latest discussions is leaning towards
DockerInfo which is a per task level.

Tim



On Mon, Oct 27, 2014 at 10:19 AM, Ankur Chauhan an...@malloc64.com wrote:

 I had a similar issue and what I ended up doing was to explicitly set the
 version tag instead of just saying :latest. That makes the whole system
 much more stable/predictable. For the cost of a single curl call when you
 push is much better for debugging issues and the state of the system.
 -- Ankur

 On 27 Oct 2014, at 09:58, Donald Laidlaw donlaid...@me.com wrote:

 I have Mesos 0.20.1 and Marathon 0.7.3

 When marathon is creating a new App, and asks Mesos to start an instance,
 what docker command is run by mesos?

 The reason I ask is that I am referencing a docker container without a
 tag, expecting the tag latest to be used. And that seems to work. But
 when I update the container with a new version, and set the latest tag to
 point to the new version in the docker registry, mesos does not pull down
 that new version. I suspect mesos is checking to see if there is an image
 of the same name already on the slave and not issuing a pull request in
 that case.

 Thanks,
 Don






Re: Exposing host services in docker container

2014-10-26 Thread Tim Chen
Hi Ankur,

Not sure I understand exactly, are these common services all running on the
same host where you're running the container?

If it's running the same host, docker container should be able to access
any port in the host, if it's cross hosts then you have to setup your own
bridge and use the lxc-conf option, or use something like pipework.

I'm adding the lxc-conf options into the next release of mesos.

Tim

On Sun, Oct 26, 2014 at 6:39 PM, Ankur Chauhan an...@malloc64.com wrote:

 Hi all,

 So I have been dabbling with docker containers in and mesos. I have the
 following scenario and i was wondering if someone had experience with
 something like this.

 I have a bunch of dockerized apps running and the slave hosts have some
 common services running, something like mongos and kafka brokers etc. I was
 wondering if there was a way to expose these services (i.e. some tcp port)
 to the docker containers?

 Is this even a legitimate way of exposing services to apps but my main
 intention is to avoid going over the network or something.

 -- Ankur


Re: Staging docker task KILLED after 1 minute

2014-10-17 Thread Tim Chen
The case where Mesos loses track about these killed containers is going to
be fixed soon, have a reviewboard up and once it merged we shouldn't have
untracked containers.

Tim

On Fri, Oct 17, 2014 at 3:14 PM, Dick Davies d...@hellooperator.net wrote:

 good catch! Sorry, the docs are right I just had a brain fart :)

 On 17 October 2014 13:46, Nils De Moor nils.de.m...@gmail.com wrote:
  Hi guys,
 
  Thanks for the swift feedback. I can confirm that tweaking the
  task_launch_timeout setting in marathon and setting it to a value bigger
  that the executor_registration_timeout setting in mesos fixed our
 problem.
 
  One sidenote though: the task_launch_timeout setting is in
 milli-seconds, so
  for 5 minutes it's 30 (vs 300 in seconds).
  It will save you some hair pulling when seeing your tasks being killed
  immediately after being launched. ;)
 
  Thanks again!
 
  Kr,
  Nils
 
  On Thu, Oct 16, 2014 at 4:27 PM, Michael Babineau
  michael.babin...@gmail.com wrote:
 
  See also https://issues.apache.org/jira/browse/MESOS-1915
 
  On Thu, Oct 16, 2014 at 2:59 AM, Dick Davies d...@hellooperator.net
  wrote:
 
  One gotcha - the marathon timeout is in seconds, so pass '300' in your
  case.
 
  let us know if it works, I spotted this the other day and anecdotally
  it addresses
  the issue for some users, be good to get more feedback.
 
  On 16 October 2014 09:49, Grzegorz Graczyk gregor...@gmail.com
 wrote:
   Make sure you have --task_launch_timeout in marathon set to same
 value
   as
   executor_registration_timeout.
  
  
 https://github.com/mesosphere/marathon/blob/master/docs/docs/native-docker.md#configure-marathon
  
   On 16 October 2014 10:37, Nils De Moor nils.de.m...@gmail.com
 wrote:
  
   Hi,
  
   Environment:
   - Clean vagrant install, 1 master, 1 slave (same behaviour on
   production
   cluster with 3 masters, 6 slaves)
   - Mesos 0.20.1
   - Marathon 0.7.3
   - Docker 1.2.0
  
   Slave config:
   - containerizers: docker,mesos
   - executor_registration_timeout: 5mins
  
   When is start docker container tasks, they start being pulled from
 the
   HUB, but after 1 minute mesos kills them.
   In the background though the pull is still finishing and when
   everything
   is pulled in the docker container is started, without mesos knowing
   about
   it.
   When I start the same task in mesos again (after I know the pull of
   the
   image is done), they run normally.
  
   So this leaves slaves with 'dirty' docker containers, as mesos has
 no
   knowledge about them.
  
   From the logs I get this:
   ---
   I1009 15:30:02.990291  1414 slave.cpp:1002] Got assigned task
   test-app.23755452-4fc9-11e4-839b-080027c4337a for framework
   20140904-160348-185204746-5050-27588-
   I1009 15:30:02.990979  1414 slave.cpp:1112] Launching task
   test-app.23755452-4fc9-11e4-839b-080027c4337a for framework
   20140904-160348-185204746-5050-27588-
   I1009 15:30:02.993341  1414 slave.cpp:1222] Queuing task
   'test-app.23755452-4fc9-11e4-839b-080027c4337a' for executor
   test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
   '20140904-160348-185204746-5050-27588-
   I1009 15:30:02.995818  1409 docker.cpp:743] Starting container
   '25ac3310-71e4-4d10-8a4b-38add4537308' for task
   'test-app.23755452-4fc9-11e4-839b-080027c4337a' (and executor
   'test-app.23755452-4fc9-11e4-839b-080027c4337a') of framework
   '20140904-160348-185204746-5050-27588-'
  
   I1009 15:31:07.033287  1413 slave.cpp:1278] Asked to kill task
   test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
   20140904-160348-185204746-5050-27588-
   I1009 15:31:07.034742  1413 slave.cpp:2088] Handling status update
   TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for task
   test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
   20140904-160348-185204746-5050-27588- from @0.0.0.0:0
   W1009 15:31:07.034881  1413 slave.cpp:1354] Killing the unregistered
   executor 'test-app.23755452-4fc9-11e4-839b-080027c4337a' of
 framework
   20140904-160348-185204746-5050-27588- because it has no tasks
   E1009 15:31:07.034945  1413 slave.cpp:2205] Failed to update
 resources
   for
   container 25ac3310-71e4-4d10-8a4b-38add4537308 of executor
   test-app.23755452-4fc9-11e4-839b-080027c4337a running task
   test-app.23755452-4fc9-11e4-839b-080027c4337a on status update for
   terminal
   task, destroying container: No container found
   I1009 15:31:07.035133  1413 status_update_manager.cpp:320] Received
   status
   update TASK_KILLED (UUID: a8ec88a1-1809-4108-b2ed-056a725ecd41) for
   task
   test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
   20140904-160348-185204746-5050-27588-
   I1009 15:31:07.035210  1413 status_update_manager.cpp:373]
 Forwarding
   status update TASK_KILLED (UUID:
 a8ec88a1-1809-4108-b2ed-056a725ecd41)
   for
   task test-app.23755452-4fc9-11e4-839b-080027c4337a of framework
   20140904-160348-185204746-5050-27588- to master@10.0.10.11:5050
   

Re: Connecting spark from a different Machine to mesos cluster

2014-10-15 Thread Tim Chen
Hi Johannes,

When you started your 2nd shell, what log output from the slave do you see
for that framework?

Master seems to think it's already terminated.

Tim

On Wed, Oct 15, 2014 at 6:31 AM, Johannes Schillinger (Intern) 
johannes.schillin...@citrix.com wrote:

  Hi Tim,



 We are running Spark 1.1.0 with Hadoop 2.4. Mesos is in Version 0.20.1 all
 in binary releases.



 The Spark console is running in default mode, which is fine grained.



 The Spark process is started from a physical Machine running Ubuntu, the
 Mesos nodes are running in VMs also in Ubuntu.



 This is the output of the Spark Shell:




 

 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath

 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties

 14/10/15 15:18:24 INFO SecurityManager: Changing view acls to: USERNAME,

 14/10/15 15:18:24 INFO SecurityManager: Changing modify acls to: USERNAME,

 14/10/15 15:18:24 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(USERNAME, );
 users with modify permissions: Set(USERNAME, )

 14/10/15 15:18:24 INFO HttpServer: Starting HTTP Server

 14/10/15 15:18:24 INFO Utils: Successfully started service 'HTTP class
 server' on port 42469.

 Welcome to

     __

  / __/__  ___ _/ /__

 _\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 1.1.0

   /_/



 Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_65)

 Type in expressions to have them evaluated.

 Type :help for more information.

 14/10/15 15:18:26 WARN Utils: Your hostname, karwjohannes01 resolves to a
 loopback address: 127.0.1.1; using CLIENT_IP instead (on interface eth0)

 14/10/15 15:18:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
 another address

 14/10/15 15:18:27 INFO SecurityManager: Changing view acls to: USERNAME,

 14/10/15 15:18:27 INFO SecurityManager: Changing modify acls to: USERNAME,

 14/10/15 15:18:27 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(USERNAME, );
 users with modify permissions: Set(USERNAME, )

 14/10/15 15:18:27 INFO Slf4jLogger: Slf4jLogger started

 14/10/15 15:18:27 INFO Remoting: Starting remoting

 14/10/15 15:18:27 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@CLIENT_IP:51879]

 14/10/15 15:18:27 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkDriver@CLIENT_IP:51879]

 14/10/15 15:18:27 INFO Utils: Successfully started service 'sparkDriver'
 on port 51879.

 14/10/15 15:18:27 INFO SparkEnv: Registering MapOutputTracker

 14/10/15 15:18:27 INFO SparkEnv: Registering BlockManagerMaster

 14/10/15 15:18:27 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20141015151827-1a2e

 14/10/15 15:18:27 INFO Utils: Successfully started service 'Connection
 manager for block manager' on port 60963.

 14/10/15 15:18:27 INFO ConnectionManager: Bound socket to port 60963 with
 id = ConnectionManagerId(CLIENT_IP,60963)

 14/10/15 15:18:27 INFO MemoryStore: MemoryStore started with capacity
 265.4 MB

 14/10/15 15:18:27 INFO BlockManagerMaster: Trying to register BlockManager

 14/10/15 15:18:27 INFO BlockManagerMasterActor: Registering block manager
 CLIENT_IP:60963 with 265.4 MB RAM

 14/10/15 15:18:27 INFO BlockManagerMaster: Registered BlockManager

 14/10/15 15:18:27 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-b032c76c-93e1-473e-802c-c55e12e85d41

 14/10/15 15:18:27 INFO HttpServer: Starting HTTP Server

 14/10/15 15:18:27 INFO Utils: Successfully started service 'HTTP file
 server' on port 47989.

 14/10/15 15:18:27 INFO Utils: Successfully started service 'SparkUI' on
 port 4040.

 14/10/15 15:18:27 INFO SparkUI: Started SparkUI at http://CLIENT_IP:4040

 14/10/15 15:18:27 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable

 I1015 15:18:28.524736  4748 sched.cpp:139] Version: 0.20.1

 I1015 15:18:28.527180  4750 sched.cpp:235] New master detected at
 master@MESOS_MASTER_IP:5050

 I1015 15:18:28.527300  4750 sched.cpp:243] No credentials provided.
 Attempting to register without authentication


 



 Mesos master WARNING log:

 W1015 14:13:00.235213  1118 master.cpp:3452] Master returning resources
 offered to framework 20141007-102213-343139338-5050-1037-3490 because the
 framework has terminated or is inactive

 W1015 14:13:35.244055  1121 master.cpp:3452] Master returning resources
 offered to framework 20141007-102213-343139338-5050-1037-3525 because the
 framework has terminated or is inactive

 W1015 14:13:50.252436  

Re: HDFS Mesos Framework

2014-10-08 Thread Tim Chen
Brenden Matthews has a HDFS framework that is still in progress:
https://github.com/brndnmtthws/hdfs

Welcome to contribute as well!

Tim

On Wed, Oct 8, 2014 at 9:51 AM, Luke Amdor luke.am...@banno.com wrote:

 Has anyone started work on a Hadoop HDFS Mesos framework? I know many of
 us just run HDFS alongside the Mesos slaves, but I was looking for
 something a little simpler. Possibly with federation support and quorum
 journals?

 --
 *Luke Amdor* | Platform Lead Architect | Banno | *ProfitStars®*
 Des Moines IA 50309 | Cell 515.231.4033



Re: Mesos Slave gets registered with lower memory than available

2014-10-01 Thread Tim Chen
Hi Stefan,

Yes it's a feature where we leave some space on each slave and not fully
allocate all the memory and cpu.

You can override how much resource your slave advertises by passing in the
--resources flag when you start your slave.

Tim

On Wed, Oct 1, 2014 at 9:22 AM, Stefan Eder masta_...@gmx.net wrote:

 Hi,

 I'm relatively new to Mesos and trying to setup a cluster for Jenkins
 slaves. Currently I have three machines running master, slave and
 zookeepers and one only running a slave.

 Everything works fine so far, except that the slaves registered on the
 master have a lover memory value than available on the machine. For
 example if I use a computer with 8GB then the registered value of that
 slave will be around 6.7GB. So there are around 1.3GB which are
 missing.

 I guess that this is a feature and not a bug (probably to ensure that
 other processes and what not can also run on the machine). However, is
 there a way to decrease that value?

 So far I found nothing about this in the documentation or elsewhere. Am
 I missing something?

 Would be nice if someone could help me.

 Thanks,
 Stefan




Re: Mesos 0.20.1 still using -net=host when launching Docker containers

2014-10-01 Thread Tim Chen
Hi Andy,

The docs is sitting at the docs folder in the source tree, and there is a
docker containerization doc markdown file.

Simply modify it and put a patch on reviewboard, and assign to the mesos
group and me.

Let me know if you need more specific steps around this.

Tim

On Wed, Oct 1, 2014 at 4:08 PM, Andy Grove andy.gr...@codefutures.com
wrote:

 I'd be happy to submit some documentation patches if you can point me in
 the right direction. Is there a separate git repo for the docs?

 Thanks,

 Andy.

 --
 Andy Grove
 VP Engineering
 CodeFutures Corporation



 On Wed, Oct 1, 2014 at 3:30 PM, Timothy Chen t...@mesosphere.io wrote:

 Sorry for the documentation, I didn't update the docs as part of 0.20.1
 which we added the network modes.

 Please if you like submit a patch and I can help submit it.

 Tim

 On Oct 1, 2014, at 2:25 PM, Tim St Clair tstcl...@redhat.com wrote:

 inline below -

 *From: *Andy Grove andy.gr...@codefutures.com
 *To: *user@mesos.apache.org
 *Sent: *Wednesday, October 1, 2014 2:47:54 PM
 *Subject: *Mesos 0.20.1 still using -net=host when launching Docker
 containers

 Hi,

 I'm making better progress but have run into another issue that I need
 help tracking down.

 I've actually packaged up my code in a github repo now and will be
 writing up a tutorial on this once I have everything working.

 https://github.com/codefutures/mesos-docker-tutorial

 The README.md contains instructions on running the framework.

 I can use this to start a single instance of the fedora/apache container
 but when I try and run multiple instances the first one works but the other
 containers start and then fail pretty quickly.

 I tracked down the error information in the sandbox and the other
 containers are failing with cannot bind to port 80 so it looks like the
 containers are being launched with host networking (-net=host). I thought
 this was one of the issues that was fixed in 0.20.1. Do I have to do
 something to enable containerized networking?

 It's still HOST by default, you'll need to specify network=BRIDGE in the
 DockerInfo iirc.

 Cheers,
 Tim


 Thanks,

 Andy.

 --
 Andy Grove
 VP Engineering
 CodeFutures Corporation





 --
 Cheers,
 Timothy St. Clair
 Red Hat Inc.





Re: Docker executor issue

2014-09-30 Thread Tim Chen
Hi Andy,

Good catch, I also missed that as I was just looking at the Docker
configurations.

You'll set the Executor when you have an custom executor.

Let us know if you have any other problems.

Tim

On Tue, Sep 30, 2014 at 11:02 AM, Andy Grove andy.gr...@codefutures.com
wrote:

 OK. So I figured out the issue with this and it was my misunderstanding of
 executors and tasks.

 My task info had:

 .setExecutor(Protos.ExecutorInfo.newBuilder(executor))

 I should have had this:

 .setContainer(containerInfoBuilder)
 .setCommand(Protos.CommandInfo.newBuilder().setShell(false))

 I didn't have a mesos executor deployed inside my container which explains
 the timeout issue.

 Thanks again for the support.


 Thanks,

 Andy.

 --
 Andy Grove
 VP Engineering
 CodeFutures Corporation



 On Tue, Sep 30, 2014 at 10:20 AM, Andy Grove andy.gr...@codefutures.com
 wrote:

 Hi Tim,

 Thanks for helping with this. I am running mesos-master and mesos-slave
 natively on the same host (my desktop). The only container in use is the
 one being launched by the mesos-slave.

 I will try your suggestion of running a simple command next.

 Here is the output from the slave from this issue though:

 I0930 10:13:52.053177 30722 main.cpp:126] Build: 2014-09-29 15:35:37 by
 andy
 I0930 10:13:52.053228 30722 main.cpp:128] Version: 0.20.1
 I0930 10:13:53.055480 30722 containerizer.cpp:89] Using isolation:
 posix/cpu,posix/mem
 I0930 10:13:53.058353 30722 main.cpp:149] Starting Mesos slave
 I0930 10:13:53.059651 30722 slave.cpp:167] Slave started on 1)@
 127.0.1.1:5051
 I0930 10:13:53.060072 30722 slave.cpp:278] Slave resources: cpus(*):8;
 mem(*):14963; disk(*):1.85648e+06; ports(*):[31000-32000]
 I0930 10:13:53.060226 30722 slave.cpp:306] Slave hostname: davros
 I0930 10:13:53.060253 30722 slave.cpp:307] Slave checkpoint: true
 I0930 10:13:53.064975 30729 state.cpp:33] Recovering state from
 '/tmp/mesos/meta'
 I0930 10:13:53.065352 30725 status_update_manager.cpp:193] Recovering
 status update manager
 I0930 10:13:53.065626 30729 docker.cpp:577] Recovering Docker containers
 I0930 10:13:53.065690 30724 containerizer.cpp:252] Recovering
 containerizer
 I0930 10:13:54.055233 30723 slave.cpp:3198] Finished recovery
 I0930 10:13:54.055448 30723 slave.cpp:589] New master detected at
 master@127.0.0.1:5050
 I0930 10:13:54.055532 30723 slave.cpp:625] No credentials provided.
 Attempting to register without authentication
 I0930 10:13:54.055537 30730 status_update_manager.cpp:167] New master
 detected at master@127.0.0.1:5050
 I0930 10:13:54.02 30723 slave.cpp:636] Detecting new master
 I0930 10:13:54.928225 30724 slave.cpp:754] Registered with master
 master@127.0.0.1:5050; given slave ID
 20140930-101303-16777343-5050-30690-0
 I0930 10:13:54.928598 30724 slave.cpp:767] Checkpointing SlaveInfo to
 '/tmp/mesos/meta/slaves/20140930-101303-16777343-5050-30690-0/slave.info'
 I0930 10:14:17.330390 30725 slave.cpp:1002] Got assigned task 0 for
 framework 20140930-101303-16777343-5050-30690-
 I0930 10:14:17.330557 30725 slave.cpp:1112] Launching task 0 for
 framework 20140930-101303-16777343-5050-30690-
 I0930 10:14:17.331296 30725 slave.cpp:1222] Queuing task '0' for executor
 default of framework '20140930-101303-16777343-5050-30690-
 *I0930 10:14:17.333109 30730 docker.cpp:984] Starting container
 'ebb1dca6-cc9d-427f-8faa-f3f723f6ab81' for executor 'default' and framework
 '20140930-101303-16777343-5050-30690-'*
 I0930 10:14:20.062705 30730 slave.cpp:2538] Monitoring executor 'default'
 of framework '20140930-101303-16777343-5050-30690-' in container
 'ebb1dca6-cc9d-427f-8faa-f3f723f6ab81'

 The container is running quite happily at this point.

 I0930 10:14:53.061337 30724 slave.cpp:3053] Current usage 0.76%. Max
 allowed age: 6.247043850997720days
 *I0930 10:15:17.331712 30730 slave.cpp:3010] Terminating executor default
 of framework 20140930-101303-16777343-5050-30690- because it did not
 register within 1mins*
 I0930 10:15:17.332221 30728 docker.cpp:1473] Destroying container
 'ebb1dca6-cc9d-427f-8faa-f3f723f6ab81'
 I0930 10:15:17.332308 30728 docker.cpp:1568] Running docker kill on
 container 'ebb1dca6-cc9d-427f-8faa-f3f723f6ab81'
 I0930 10:15:18.109361 30730 docker.cpp:1646] Executor for container
 'ebb1dca6-cc9d-427f-8faa-f3f723f6ab81' has exited


 Thanks,

 Andy.

 --
 Andy Grove
 VP Engineering
 CodeFutures Corporation



 On Mon, Sep 29, 2014 at 6:25 PM, Tim Chen t...@mesosphere.io wrote:

 Hi Andy,

 You don't need to specifiy -d as the docker containerizer will set it
 for you since we run all docker images detached.

 It seems like the executor just simply can't register with the slave.
 Can you try just running a simple command without Docker that takes longer
 than the executor registration timeout to see if you see the same error?

 Also do you run the mesos slave in a docker container as well?

 Will be great if you can share the slave log as Vinod suggested too.

 Tim

Re: [VOTE] Release Apache Mesos 0.20.1 (rc3)

2014-09-19 Thread Tim Chen
+1 (non-binding)

Make check on Centos 5.5, docker tests all passed too.

Tim

On Fri, Sep 19, 2014 at 9:17 AM, Jie Yu yujie@gmail.com wrote:

 +1 (binding)

 Make check on centos5 and centos6 (gcc48)

 On Thu, Sep 18, 2014 at 4:05 PM, Adam Bordelon a...@mesosphere.io wrote:

 Hi all,

 Please vote on releasing the following candidate as Apache Mesos 0.20.1.


 0.20.1 includes the following:

 
 Minor bug fixes for docker integration, network isolation, build, etc.

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.20.1-rc3

 

 The candidate for Mesos 0.20.1 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc3/mesos-0.20.1.tar.gz

 The tag to be voted on is 0.20.1-rc3:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.20.1-rc3

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc3/mesos-0.20.1.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc3/mesos-0.20.1.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1036

 Please vote on releasing this package as Apache Mesos 0.20.1!

 The vote is open until Mon Sep 22 17:00:00 PDT 2014 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.20.1
 [ ] -1 Do not release this package because ...

 Thanks,
 Adam and Bhuvan





Re: [VOTE] Release Apache Mesos 0.20.1 (rc2)

2014-09-18 Thread Tim Chen
-1

The docker test failed when I removed the image, and found a problem from
the docker pull implementation.
I've created a reviewboard for a fix: https://reviews.apache.org/r/25758

Will like to get this fixed before releasing it.

Tim

On Wed, Sep 17, 2014 at 9:10 PM, Vinod Kone vinodk...@gmail.com wrote:

 +1 (binding)

 make check passes on CentOS 5.5 w/ gcc 4.8.2.



 On Wed, Sep 17, 2014 at 7:42 PM, Adam Bordelon a...@mesosphere.io wrote:

 Update: The vote is open until Mon Sep 22 10:00:00 PDT 2014 and passes
 if a majority of at least 3 +1 PMC votes are cast.

 On Wed, Sep 17, 2014 at 6:27 PM, Adam Bordelon a...@mesosphere.io
 wrote:

 Hi all,

 Please vote on releasing the following candidate as Apache Mesos 0.20.1.


 0.20.1 includes the following:

 
 Minor bug fixes for docker integration, network isolation, etc.

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.20.1-rc2

 

 The candidate for Mesos 0.20.1 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz

 The tag to be voted on is 0.20.1-rc2:
 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.20.1-rc2

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.20.1-rc2/mesos-0.20.1.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1034

 Please vote on releasing this package as Apache Mesos 0.20.1!

 The vote is open until  and passes if a majority of at least 3 +1 PMC
 votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.20.1
 [ ] -1 Do not release this package because ...

 Thanks,
 -Adam-






Re: Mesos 0.20.0 with Docker registry availability

2014-09-05 Thread Tim Chen
Hi Maxime,

It is a very valid concern and that's why I've added a patch that should go
out in 0.20.1 to not do a docker pull on every run anymore.

Mesos will still try to docker pull when the image isn't available locally
(via docker inspect), but only once.

The downside ofcourse is that you're not able to automatically get the
latest tagged image, but I think it's worth while price to may to gain the
benefits of not depending on registry, able to run local images and more.

Tim


On Thu, Sep 4, 2014 at 10:50 PM, Maxime Brugidou maxime.brugi...@gmail.com
wrote:

 Hi,

 The current Docker integration in 0.20 does a docker pull from the
 registry before running any task. This means that your entire Mesos cluster
 becomes unusable if the registry goes down.

 The docs allow you to configure a custom .dockercfg for your tasks to
 point to a private docker registry.

 However it is not easy to run an HA docker registry. The docker-registry
 project recommend using S3 storage buy this is definitely not an option for
 some people.

 I know that for regular artifacts, Mesos can use HDFS storage and you can
 run your HDFS datanodes as Mesos tasks.

 So even if I attempt to have a docker registry storage in HDFS (which is
 not supported by docker-registry at the moment), I am stuck on a chicken
 and egg problem. I want to have as little services outside of Mesos as
 possible and it is hard to maintain HA services (especially outside of
 Mesos).

 Is there anyone running Mesos with Docker in production without S3? I am
 trying to make all the services outside of Mesos (the infra services that
 are necessary to run Mesos like DNS, Haproxy, Chef server... etc) either HA
 or not critical for the cluster to run. The docker registry is a new piece
 of infra outside of Mesos that is critical...

 Best,
 Maxime



Re: Mesos 0.20.0 with Docker registry availability

2014-09-05 Thread Tim Chen
It is an option that I'm considering adding, but not sure about the failure
to pull part.
Can you create a jira for this? We can then discuss there.

Tim


On Thu, Sep 4, 2014 at 11:19 PM, Steven Schlansker 
sschlans...@opentable.com wrote:

 Would it be possible to have a mode where it tries to pull, but then does
 not fail solely due to the fail of a pull?  In particular, we use tags to
 indicate which build should be deployed e.g. “foo-server:production” tag vs
 “foo-server:staging” tags.

 On Sep 4, 2014, at 11:05 PM, Tim Chen t...@mesosphere.io wrote:

  Hi Maxime,
 
  It is a very valid concern and that's why I've added a patch that should
 go out in 0.20.1 to not do a docker pull on every run anymore.
 
  Mesos will still try to docker pull when the image isn't available
 locally (via docker inspect), but only once.
 
  The downside ofcourse is that you're not able to automatically get the
 latest tagged image, but I think it's worth while price to may to gain the
 benefits of not depending on registry, able to run local images and more.
 
  Tim
 
 
  On Thu, Sep 4, 2014 at 10:50 PM, Maxime Brugidou 
 maxime.brugi...@gmail.com wrote:
  Hi,
 
  The current Docker integration in 0.20 does a docker pull from the
 registry before running any task. This means that your entire Mesos cluster
 becomes unusable if the registry goes down.
 
  The docs allow you to configure a custom .dockercfg for your tasks to
 point to a private docker registry.
 
  However it is not easy to run an HA docker registry. The docker-registry
 project recommend using S3 storage buy this is definitely not an option for
 some people.
 
  I know that for regular artifacts, Mesos can use HDFS storage and you
 can run your HDFS datanodes as Mesos tasks.
 
  So even if I attempt to have a docker registry storage in HDFS (which is
 not supported by docker-registry at the moment), I am stuck on a chicken
 and egg problem. I want to have as little services outside of Mesos as
 possible and it is hard to maintain HA services (especially outside of
 Mesos).
 
  Is there anyone running Mesos with Docker in production without S3? I am
 trying to make all the services outside of Mesos (the infra services that
 are necessary to run Mesos like DNS, Haproxy, Chef server... etc) either HA
 or not critical for the cluster to run. The docker registry is a new piece
 of infra outside of Mesos that is critical...
 
  Best,
  Maxime
 
 




Re: Launching docker containers from private repos in docker hub

2014-09-05 Thread Tim Chen
The Docker Containerizer will automatically set the $HOME directory for
you, so all you need is to include the .dockercfg as Michael mentioned.

Thanks,

Tim


On Fri, Sep 5, 2014 at 10:32 AM, Michael Babineau 
michael.babin...@gmail.com wrote:

 You'll need to put a .dockercfg file somewhere the slave can access.
 Storing it in a central place and passing it in as a task URI is a good
 approach.

 Docker looks for .dockercfg inside of $HOME. I don't recall whether $HOME
 is automatically set to the workspace -- if not, you'll need to set that on
 your task as well.

 Some more info here:
 http://mesos.apache.org/documentation/latest/docker-containerizer/


 On Fri, Sep 5, 2014 at 10:07 AM, Andy Grove andy.gr...@codefutures.com
 wrote:

 I now have mesos launching docker containers as tasks, which is great.

 I am using this code:

 final ContainerInfo.DockerInfo.Builder dockerInfo =
 ContainerInfo.DockerInfo.newBuilder()
 .setImage(registry.hub.docker.com/dockerfile/ubuntu);

 Is there a way I can provide docker hub credentials so I can launch a
 container from a private docker hub repo?

 Thanks,

 Andy.

 --
 Andy Grove
 VP Engineering
 CodeFutures Corporation






Re: Mesos 0.20.0 with Docker registry availability

2014-09-05 Thread Tim Chen
Hi Tom,

It's definitely a tradeoff between able to automatically pick up latest
changes, and able to have more control of what your cluster is running.

The downside of just pulling latest with the latest is that you might not
know what's exactly running since someone might override latest and there
is no more versioning that you can rollback to.

I thought a good solution could be having the frameworks to have a way to
configure loading a default setting for image that it puts into DockerInfo
vs having it baked into the slave, then you're much more flexible of how
you want to deal with default images/tags and not need to rely on Mesos
changes.

Tim


On Fri, Sep 5, 2014 at 3:29 AM, Tom Arnfeld t...@duedil.com wrote:

  You can tag each image with your commit hash that way Mesos will always
 have to do a docker pull and you don't lose the fast iteration cycle in
 development.

 I mentioned this on one of the review requests the other day. The problem
 here is that, say I want to iterate quickly on installing things for our
 Hadoop on Mesos cluster, I need to now change all the hadoop configuration
 on my Job Tracker's to point to the new image, which means a restart of the
 JT and jobs will die. This goes for pretty much every mesos framework that
 isn't for launching long running tasks.


 On 5 September 2014 11:14, Steve Domin st...@gocardless.com wrote:

 Hi Ryan,

 You can tag each image with your commit hash that way Mesos will always
 have to do a docker pull and you don't lose the fast iteration cycle in
 development.

 Steve

 On Friday, September 5, 2014, craig mcmillan mccraigmccr...@gmail.com
 wrote:

 hey ryan,

 there are two deployment use-cases i generally have :

 - production : i want to consider carefully what i deploy, and refer to
a specific image. a versioned tag works well here

 - development : i want to iterate quickly and something like a branch
 (movable tag) works really well here, à la heroku :
 git-push = commit-hook = build docker-image =
 curl-to-marathon

 it's the development use-case that pull on every launch supports best

 would an option in the ContainerInfo to pull on every launch be
 reasonable ?

 i'm happy to do a PR if that would be helpful !

 :craig

 On 5 Sep 2014, at 9:07, Ryan Thomas wrote:

  Whilst this is somewhat unrelated to the mesos implementation, I think
 it
 is generally good practice to have immutable tags on the images, this is
 something I dislike about docker :)

 Whist the gc of old images will eventually become a problem, it will
 really
 only be the layer delta that is consumed with each new tag. But I think
 yes, there would need to be some mechanism to clear out the images in
 the
 local registry.

 ryan
 On 5 Sep 2014 18:03, mccraig mccraig mccraigmccr...@gmail.com
 wrote:

  ah, so i will have to use a different tag to update an app

 one immediate problem i can see is that it makes garbage collecting old
 docker images from slaves harder : currently i update the image
 associated
 with a tag and restart tasks to update the running app, then
 occasionally a
 cron job to remove all docker images with no tag

 if every updated image has a new tag it will be harder to figure out
 which
 images to remove... perhaps any with no running container, though that
 could lead to unnecessary pulls and slower restarts of failed tasks

 :craig

 On 5 Sep 2014, at 08:43, Ryan Thomas r.n.tho...@gmail.com wrote:

 Hey Craig,

 docker run will attempt a pull of the image if it cannot find a
 matching
 image and tag in its local repository.

 So it should only pull on the first run of a given tag.

 ryan
 On 5 Sep 2014 17:41, mccraig mccraig mccraigmccr...@gmail.com
 wrote:

  hi tim,

 if it doesn't pull on every run, when will it pull ?

 :craig

 On 5 Sep 2014, at 07:05, Tim Chen t...@mesosphere.io wrote:

 Hi Maxime,

 It is a very valid concern and that's why I've added a patch that
 should
 go out in 0.20.1 to not do a docker pull on every run anymore.

 Mesos will still try to docker pull when the image isn't available
 locally (via docker inspect), but only once.

 The downside ofcourse is that you're not able to automatically get the
 latest tagged image, but I think it's worth while price to may to
 gain the
 benefits of not depending on registry, able to run local images and
 more.

 Tim


 On Thu, Sep 4, 2014 at 10:50 PM, Maxime Brugidou 
 maxime.brugi...@gmail.com wrote:

  Hi,

 The current Docker integration in 0.20 does a docker pull from the
 registry before running any task. This means that your entire Mesos
 cluster
 becomes unusable if the registry goes down.

 The docs allow you to configure a custom .dockercfg for your tasks to
 point to a private docker registry.

 However it is not easy to run an HA docker registry. The
 docker-registry
 project recommend using S3 storage buy this is definitely not an
 option for
 some people.

 I know that for regular artifacts, Mesos can use HDFS

  1   2   >