mesos agent not recovering after ZK init failure

2016-02-09 Thread Sharma Podila
We had a few mesos agents stuck in an unrecoverable state after a transient
ZK init error. Is this a known problem? I wasn't able to find an existing
jira item for this. We are on 0.24.1 at this time.

Most agents were fine, except a handful. These handful of agents had their
mesos-slave process constantly restarting. The .INFO logfile had the
following contents below, before the process exited, with no error
messages. The restarts were happening constantly due to an existing service
keep alive strategy.

To fix it, we manually stopped the service, removed the data in the working
dir, and then restarted it. The mesos-slave process was able to restart
then. The manual intervention needed to resolve it is problematic.

Here's the contents of the various log files on the agent:

The .INFO logfile for one of the restarts before mesos-slave process exited
with no other error messages:

Log file created at: 2016/02/09 02:12:48
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by
builds
I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
posix/cpu,posix/mem,filesystem/posix
I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
10.138.146.230:7101
I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
--appc_store_dir="/tmp/mesos/store/appc"
--attributes="region:us-east-1;" --authenticatee=""
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="mesos" "
I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: 
I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
10.138.146.230:7101) connected to ZooKeeper
I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
'/titus/main/mesos' in ZooKeeper
I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
(id='209')
I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
'/titus/main/mesos/json.info_000209' in ZooKeeper
I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
'/mnt/data/mesos/meta'
I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file
'/mnt/data/mesos/meta/resources/resources.info'
I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=
master@10.230.95.110:7103) is detected


The .FATAL log file when the original transient ZK error occurred:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]


The .ERROR log file:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]

The .WARNING file had the same content.


Re: mesos agent not recovering after ZK init failure

2016-02-09 Thread Vinod Kone
MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess you
are saying it is somehow related but not exactly the same issue?

On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés  wrote:

> On 9 February 2016 at 11:04, Sharma Podila  wrote:
>
>> We had a few mesos agents stuck in an unrecoverable state after a
>> transient ZK init error. Is this a known problem? I wasn't able to find an
>> existing jira item for this. We are on 0.24.1 at this time.
>>
>> Most agents were fine, except a handful. These handful of agents had
>> their mesos-slave process constantly restarting. The .INFO logfile had the
>> following contents below, before the process exited, with no error
>> messages. The restarts were happening constantly due to an existing service
>> keep alive strategy.
>>
>> To fix it, we manually stopped the service, removed the data in the
>> working dir, and then restarted it. The mesos-slave process was able to
>> restart then. The manual intervention needed to resolve it is problematic.
>>
>> Here's the contents of the various log files on the agent:
>>
>> The .INFO logfile for one of the restarts before mesos-slave process
>> exited with no other error messages:
>>
>> Log file created at: 2016/02/09 02:12:48
>> Running on machine: titusagent-main-i-7697a9c5
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by
>> builds
>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
>> posix/cpu,posix/mem,filesystem/posix
>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>> 10.138.146.230:7101
>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>> --appc_store_dir="/tmp/mesos/store/appc"
>> --attributes="region:us-east-1;" --authenticatee=""
>> --cgroups_cpu_enable_pids_and_tids_count="false"
>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>> --container_disk_watch_interval="15secs" --containerizers="mesos" "
>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: 
>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
>> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
>> 10.138.146.230:7101) connected to ZooKeeper
>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
>> '/titus/main/mesos' in ZooKeeper
>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
>> (id='209')
>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>> '/titus/main/mesos/json.info_000209' in ZooKeeper
>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
>> '/mnt/data/mesos/meta'
>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file
>> '/mnt/data/mesos/meta/resources/resources.info'
>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=
>> master@10.230.95.110:7103) is detected
>>
>>
>> The .FATAL log file when the original transient ZK error occurred:
>>
>> Log file created at: 2016/02/05 17:21:37
>> Running on machine: titusagent-main-i-7697a9c5
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>
>>
>> The .ERROR log file:
>>
>> Log file created at: 2016/02/05 17:21:37
>> Running on machine: titusagent-main-i-7697a9c5
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>
>> The .WARNING file had the same content.
>>
>
> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>
>
> -rgs
>
>


Re: mesos agent not recovering after ZK init failure

2016-02-09 Thread Raúl Gutiérrez Segalés
On 9 February 2016 at 11:04, Sharma Podila  wrote:

> We had a few mesos agents stuck in an unrecoverable state after a
> transient ZK init error. Is this a known problem? I wasn't able to find an
> existing jira item for this. We are on 0.24.1 at this time.
>
> Most agents were fine, except a handful. These handful of agents had their
> mesos-slave process constantly restarting. The .INFO logfile had the
> following contents below, before the process exited, with no error
> messages. The restarts were happening constantly due to an existing service
> keep alive strategy.
>
> To fix it, we manually stopped the service, removed the data in the
> working dir, and then restarted it. The mesos-slave process was able to
> restart then. The manual intervention needed to resolve it is problematic.
>
> Here's the contents of the various log files on the agent:
>
> The .INFO logfile for one of the restarts before mesos-slave process
> exited with no other error messages:
>
> Log file created at: 2016/02/09 02:12:48
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by
> builds
> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
> posix/cpu,posix/mem,filesystem/posix
> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
> 10.138.146.230:7101
> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
> --appc_store_dir="/tmp/mesos/store/appc"
> --attributes="region:us-east-1;" --authenticatee=""
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="mesos" "
> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: 
> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
> 10.138.146.230:7101) connected to ZooKeeper
> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
> '/titus/main/mesos' in ZooKeeper
> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
> (id='209')
> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
> '/titus/main/mesos/json.info_000209' in ZooKeeper
> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
> '/mnt/data/mesos/meta'
> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file
> '/mnt/data/mesos/meta/resources/resources.info'
> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=
> master@10.230.95.110:7103) is detected
>
>
> The .FATAL log file when the original transient ZK error occurred:
>
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
> zookeeper_init: No such file or directory [2]
>
>
> The .ERROR log file:
>
> Log file created at: 2016/02/05 17:21:37
> Running on machine: titusagent-main-i-7697a9c5
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
> zookeeper_init: No such file or directory [2]
>
> The .WARNING file had the same content.
>

Maybe related: https://issues.apache.org/jira/browse/MESOS-1326


-rgs


Re: mesos agent not recovering after ZK init failure

2016-02-09 Thread Sharma Podila
Maybe related, but, maybe different since a new process seems to find the
master leader and still aborts, never recovering with restarts until work
dir data is removed.
It is happening in 0.24.1.




On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone  wrote:

> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess you
> are saying it is somehow related but not exactly the same issue?
>
> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
> r...@itevenworks.net> wrote:
>
>> On 9 February 2016 at 11:04, Sharma Podila  wrote:
>>
>>> We had a few mesos agents stuck in an unrecoverable state after a
>>> transient ZK init error. Is this a known problem? I wasn't able to find an
>>> existing jira item for this. We are on 0.24.1 at this time.
>>>
>>> Most agents were fine, except a handful. These handful of agents had
>>> their mesos-slave process constantly restarting. The .INFO logfile had the
>>> following contents below, before the process exited, with no error
>>> messages. The restarts were happening constantly due to an existing service
>>> keep alive strategy.
>>>
>>> To fix it, we manually stopped the service, removed the data in the
>>> working dir, and then restarted it. The mesos-slave process was able to
>>> restart then. The manual intervention needed to resolve it is problematic.
>>>
>>> Here's the contents of the various log files on the agent:
>>>
>>> The .INFO logfile for one of the restarts before mesos-slave process
>>> exited with no other error messages:
>>>
>>> Log file created at: 2016/02/09 02:12:48
>>> Running on machine: titusagent-main-i-7697a9c5
>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by
>>> builds
>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
>>> posix/cpu,posix/mem,filesystem/posix
>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>> 10.138.146.230:7101
>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>> --appc_store_dir="/tmp/mesos/store/appc"
>>> --attributes="region:us-east-1;" --authenticatee=""
>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>> --container_disk_watch_interval="15secs" --containerizers="mesos" "
>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: 
>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
>>> 10.138.146.230:7101) connected to ZooKeeper
>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations:
>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
>>> '/titus/main/mesos' in ZooKeeper
>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
>>> (id='209')
>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>> '/titus/main/mesos/json.info_000209' in ZooKeeper
>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
>>> '/mnt/data/mesos/meta'
>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file
>>> '/mnt/data/mesos/meta/resources/resources.info'
>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=
>>> master@10.230.95.110:7103) is detected
>>>
>>>
>>> The .FATAL log file when the original transient ZK error occurred:
>>>
>>> Log file created at: 2016/02/05 17:21:37
>>> Running on machine: titusagent-main-i-7697a9c5
>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>
>>>
>>> The .ERROR log file:
>>>
>>> Log file created at: 2016/02/05 17:21:37
>>> Running on machine: titusagent-main-i-7697a9c5
>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>
>>> The .WARNING file had the same content.
>>>
>>
>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>
>>
>> -rgs
>>
>>
>


Re: Apache Mesos Community Sync

2016-02-09 Thread Michael Park
Our next community sync will be on Thursday, February 11, 2016 at 9am PST.

To join in person, come to Mesosphere HQ at 88 Stevenson St. and see the
reception on the 2nd floor.

Please add your agenda items to the Google Doc

!

P.S. Subscribe to the Mesos Events Calendar

for future meetings and events.

Thanks,

MPark.

On 27 January 2016 at 19:01, Michael Park  wrote:

> Our next community sync will be tomorrow January 28, 2016 at 3pm PST.
>
> To join in person, come to Mesosphere HQ at 88 Stevenson St. and see the
> reception on the 2nd floor.
>
> Please add your agenda items to the Google Doc
> 
> !
>
> P.S. Subscribe to the Mesos events calendar at
> https://calendar.google.com/calendar/embed?src=2hecvndc0mnaqlir34cqnfvtak%40group.calendar.google.com
>  for
> future meetings and events.
>
> Thanks,
>
> MPark.
>
> On 13 January 2016 at 13:43, Michael Park  wrote:
>
>> Our next community sync will be tomorrow Jan 14 at 9pm PST (looks like I
>> mistakenly said Jan 15 in my last email, sorry!).
>>
>> We'll host this one online only, and will send out links to Hangouts and
>> OnAir tomorrow.
>>
>> Please add your agenda items to the Google Doc
>> ,
>> and subscribe to the Mesos events calendar at
>> https://calendar.google.com/calendar/embed?src=2hecvndc0mnaqlir34cqnfvtak%40group.calendar.google.com
>>  for
>> future meeting schedules.
>>
>> Thanks,
>>
>> MPark.
>>
>> On Tue, Dec 22, 2015 at 7:12 PM Michael Park  wrote:
>>
>>> Our last community sync was supposed to be on Dec 17, unfortunately we
>>> missed it. Sorry if anyone was looking for where it was happening.
>>>
>>> The next one would have been on Dec 31, but given that most people will
>>> likely be out for holidays I've updated the calendar to cancel it.
>>>
>>> Our next meeting is scheduled for Jan 15 at 9pm PST, and we'll send out
>>> more detailed information as it approaches.
>>>
>>> Thanks,
>>>
>>> Happy holidays!
>>>
>>> MPark.
>>>
>>> On Thu, Dec 3, 2015 at 2:46 AM Michael Park  wrote:
>>>
 Our next community sync will be tomorrow December 3 at 3pm PST.

 To join in person, come to Mesosphere HQ at 88 Stevenson St. and see
 the reception on the 2nd floor.

 Please add your agenda items to the Google Doc
 
 !

 Subscribe to our Mesos events calendar at
 https://calendar.google.com/calendar/embed?src=2hecvndc0mnaqlir34cqnfvtak%40group.calendar.google.com
  for
 future meeting schedules.

 Thanks,

 MPark.

 We will use Hangouts + YouTube OnAir, the links will be shared on IRC
 and via email shortly before the meeting.

 Please add agenda items to the Google Doc
 
 !

 Thanks,

 MPark.

 On Thu, Nov 19, 2015 at 12:01 PM Michael Park 
 wrote:

> Greg Mann and I will be hosting the community sync on the web today
> (November 19) at 9pm PST.
>
> We will use Hangouts + YouTube OnAir, the links will be shared on IRC
> and via email shortly before the meeting.
>
> Please add agenda items to the Google Doc
> 
> !
>
> Thanks,
>
> MPark.
>
> On Thu, Nov 5, 2015 at 3:20 PM Adam Bordelon 
> wrote:
>
>> Sorry for the late link, but you can see the youtube stream (and
>> after-the-fact video) at: http://youtu.be/rJyT8xDzhcA
>>
>> We also have a hangout link for those with agenda items, or those who
>> have
>> lengthy things to discuss (ask if you need the link). For brief
>> questions,
>> you can add them to the agenda or ask in IRC and I will relay them
>> for you.
>>
>> On Wed, Nov 4, 2015 at 4:42 PM, Adam Bordelon 
>> wrote:
>>
>> > Sounds great! Please join us at Mesosphere HQ, 88 Stevenson St., SF
>> at 3pm
>> > Pacific tomorrow.
>> > We will use youtube-onair again, links to be posted to IRC/email
>> shortly
>> > before the meeting.
>> >
>> > Please add agenda items:
>> >
>> >
>> https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit#heading=h.za1f9dpxisdr
>> >
>> > On Wed, Nov 4, 2015 at 4:25 PM, Jie Yu  wrote:
>> >
>> >> Adam, since most of the Twitter 

Re: China Mesos Developers Wechat Group

2016-02-09 Thread haosdent
Thank you very much, adam. Now I know why @Gilbert told me he could not saw
the image last night. Let me send this through a link.

The QR code link: http://i13.tietuku.com/d95e584fa4815860.jpg



On Tue, Feb 9, 2016 at 3:33 PM, Adam Bordelon  wrote:

> Haosdent, I think Apache mail may strip out images, so you'll have to send
> a link to the QR image.
>
> On Sun, Feb 7, 2016 at 8:02 AM, haosdent  wrote:
>
> > Hi, our dear Chinese friends. Because some interesting things related to
> > China Network, it is a bit difficult to communicated with Google Hangout
> or
> > other tools. Thank to Tommy Xiao. He create a wechat group named
> > "mesos开发爱好者". Feel free to scan this QR code and join this group if you
> > have a wechat account.
> >
> > [image: Inline image 1]
> > Happy Lunar New Year!
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
>



-- 
Best Regards,
Haosdent Huang


Re: Managing Persistency via Frameworks (HDFS, Cassandra)

2016-02-09 Thread tommy xiao
Hi Andreas,

I have recommend my customer to build a hdfs pool resources outside mesos
cluster in general concerns. But in development or stage environment, use
mesos to manage your hdfs culster is ideal purpose. when mesos community
give more production case, then we can upgrade the develop cluster to
production cluster easily.


2016-02-09 14:50 GMT+08:00 Andreas Fritzler :

> Hi Klaus,
>
> thanks for your reply. I am aware of the frameworks provided by mesosphere
> and I already tried them out in a POC setup. From looking at the HDFS
> documentation [1] however, the framework seems to be still in beta.
>
> "HDFS is available at the beta level and not recommended for Mesosphere
> DCOS production systems."
>
> I think what my questions are boiling down to is the following: should I
> use a Mesos framework to manage persistency within my Mesos cluster or
> should I do it outside with other means - e.g. using Ambari to setup a
> shared HDFS etc.
>
> If I would use those frameworks, how is your experience regarding the life
> cycle management? Scaling out instances, upgrading to newer versions etc.
>
> Regards,
> Andreas
>
> [1] https://docs.mesosphere.com/manage-service/hdfs/
>
> On Tue, Feb 9, 2016 at 1:05 AM, Klaus Ma  wrote:
>
>> Hi Andreas,
>>
>> I think Mesosphere has done some work on your questions, would you check
>> related repos at https://github.com/mesosphere ?
>>
>>
>> On Mon, Feb 8, 2016 at 9:43 PM Andreas Fritzler <
>> andreas.fritz...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a couple of questions around the persistency topic within a Mesos
>>> cluster:
>>>
>>> 1. Any takes on the quality of the HDFS [1] and the Cassandra [2]
>>> frameworks? Does anybody have any experiences in running those frameworks
>>> in production?
>>>
>>> 2. How well are those frameworks performing if I want to use them to
>>> separate tenants on one Mesos cluster? (HDFS is not dockerized yet?)
>>>
>>> 3. How about scaling out/down existing framework instances? Is that even
>>> possible? Couldn't find anything in the docs/github.
>>>
>>> 4. Upgrading a running instance: wondering how that is managed in those
>>> frameworks. There is an open issue for the HDFS [3] part. For cassandra the
>>> scheduler update seems to be smooth, however changing the underlying
>>> Cassandra version seems to be tricky [4].
>>>
>>> Regards,
>>> Andreas
>>>
>>> [1] https://github.com/mesosphere/hdfs
>>> [2] https://github.com/mesosphere/cassandra-mesos
>>> [3] https://github.com/mesosphere/hdfs/issues/23
>>> [4] https://github.com/mesosphere/cassandra-mesos/issues/137
>>>
>> --
>>
>> Regards,
>> 
>> Da (Klaus), Ma (马达), PMP® | Advisory Software Engineer
>> IBM Platform Development & Support, STG, IBM GCG
>> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
>>
>
>


-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com