Hi Siva, it looks like you bumped into https://issues.apache.org/jira/browse/MESOS-2276. Feel free to upvote!
On Thu, Feb 5, 2015 at 1:56 PM, Sivaram Kannan <sivara...@gmail.com> wrote: > > Hi, > > I am our deployments of mesos-slave, we are getting the following error > during start up. I understand the slave is failing due to large number of > fd's being opened. I have increased the ulimit of fd's to 4096 from 1024 > but still the same behavior. What can I do to solve this problem, and what > should I do to prevent it. > > Thanks, > ./Siva. > > > Initiating client connection, host=11.0.190.1:2181 sessionTimeout=10000 > watcher=0x7f6de4 > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.076289 15 > slave.cpp:169] Slave started on 1)@11.1.6.1:5051 > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.076544 15 > slave.cpp:289] Slave resources: cpus(*):24; mem(*):47336; disk(*):469416; > ports(*):[31000-32000] > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.076575 15 > slave.cpp:318] Slave hostname: 11.1.6.1 > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.076582 15 > slave.cpp:319] Slave checkpoint: true > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.078135 25 > state.cpp:33] Recovering state from '/var/lib/mesos/slave/meta' > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.078233 20 > status_update_manager.cpp:197] Recovering status update manager > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.078333 20 > docker.cpp:767] Recovering Docker containers > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: 2015-02-05 > 12:33:58,102:6(0x7f6dc3fff700):ZOO_INFO@check_events@1703: initiated > connection to server [11.0.190.1:2181] > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: 2015-02-05 > 12:33:58,104:6(0x7f6dc3fff700):ZOO_INFO@check_events@1750: session > establishment complete on server [11.0.190.1:2181], > sessionId=0x14b3c82555299c7, > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.104671 30 > group.cpp:313] Group process (group(1)@11.1.6.1:5051) connected to > ZooKeeper > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.104708 30 > group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) > = (0, 0, 0) > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.104725 30 > group.cpp:385] Trying to create path '/mesos' in ZooKeeper > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.106376 22 > detector.cpp:138] Detected a new leader: (id='3') > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.106477 25 > group.cpp:659] Trying to get '/mesos/info_0000000003' in ZooKeeper > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: I0205 12:33:58.107293 30 > detector.cpp:433] A new leading master (UPID=master@11.1.4.1:5050) is > detected > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Failed to perform recovery: > Collect failed: Failed to create pipe: Too many open files > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: To remedy this do as follows: > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Step 1: rm -f > /var/lib/mesos/slave/meta/slaves/latest > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: This ensures slave doesn't > recover old live executors. > Feb 05 12:33:58 node-d4856455ad5c sh[32162]: Step 2: Restart the slave. > Feb 05 12:33:58 node-d4856455ad5c systemd[1]: mesos-slave.service: main > process exited, code=exited, status=1/FAILURE > Feb 05 12:33:58 node-d4856455ad5c docker[3351]: mesos_slave > Feb 05 12:33:58 node-d4856455ad5c systemd[1]: Unit mesos-slave.service > entered failed state. > > >