>but I noticed that the code added to fix Mesos-3834 appears in the master branch in github, but not the 0.26.0 branch. 0.26-rc1 checkout since Nov 13,2015 while this patch submit in Nov 24.2015, so don't contains this patch.
On Fri, Jan 22, 2016 at 7:19 AM, David Kesler <[email protected]> wrote: > I'm attempting to test upgrading from our current version of mesos > (0.22.1) to the latest. Even when going only one minor version at a time, > I'm running into issues due to the lack of framework id in the framework > info. > > I've been able to replicate the issue reliably. I started with with a > single master and slave, with a fresh install of marathon 0.9.0 and mesos > 0.22.1, wiping out /tmp/mesos on the slave and /mesos and /marathon in > zookeeper. I started up a task. At this point, I can look at > `/tmp/mesos/meta/slaves/latest/frameworks/<my current marathon framework > id>/framework.info` and verify that there is no framework id present in > the file. I then upgraded the master to mesos 0.23.1, restarted it, then > the slave to 0.23.1 and restarted it, then marathon to 0.11.1 (which was > built against mesos 0.23) and restarted it. The slave came up and > recovered just fine. However the framework.info file never gets updated > with the framework id. If I then proceed to upgrade the master to 0.24, > restart it, then the slave to 0.24 and restart it, the slave fails to come > up with the following error: > > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.409395 9527 main.cpp:187] Version: 0.24.1 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.409406 9527 main.cpp:190] Git tag: 0.24.1 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.409418 9527 main.cpp:194] Git SHA: > 44873806c2bb55da37e9adbece938274d8cd7c48 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.513608 9527 containerizer.cpp:143] Using isolation: > posix/cpu,posix/mem,filesystem/posix > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@712: Client > environment:zookeeper.version=zookeeper C client 3.4.5 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@716: Client > environment:host.name=dev-sandbox-mesos-slave1 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@723: Client > environment:os.name=Linux > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@724: Client > environment:os.arch=3.13.0-58-generic > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,514:9527(0x7f18d63e1700):ZOO_INFO@log_env@725: Client > environment:os.version=#97-Ubuntu SMP Wed Jul 8 02:56:15 UTC 2015 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.514710 9527 main.cpp:272] Starting Mesos slave > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.516090 9542 slave.cpp:190] Slave started on 1)@ > 10.100.25.112:5051 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.516180 9542 slave.cpp:191] Flags at startup: > --appc_store_dir="/tmp/mesos/store/appc" --authenticatee="crammd5" > --cgroups_cpu_enable_pids_and_tids_count="false" > --cgroups_enable_cfs="false" --cgroups_hierar > chy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" > --container_disk_watch_interval="15secs" --containerizers="docker,mesos" > --default_role="*" --disk_watch_interval="1mins" --docker="docker" > --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_so > cket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --enforce_container_disk_quota="false" > --executor_registration_timeout="5mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" > --gc_disk_headroom="0.1" --hadoop_home="" --help="false" > --initialize_driver_logging="true" --ip="10.100.25.112" > --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" > --log_dir="/var/log/mesos" --logbufsecs="0" --logging > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: _level="INFO" > --master="zk://dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181/mesos" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0n > s" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" > --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" > --sandbox_directory="/mnt/mesos/sandbox" --strict="true" > --switch_user="true" --version="false" --wo > rk_dir="/tmp/mesos" > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.517006 9542 slave.cpp:354] Slave resources: cpus(*):2; > mem(*):15025; disk(*):35818; ports(*):[31000-32000] > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.517315 9542 slave.cpp:384] Slave hostname: > dev-sandbox-mesos-slave1.nyc.dev.yodle.com > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.517334 9542 slave.cpp:389] Slave checkpoint: true > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@733: Client > environment:user.name=(null) > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@741: Client > environment:user.home=/root > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@log_env@753: Client > environment:user.dir=/ > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,517:9527(0x7f18d63e1700):ZOO_INFO@zookeeper_init@786: Initiating > client connection, host=dev-sandbox-mesos-zk1.nyc.dev.yodle.com:2181 > sessionTimeout=10000 watcher=0x7f18dfac6610 sessionId=0 sessionPassw > d=<null> context=0x7f18b8002180 flags=0 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.520829 9544 state.cpp:54] Recovering state from '/tmp/mesos/meta' > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,521:9527(0x7f18d2d8d700):ZOO_INFO@check_events@1703: initiated > connection to server [10.100.25.111:2181] > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.524245 9542 slave.cpp:4157] Recovering framework > 20160121-172941-1847157770-5050-4782-0000 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: F0121 > 17:54:46.524288 9542 slave.cpp:4175] Check failed: frameworkInfo.has_id() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: *** Check > failure stack trace: *** > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18dfe3091d google::LogMessage::Fail() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18dfe3275d google::LogMessage::SendToLog() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: 2016-01-21 > 17:54:46,528:9527(0x7f18d2d8d700):ZOO_INFO@check_events@1750: session > establishment complete on server [10.100.25.111:2181], > sessionId=0x14ec1fa6d1a263d, negotiated timeout=10000 > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.528326 9549 group.cpp:331] Group process (group(1)@ > 10.100.25.112:5051) connected to ZooKeeper > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.528370 9549 group.cpp:805] Syncing group operations: queue size > (joins, cancels, datas) = (0, 0, 0) > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.528455 9549 group.cpp:403] Trying to create path '/mesos' in > ZooKeeper > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18dfe3050c google::LogMessage::Flush() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18dfe33059 google::LogMessageFatal::~LogMessageFatal() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.532296 9549 detector.cpp:156] Detected a new leader: (id='2') > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.532524 9543 group.cpp:674] Trying to get '/mesos/info_0000000002' > in ZooKeeper > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18df900ba8 mesos::internal::slave::Slave::recoverFramework() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: W0121 > 17:54:46.533833 9543 detector.cpp:444] Leading master > [email protected]:5050 is using a Protobuf binary format when > registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 > (see MESOS-2340) > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: I0121 > 17:54:46.534034 9543 detector.cpp:481] A new leading master (UPID= > [email protected]:5050) is detected > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18df907193 mesos::internal::slave::Slave::recover() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18df938383 > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS8_5state5StateEESD_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSK_FSI_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18dfde1681 process::ProcessManager::resume() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18dfde197f process::internal::schedule() > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18dec6da40 (unknown) > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18de48a182 start_thread > Jan 21 17:54:46 dev-sandbox-mesos-slave1 mesos-slave[9527]: @ > 0x7f18de1b747d (unknown) > > > > > With 0.23.1 running, I've tried restarting the mesos-slave multiple times, > I've tried deploying new tasks, and I've tried waiting, but the > framework.info file never seems to get updated, so I have no clue how I'm > supposed to actually get past 0.23.1 as part of the upgrade. > > Additionally, I saw https://issues.apache.org/jira/browse/MESOS-3834 > which says it was fixed in 0.26.0 and resolved in November, so I tried > going all the way to mesos 0.26.0. (Yes, I'm aware that it's not > recommended to skip versions, but I wanted to see if I could get around the > framework id issue). Not only did it fail the same way, but I noticed that > the code added to fix Mesos-3834 appears in the master branch in github, > but not the 0.26.0 branch. > > One last thing I don't understand is that our current dev/qa/master > cluster slaves appear to be writing the framework id to the framework.info > file, despite running mesos 0.22.1 and marathon 0.9.0 and set up via puppet > just like the sandbox I've been testing in. So it's possible that there's > some issue preventing the slave in the sandbox from writing the framework > id to the file, but I can't find any difference in setups that would cause > that either. > > Any help you can provide would be greatly appreciated. > -- Best Regards, Haosdent Huang

