Hi, I'm having a bit of trouble running Hadoop on Mesos.
I've followed the instructions on https://github.com/mesos/hadoop, using the same CDH5 package and Mesos 0.19. The only difference I can think of is I am using an existing HDFS cluster, not a local one. When starting the job tracker, it all looks fine: 14/07/03 13:51:57 INFO mapred.MesosScheduler: Starting MesosScheduler Warning: MESOS_NATIVE_LIBRARY is deprecated, use MESOS_NATIVE_JAVA_LIBRARY instead. Future releases will not support JNI bindings via MESOS_NATIVE_LIBRARY. I0703 13:51:57.096916 62004 sched.cpp:126] Version: 0.19.0 I0703 13:51:57.100489 62055 sched.cpp:222] New master detected at [email protected]:5050 I0703 13:51:57.100621 62055 sched.cpp:230] No credentials provided. Attempting to register without authentication 14/07/03 13:51:57 INFO mapred.JobTracker: Starting the recovery process for 0 jobs ... 14/07/03 13:51:57 INFO mapred.JobTracker: Recovery done! Recoverd 0 of 0 jobs. 14/07/03 13:51:57 INFO mapred.JobTracker: Recovery Duration (ms):0 14/07/03 13:51:57 INFO mapred.JobTracker: Refreshing hosts information 14/07/03 13:51:57 INFO util.HostsFileReader: Setting the includes file to 14/07/03 13:51:57 INFO util.HostsFileReader: Setting the excludes file to 14/07/03 13:51:57 INFO util.HostsFileReader: Refreshing hosts (include/exclude) list 14/07/03 13:51:57 INFO mapred.JobTracker: Decommissioning 0 nodes 14/07/03 13:51:57 INFO ipc.Server: IPC Server listener on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server Responder: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 0 on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 1 on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 2 on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 3 on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 4 on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 5 on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 6 on 9001: starting 14/07/03 13:51:57 INFO mapred.JobTracker: Starting RUNNING 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 7 on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 8 on 9001: starting 14/07/03 13:51:57 INFO ipc.Server: IPC Server handler 9 on 9001: starting But after connecting to Mesos and registering as a framework, it seems to instantly terminate, then try again. See screenshot from the terminated frameworks table at https://db.tt/COLpSUIQ. Nothing more is posted to the job tracker log. I can't see much in the mesos log on the master. Here are the messages for one framework: I0703 13:57:26.040679 51675 master.cpp:1059] Registering framework 20140620-174222-1209730570-5050-51658-0666 at scheduler(1)@127.0.1.1:53662 I0703 13:57:26.040802 51675 hierarchical_allocator_process.hpp:331] Added framework 20140620-174222-1209730570-5050-51658-0666 I0703 13:57:26.040817 51663 master.cpp:662] Framework 20140620-174222-1209730570-5050-51658-0666 disconnected I0703 13:57:26.041102 51663 master.cpp:1319] Deactivating framework 20140620-174222-1209730570-5050-51658-0666 I0703 13:57:26.041127 51663 master.cpp:684] Giving framework 20140620-174222-1209730570-5050-51658-0666 0ns to failover W0703 13:57:26.041158 51663 master.cpp:2862] Master returning resources offered to framework 20140620-174222-1209730570-5050-51658-0666 because the framework has terminated or is inactive I0703 13:57:26.041177 51664 hierarchical_allocator_process.hpp:407] Deactivated framework 20140620-174222-1209730570-5050-51658-0666 I0703 13:57:26.041251 51666 master.cpp:2849] Framework failover timeout, removing framework 20140620-174222-1209730570-5050-51658-0666 I0703 13:57:26.041373 51666 master.cpp:3344] Removing framework 20140620-174222-1209730570-5050-51658-0666 I0703 13:57:26.041261 51664 hierarchical_allocator_process.hpp:636] Recovered cpus(*):16; mem(*):192391; disk(*):1.51388e+06; ports(*):[31000-32000] (total allocatable: cpus(*):16; mem(*):192391; disk(*):1.51388e+06; ports(*):[31000-32000]) on slave 20140618-174325-1209730570-5050-4637-1 from framework 20140620-174222-1209730570-5050-51658-0666 I0703 13:57:26.041502 51664 hierarchical_allocator_process.hpp:636] Recovered cpus(*):16; mem(*):192391; disk(*):1.51388e+06; ports(*):[31000-32000] (total allocatable: cpus(*):16; mem(*):192391; disk(*):1.51388e+06; ports(*):[31000-32000]) on slave 20140618-174325-1209730570-5050-4637-0 from framework 20140620-174222-1209730570-5050-51658-0666 I0703 13:57:26.041566 51664 hierarchical_allocator_process.hpp:636] Recovered cpus(*):15; mem(*):191879; disk(*):1.51388e+06; ports(*):[31000-31643, 31645-32000] (total allocatable: cpus(*):15; mem(*):191879; disk(*):1.51388e+06; ports(*):[31000-31643, 31645-32000]) on slave 20140618-172514-1209730570-5050-1282-0 from framework 20140620-174222-1209730570-5050-51658-0666 I0703 13:57:26.041592 51664 hierarchical_allocator_process.hpp:362] Removed framework 20140620-174222-1209730570-5050-51658-0666 Setting HADOOP_ROOT_LOGGER=DEBUG,console before running the job tracker didn't seem to give me any more Mesos related messages. Has anyone else came across this problem? Any ideas what might be causing this? Is there any way I can increase the logging for Mesos or the Mesos Hadoop library? This is for Mesos 0.19 on Ubuntu 14.04. Mesos cluster is running other frameworks without problems (Marathon, Chronos). I'm new to both Mesos and Hadoop. Thanks, Andrew

