Thanks Marco for the update. My understanding of the hadoop mesos framework was that the executor would download the hadoop distro from mapred.mesos.executor.uri and execute the TTs. I didn't know that to download from HDFS it needs `hdfs` binary in PATH. I don't have a hadoop setup on the mesos slave. Should I go ahead and add them?
Regarding the line number mismatch, I installed the package through mesosphere not sure if that's the reason. On Tue, Aug 18, 2015 at 1:22 PM, Marco Massenzio <[email protected]> wrote: > Are you sure this is a 0.21.1 cluster? the line numbers in the logs match > the code in Mesos 0.23.0 > > This is, however, a genuine bug (src/launcher/fetcher.cpp#L99): > > Try<bool> available = hdfs.available(); > > if (available.isError() || !available.get()) { > return Error("Skipping fetch with Hadoop Client as" > " Hadoop Client not available: " + available.error()); > } > > The root cause is that (probably) the HDFS client is not available on the > slave; however, we do not 'error()' but rather return a 'false' - this is > all good. > The bug is exposed in the return line, where we try to retrieve > available.error() (which we should not - it's just `false`). > > This was a 'latent' bug that *may* have been exposed by (my) recent > refactoring of os::shell which is used by hdfs.available() under the covers. > (this is a bit unclear, though, as that refactoring is post-0.23) > > Be that as it may, I've filed > https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and > I may be able to sneak it into 0.24 (which we're cutting now). > > Thanks for reporting! > > PS - bad code aside, the root cause is that the `hdfs` binary seems to be > unreachable on the slave: is it installed in the PATH of the user under > which the slave binary executes? > > > > *Marco Massenzio* > > *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* > > On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar <[email protected]> > wrote: > >> We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on >> top of this setup without any problems for ~4 months now. I'm now trying to >> get hadoop mesos <https://github.com/mesos/hadoop/> integration working >> but I see the TaskTrackers that gets launched are failing with the >> following error >> >> I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info: >> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd","user":"hadoop"} >> I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI >> 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' >> I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the >> sandbox directory >> I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI >> 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' >> *mesos-fetcher: >> /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90: >> const string& Try<T>::error() const [with T = bool; std::string = >> std::basic_string<char>]: Assertion `data.isNone()' failed.* >> *** Aborted at 1439876195 (unix time) try "date -d @1439876195" if you >> are using GNU date *** >> PC: @ 0x343ee32635 (unknown) >> *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from PID >> 24428; stack trace: *** >> @ 0x343f20f710 (unknown) >> @ 0x343ee32635 (unknown) >> @ 0x343ee33e15 (unknown) >> @ 0x343ee2b75e (unknown) >> @ 0x343ee2b820 (unknown) >> @ 0x408b0a Try<>::error() >> @ 0x40cbcf download() >> @ 0x4098a3 main >> @ 0x343ee1ed5d (unknown) >> @ 0x40aeb5 (unknown) >> Failed to synchronize with slave (it's probably exited) >> >> Environment >> - EC2 Machines >> - Output of lsb_release -a >> LSB Version: >> >> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch >> Distributor ID: CentOS >> Description: CentOS release 6.5 (Final) >> Release: 6.5 >> Codename: Final >> >> Any ideas what I'm doing wrong? >> >> -- >> -- Ashwanth Kumar >> > > -- -- Ashwanth Kumar

