For info, the patch was committed today and made the cut to 0.24-rc1.
Thanks to @vinodkone for super-quick turnaround. — Sent from Mailbox On Tue, Aug 18, 2015 at 10:45 AM, Marco Massenzio <[email protected]> wrote: > Hi Ashwanth, > I've pushed a fix out for review <https://reviews.apache.org/r/37584/>, > we'll see if it makes it in time for 0.24. > As for the version, you can quickly verify that by running `mesos-master > --version` (or just look at the very beginning of the logs, it will tell > you a bunch of stuff about version, build, etc.) > I am sorry, I don't really know enough about setting up Hadoop on Mesos to > give you any useful guidance; from a quick glance at the code, it seems to > me that, if the URI is a `hdfs://` one, the only way to retrieve the > tarball is via HDFS (so you will need the hdfs client to be available on > the Slave(s)). > If you do use an HTTP URI (http://....) then it should work just fine. > Hopefully others will be able to chime in with a more informed view. > *Marco Massenzio* > *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* > On Tue, Aug 18, 2015 at 2:46 AM, Ashwanth Kumar <[email protected]> wrote: >> Thanks Marco for the update. >> >> My understanding of the hadoop mesos framework was that the executor would >> download the hadoop distro from mapred.mesos.executor.uri and execute the >> TTs. I didn't know that to download from HDFS it needs `hdfs` binary in >> PATH. I don't have a hadoop setup on the mesos slave. Should I go ahead and >> add them? >> >> Regarding the line number mismatch, I installed the package through >> mesosphere not sure if that's the reason. >> >> >> On Tue, Aug 18, 2015 at 1:22 PM, Marco Massenzio <[email protected]> >> wrote: >> >>> Are you sure this is a 0.21.1 cluster? the line numbers in the logs match >>> the code in Mesos 0.23.0 >>> >>> This is, however, a genuine bug (src/launcher/fetcher.cpp#L99): >>> >>> Try<bool> available = hdfs.available(); >>> >>> if (available.isError() || !available.get()) { >>> return Error("Skipping fetch with Hadoop Client as" >>> " Hadoop Client not available: " + available.error()); >>> } >>> >>> The root cause is that (probably) the HDFS client is not available on the >>> slave; however, we do not 'error()' but rather return a 'false' - this is >>> all good. >>> The bug is exposed in the return line, where we try to retrieve >>> available.error() (which we should not - it's just `false`). >>> >>> This was a 'latent' bug that *may* have been exposed by (my) recent >>> refactoring of os::shell which is used by hdfs.available() under the covers. >>> (this is a bit unclear, though, as that refactoring is post-0.23) >>> >>> Be that as it may, I've filed >>> https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and >>> I may be able to sneak it into 0.24 (which we're cutting now). >>> >>> Thanks for reporting! >>> >>> PS - bad code aside, the root cause is that the `hdfs` binary seems to be >>> unreachable on the slave: is it installed in the PATH of the user under >>> which the slave binary executes? >>> >>> >>> >>> *Marco Massenzio* >>> >>> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>* >>> >>> On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar <[email protected]> >>> wrote: >>> >>>> We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on >>>> top of this setup without any problems for ~4 months now. I'm now trying to >>>> get hadoop mesos <https://github.com/mesos/hadoop/> integration working >>>> but I see the TaskTrackers that gets launched are failing with the >>>> following error >>>> >>>> I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info: >>>> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd","user":"hadoop"} >>>> I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI >>>> 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' >>>> I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the >>>> sandbox directory >>>> I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI >>>> 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz' >>>> *mesos-fetcher: >>>> /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90: >>>> const string& Try<T>::error() const [with T = bool; std::string = >>>> std::basic_string<char>]: Assertion `data.isNone()' failed.* >>>> *** Aborted at 1439876195 (unix time) try "date -d @1439876195" if you >>>> are using GNU date *** >>>> PC: @ 0x343ee32635 (unknown) >>>> *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from >>>> PID 24428; stack trace: *** >>>> @ 0x343f20f710 (unknown) >>>> @ 0x343ee32635 (unknown) >>>> @ 0x343ee33e15 (unknown) >>>> @ 0x343ee2b75e (unknown) >>>> @ 0x343ee2b820 (unknown) >>>> @ 0x408b0a Try<>::error() >>>> @ 0x40cbcf download() >>>> @ 0x4098a3 main >>>> @ 0x343ee1ed5d (unknown) >>>> @ 0x40aeb5 (unknown) >>>> Failed to synchronize with slave (it's probably exited) >>>> >>>> Environment >>>> - EC2 Machines >>>> - Output of lsb_release -a >>>> LSB Version: >>>> >>>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch >>>> Distributor ID: CentOS >>>> Description: CentOS release 6.5 (Final) >>>> Release: 6.5 >>>> Codename: Final >>>> >>>> Any ideas what I'm doing wrong? >>>> >>>> -- >>>> -- Ashwanth Kumar >>>> >>> >>> >> >> >> -- >> -- Ashwanth Kumar >>

