Are you sure this is a 0.21.1 cluster? the line numbers in the logs match
the code in Mesos 0.23.0

This is, however, a genuine bug (src/launcher/fetcher.cpp#L99):

  Try<bool> available = hdfs.available();

  if (available.isError() || !available.get()) {
    return Error("Skipping fetch with Hadoop Client as"
                 " Hadoop Client not available: " + available.error());
  }

The root cause is that (probably) the HDFS client is not available on the
slave; however, we do not 'error()' but rather return a 'false' - this is
all good.
The bug is exposed in the return line, where we try to retrieve
available.error() (which we should not - it's just `false`).

This was a 'latent' bug that *may* have been exposed by (my) recent
refactoring of os::shell which is used by hdfs.available() under the covers.
(this is a bit unclear, though, as that refactoring is post-0.23)

Be that as it may, I've filed
https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and I
may be able to sneak it into 0.24 (which we're cutting now).

Thanks for reporting!

PS - bad code aside, the root cause is that the `hdfs` binary seems to be
unreachable on the slave: is it installed in the PATH of the user under
which the slave binary executes?



*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar <[email protected]> wrote:

> We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on
> top of this setup without any problems for ~4 months now. I'm now trying to
> get hadoop mesos <https://github.com/mesos/hadoop/> integration working
> but I see the TaskTrackers that gets launched are failing with the
> following error
>
> I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info:
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd","user":"hadoop"}
> I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI
> 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
> I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the
> sandbox directory
> I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI
> 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
> *mesos-fetcher:
> /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90:
> const string& Try<T>::error() const [with T = bool; std::string =
> std::basic_string<char>]: Assertion `data.isNone()' failed.*
> *** Aborted at 1439876195 (unix time) try "date -d @1439876195" if you are
> using GNU date ***
> PC: @       0x343ee32635 (unknown)
> *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from PID
> 24428; stack trace: ***
>     @       0x343f20f710 (unknown)
>     @       0x343ee32635 (unknown)
>     @       0x343ee33e15 (unknown)
>     @       0x343ee2b75e (unknown)
>     @       0x343ee2b820 (unknown)
>     @           0x408b0a Try<>::error()
>     @           0x40cbcf download()
>     @           0x4098a3 main
>     @       0x343ee1ed5d (unknown)
>     @           0x40aeb5 (unknown)
> Failed to synchronize with slave (it's probably exited)
>
> Environment
> - EC2 Machines
> - Output of lsb_release -a
> LSB Version:
>  
> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
> Distributor ID: CentOS
> Description:  CentOS release 6.5 (Final)
> Release:  6.5
> Codename: Final
>
> Any ideas what I'm doing wrong?
>
> --
> -- Ashwanth Kumar
>

Reply via email to