Thanks Marco for the update.

My understanding of the hadoop mesos framework was that the executor would
download the hadoop distro from mapred.mesos.executor.uri and execute the
TTs. I didn't know that to download from HDFS it needs `hdfs` binary in
PATH. I don't have a hadoop setup on the mesos slave. Should I go ahead and
add them?

Regarding the line number mismatch, I installed the package through
mesosphere not sure if that's the reason.


On Tue, Aug 18, 2015 at 1:22 PM, Marco Massenzio <[email protected]>
wrote:

> Are you sure this is a 0.21.1 cluster? the line numbers in the logs match
> the code in Mesos 0.23.0
>
> This is, however, a genuine bug (src/launcher/fetcher.cpp#L99):
>
>   Try<bool> available = hdfs.available();
>
>   if (available.isError() || !available.get()) {
>     return Error("Skipping fetch with Hadoop Client as"
>                  " Hadoop Client not available: " + available.error());
>   }
>
> The root cause is that (probably) the HDFS client is not available on the
> slave; however, we do not 'error()' but rather return a 'false' - this is
> all good.
> The bug is exposed in the return line, where we try to retrieve
> available.error() (which we should not - it's just `false`).
>
> This was a 'latent' bug that *may* have been exposed by (my) recent
> refactoring of os::shell which is used by hdfs.available() under the covers.
> (this is a bit unclear, though, as that refactoring is post-0.23)
>
> Be that as it may, I've filed
> https://issues.apache.org/jira/browse/MESOS-3287: the fix is trivial and
> I may be able to sneak it into 0.24 (which we're cutting now).
>
> Thanks for reporting!
>
> PS - bad code aside, the root cause is that the `hdfs` binary seems to be
> unreachable on the slave: is it installed in the PATH of the user under
> which the slave binary executes?
>
>
>
> *Marco Massenzio*
>
> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Mon, Aug 17, 2015 at 10:46 PM, Ashwanth Kumar <[email protected]>
> wrote:
>
>> We've a 20 node mesos cluster running mesos v0.21.1, We run marathon on
>> top of this setup without any problems for ~4 months now. I'm now trying to
>> get hadoop mesos <https://github.com/mesos/hadoop/> integration working
>> but I see the TaskTrackers that gets launched are failing with the
>> following error
>>
>> I0818 05:36:35.058688 24428 fetcher.cpp:409] Fetcher Info:
>> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/20150706-075218-1611773194-5050-28439-S473\/hadoop","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"hdfs:\/\/hdfs.prod:54310\/user\/ashwanth\/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/20150706-075218-1611773194-5050-28439-S473\/frameworks\/20150706-075218-1611773194-5050-28439-4532\/executors\/executor_Task_Tracker_4129\/runs\/c26f52d4-4055-46fa-b999-11d73f2096dd","user":"hadoop"}
>> I0818 05:36:35.059806 24428 fetcher.cpp:364] Fetching URI
>> 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
>> I0818 05:36:35.059821 24428 fetcher.cpp:238] Fetching directly into the
>> sandbox directory
>> I0818 05:36:35.059835 24428 fetcher.cpp:176] Fetching URI
>> 'hdfs://hdfs.prod:54310/user/ashwanth/hadoop-with-mesos-2.6.0-cdh5.4.4.tar.gz'
>> *mesos-fetcher:
>> /tmp/mesos-build/mesos-repo/3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:90:
>> const string& Try<T>::error() const [with T = bool; std::string =
>> std::basic_string<char>]: Assertion `data.isNone()' failed.*
>> *** Aborted at 1439876195 (unix time) try "date -d @1439876195" if you
>> are using GNU date ***
>> PC: @       0x343ee32635 (unknown)
>> *** SIGABRT (@0x5f6c) received by PID 24428 (TID 0x7f988832f820) from PID
>> 24428; stack trace: ***
>>     @       0x343f20f710 (unknown)
>>     @       0x343ee32635 (unknown)
>>     @       0x343ee33e15 (unknown)
>>     @       0x343ee2b75e (unknown)
>>     @       0x343ee2b820 (unknown)
>>     @           0x408b0a Try<>::error()
>>     @           0x40cbcf download()
>>     @           0x4098a3 main
>>     @       0x343ee1ed5d (unknown)
>>     @           0x40aeb5 (unknown)
>> Failed to synchronize with slave (it's probably exited)
>>
>> Environment
>> - EC2 Machines
>> - Output of lsb_release -a
>> LSB Version:
>>  
>> :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
>> Distributor ID: CentOS
>> Description:  CentOS release 6.5 (Final)
>> Release:  6.5
>> Codename: Final
>>
>> Any ideas what I'm doing wrong?
>>
>> --
>> -- Ashwanth Kumar
>>
>
>


-- 
-- Ashwanth Kumar

Reply via email to