Re: Agent Working Directory Best Practices

Steven Schlansker Tue, 27 Jun 2017 09:27:07 -0700

> On Jun 26, 2017, at 5:30 PM, James Peach <jor...@gmail.com> wrote:
> 
> 
>> On Jun 26, 2017, at 4:05 PM, Steven Schlansker <sschlans...@opentable.com> 
>> wrote:
>> 
>> 
>>> On Jun 25, 2017, at 11:24 PM, Benjamin Mahler <bmah...@apache.org> wrote:
>>> 
>>> As a data point, as far as I'm aware, most users are using a local work 
>>> directory, not an NFS mounted one. Would love to hear from anyone on the 
>>> list if they are doing this, and if there are any subtleties that should be 
>>> documented.
>> 
>> We don't run NFS in particular but we did originally use a SAN -- two 
>> observations:
>> 
>> NFS (historically, maybe it's better now, but doubtful...) has really bad 
>> failure modes.
>> Network failures can cause serious hangs both in user-space and 
>> kernel-space.  Such
>> hangs can be impossible to clear without rebooting the machine, and in some 
>> edge cases
>> can even make it difficult or impossible to reboot the machine via normal 
>> means.
> 
> You need to make sure to mount with the "intr" option.
> 
> https://speakerdeck.com/gnb/130-lca2008-nfs-tuning-secrets-d7

That's not without some caveats.  nfs(5):

The intr / nointr mount option is deprecated after kernel 2.6.25. Only SIGKILL 
can interrupt a pending NFS operation on these kernels, and if specified, this 
mount option is ignored to provide backwards compatibility with older kernels.
Using the intr option is preferred to using the soft option because it is 
significantly less likely to result in data corruption.

...

NB: A so-called "soft" timeout can cause silent data corruption in certain 
cases. As such, use the soft option only when client responsiveness is more 
important than data integrity. Using NFS over TCP or increasing the value of 
the retrans option may mitigate some of the risks of using the soft option.

So, 'intr' is deprecated / removed on any reasonable kernel, and 'soft' has 
silent data corruption issues.
Typical Linux, having a broken implementation that then points you to instead 
use a deprecated / removed implementation :)

I'm sure there's a way to get NFS working great.  Just pointing out that you'll 
need an expert to take ownership of it!

> 
>> 
>> Network attached drives (our SAN) are less reliable, slower, and more complex
>> (read: more failure modes) than local disk.  It's also a really big single 
>> point
>> of failure.  So far our only true cluster outages have been due to failure of
>> the SAN, since it took down all nodes at once -- once we removed the SAN, 
>> future
>> failures had islands of availability and any properly written application
>> could continue running (obviously without network resources) through the 
>> incident.
>> 
>> Maybe this isn't a huge deal for your use case, which might differ from ours.
>> For us, it was enough of a problem that we now purchase local SSD scratch 
>> space
>> for every node just so that we have some storage we can depend on a bit more
>> than network attached storage.
>> 
>>> 
>>> On Thu, Jun 22, 2017 at 11:13 PM, <thomas.kurm...@artorg.unibe.ch> wrote:
>>> Hi,
>>> 
>>> We have a couple of server nodes mainly used for computational tasks in
>>> our mesos cluster. These servers have beefy cpus, gpus etc. but only
>>> limited ssd space. We also have a 40GBe network and a decently fast
>>> file server.
>>> 
>>> My question is simple but I didnt find an answer anywhere: What are the
>>> best practices for the working directory on mesos-agent nodes? Should
>>> we keep the working directory local or is it reasonable to use a nfs
>>> mounted folder? We implemented both and they seem to work fine, but I
>>> would rather like to follow "best practices".
>>> 
>>> Thanks and cheers
>>> 
>>> Tom
>>> 
>> 
>

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Agent Working Directory Best Practices

Reply via email to