IGFS YARN setup

Haithem Turki Thu, 26 May 2016 14:57:12 -0700

Hello,

I'm interested in using IGFS as a Hadoop caching layer - the usecase
revolves largely around Spark jobs running on a YARN cluster that persist
data to S3 (although I have some non-Spark stuff running too so would
ideally integrate at the Hadoop filesystem layer). I'm excited about the
potential speedups that this could bring :)


I took a stab at deploying this for the first time, and had some questions:

- I ideally was envisioning deploying nodes via YARN to take advantage of
dynamic scaling and use any available memory on the cluster, I wanted to
make sure that this was indeed a supported workflow / on the roadmap as I
hit a few bumps along the way:
* I ended up needing to dump pretty much all of my Hadoop-related jars to
HDFS for my nodes to startup correctly (or else I was getting
ClassNotFoundExceptions ranging from guava to hadoop to asm to ignite
classes not being there). Am I doing something horribly wrong / have you
guys considered package a fat jar for the non-hadoop dependencies at least?
* Couldn't specify the yarn queue despite attempting to
set -Dmapreduce.job.queuename via IGNITE_JVM_OPTS variable (
https://issues.apache.org/jira/browse/IGNITE-2738?)
* Seems like dynamic allocation isn't supported? Wanted to get a sense of
whether this was in the roadmap
* Since YARN allocates containers at random it's pretty onerous to figure
out which hostnames have Ignite nodes running on them and specifying those
in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port
conflicts if multiple nodes are running on the same machine) and I guess I
can set up a reverse proxy so that I can point towards a stable URL but
it's not great / doesn't scale well so I was wondering if there were other
suggestions on how to configure discovery (maybe spin up a local node
outside of YARN that leverages the cluster discovery?)
* I also wasn't clear on how cluster routing/balancing worked. If I specify
my hadoop jobs to point at host1:10500 via TCP, will all read/writes route
through that node or do the reads/writes somehow get balanced?

Or is this completely crazy / should I just deploy IGFS outside of YARN?

- Is there a way of configuring the local filesystem as a tiered storage
layer (or is it on the roadmap)? Usecase is that even reading from an SSD
is much faster than S3.

Thanks in advance!
- Haithem

IGFS YARN setup

Reply via email to