Hello, I'm interested in using IGFS as a Hadoop caching layer - the usecase revolves largely around Spark jobs running on a YARN cluster that persist data to S3 (although I have some non-Spark stuff running too so would ideally integrate at the Hadoop filesystem layer). I'm excited about the potential speedups that this could bring :)
I took a stab at deploying this for the first time, and had some questions: - I ideally was envisioning deploying nodes via YARN to take advantage of dynamic scaling and use any available memory on the cluster, I wanted to make sure that this was indeed a supported workflow / on the roadmap as I hit a few bumps along the way: * I ended up needing to dump pretty much all of my Hadoop-related jars to HDFS for my nodes to startup correctly (or else I was getting ClassNotFoundExceptions ranging from guava to hadoop to asm to ignite classes not being there). Am I doing something horribly wrong / have you guys considered package a fat jar for the non-hadoop dependencies at least? * Couldn't specify the yarn queue despite attempting to set -Dmapreduce.job.queuename via IGNITE_JVM_OPTS variable ( https://issues.apache.org/jira/browse/IGNITE-2738?) * Seems like dynamic allocation isn't supported? Wanted to get a sense of whether this was in the roadmap * Since YARN allocates containers at random it's pretty onerous to figure out which hostnames have Ignite nodes running on them and specifying those in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port conflicts if multiple nodes are running on the same machine) and I guess I can set up a reverse proxy so that I can point towards a stable URL but it's not great / doesn't scale well so I was wondering if there were other suggestions on how to configure discovery (maybe spin up a local node outside of YARN that leverages the cluster discovery?) * I also wasn't clear on how cluster routing/balancing worked. If I specify my hadoop jobs to point at host1:10500 via TCP, will all read/writes route through that node or do the reads/writes somehow get balanced? Or is this completely crazy / should I just deploy IGFS outside of YARN? - Is there a way of configuring the local filesystem as a tiered storage layer (or is it on the roadmap)? Usecase is that even reading from an SSD is much faster than S3. Thanks in advance! - Haithem
