I also had to create a "default-config.xml" block and point towards it in HDFS via "IGNITE_XML_CONFIG" and then add the following property to the "igfs-data" bean, not sure if that's expected...
<property name="affinityMapper"> <bean class="org.apache.ignite.igfs.IgfsGroupDataBlocksKeyMapper"> <!— How many sequential blocks will be stored on the same node. --> <constructor-arg value="512"/> </bean> </property> On Thu, May 26, 2016 at 5:56 PM, Haithem Turki <[email protected]> wrote: > Hello, > > I'm interested in using IGFS as a Hadoop caching layer - the usecase > revolves largely around Spark jobs running on a YARN cluster that persist > data to S3 (although I have some non-Spark stuff running too so would > ideally integrate at the Hadoop filesystem layer). I'm excited about the > potential speedups that this could bring :) > > I took a stab at deploying this for the first time, and had some questions: > > - I ideally was envisioning deploying nodes via YARN to take advantage of > dynamic scaling and use any available memory on the cluster, I wanted to > make sure that this was indeed a supported workflow / on the roadmap as I > hit a few bumps along the way: > * I ended up needing to dump pretty much all of my Hadoop-related jars to > HDFS for my nodes to startup correctly (or else I was getting > ClassNotFoundExceptions ranging from guava to hadoop to asm to ignite > classes not being there). Am I doing something horribly wrong / have you > guys considered package a fat jar for the non-hadoop dependencies at least? > * Couldn't specify the yarn queue despite attempting to > set -Dmapreduce.job.queuename via IGNITE_JVM_OPTS variable ( > https://issues.apache.org/jira/browse/IGNITE-2738?) > * Seems like dynamic allocation isn't supported? Wanted to get a sense of > whether this was in the roadmap > * Since YARN allocates containers at random it's pretty onerous to figure > out which hostnames have Ignite nodes running on them and specifying those > in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port > conflicts if multiple nodes are running on the same machine) and I guess I > can set up a reverse proxy so that I can point towards a stable URL but > it's not great / doesn't scale well so I was wondering if there were other > suggestions on how to configure discovery (maybe spin up a local node > outside of YARN that leverages the cluster discovery?) > * I also wasn't clear on how cluster routing/balancing worked. If I > specify my hadoop jobs to point at host1:10500 via TCP, will all > read/writes route through that node or do the reads/writes somehow get > balanced? > > Or is this completely crazy / should I just deploy IGFS outside of YARN? > > - Is there a way of configuring the local filesystem as a tiered storage > layer (or is it on the roadmap)? Usecase is that even reading from an SSD > is much faster than S3. > > Thanks in advance! > - Haithem >
