Hi, I'm trying to run ignite on AWS EMR as a YARN application, using zookeeper for node discovery. I have compiled ignite with
``` mvn clean package -DskipTests -Dignite.edition=hadoop -Dhadoop.version=2.7.3 ``` I'm using ignite_yarn.properties ``` # The number of nodes in the cluster. IGNITE_NODE_COUNT=3 # The number of CPU Cores for each Apache Ignite node. IGNITE_RUN_CPU_PER_NODE=1 # The number of Megabytes of RAM for each Apache Ignite node. IGNITE_MEMORY_PER_NODE=500 IGNITE_PATH=hdfs:///user/hadoop/ignite/apache-ignite-2.3.0-hadoop-2.7.3.zip IGNITE_XML_CONFIG=hdfs:///user/hadoop/ignite/ignite_conf.xml # Local path IGNITE_WORK_DIR=/mnt # Local path IGNITE_RELEASES_DIR=/mnt IGNITE_WORKING_DIR=/mnt ```` and ignite_conf.xml as ``` <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd"> <bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration"> <property name="cacheConfiguration"> <list> <!-- Partitioned replicated cache configuration (Atomic mode). --> <bean class="org.apache.ignite.configuration.CacheConfiguration"> <property name="name" value="default"/> <property name="atomicityMode" value="ATOMIC"/> <property name="backups" value="3"/> <property name="cacheMode" value="PARTITIONED"/> </bean> </list> </property> <!-- Explicitly configure TCP discovery SPI to provide list of initial nodes. --> <property name="discoverySpi"> <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi"> <property name="ipFinder"> <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.zk.TcpDiscoveryZookeeperIpFinder"> <!-- FIXME change to master internal API (as used by YARN), e.g. ip-10-0-0-154.ec2.internal:2181 --> <property name="zkConnectionString" value="ip-10-0-0-173.ec2.internal:2181"/> </bean> </property> </bean> </property> <property name="gridLogger"> <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger"> <!-- Default path relative to IGNITE_HOME, assumming IGNITE_HOME is set to the root of the Ignite installation --> <constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/> </bean> </property> </bean> </beans> ``` Then I launch the yarn job as ``` IGNITE_YARN_JAR=/mnt/ignite/apache-ignite-2.3.0-src/modules/yarn/target/ignite-yarn-2.3.0.jar yarn jar ${IGNITE_YARN_JAR} ${IGNITE_YARN_JAR} /mnt/ignite/ignite_yarn.properties ``` The app launches and the application master is outputting logs, but containers only last some seconds running, and the application is constantly asking for more containers. For example, in the application master log ``` Nov 20, 2017 3:08:30 AM org.apache.ignite.yarn.ApplicationMaster onContainersAllocated INFO: Launching container: container_1511142795395_0005_01_017079. 17/11/20 03:08:30 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-10-0-0-230.ec2.internal:8041 Nov 20, 2017 3:08:30 AM org.apache.ignite.yarn.ApplicationMaster onContainersAllocated INFO: Launching container: container_1511142795395_0005_01_017080. 17/11/20 03:08:30 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-10-0-0-78.ec2.internal:8041 Nov 20, 2017 3:08:30 AM org.apache.ignite.yarn.ApplicationMaster onContainersAllocated INFO: Launching container: container_1511142795395_0005_01_017081. 17/11/20 03:08:30 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-10-0-0-193.ec2.internal:8041 Nov 20, 2017 3:08:31 AM org.apache.ignite.yarn.ApplicationMaster onContainersCompleted INFO: Container completed. Container id: container_1511142795395_0005_01_017080. State: COMPLETE. Nov 20, 2017 3:08:31 AM org.apache.ignite.yarn.ApplicationMaster onContainersCompleted INFO: Container completed. Container id: container_1511142795395_0005_01_017081. State: COMPLETE. Nov 20, 2017 3:08:31 AM org.apache.ignite.yarn.ApplicationMaster onContainersCompleted INFO: Container completed. Container id: container_1511142795395_0005_01_017079. State: COMPLETE. Nov 20, 2017 3:08:31 AM org.apache.ignite.yarn.ApplicationMaster onContainersAllocated ``` In the logs for a node manager I see containers seem to fail when they are launched, because the corresponding bash command is not well formed ``` 2017-11-20 03:08:47,810 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl (AsyncDispatcher event handler): Container container_1511142795395_0005_01_017281 transitioned from LOCALIZED to RUNNING 2017-11-20 03:08:47,811 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor (ContainersLauncher #4): launchContainer: [bash, /mnt/yarn/usercache/hadoop/appcache/application_1511142795395_0005/container_1511142795395_0005_01_017281/default_container_executor.sh] 2017-11-20 03:08:47,819 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor (ContainersLauncher #4): Exit code from container container_1511142795395_0005_01_017281 is : 2 2017-11-20 03:08:47,819 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor (ContainersLauncher #4): Exception from container-launch with container ID: container_1511142795395_0005_01_017281 and exit code: 2 ExitCodeException exitCode=2: /mnt/yarn/usercache/hadoop/appcache/application_1511142795395_0005/container_1511142795395_0005_01_017281/launch_container.sh: line 4: syntax error near unexpected token `(' /mnt/yarn/usercache/hadoop/appcache/application_1511142795395_0005/container_1511142795395_0005_01_017281/launch_container.sh: line 4: `export BASH_FUNC_run_prestart()="() { su -s /bin/bash $SVC_USER -c "cd $WORKING_DIR && $EXEC_PATH --config '$CONF_DIR' start $DAEMON_FLAGS"' at org.apache.hadoop.util.Shell.runCommand(Shell.java:582) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2017-11-20 03:08:47,819 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #4): Exception from container-launch. 2017-11-20 03:08:47,819 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #4): Container id: container_1511142795395_0005_01_017281 2017-11-20 03:08:47,819 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #4): Exit code: 2 2017-11-20 03:08:47,819 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #4): Exception message: /mnt/yarn/usercache/hadoop/appcache/application_1511142795395_0005/container_1511142795395_0005_01_017281/launch_container.sh: line 4: syntax error near unexpected token `(' 2017-11-20 03:08:47,819 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #4): /mnt/yarn/usercache/hadoop/appcache/application_1511142795395_0005/container_1511142795395_0005_01_017281/launch_container.sh: line 4: `export BASH_FUNC_run_prestart()="() { su -s /bin/bash $SVC_USER -c "cd $WORKING_DIR && $EXEC_PATH --config '$CONF_DIR' start $DAEMON_FLAGS"' 2017-11-20 03:08:47,819 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #4): 2017-11-20 03:08:47,819 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor (ContainersLauncher #4): Stack trace: ExitCodeException exitCode=2: /mnt/yarn/usercache/hadoop/appcache/application_1511142795395_0005/container_1511142795395_0005_01_017281/launch_container.sh: line 4: syntax error near unexpected token `(' ``` When I launch ignite manually in the master it is able to start fine, and connect to zookeeper, but I see a topology with just 1 node. Any thoughts on what I might be doing wrong here? Thanks in advance. Juan Rodriguez