Yes Jeff Thanks again. I could successfully run standalone TF training application with Tensorboard on docker container. Will definitely take care of silent ssh once I start with Distributed TF..
On Tue, Feb 19, 2019 at 9:44 PM Jeff Hubbs <jhubbsl...@att.net> wrote: > Great, Vinay - I'm glad that made a difference. When you get to the point > where you are running a cluster, the same sort of thing will have to carry > over to all nodes, with the added issue that ssh and keys must be > configured such that each of those users can shell to other nodes without > supplying a password. > > On 2/18/19 11:41 PM, Vinay Kashyap wrote: > > Perfect Jeff, I clearly understand. > After changing the setup to the appropriate users and folder permissions, > I can see some progress.. > > Cheers.. > > On Fri, Feb 15, 2019 at 10:05 AM Jeff Hubbs <jhubbsl...@att.net> wrote: > >> On 2/14/19 11:09 PM, Vinay Kashyap wrote: >> >> I am running hadoop on my mac and all the folders have *myuser:staff* as >> the owner. I have verified the permissions for the local dirs to be 755. >> >> This doesn't sound right. By-the-book, there are supposed to be separate >> "users" for hdfs, yarn, and mapred to run their respective daemons. The >> directories they read/write in are supposed to be permed and owned to >> expect that. One possible approach for purposes of log-writing etc. is to >> put those user accounts in a group (perhaps named "hadoop") so that >> read/written areas in common are owned by that group and permed accordingly. >> >> If you're going to ad-lib that arrangement then you'll have to ad-lib a >> lot of the rest of how worker nodes and edge nodes behave accordingly. >> >> I run all hadoop services with myuser and I have configured >> *yarn.nodemanager.linux-container-executor.group**=staff *accordingly >> both in *yarn-site.xml* and *container-executor.cfg* >> >> 1. Is the container-executor binary certified to work as expected on >> OSX.? >> 2. When linux container executor is configured, is there any hard >> expectation that users of the running hadoop services to be part of [*root, >> hdfs, yarn...*] and group to be *hadoop*.? So that the directory >> permissions fall in line accordingly? >> >> Can you please help me understand this.? Could not find any write up on >> this. >> >> On Thu, Feb 14, 2019 at 11:13 PM Prabhu Josephraj <pjos...@cloudera.com> >> wrote: >> >>> In case of Distributed Shell Job - ApplicationMaster runs in normal >>> linux container and the subsequent shell command runs inside Docker >>> container. The job fails even before launching AM, that is before >>> starting Docker Container. I think the Distributed Shell job will fail even >>> without Docker Settings. >>> >>> As per the error code 20 , it is mostly related to accessing of NM local >>> directory. >>> >>> >>> https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_sg_yarn_container_exec_errors.html >>> >>> 20 >>> >>> INITIALIZE_USER_FAILED >>> >>> Couldn't get, stat, or secure the per-user NodeManager directory. >>> >>> Can we try below steps on (all) NodeManager machine. >>> >>> Remove all contents under /data/yarn and make sure the /data and >>> /data/yarn directory permission is 755 with owner root:root and local >>> directory >>> is owned by yarn:hadoop. >>> >>> [root@tparimi-tarunhdp26-4 ~]# ls -lrt / >>> drwxr-xr-x. 5 root root 44 Oct 24 11:47 data >>> >>> [root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/ >>> drwxr-xr-x. 4 root root 28 Oct 24 14:30 yarn >>> >>> [root@tparimi-tarunhdp26-4 ~]# ls -lrt /data/yarn/ >>> total 4 >>> drwxr-xr-x. 5 yarn hadoop 54 Feb 14 17:32 local >>> drwxrwxr-x. 10 yarn hadoop 4096 Feb 14 17:32 log >>> >>> And also check if Distributed Shell jobs runs fine without Docker >>> Settings. >>> >>> >>> >>> >>> >>> On Thu, Feb 14, 2019 at 10:15 PM Vinay Kashyap <vinu.k...@gmail.com> >>> wrote: >>> >>>> Hi Prabhu, >>>> >>>> Thanks for your reply. >>>> I tried the configurations as per your suggestion. But I get the >>>> same error. >>>> Is this related to container localization by any chance?. >>>> Also, is there any log or out information which says that the docker >>>> container runtime has been picked up.? >>>> >>>> >>>> >>>> On Thu, Feb 14, 2019 at 9:38 PM Prabhu Josephraj <pjos...@cloudera.com> >>>> wrote: >>>> >>>>> Hi Vinay, >>>>> >>>>> Can you try specifying below configs under Docker section in >>>>> container-executor.cfg which will allow Docker Containers to use the NM >>>>> Local Dirs. >>>>> >>>>> >>>>> docker.allowed.ro-mounts=/data/yarn/local,,/usr/jdk64/jdk1.8.0_112/bin >>>>> docker.allowed.rw-mounts=/data/yarn/local,/data/yarn/log >>>>> >>>>> Thanks, >>>>> Prabhu Joseph >>>>> >>>>> On Thu, Feb 14, 2019 at 9:28 PM Vinay Kashyap <vinu.k...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> I am using Hadoop 3.2.0 and trying to run a simple application in a >>>>>> docker container and I have made the required configuration changes both >>>>>> in >>>>>> *yarn-site.xml* and *container-executor.cfg* to choose >>>>>> LinuxContainerExecutor and docker runtime. >>>>>> >>>>>> I use the example of distributed shell in one of the hortonworks >>>>>> blog. >>>>>> https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/ >>>>>> >>>>>> The problem I face here is when the application is submitted to YARN >>>>>> it fails with a reason related to directory creation issue with the below >>>>>> error >>>>>> >>>>>> 2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application >>>>>> report from ASM for, appId=2, clientToAMToken=null, >>>>>> appDiagnostics=Application application_1550156488785_0002 failed 2 times >>>>>> due to AM Container for appattempt_1550156488785_0002_000002 exited with >>>>>> exitCode: -1000 Failing this attempt.Diagnostics: [2019-02-14 >>>>>> 20:51:16.282]Application application_1550156488785_0002 initialization >>>>>> failed (exitCode=20) with output: main : command provided 0 main : user >>>>>> is >>>>>> myuser main : requested yarn user is myuser Failed to create directory >>>>>> /data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser >>>>>> - Not a directory >>>>>> >>>>>> I have configured *yarn.nodemanager.local-dirs* in yarn-site.xml and >>>>>> I can see the same reflected in YARN web ui *localhost:8088/conf* >>>>>> >>>>>> <property> >>>>>> <name>yarn.nodemanager.local-dirs</name> >>>>>> <value>/data/yarn/local</value> >>>>>> <final>false</final> >>>>>> <source>yarn-site.xml</source> >>>>>> </property> >>>>>> >>>>>> I do not understand why is it trying to create usercache dir inside >>>>>> the nmPrivate directory. >>>>>> >>>>>> Note : I have verified the permissions for myuser to the directories >>>>>> and also have tried clearing the directories manually as suggested in a >>>>>> related post. But no fruit. I do not see any additional information about >>>>>> container launch failure in any other logs. >>>>>> >>>>>> How do I debug why the usercache dir is not resolved properly?? >>>>>> >>>>>> Really appreciate any help on this. >>>>>> >>>>>> Thanks >>>>>> >>>>>> Vinay Kashyap >>>>>> >>>>> >>>> >>>> -- >>>> *Thanks and regards* >>>> *Vinay Kashyap* >>>> >>> >> >> -- >> *Thanks and regards* >> *Vinay Kashyap* >> >> >> > > -- > *Thanks and regards* > *Vinay Kashyap* > > > -- *Thanks and regards* *Vinay Kashyap*