It really depends from your use case. There are two problems: networking and configuration.
Usually I use docker-compose on my local machine. With docker-compose all of the containers will share the same network, so you can set a specific hostname for the namendode container and use that from spark in core-site.xml as the HDFS root path. For real (multi-host) cluster I use docker host networking. For host network I can use the data localization feature of yarn easily without any magic. In that case you can use the hostname of the server where the namenode container has been started. In both cases: you don’t need to map docker volume outside as spark uses hdfs over the rpc, but it could help to persist the working data of the nodes. Usually I set the dfs.namenode.name.dir and dfs.datanode.data.dir and map these directories as volumes in docker. Marton ps: The way how I am using dockerized hadoop is available from here: https://github.com/elek/bigdata-docker But this is not the easiest way to start as It contains multiple way to start cluster (local vs. remote) and sometimes I use consul for configuration management. On Mar 14, 2017, at 6:17 PM, Adamantios Corais <[email protected]<mailto:[email protected]>> wrote: Hi, I am trying to setup an HDFS cluster for development and testing using the following docker image: sequenceiq/hadoop-docker:2.7.1 My question is, which volume should I mount on the host machine in order to read and write from an external app (e.g. Spark)? What will be the HDFS path in that case? -- // Adamantios Corais
