It really depends from your use case.

There are two problems: networking and configuration.

Usually I use docker-compose on my local machine. With docker-compose all of 
the containers will share the same network, so you can set a specific hostname 
for the namendode container and use that from spark in core-site.xml as the 
HDFS root path.

For real (multi-host) cluster I use docker host networking. For host network I 
can use the data localization feature of yarn easily without any magic. In that 
case you can use the hostname of the server where the namenode container has 
been started.

In both cases: you don’t need to map docker volume outside as spark uses hdfs 
over the rpc, but it could help to persist the working data of the nodes. 
Usually I set the dfs.namenode.name.dir and dfs.datanode.data.dir and map these 
directories as volumes in docker.

Marton

ps:

The way how I am using dockerized hadoop is available from here: 
https://github.com/elek/bigdata-docker But this is not the easiest way to start 
as It contains multiple way to start cluster (local vs. remote) and sometimes I 
use consul for configuration management.

On Mar 14, 2017, at 6:17 PM, Adamantios Corais 
<[email protected]<mailto:[email protected]>> wrote:


Hi,

I am trying to setup an HDFS cluster for development and testing using the 
following docker image: sequenceiq/hadoop-docker:2.7.1

My question is, which volume should I mount on the host machine in order to 
read and write from an external app (e.g. Spark)? What will be the HDFS path in 
that case?

--
// Adamantios Corais

Reply via email to