Re: Spark, Mesos, Docker and S3

Mao Geng Wed, 27 Jan 2016 07:29:42 -0800

Hi Sathish,

The docker image is normal, no AWS profile included.


When the driver container runs with --net=host, the driver host's AWS profile 
will take effect so that the driver can access the protected s3 files. 

Similarly,  Mesos slaves also run Spark executor docker container in --net=host 
mode, so that the AWS profile of Mesos slaves will take effect.

Hope it helps,
Mao

> On Jan 26, 2016, at 9:15 PM, Sathish Kumaran Vairavelu 
> <vsathishkuma...@gmail.com> wrote:
> 
> Hi Mao, 
> 
> I want to check on accessing the S3 from Spark docker in Mesos.  The EC2 
> instance that I am using has the AWS profile/IAM included.  Should we build 
> the docker image with any AWS profile settings or --net=host docker option 
> takes care of it? 
> 
> Please help
> 
> 
> Thanks
> 
> Sathish
> 
>> On Tue, Jan 26, 2016 at 9:04 PM Mao Geng <m...@sumologic.com> wrote:
>> Thank you very much, Jerry! 
>> 
>> I changed to "--jars 
>> /opt/spark/lib/hadoop-aws-2.7.1.jar,/opt/spark/lib/aws-java-sdk-1.7.4.jar" 
>> then it worked like a charm!
>> 
>> From Mesos task logs below, I saw Mesos executor downloaded the jars from 
>> the driver, which is a bit unnecessary (as the docker image already has 
>> them), but that's ok - I am happy seeing Spark + Mesos + Docker + S3 worked 
>> together!  
>> 
>> Thanks,
>> Mao
>> 16/01/27 02:54:45 INFO Executor: Using REPL class URI: 
>> http://172.16.3.98:33771
>> 16/01/27 02:55:12 INFO CoarseGrainedExecutorBackend: Got assigned task 0
>> 16/01/27 02:55:12 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
>> 16/01/27 02:55:12 INFO Executor: Fetching 
>> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar with timestamp 
>> 1453863280432
>> 16/01/27 02:55:12 INFO Utils: Fetching 
>> http://172.16.3.98:3850/jars/hadoop-aws-2.7.1.jar to 
>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp1518118694295619525.tmp
>> 16/01/27 02:55:12 INFO Utils: Copying 
>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/-19880839621453863280432_cache
>>  to /./hadoop-aws-2.7.1.jar
>> 16/01/27 02:55:12 INFO Executor: Adding file:/./hadoop-aws-2.7.1.jar to 
>> class loader
>> 16/01/27 02:55:12 INFO Executor: Fetching 
>> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar with timestamp 
>> 1453863280472
>> 16/01/27 02:55:12 INFO Utils: Fetching 
>> http://172.16.3.98:3850/jars/aws-java-sdk-1.7.4.jar to 
>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/fetchFileTemp8868621397726761921.tmp
>> 16/01/27 02:55:12 INFO Utils: Copying 
>> /tmp/spark-7b8e1681-8a62-4f1d-9e11-fdf8062b1b08/8167072821453863280472_cache 
>> to /./aws-java-sdk-1.7.4.jar
>> 16/01/27 02:55:12 INFO Executor: Adding file:/./aws-java-sdk-1.7.4.jar to 
>> class loader
>>> On Tue, Jan 26, 2016 at 5:40 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>> Hi Mao,
>>> 
>>> Can you try --jars to include those jars?
>>> 
>>> Best Regards,
>>> 
>>> Jerry
>>> 
>>> Sent from my iPhone
>>> 
>>>> On 26 Jan, 2016, at 7:02 pm, Mao Geng <m...@sumologic.com> wrote:
>>>> 
>>>> Hi there, 
>>>> 
>>>> I am trying to run Spark on Mesos using a Docker image as executor, as 
>>>> mentioned 
>>>> http://spark.apache.org/docs/latest/running-on-mesos.html#mesos-docker-support.
>>>>  
>>>> 
>>>> I built a docker image using the following Dockerfile (which is based on 
>>>> https://github.com/apache/spark/blob/master/docker/spark-mesos/Dockerfile):
>>>> 
>>>> FROM mesosphere/mesos:0.25.0-0.2.70.ubuntu1404
>>>> 
>>>> # Update the base ubuntu image with dependencies needed for Spark
>>>> RUN apt-get update && \
>>>>     apt-get install -y python libnss3 openjdk-7-jre-headless curl
>>>> 
>>>> RUN curl 
>>>> http://www.carfab.com/apachesoftware/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
>>>>  | tar -xzC /opt && \
>>>>     ln -s /opt/spark-1.6.0-bin-hadoop2.6 /opt/spark
>>>> ENV SPARK_HOME /opt/spark
>>>> ENV MESOS_NATIVE_JAVA_LIBRARY /usr/local/lib/libmesos.so
>>>> 
>>>> Then I successfully ran spark-shell via this docker command:
>>>> docker run --rm -it --net=host <registry>/<image>:<tag> 
>>>> /opt/spark/bin/spark-shell --master mesos://<master_host>:5050 --conf 
>>>> <registry>/<image>:<tag> 
>>>> 
>>>> So far so good. Then I wanted to call sc.textFile to load a file from S3, 
>>>> but I was blocked by some issues which I couldn't figure out. I've read 
>>>> https://dzone.com/articles/uniting-spark-parquet-and-s3-as-an-alternative-to
>>>>  and 
>>>> http://blog.encomiabile.it/2015/10/29/apache-spark-amazon-s3-and-apache-mesos,
>>>>  learned that I need to add hadood-aws-2.7.1 and aws-java-sdk-2.7.4 into 
>>>> the executor and driver's classpaths, in order to access s3 files. 
>>>> 
>>>> So, I added following lines into Dockerfile and build a new image. 
>>>> RUN curl 
>>>> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
>>>>  -o /opt/spark/lib/aws-java-sdk-1.7.4.jar
>>>> RUN curl 
>>>> http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.1/hadoop-aws-2.7.1.jar
>>>>  -o /opt/spark/lib/hadoop-aws-2.7.1.jar
>>>> 
>>>> Then I started spark-shell again with below command: 
>>>> docker run --rm -it --net=host <registry>/<image>:<tag> 
>>>> /opt/spark/bin/spark-shell --master mesos://<master_host>:5050 --conf 
>>>> <registry>/<image>:<tag> --conf 
>>>> spark.executor.extraClassPath=/opt/spark/lib/hadoop-aws-2.7.1.jar:/opt/spark/lib/aws-java-sdk-1.7.4.jar
>>>>  --conf 
>>>> spark.driver.extraClassPath=/opt/spark/lib/hadoop-aws-2.7.1.jar:/opt/spark/lib/aws-java-sdk-1.7.4.jar
>>>> 
>>>> But below command failed when I ran it in spark-shell: 
>>>> scala> sc.textFile("s3a://<bucket_name>/<file_name>").count()
>>>> [Stage 0:>                                                          (0 + 
>>>> 2) / 2]16/01/26 23:05:23 WARN TaskSetManager: Lost task 0.0 in stage 0.0 
>>>> (TID 0, ip-172-16-14-203.us-west-2.compute.internal): 
>>>> java.lang.RuntimeException: java.lang.ClassNotFoundException: Class 
>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>>    at 
>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>>>>    at 
>>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
>>>>    at 
>>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
>>>>    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
>>>>    at 
>>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
>>>>    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
>>>>    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
>>>>    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>>>    at 
>>>> org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:107)
>>>>    at 
>>>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>>>>    at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)
>>>>    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
>>>>    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>>>>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>    at 
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>>    at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>>    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>    at java.lang.Thread.run(Thread.java:745)
>>>> Caused by: java.lang.ClassNotFoundException: Class 
>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>>    at 
>>>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
>>>>    at 
>>>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
>>>>    ... 23 more
>>>> 
>>>> I checked hadoop-aws-2.7.1.jar, the org.apache.hadoop.fs.s3a.S3AFileSystem 
>>>> class file is in it. I also checked the Environment page of driver's Web 
>>>> UI at 4040 port, both hadoop-aws-2.7.1.jar and aws-java-sdk-1.7.4.jar are 
>>>> in the Classpath Entries (system path). And following code ran fine in 
>>>> spark-shell:
>>>> scala> val clazz = Class.forName("org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>> clazz: Class[_] = class org.apache.hadoop.fs.s3a.S3AFileSystem
>>>> 
>>>> scala> clazz.getClassLoader()
>>>> res2: ClassLoader = sun.misc.Launcher$AppClassLoader@770848b9
>>>> 
>>>> So, I am confused why the task failed with 
>>>> "java.lang.ClassNotFoundException" Exception? Is there something wrong in 
>>>> the command line options I used to start spark-shell, or in the docker 
>>>> image, or in the "s3a://" url? Or is something related to the Docker 
>>>> executor of Mesos? I studied a bit 
>>>> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala
>>>>  but didn't understand it well... 
>>>> 
>>>> Appreciate if anyone will shed some lights on me.
>>>> 
>>>> Thanks,
>>>> Mao Geng

Re: Spark, Mesos, Docker and S3

Reply via email to