Hi all, in one of previous threads I've described some problems with configuring zeppelin + spark on EC2. Now I'm step further. On both server I have spark 1.6.1, just updated to this version but the version itself is not a problem.
What happens... 1. Pure environment, zeppelin connects to remote spark, some small piece of of code is executed: val NUM_SAMPLES = 10000000 val count = sc.parallelize(1 to NUM_SAMPLES).map{i => val x = Math.random() val y = Math.random() if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / NUM_SAMPLES) 2. No problem at all. Step futher. I need to read data from Amazon S3. sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "xxx") val rddFull = sc.textFile("xxx").zipWithIndex() And... java.io.IOException: No FileSystem for scheme: s3n at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) ... 3. Clear. Hadoop AWS is missing in CP. I'm adding the following dependency: org.apache.hadoop:hadoop-aws:2.6.0 4. Executing the piece of code again: java.lang.NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer$.handledType()Ljava/lang/Class; at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<init>(ScalaNumberDeserializersModule.scala:49) at com.fasterxml.jackson.module.scala.deser.NumberDeserializers$.<clinit>(ScalaNumberDeserializersModule.scala) at com.fasterxml.jackson.module.scala.deser.ScalaNumberDeserializersModule$class.$init$(ScalaNumberDeserializersModule.scala:61) at com.fasterxml.jackson.module.scala.DefaultScalaModule.<init>(DefaultScalaModule.scala:19) at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<init>(DefaultScalaModule.scala:35) at com.fasterxml.jackson.module.scala.DefaultScalaModule$.<clinit>(DefaultScalaModule.scala) at org.apache.spark.rdd.RDDOperationScope$.<init>(RDDOperationScope.scala:81) at org.apache.spark.rdd.RDDOperationScope$.<clinit>(RDDOperationScope.scala) at org.apache.spark.SparkContext.withScope(SparkContext.scala:714) at org.apache.spark.SparkContext.textFile(SparkContext.scala:830) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:41) at $iwC$$iwC$$iwC.<init>(<console>:43) at $iwC$$iwC.<init>(<console>:45) at $iwC.<init>(<console>:47) at <init>(<console>:49) at .<init>(<console>:53) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.zeppelin.spark.SparkInterpreter.interpretInput(SparkInterpreter.java:813) at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:756) at org.apache.zeppelin.spark.SparkInterpreter.interpret(SparkInterpreter.java:748) at org.apache.zeppelin.interpreter.ClassloaderInterpreter.interpret(ClassloaderInterpreter.java:57) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:331) at org.apache.zeppelin.scheduler.Job.run(Job.java:171) at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 5. Small investigation in google shows that it's all about incompatible versions of fastxml libraries. And the question: how to fix it???? What I've already done to fix it: 1. tried with spark 1.5.2 hadoop 2.6 - same error 2. tried with spark 1.5.2 hadoop 2.4 - same error 3. tried to recompile spark with upgraded version of fastxml (2.6) - same error Moreover, if I login onto the spark server and use the spark shell I'm able to execute the piece of code without any problems. After a few seconds I've got my file read and the basic count() shows the correct result. Regards, Marcin