I figured it out - I should be using textFile(...), not hadoopFile(...). And my HDFS URL should include the host:
hdfs://host/user/kwilliams/corTable2/part-m-00000 I haven't figured out how to let the hostname default to the host mentioned in our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, but that's not so important. -Ken > -----Original Message----- > From: Williams, Ken [mailto:ken.willi...@windlogics.com] > Sent: Monday, April 21, 2014 2:04 PM > To: Spark list > Subject: Problem connecting to HDFS in Spark shell > > I'm trying to get my feet wet with Spark. I've done some simple stuff in the > shell in standalone mode, and now I'm trying to connect to HDFS resources, > but I'm running into a problem. > > I synced to git's master branch (c399baa - "SPARK-1456 Remove view bounds > on Ordered in favor of a context bound on Ordering. (3 days ago) <Michael > Armbrust>" and built like so: > > SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly > > This created various jars in various places, including these (I think): > > ./examples/target/scala-2.10/spark-examples-assembly-1.0.0- > SNAPSHOT.jar > ./tools/target/scala-2.10/spark-tools-assembly-1.0.0-SNAPSHOT.jar > ./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT- > hadoop2.2.0.jar > > In `conf/spark-env.sh`, I added this (actually before I did the assembly): > > export HADOOP_CONF_DIR=/etc/hadoop/conf > > Now I fire up the shell (bin/spark-shell) and try to grab data from HFDS, and > get the following exception: > > scala> var hdf = sc.hadoopFile("hdfs:///user/kwilliams/dat/part-m-00000") > hdf: org.apache.spark.rdd.RDD[(Nothing, Nothing)] = HadoopRDD[0] at > hadoopFile at <console>:12 > > scala> hdf.count() > java.lang.RuntimeException: java.lang.InstantiationException > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131 > ) > at > org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155) > at > org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168) > at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:209) > at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:207) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1064) > at org.apache.spark.rdd.RDD.count(RDD.scala:806) > at $iwC$$iwC$$iwC$$iwC.<init>(<console>:15) > at $iwC$$iwC$$iwC.<init>(<console>:20) > at $iwC$$iwC.<init>(<console>:22) > at $iwC.<init>(<console>:24) > at <init>(<console>:26) > at .<init>(<console>:30) > at .<clinit>(<console>) > at .<init>(<console>:7) > at .<clinit>(<console>) > at $print(<console>) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j > ava:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces > sorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:10 > 45) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:84 > 1) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) > at > org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spark > ILoop.scala:936) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sc > ala:884) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sc > ala:884) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader. > scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > Caused by: java.lang.InstantiationException > at > sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(Ins > tantiationExceptionConstructorAccessorImpl.java:48) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129 > ) > ... 41 more > > > Is this recognizable to anyone as a build problem, or a config problem, or > anything? Failing that, any way to get more information about where in the > process it's failing? > > Thanks. > > -- > Ken Williams, Senior Research Scientist > WindLogics > http://windlogics.com > > > > ________________________________ > > CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. Any unauthorized review, use, disclosure or distribution of any > kind is strictly prohibited. If you are not the intended recipient, please > contact the sender via reply e-mail and destroy all copies of the original > message. Thank you.