I figured it out - I should be using textFile(...), not hadoopFile(...).  And 
my HDFS URL should include the host:

  hdfs://host/user/kwilliams/corTable2/part-m-00000

I haven't figured out how to let the hostname default to the host mentioned in 
our /etc/hadoop/conf/hdfs-site.xml like the Hadoop command-line tools do, but 
that's not so important.

 -Ken


> -----Original Message-----
> From: Williams, Ken [mailto:ken.willi...@windlogics.com]
> Sent: Monday, April 21, 2014 2:04 PM
> To: Spark list
> Subject: Problem connecting to HDFS in Spark shell
> 
> I'm trying to get my feet wet with Spark.  I've done some simple stuff in the
> shell in standalone mode, and now I'm trying to connect to HDFS resources,
> but I'm running into a problem.
> 
> I synced to git's master branch (c399baa - "SPARK-1456 Remove view bounds
> on Ordered in favor of a context bound on Ordering. (3 days ago) <Michael
> Armbrust>" and built like so:
> 
>     SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly
> 
> This created various jars in various places, including these (I think):
> 
>    ./examples/target/scala-2.10/spark-examples-assembly-1.0.0-
> SNAPSHOT.jar
>    ./tools/target/scala-2.10/spark-tools-assembly-1.0.0-SNAPSHOT.jar
>    ./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-
> hadoop2.2.0.jar
> 
> In `conf/spark-env.sh`, I added this (actually before I did the assembly):
> 
>     export HADOOP_CONF_DIR=/etc/hadoop/conf
> 
> Now I fire up the shell (bin/spark-shell) and try to grab data from HFDS, and
> get the following exception:
> 
> scala> var hdf = sc.hadoopFile("hdfs:///user/kwilliams/dat/part-m-00000")
> hdf: org.apache.spark.rdd.RDD[(Nothing, Nothing)] = HadoopRDD[0] at
> hadoopFile at <console>:12
> 
> scala> hdf.count()
> java.lang.RuntimeException: java.lang.InstantiationException
>         at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131
> )
>         at
> org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:155)
>         at
> org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:168)
>         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:209)
>         at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:207)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1064)
>         at org.apache.spark.rdd.RDD.count(RDD.scala:806)
>         at $iwC$$iwC$$iwC$$iwC.<init>(<console>:15)
>         at $iwC$$iwC$$iwC.<init>(<console>:20)
>         at $iwC$$iwC.<init>(<console>:22)
>         at $iwC.<init>(<console>:24)
>         at <init>(<console>:26)
>         at .<init>(<console>:30)
>         at .<clinit>(<console>)
>         at .<init>(<console>:7)
>         at .<clinit>(<console>)
>         at $print(<console>)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
>         at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:10
> 45)
>         at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
>         at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
>         at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
>         at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
>         at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:84
> 1)
>         at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
>         at 
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
>         at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
>         at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spark
> ILoop.scala:936)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sc
> ala:884)
>         at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.sc
> ala:884)
>         at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.
> scala:135)
>         at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
>         at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
>         at org.apache.spark.repl.Main$.main(Main.scala:31)
>         at org.apache.spark.repl.Main.main(Main.scala)
> Caused by: java.lang.InstantiationException
>         at
> sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(Ins
> tantiationExceptionConstructorAccessorImpl.java:48)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129
> )
>         ... 41 more
> 
> 
> Is this recognizable to anyone as a build problem, or a config problem, or
> anything?  Failing that, any way to get more information about where in the
> process it's failing?
> 
> Thanks.
> 
> --
> Ken Williams, Senior Research Scientist
> WindLogics
> http://windlogics.com
> 
> 
> 
> ________________________________
> 
> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution of any
> kind is strictly prohibited. If you are not the intended recipient, please
> contact the sender via reply e-mail and destroy all copies of the original
> message. Thank you.

Reply via email to