I'm not quiet sure whether spark will go inside subdirectories and pick up
files from it. You could do something like following to bring all files to
one directory.

find . -iname '*' -exec mv '{}' . \;


Thanks
Best Regards

On Fri, Dec 12, 2014 at 6:34 PM, Karen Murphy <k.l.mur...@qub.ac.uk> wrote:
>
>
>  When I try to load a text file from a HDFS path using
> sc.wholeTextFiles("hdfs://localhost:54310/graphx/
> anywebsite.com/anywebsite.com/")
>
>  I'm get the following error:
>
> java.io.FileNotFoundException: Path is not a file: /graphx/
> anywebsite.com/anywebsite.com/css
> (full stack trace at bottom of message).
>
>  If I switch my Scala code to reading the input file from the local disk,
> wholeTextFiles doesn't pickup directories (such as css in this case) and
> there is no exception raised.
>
>  The trace information in the 'local file' version shows that only plain
> text files are collected with sc.wholeTextFiles:
>
>  14/12/12 11:51:29 INFO WholeTextFileRDD: Input split: Paths:/tmp/
> anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247
>
>  Yet the trace information in the 'HDFS file' version shows directories
> too are collected with sc.wholeTextFiles:
>
>  14/12/12 11:49:07 INFO WholeTextFileRDD: Input split: Paths:/graphx/
> anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
> 14/12/12 11:49:07 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID
> 1)
> java.io.FileNotFoundException: Path is not a file: /graphx/
> anywebsite.com/anywebsite.com/css
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
>
>  Should the HDFS version behave the same as the local version of
> wholeTextFiles as far as the treatment of directories/non plain text files
> are concerned ?
>
>  Any help, advice or workaround suggestions would be much appreciated,
>
>  Thanks
> Karen
>
>  VERSION INFO
> Ubuntu 14.04
> Spark 1.1.1
> Hadoop 2.5.2
> Scala 2.10.4
>
>  FULL STACK TRACE
> 14/12/12 12:02:31 INFO WholeTextFileRDD: Input split: Paths:/graphx/
> anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
> 14/12/12 12:02:31 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID
> 1)
> java.io.FileNotFoundException: Path is not a file: /graphx/
> anywebsite.com/anywebsite.com/css
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
>         at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>
>          at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>         at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>         at
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1167)
>         at
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1155)
>         at
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1145)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:268)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:235)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:228)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1318)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:293)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:289)
>         at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:289)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>         at
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:60)
>         at
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>         at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138)
>         at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>         at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>         at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>         at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>         at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
>         at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>         at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>         at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>         at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>         at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>         at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
>         at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
>         at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
>         at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): Path
> is not a file: /graphx/anywebsite.com/anywebsite.com/css
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
>         at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
>
>          at org.apache.hadoop.ipc.Client.call(Client.java:1411)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>         at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:225)
>         at
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1165)
>         ... 37 more
> 14/12/12 12:02:31 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1,
> localhost): java.io.FileNotFoundException: Path is not a file: /graphx/
> anywebsite.com/anywebsite.com/css
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
>         at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
>         at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>
>

Reply via email to