Thanks Akhil,

In line with your suggestion I have used the following 2 commands to flatten the directory structure:

find . -type f -iname '*' -exec  mv '{}' . \;
find . -type d -exec rm -rf '{}' \;

Kind Regards
Karen



On 12/12/14 13:25, Akhil Das wrote:
I'm not quiet sure whether spark will go inside subdirectories and pick up files from it. You could do something like following to bring all files to one directory.

        find . -iname '*' -exec mv '{}' . \;


Thanks
Best Regards

On Fri, Dec 12, 2014 at 6:34 PM, Karen Murphy <k.l.mur...@qub.ac.uk <mailto:k.l.mur...@qub.ac.uk>> wrote:


    When I try to load a text file from a HDFS path using
    
sc.wholeTextFiles("hdfs://localhost:54310/graphx/anywebsite.com/anywebsite.com/
    <http://anywebsite.com/anywebsite.com/>")

    I'm get the following error:
    java.io.FileNotFoundException: Path is not a file:
    /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
    (full stack trace at bottom of message).

    If I switch my Scala code to reading the input file from the local
    disk, wholeTextFiles doesn't pickup directories (such as css in
    this case) and there is no exception raised.

    The trace information in the 'local file' version shows that only
    plain text files are collected with sc.wholeTextFiles:

    14/12/12 11:51:29 INFO WholeTextFileRDD: Input split:
    
Paths:/tmp/anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247
    
<http://anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247>

    Yet the trace information in the 'HDFS file' version shows
    directories too are collected with sc.wholeTextFiles:

    14/12/12 11:49:07 INFO WholeTextFileRDD: Input split:
    
Paths:/graphx/anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
    
<http://anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0>
    14/12/12 11:49:07 ERROR Executor: Exception in task 1.0 in stage
    0.0 (TID 1)
    java.io.FileNotFoundException: Path is not a file:
    /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)

    Should the HDFS version behave the same as the local version of
    wholeTextFiles as far as the treatment of directories/non plain
    text files are concerned ?

    Any help, advice or workaround suggestions would be much appreciated,

    Thanks
    Karen

    VERSION INFO
    Ubuntu 14.04
    Spark 1.1.1
    Hadoop 2.5.2
    Scala 2.10.4

    FULL STACK TRACE
    14/12/12 12:02:31 INFO WholeTextFileRDD: Input split:
    
Paths:/graphx/anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
    
<http://anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0>
    14/12/12 12:02:31 ERROR Executor: Exception in task 1.0 in stage
    0.0 (TID 1)
    java.io.FileNotFoundException: Path is not a file:
    /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
            at
    
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
            at
    
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
            at
    
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
            at
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
            at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
            at
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
            at
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
            at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

            at
    sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at
    
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
            at
    
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at
    java.lang.reflect.Constructor.newInstance(Constructor.java:526)
            at
    
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
            at
    
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
            at
    org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1167)
            at
    org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1155)
            at
    org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1145)
            at
    
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:268)
            at
    org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:235)
            at
    org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:228)
            at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1318)
            at
    
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:293)
            at
    
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:289)
            at
    
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
            at
    
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:289)
            at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
            at
    
org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:60)
            at
    
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
            at
    org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138)
            at
    
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
            at
    scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
            at scala.collection.Iterator$class.foreach(Iterator.scala:727)
            at
    scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
            at
    scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
            at
    scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
            at
    scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
            at scala.collection.TraversableOnce$class.to
    <http://class.to>(TraversableOnce.scala:273)
            at scala.collection.AbstractIterator.to
    <http://scala.collection.AbstractIterator.to>(Iterator.scala:1157)
            at
    scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
            at
    scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
            at
    scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
            at
    scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
            at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
            at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
            at
    
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
            at
    
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
            at
    org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
            at org.apache.spark.scheduler.Task.run(Task.scala:54)
            at
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
            at
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)
    Caused by:
    org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
    Path is not a file: /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
            at
    
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
            at
    
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
            at
    
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
            at
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
            at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
            at
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
            at
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
            at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

            at org.apache.hadoop.ipc.Client.call(Client.java:1411)
            at org.apache.hadoop.ipc.Client.call(Client.java:1364)
            at
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
            at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at
    
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
            at
    
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:606)
            at
    
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
            at
    
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
            at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
            at
    
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:225)
            at
    org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1165)
            ... 37 more
    14/12/12 12:02:31 WARN TaskSetManager: Lost task 1.0 in stage 0.0
    (TID 1, localhost): java.io.FileNotFoundException: Path is not a
    file: /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
            at
    
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
            at
    
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
            at
    
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
            at
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)


Reply via email to