Re: ...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

Karen Murphy Mon, 15 Dec 2014 02:32:35 -0800


Thanks Akhil,

In line with your suggestion I have used the following 2 commands toflatten the directory structure:


find . -type f -iname '*' -exec  mv '{}' . \;
find . -type d -exec rm -rf '{}' \;

Kind Regards
Karen



On 12/12/14 13:25, Akhil Das wrote:

I'm not quiet sure whether spark will go inside subdirectories andpick up files from it. You could do something like following to bringall files to one directory.


        find . -iname '*' -exec mv '{}' . \;


Thanks
Best Regards

On Fri, Dec 12, 2014 at 6:34 PM, Karen Murphy <k.l.mur...@qub.ac.uk<mailto:k.l.mur...@qub.ac.uk>> wrote:



    When I try to load a text file from a HDFS path using
    
sc.wholeTextFiles("hdfs://localhost:54310/graphx/anywebsite.com/anywebsite.com/
    <http://anywebsite.com/anywebsite.com/>")

    I'm get the following error:
    java.io.FileNotFoundException: Path is not a file:
    /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
    (full stack trace at bottom of message).

    If I switch my Scala code to reading the input file from the local
    disk, wholeTextFiles doesn't pickup directories (such as css in
    this case) and there is no exception raised.

    The trace information in the 'local file' version shows that only
    plain text files are collected with sc.wholeTextFiles:

    14/12/12 11:51:29 INFO WholeTextFileRDD: Input split:
    
Paths:/tmp/anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247
    
<http://anywebsite.com/anywebsite.com/index-2.html:0+6192,/tmp/anywebsite.com/anywebsite.com/gallery.html:0+3258,/tmp/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/tmp/anywebsite.com/anywebsite.com/jquery.html:0+326,/tmp/anywebsite.com/anywebsite.com/index.html:0+6174,/tmp/anywebsite.com/anywebsite.com/contact.html:0+3050,/tmp/anywebsite.com/anywebsite.com/archive.html:0+3247>

    Yet the trace information in the 'HDFS file' version shows
    directories too are collected with sc.wholeTextFiles:

    14/12/12 11:49:07 INFO WholeTextFileRDD: Input split:
    
Paths:/graphx/anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
    
<http://anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0>
    14/12/12 11:49:07 ERROR Executor: Exception in task 1.0 in stage
    0.0 (TID 1)
    java.io.FileNotFoundException: Path is not a file:
    /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)

    Should the HDFS version behave the same as the local version of
    wholeTextFiles as far as the treatment of directories/non plain
    text files are concerned ?

    Any help, advice or workaround suggestions would be much appreciated,

    Thanks
    Karen

    VERSION INFO
    Ubuntu 14.04
    Spark 1.1.1
    Hadoop 2.5.2
    Scala 2.10.4

    FULL STACK TRACE
    14/12/12 12:02:31 INFO WholeTextFileRDD: Input split:
    
Paths:/graphx/anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0
    
<http://anywebsite.com/anywebsite.com/archive.html:0+3247,/graphx/anywebsite.com/anywebsite.com/contact.html:0+3050,/graphx/anywebsite.com/anywebsite.com/css:0+0,/graphx/anywebsite.com/anywebsite.com/exhibitions.html:0+6663,/graphx/anywebsite.com/anywebsite.com/gallery.html:0+3258,/graphx/anywebsite.com/anywebsite.com/highslide:0+0,/graphx/anywebsite.com/anywebsite.com/highslideIndex:0+0,/graphx/anywebsite.com/anywebsite.com/images:0+0,/graphx/anywebsite.com/anywebsite.com/index-2.html:0+6192,/graphx/anywebsite.com/anywebsite.com/index.html:0+6174,/graphx/anywebsite.com/anywebsite.com/jquery.html:0+326,/graphx/anywebsite.com/anywebsite.com/js:0+0>
    14/12/12 12:02:31 ERROR Executor: Exception in task 1.0 in stage
    0.0 (TID 1)
    java.io.FileNotFoundException: Path is not a file:
    /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
            at
    
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
            at
    
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
            at
    
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
            at
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
            at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
            at
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
            at
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
            at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

            at
    sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at
    
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
            at
    
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at
    java.lang.reflect.Constructor.newInstance(Constructor.java:526)
            at
    
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
            at
    
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
            at
    org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1167)
            at
    org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1155)
            at
    org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1145)
            at
    
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:268)
            at
    org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:235)
            at
    org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:228)
            at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1318)
            at
    
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:293)
            at
    
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:289)
            at
    
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
            at
    
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:289)
            at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
            at
    
org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:60)
            at
    
org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
            at
    org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:138)
            at
    
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
            at
    scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
            at scala.collection.Iterator$class.foreach(Iterator.scala:727)
            at
    scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
            at
    scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
            at
    scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
            at
    scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
            at scala.collection.TraversableOnce$class.to
    <http://class.to>(TraversableOnce.scala:273)
            at scala.collection.AbstractIterator.to
    <http://scala.collection.AbstractIterator.to>(Iterator.scala:1157)
            at
    scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
            at
    scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
            at
    scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
            at
    scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
            at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
            at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
            at
    
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
            at
    
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1143)
            at
    org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
            at org.apache.spark.scheduler.Task.run(Task.scala:54)
            at
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
            at
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:745)
    Caused by:
    org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
    Path is not a file: /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
            at
    
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
            at
    
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
            at
    
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
            at
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
            at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
            at
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
            at
    org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
            at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

            at org.apache.hadoop.ipc.Client.call(Client.java:1411)
            at org.apache.hadoop.ipc.Client.call(Client.java:1364)
            at
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
            at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at
    
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
            at
    
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:606)
            at
    
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
            at
    
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
            at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
            at
    
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:225)
            at
    org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1165)
            ... 37 more
    14/12/12 12:02:31 WARN TaskSetManager: Lost task 1.0 in stage 0.0
    (TID 1, localhost): java.io.FileNotFoundException: Path is not a
    file: /graphx/anywebsite.com/anywebsite.com/css
    <http://anywebsite.com/anywebsite.com/css>
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:68)
            at
    org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:54)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1795)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1738)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1718)
            at
    
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1690)
            at
    
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:519)
            at
    
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:337)
            at
    
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
            at
    
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

Re: ...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

Reply via email to