Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:
I faced similar issue with "wholeTextFiles" function due to version compatibility. Spark 1.0 with Hadoop 2.4.1 worked. Did you try other function such as "textFile" to check if the issue is specific to "wholeTextFiles"? Spark needs to be re-compiled for different hadoop versions. However, you can keep multiple Spark directories in your system compiled with different versions. On Thu, Oct 9, 2014 at 1:15 PM, wrote: > I've tried to add / at the end of the path, but the result was exactly the > same. I also guess that there will be some problem on the level of Hadoop - > S3 comunication. Doy you know if there is some possibility of how tu run > scripts from Spark on for example different hadoom version from the > standard EC2 installation? > > __ > > Od: Sean Owen > > Komu: > > Datum: 08.10.2014 18:05 > > Předmět: Re: SparkContext.wholeTextFiles() > java.io.FileNotFoundException: File does not exist: > > > > > CC: "user@spark.apache.org" > > Take this as a bit of a guess, since I don't use S3 much and am only a > bit aware of the Hadoop+S3 integration issues. But I know that S3's > lack of proper directories causes a few issues when used with Hadoop, > which wants to list directories. > > According to > http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html > ... I wonder if you simply need to end the path with "/" to make it > clear you mean it as a directory. Hadoop S3 OutputFormats are going to > append ..._$folder$ files to mark directories too, although I don't > think it's required necessarily to read them as dirs. > > I still imagine there could be some problem between Hadoop in Spark in > this regard, but worth trying the path thing first. You do need s3n:// > for sure. > > On Wed, Oct 8, 2014 at 4:54 PM, wrote: > > One more update: I've realized that this problem is not only Python > related. > > I've tried it also in Scala, but I'm still getting the same error, my > scala > > code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first() > > > > __ > > > > > > My additional question is if this problem can be possibly caused by the > fact > > that my file is bigger than RAM memory across the whole cluster? > > > > > > > > __ > > > > Hi > > > > I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 > I'm > > getting following Error: > > > > > > > > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to > process : > > 1 > > > > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to > process : > > 1 > > > > Traceback (most recent call last): > > > > File "/root/distributed_rdd_test.py", line 27, in > > > > result = > > distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) > > > > File "/root/spark/python/pyspark/rdd.py", line 1126, in take > > > > totalParts = self._jrdd.partitions().size() > > > > File > "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > > line 538, in __call__ > > > > File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", > line > > 300, in get_return_value > > > > py4j.protocol.Py4JJavaError: An error occurred while calling > o30.partitions. > > > > : java.io.FileNotFoundException: File does not exist: > /wikiinput/wiki.xml.gz > > > > at > > > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) > > > > at > > > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) > > > > at > > > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) > > > > at > > > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) > > > > at > > > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220) > > > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > > > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > > > > at scala.Option.getOrElse(Option.scala:120) > > > > at org.apache.spark.rdd
Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:
I've tried to add / at the end of the path, but the result was exactly the same. I also guess that there will be some problem on the level of Hadoop - S3 comunication. Doy you know if there is some possibility of how tu run scripts from Spark on for example different hadoom version from the standard EC2 installation? __ Od: Sean Owen Komu: Datum: 08.10.2014 18:05 Předmět: Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist: CC: "user@spark.apache.org" Take this as a bit of a guess, since I don't use S3 much and am only a bit aware of the Hadoop+S3 integration issues. But I know that S3's lack of proper directories causes a few issues when used with Hadoop, which wants to list directories. According to http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html <http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html> ... I wonder if you simply need to end the path with "/" to make it clear you mean it as a directory. Hadoop S3 OutputFormats are going to append ..._$folder$ files to mark directories too, although I don't think it's required necessarily to read them as dirs. I still imagine there could be some problem between Hadoop in Spark in this regard, but worth trying the path thing first. You do need s3n:// for sure. On Wed, Oct 8, 2014 at 4:54 PM, wrote: One more update: I've realized that this problem is not only Python related. I've tried it also in Scala, but I'm still getting the same error, my scala code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first() __ My additional question is if this problem can be possibly caused by the fact that my file is bigger than RAM memory across the whole cluster? __ Hi I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting following Error: 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1 Traceback (most recent call last): File "/root/distributed_rdd_test.py", line 27, in result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) File "/root/spark/python/pyspark/rdd.py", line 1126, in take totalParts = self._jrdd.partitions().size() File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions. : java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) My code is following: sc = SparkContext(appName="Process wiki"
Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:
Take this as a bit of a guess, since I don't use S3 much and am only a bit aware of the Hadoop+S3 integration issues. But I know that S3's lack of proper directories causes a few issues when used with Hadoop, which wants to list directories. According to http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html ... I wonder if you simply need to end the path with "/" to make it clear you mean it as a directory. Hadoop S3 OutputFormats are going to append ..._$folder$ files to mark directories too, although I don't think it's required necessarily to read them as dirs. I still imagine there could be some problem between Hadoop in Spark in this regard, but worth trying the path thing first. You do need s3n:// for sure. On Wed, Oct 8, 2014 at 4:54 PM, wrote: > One more update: I've realized that this problem is not only Python related. > I've tried it also in Scala, but I'm still getting the same error, my scala > code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first() > > __ > > > My additional question is if this problem can be possibly caused by the fact > that my file is bigger than RAM memory across the whole cluster? > > > > __ > > Hi > > I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm > getting following Error: > > > > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : > 1 > > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : > 1 > > Traceback (most recent call last): > > File "/root/distributed_rdd_test.py", line 27, in > > result = > distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) > > File "/root/spark/python/pyspark/rdd.py", line 1126, in take > > totalParts = self._jrdd.partitions().size() > > File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", > line 538, in __call__ > > File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line > 300, in get_return_value > > py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions. > > : java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz > > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) > > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) > > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) > > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) > > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > > at scala.Option.getOrElse(Option.scala:120) > > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > > at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > > at scala.Option.getOrElse(Option.scala:120) > > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > > at > org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) > > at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > > at py4j.Gateway.invoke(Gateway.java:259) > > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > > at py4j.commands.CallCommand.execute(CallCommand.java:79) > > at py4j.GatewayConnection.run(GatewayConnection.java:207) > > > > at java.lang.Thread.run(Thread.java:745) > > > > My code is following: > > > > sc = SparkContext(appName="Process wiki") > > distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') > > result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) > > for item in result: > > print item.getvalue() > > sc.stop() > > > > So my question is, is it possible to read whole files from S3? Based on the > documentation it shouold be possible, but it seems that it does not work for > me. > > > > When I do just: > > > > sc = SparkContext(appName="Process wiki") > > distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10) > > print distData > > > > Then the error that I'm getting is exactly the same. > > >
Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:
One more update: I've realized that this problem is not only Python related. I've tried it also in Scala, but I'm still getting the same error, my scala code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first() __ My additional question is if this problem can be possibly caused by the fact that my file is bigger than RAM memory across the whole cluster? __ Hi I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting following Error: 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1 Traceback (most recent call last): File "/root/distributed_rdd_test.py", line 27, in result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) File "/root/spark/python/pyspark/rdd.py", line 1126, in take totalParts = self._jrdd.partitions().size() File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions. : java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) My code is following: sc = SparkContext(appName="Process wiki") distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) for item in result: print item.getvalue() sc.stop() So my question is, is it possible to read whole files from S3? Based on the documentation it shouold be possible, but it seems that it does not work for me. When I do just: sc = SparkContext(appName="Process wiki") distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10) print distData Then the error that I'm getting is exactly the same. Thank you in advance for any advice. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:
My additional question is if this problem can be possibly caused by the fact that my file is bigger than RAM memory across the whole cluster? __ Hi I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting following Error: 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1 Traceback (most recent call last): File "/root/distributed_rdd_test.py", line 27, in result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) File "/root/spark/python/pyspark/rdd.py", line 1126, in take totalParts = self._jrdd.partitions().size() File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions. : java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50) at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) My code is following: sc = SparkContext(appName="Process wiki") distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10) for item in result: print item.getvalue() sc.stop() So my question is, is it possible to read whole files from S3? Based on the documentation it shouold be possible, but it seems that it does not work for me. When I do just: sc = SparkContext(appName="Process wiki") distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10) print distData Then the error that I'm getting is exactly the same. Thank you in advance for any advice. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org