I've tried to add / at the end of the path, but the result was exactly the
same. I also guess that there will be some problem on the level of Hadoop - S3
comunication. Doy you know if there is some possibility of how tu run scripts
from Spark on for example different hadoom version from the standard EC2
installation?
______________________________________________________________
Od: Sean Owen <so...@cloudera.com>
Komu: <jan.zi...@centrum.cz>
Datum: 08.10.2014 18:05
Předmět: Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File
does not exist:
CC: "user@spark.apache.org"
Take this as a bit of a guess, since I don't use S3 much and am only a
bit aware of the Hadoop+S3 integration issues. But I know that S3's
lack of proper directories causes a few issues when used with Hadoop,
which wants to list directories.
According to
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html
<http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html>
... I wonder if you simply need to end the path with "/" to make it
clear you mean it as a directory. Hadoop S3 OutputFormats are going to
append ..._$folder$ files to mark directories too, although I don't
think it's required necessarily to read them as dirs.
I still imagine there could be some problem between Hadoop in Spark in
this regard, but worth trying the path thing first. You do need s3n://
for sure.
On Wed, Oct 8, 2014 at 4:54 PM, <jan.zi...@centrum.cz> wrote:
One more update: I've realized that this problem is not only Python related.
I've tried it also in Scala, but I'm still getting the same error, my scala
code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first()
______________________________________________________________
My additional question is if this problem can be possibly caused by the fact
that my file is bigger than RAM memory across the whole cluster?
______________________________________________________________
Hi
I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm
getting following Error:
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
1
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
1
Traceback (most recent call last):
File "/root/distributed_rdd_test.py", line 27, in <module>
result =
distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
File "/root/spark/python/pyspark/rdd.py", line 1126, in take
totalParts = self._jrdd.partitions().size()
File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
My code is following:
sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput')
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
for item in result:
print item.getvalue()
sc.stop()
So my question is, is it possible to read whole files from S3? Based on the
documentation it shouold be possible, but it seems that it does not work for
me.
When I do just:
sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10)
print distData
Then the error that I'm getting is exactly the same.
Thank you in advance for any advice.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org