I've tried to add / at the end of the path, but the result was exactly the 
same. I also guess that there will be some problem on the level of Hadoop - S3 
comunication. Doy you know if there is some possibility of how tu run scripts 
from Spark on for example different hadoom version from the standard EC2 
installation?
______________________________________________________________
Od: Sean Owen <so...@cloudera.com>
Komu: <jan.zi...@centrum.cz>
Datum: 08.10.2014 18:05
Předmět: Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File 
does not exist:

CC: "user@spark.apache.org"
Take this as a bit of a guess, since I don't use S3 much and am only a
bit aware of the Hadoop+S3 integration issues. But I know that S3's
lack of proper directories causes a few issues when used with Hadoop,
which wants to list directories.

According to 
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html
 
<http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html>
... I wonder if you simply need to end the path with "/" to make it
clear you mean it as a directory. Hadoop S3 OutputFormats are going to
append ..._$folder$ files to mark directories too, although I don't
think it's required necessarily to read them as dirs.

I still imagine there could be some problem between Hadoop in Spark in
this regard, but worth trying the path thing first. You do need s3n://
for sure.

On Wed, Oct 8, 2014 at 4:54 PM,  <jan.zi...@centrum.cz> wrote:
One more update: I've realized that this problem is not only Python related.
I've tried it also in Scala, but I'm still getting the same error, my scala
code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first()

______________________________________________________________


My additional question is if this problem can be possibly caused by the fact
that my file is bigger than RAM memory across the whole cluster?



______________________________________________________________

Hi

I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm
getting following Error:



14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
1

14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
1

Traceback (most recent call last):

  File "/root/distributed_rdd_test.py", line 27, in <module>

    result =
distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)

  File "/root/spark/python/pyspark/rdd.py", line 1126, in take

    totalParts = self._jrdd.partitions().size()

  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__

  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.

: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz

at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)

at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.<init>(CombineFileInputFormat.java:489)

at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)

at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)

at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)

at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

at py4j.Gateway.invoke(Gateway.java:259)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:207)



at java.lang.Thread.run(Thread.java:745)



My code is following:



sc = SparkContext(appName="Process wiki")

distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput')

result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)

for item in result:

        print item.getvalue()

sc.stop()



So my question is, is it possible to read whole files from S3? Based on the
documentation it shouold be possible, but it seems that it does not work for
me.



When I do just:



sc = SparkContext(appName="Process wiki")

distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10)

print distData



Then the error that I'm getting is exactly the same.



Thank you in advance for any advice.



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to