subject:"Re\: SparkContext.wholeTextFiles\(\) java.io.FileNotFoundException\: File does not exist\:"

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-09 Thread Rahul Kumar Singh

I faced similar issue with "wholeTextFiles" function due to version
compatibility. Spark 1.0 with Hadoop 2.4.1 worked. Did you try other
function such as "textFile" to check if the issue is specific to
"wholeTextFiles"?

Spark needs to be re-compiled for different hadoop versions. However, you
can keep multiple Spark directories in your system compiled with different
versions.

On Thu, Oct 9, 2014 at 1:15 PM,  wrote:

> I've tried to add / at the end of the path, but the result was exactly the
> same. I also guess that there will be some problem on the level of Hadoop -
> S3 comunication. Doy you know if there is some possibility of how tu run
> scripts from Spark on for example different hadoom version from the
> standard EC2 installation?
>
> __
> > Od: Sean Owen 
> > Komu: 
> > Datum: 08.10.2014 18:05
> > Předmět: Re: SparkContext.wholeTextFiles()
> java.io.FileNotFoundException: File does not exist:
> >
>
> > CC: "user@spark.apache.org"
>
> Take this as a bit of a guess, since I don't use S3 much and am only a
> bit aware of the Hadoop+S3 integration issues. But I know that S3's
> lack of proper directories causes a few issues when used with Hadoop,
> which wants to list directories.
>
> According to
> http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html
> ... I wonder if you simply need to end the path with "/" to make it
> clear you mean it as a directory. Hadoop S3 OutputFormats are going to
> append ..._$folder$ files to mark directories too, although I don't
> think it's required necessarily to read them as dirs.
>
> I still imagine there could be some problem between Hadoop in Spark in
> this regard, but worth trying the path thing first. You do need s3n://
> for sure.
>
> On Wed, Oct 8, 2014 at 4:54 PM,   wrote:
> > One more update: I've realized that this problem is not only Python
> related.
> > I've tried it also in Scala, but I'm still getting the same error, my
> scala
> > code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first()
> >
> > __
> >
> >
> > My additional question is if this problem can be possibly caused by the
> fact
> > that my file is bigger than RAM memory across the whole cluster?
> >
> >
> >
> > __
> >
> > Hi
> >
> > I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3
> I'm
> > getting following Error:
> >
> >
> >
> > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to
> process :
> > 1
> >
> > 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to
> process :
> > 1
> >
> > Traceback (most recent call last):
> >
> >   File "/root/distributed_rdd_test.py", line 27, in 
> >
> > result =
> > distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
> >
> >   File "/root/spark/python/pyspark/rdd.py", line 1126, in take
> >
> > totalParts = self._jrdd.partitions().size()
> >
> >   File
> "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> > line 538, in __call__
> >
> >   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line
> > 300, in get_return_value
> >
> > py4j.protocol.Py4JJavaError: An error occurred while calling
> o30.partitions.
> >
> > : java.io.FileNotFoundException: File does not exist:
> /wikiinput/wiki.xml.gz
> >
> > at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
> >
> > at
> >
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
> >
> > at
> >
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
> >
> > at
> >
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
> >
> > at
> >
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
> >
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> >
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> >
> > at scala.Option.getOrElse(Option.scala:120)
> >
> > at org.apache.spark.rdd

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-09 Thread jan.zikes


I've tried to add / at the end of the path, but the result was exactly the 
same. I also guess that there will be some problem on the level of Hadoop - S3 
comunication. Doy you know if there is some possibility of how tu run scripts 
from Spark on for example different hadoom version from the standard EC2 
installation?
__

Od: Sean Owen 
Komu: 
Datum: 08.10.2014 18:05
Předmět: Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File 
does not exist:

CC: "user@spark.apache.org"

Take this as a bit of a guess, since I don't use S3 much and am only a
bit aware of the Hadoop+S3 integration issues. But I know that S3's
lack of proper directories causes a few issues when used with Hadoop,
which wants to list directories.

According to 
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html
 
<http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html>
... I wonder if you simply need to end the path with "/" to make it
clear you mean it as a directory. Hadoop S3 OutputFormats are going to
append ..._$folder$ files to mark directories too, although I don't
think it's required necessarily to read them as dirs.

I still imagine there could be some problem between Hadoop in Spark in
this regard, but worth trying the path thing first. You do need s3n://
for sure.

On Wed, Oct 8, 2014 at 4:54 PM,   wrote:

One more update: I've realized that this problem is not only Python related.
I've tried it also in Scala, but I'm still getting the same error, my scala
code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first()

__


My additional question is if this problem can be possibly caused by the fact
that my file is bigger than RAM memory across the whole cluster?



__

Hi

I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm
getting following Error:



14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
1

14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
1

Traceback (most recent call last):

  File "/root/distributed_rdd_test.py", line 27, in 

    result =
distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)

  File "/root/spark/python/pyspark/rdd.py", line 1126, in take

    totalParts = self._jrdd.partitions().size()

  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__

  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.

: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz

at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)

at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)

at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)

at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)

at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)

at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

at py4j.Gateway.invoke(Gateway.java:259)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:207)



at java.lang.Thread.run(Thread.java:745)



My code is following:



sc = SparkContext(appName="Process wiki"

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-08 Thread Sean Owen

Take this as a bit of a guess, since I don't use S3 much and am only a
bit aware of the Hadoop+S3 integration issues. But I know that S3's
lack of proper directories causes a few issues when used with Hadoop,
which wants to list directories.

According to 
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html
... I wonder if you simply need to end the path with "/" to make it
clear you mean it as a directory. Hadoop S3 OutputFormats are going to
append ..._$folder$ files to mark directories too, although I don't
think it's required necessarily to read them as dirs.

I still imagine there could be some problem between Hadoop in Spark in
this regard, but worth trying the path thing first. You do need s3n://
for sure.

On Wed, Oct 8, 2014 at 4:54 PM,   wrote:
> One more update: I've realized that this problem is not only Python related.
> I've tried it also in Scala, but I'm still getting the same error, my scala
> code: val file = sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first()
>
> __
>
>
> My additional question is if this problem can be possibly caused by the fact
> that my file is bigger than RAM memory across the whole cluster?
>
>
>
> __
>
> Hi
>
> I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm
> getting following Error:
>
>
>
> 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
> 1
>
> 14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process :
> 1
>
> Traceback (most recent call last):
>
>   File "/root/distributed_rdd_test.py", line 27, in 
>
> result =
> distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
>
>   File "/root/spark/python/pyspark/rdd.py", line 1126, in take
>
> totalParts = self._jrdd.partitions().size()
>
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
>
> : java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
>
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>
> at
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
>
> at
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
>
> at
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
>
> at
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>
> at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>
> at scala.Option.getOrElse(Option.scala:120)
>
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>
> at
> org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
>
> at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>
> at py4j.Gateway.invoke(Gateway.java:259)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
>
>
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> My code is following:
>
>
>
> sc = SparkContext(appName="Process wiki")
>
> distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput')
>
> result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
>
> for item in result:
>
> print item.getvalue()
>
> sc.stop()
>
>
>
> So my question is, is it possible to read whole files from S3? Based on the
> documentation it shouold be possible, but it seems that it does not work for
> me.
>
>
>
> When I do just:
>
>
>
> sc = SparkContext(appName="Process wiki")
>
> distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10)
>
> print distData
>
>
>
> Then the error that I'm getting is exactly the same.
>
>
>

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-08 Thread jan.zikes


One more update: I've realized that this problem is not only Python related. I've tried 
it also in Scala, but I'm still getting the same error, my scala code: val file = 
sc.wholeTextFiles("s3n://wiki-dump/wikiinput").first()
__
My additional question is if this problem can be possibly caused by the fact 
that my file is bigger than RAM memory across the whole cluster?
 
__
Hi
I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm 
getting following Error:
 
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
Traceback (most recent call last):
  File "/root/distributed_rdd_test.py", line 27, in 
    result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
  File "/root/spark/python/pyspark/rdd.py", line 1126, in take
    totalParts = self._jrdd.partitions().size()
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
 
at java.lang.Thread.run(Thread.java:745)
 
My code is following:
 
sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') 
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
for item in result:
        print item.getvalue()
sc.stop()
 
So my question is, is it possible to read whole files from S3? Based on the 
documentation it shouold be possible, but it seems that it does not work for me.
 
When I do just:
 
sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10)
print distData
 
Then the error that I'm getting is exactly the same.
 
Thank you in advance for any advice.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-08 Thread jan.zikes


My additional question is if this problem can be possibly caused by the fact 
that my file is bigger than RAM memory across the whole cluster?
 
__
Hi
I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm 
getting following Error:
 
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
Traceback (most recent call last):
  File "/root/distributed_rdd_test.py", line 27, in 
    result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
  File "/root/spark/python/pyspark/rdd.py", line 1126, in take
    totalParts = self._jrdd.partitions().size()
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 
538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
 
at java.lang.Thread.run(Thread.java:745)
 
My code is following:
 
sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') 
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
for item in result:
        print item.getvalue()
sc.stop()
 
So my question is, is it possible to read whole files from S3? Based on the 
documentation it shouold be possible, but it seems that it does not work for me.
 
When I do just:
 
sc = SparkContext(appName="Process wiki")
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput').take(10)
print distData
 
Then the error that I'm getting is exactly the same.
 
Thank you in advance for any advice.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

5 matches

Site Navigation

Mail list logo

Footer information