Thank you, this seems to be the way to go, but unfortunately, when I'm trying
to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting
following Error:
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
Traceback (most recent call last):
File /root/distributed_rdd_test.py, line 27, in module
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
File /root/spark/python/pyspark/rdd.py, line 1126, in take
totalParts = self._jrdd.partitions().size()
File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line
538, in __call__
File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
My code is following:
sc = SparkContext(appName=Process wiki)
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput')
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
for item in result:
print item.getvalue()
sc.stop()
So my question is is it possible to read whole files from S3? Based on the
documentation it shouold be possible, but it seems that it does not work for me.
__
Od: Davies Liu dav...@databricks.com
Komu: jan.zi...@centrum.cz
Datum: 07.10.2014 17:38
Předmět: Re: Parsing one big multiple line .xml loaded in RDD using Python
CC: u...@spark.incubator.apache.org
Maybe sc.wholeTextFile() is what you want, you can get the whole text
and parse it by yourself.
On Tue, Oct 7, 2014 at 1:06 AM, jan.zi...@centrum.cz wrote:
Hi,
I have already unsucesfully asked quiet simmilar question at stackoverflow,
particularly here:
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
I've also unsucessfully tryied some workaround, but unsucessfuly, workaround
problem can be found at
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html.
Particularly what I'm trying to do, I have .xml dump of wikipedia as the
input. The .xml is quite big and it spreads across multiple lines. You can
check it out at
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
My goal is to parse this .xml in a same way as
gensim.corpora.wikicorpus.extract_pages do, implementation is at
https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py
https://github.com/piskvorky/gensim/blob/develop/gensim/corpora