Re: Parsing one big multiple line .xml loaded in RDD using Python

2014-10-08 Thread jan.zikes

Thank you, this seems to be the way to go, but unfortunately, when I'm trying 
to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting 
following Error:
 
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1
Traceback (most recent call last):
  File /root/distributed_rdd_test.py, line 27, in module
    result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
  File /root/spark/python/pyspark/rdd.py, line 1126, in take
    totalParts = self._jrdd.partitions().size()
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
538, in __call__
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.partitions.
: java.io.FileNotFoundException: File does not exist: /wikiinput/wiki.xml.gz
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:280)
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:240)
at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:220)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:56)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:50)
at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
 
at java.lang.Thread.run(Thread.java:745)
 
My code is following:
 
sc = SparkContext(appName=Process wiki)
distData = sc.wholeTextFiles('s3n://wiki-dump/wikiinput') 
result = distData.flatMap(gensim.corpora.wikicorpus.extract_pages).take(10)
for item in result:
        print item.getvalue()
sc.stop()
 
So my question is is it possible to read whole files from S3? Based on the 
documentation it shouold be possible, but it seems that it does not work for me.
 
__

Od: Davies Liu dav...@databricks.com
Komu: jan.zi...@centrum.cz
Datum: 07.10.2014 17:38
Předmět: Re: Parsing one big multiple line .xml loaded in RDD using Python

CC: u...@spark.incubator.apache.org

Maybe sc.wholeTextFile() is what you want, you can get the whole text
and parse it by yourself.

On Tue, Oct 7, 2014 at 1:06 AM,  jan.zi...@centrum.cz wrote:

Hi,

I have already unsucesfully asked quiet simmilar question at stackoverflow,
particularly here:
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim
 
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
I've also unsucessfully tryied some workaround, but unsucessfuly, workaround
problem can be found at
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html
 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html.

Particularly what I'm trying to do, I have .xml dump of wikipedia as the
input. The .xml is quite big and it spreads across multiple lines. You can
check it out at
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.

My goal is to parse this .xml in a same way as
gensim.corpora.wikicorpus.extract_pages do, implementation is at
https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py 
https://github.com/piskvorky/gensim/blob/develop/gensim/corpora

Parsing one big multiple line .xml loaded in RDD using Python

2014-10-07 Thread jan.zikes

Hi,

I have already unsucesfully asked quiet simmilar question at stackoverflow, 
particularly here: 
http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
 I've also unsucessfully tryied some workaround, but unsucessfuly, workaround 
problem can be found at 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html.

Particularly what I'm trying to do, I have .xml dump of wikipedia as the input. 
The .xml is quite big and it spreads across multiple lines. You can check it 
out at 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. 
My goal is to parse this .xml in a same way as 
gensim.corpora.wikicorpus.extract_pages do, implementation is at 
https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py. 
Unfortunately this method does not work, because RDD.flatMap() process the RDD 
line by line as strings.

Does anyone has some suggestion of how to possibly parse the wikipedia like 
.xml loaded in RDD using Python? 

Thank you in advance for any suggestions, advices or hints. 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Parsing one big multiple line .xml loaded in RDD using Python

2014-10-07 Thread Davies Liu
Maybe sc.wholeTextFile() is what you want, you can get the whole text
and parse it by yourself.

On Tue, Oct 7, 2014 at 1:06 AM,  jan.zi...@centrum.cz wrote:
 Hi,

 I have already unsucesfully asked quiet simmilar question at stackoverflow,
 particularly here:
 http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim.
 I've also unsucessfully tryied some workaround, but unsucessfuly, workaround
 problem can be found at
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Python-using-generator-of-data-bigger-than-RAM-as-input-to-sc-parallelize-td15789.html.

 Particularly what I'm trying to do, I have .xml dump of wikipedia as the
 input. The .xml is quite big and it spreads across multiple lines. You can
 check it out at
 http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.

 My goal is to parse this .xml in a same way as
 gensim.corpora.wikicorpus.extract_pages do, implementation is at
 https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py.
 Unfortunately this method does not work, because RDD.flatMap() process the
 RDD line by line as strings.

 Does anyone has some suggestion of how to possibly parse the wikipedia like
 .xml loaded in RDD using Python?

 Thank you in advance for any suggestions, advices or hints.



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org