The file itself is for now just wikipedia dump, that can be downloaded from 
here 
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. 
It's basically one big .xml that I need to parse in a way to have title + text 
on one line of the data. For this I currently use 
gensim.corpora.wikicorpus.extract_pages that is possible to see here 
https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/wikicorpus.py. 
This returns the generator from which I'd like to make the RDD.
______________________________________________________________
Od: Steve Lewis <lordjoe2...@gmail.com>
Komu: <jan.zi...@centrum.cz>
Datum: 07.10.2014 01:25
Předmět: Re: Spark and Python using generator of data bigger than RAM as input 
to sc.parallelize()

Say more about the one file you have  - is the file itself large and is it 
text?Here are 3 samples - I have tested the first two in Spark like this 
    Class inputFormatClass = MGFInputFormat.class;        Class keyClass = 
String.class;        Class valueClass = String.class;        JavaPairRDD<String, 
String> spectraAsStrings = ctx.newAPIHadoopFile(                path,              
  inputFormatClass,                keyClass,                valueClass,               
 ctx.hadoopConfiguration()        );I have not tested with non-local cluster or 
gigabyte sized files on Spark but the equivalent Hadoop code - like this but 
returning Hadoop Text works well at those scales
On Mon, Oct 6, 2014 at 2:33 PM, <jan.zi...@centrum.cz <jan.zi...@centrum.cz>> 
wrote:
@Davies
I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle 
neck on the master node.
Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes 
to store there whole data that needs to be parsed by extract_pages. I have my 
data on S3 and I kind of hoped that after reading (sc.textFile(file_on_s3)) the 
data from S3 to RDD it will be possible to pass the RDD to extract_pages, this 
unfortunately does not work for me. If it'd work it'd be by far the best way to 
go for me.
 
@Steve
I can try Hadoop Custom InputFormat. It'd be great if you could send me some 
samples. But if I understand it correctly then I'm afraid that it won't work 
for me, because I actually don't have any url to wikipedia, I have only file, 
that is opened, parsed and returned as generator that generates parsed pagename 
and text from wikipedia (it can be also some non public wikipedia like site)
______________________________________________________________
Od: Steve Lewis <lordjoe2...@gmail.com <lordjoe2...@gmail.com>>
Komu: Davies Liu <dav...@databricks.com <dav...@databricks.com>>
Datum: 06.10.2014 22:39
Předmět: Re: Spark and Python using generator of data bigger than RAM as input 
to sc.parallelize()

CC: "user"
Try a Hadoop Custom InputFormat - I can give you some samples - While I have 
not tried this an input split has only a length (could be ignores if the format 
treats as non splittable) and a String for a location.If the location is a URL 
into wikipedia the whole thing should work.Hadoop InputFormats seem to be the 
best way to get large (say multi gigabyte files) into RDDs

--
Steven M. Lewis PhD4221 105th Ave NEKirkland, WA 98033206-384-1340 (cell)
Skype lordjoe_com


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to