@Davies
I know that gensim.corpora.wikicorpus.extract_pages will be for sure the bottle
neck on the master node.
Unfortunately I am using Spark on EC2 and I don't have enough space on my nodes
to store there whole data that needs to be parsed by extract_pages. I have my
data on S3 and I kind of hoped that after reading (sc.textFile(file_on_s3)) the
data from S3 to RDD it will be possible to pass the RDD to extract_pages, this
unfortunately does not work for me. If it'd work it'd be by far the best way to
go for me.
@Steve
I can try Hadoop Custom InputFormat. It'd be great if you could send me some
samples. But if I understand it correctly then I'm afraid that it won't work
for me, because I actually don't have any url to wikipedia, I have only file,
that is opened, parsed and returned as generator that generates parsed pagename
and text from wikipedia (it can be also some non public wikipedia like site)
______________________________________________________________
Od: Steve Lewis <lordjoe2...@gmail.com>
Komu: Davies Liu <dav...@databricks.com>
Datum: 06.10.2014 22:39
Předmět: Re: Spark and Python using generator of data bigger than RAM as input
to sc.parallelize()
CC: "user"
Try a Hadoop Custom InputFormat - I can give you some samples - While I have
not tried this an input split has only a length (could be ignores if the format
treats as non splittable) and a String for a location.If the location is a URL
into wikipedia the whole thing should work.Hadoop InputFormats seem to be the
best way to get large (say multi gigabyte files) into RDDs
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org