Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Csaba Ragany
Thank you Holden, it works!

infile = sc.wholeTextFiles(sys.argv[1])
rdd = sc.parallelize(infile.collect())
rdd.saveAsSequenceFile(sys.argv[2])

Csaba


2014-10-28 17:56 GMT+01:00 Holden Karau hol...@pigscanfly.ca:

 Hi Csaba,

 It sounds like the API you are looking for is sc.wholeTextFiles :)

 Cheers,

 Holden :)


 On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote:

 Dear Spark Community,

 Is it possible to convert text files (.log or .txt files) into
 sequencefiles in Python?

 Using PySpark I can create a parallelized file with
 rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
 with rdd.saveAsSequenceFile(). But how can I put the whole content of my
 text files into the 'value' of 'key1' ?

 I want a sequencefile where the keys are the filenames of the text files
 and the values are their content.

 Thank you for any help!
 Csaba



 --
 Cell : 425-233-8271



Re: pySpark - convert log/txt files into sequenceFile

2014-10-29 Thread Davies Liu
Without the second line, it's will much faster:

 infile = sc.wholeTextFiles(sys.argv[1])
 infile.saveAsSequenceFile(sys.argv[2])


On Wed, Oct 29, 2014 at 3:31 AM, Csaba Ragany rag...@gmail.com wrote:
 Thank you Holden, it works!

 infile = sc.wholeTextFiles(sys.argv[1])
 rdd = sc.parallelize(infile.collect())
 rdd.saveAsSequenceFile(sys.argv[2])

 Csaba


 2014-10-28 17:56 GMT+01:00 Holden Karau hol...@pigscanfly.ca:

 Hi Csaba,

 It sounds like the API you are looking for is sc.wholeTextFiles :)

 Cheers,

 Holden :)


 On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote:

 Dear Spark Community,

 Is it possible to convert text files (.log or .txt files) into
 sequencefiles in Python?

 Using PySpark I can create a parallelized file with
 rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with
 rdd.saveAsSequenceFile(). But how can I put the whole content of my text
 files into the 'value' of 'key1' ?

 I want a sequencefile where the keys are the filenames of the text files
 and the values are their content.

 Thank you for any help!
 Csaba



 --
 Cell : 425-233-8271



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau
Hi Csaba,

It sounds like the API you are looking for is sc.wholeTextFiles :)

Cheers,

Holden :)

On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote:

 Dear Spark Community,

 Is it possible to convert text files (.log or .txt files) into
 sequencefiles in Python?

 Using PySpark I can create a parallelized file with
 rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
 with rdd.saveAsSequenceFile(). But how can I put the whole content of my
 text files into the 'value' of 'key1' ?

 I want a sequencefile where the keys are the filenames of the text files
 and the values are their content.

 Thank you for any help!
 Csaba



-- 
Cell : 425-233-8271