Union of multiple RDDs

2016-06-21 Thread Apurva Nandan
Hello,

I am trying to combine several small text files (each file is approx
hundreds of MBs to 2-3 gigs) into one big parquet file.

I am loading each one of them and trying to take a union, however this
leads to enormous amounts of partitions, as union keeps on adding the
partitions of the input RDDs together.

I also tried loading all the files via wildcard, but that behaves almost
the same as union i.e. generates a lot of partitions.

One of the approach that I thought was to reparititon the rdd generated
after each union and then continue the process, but I don't know how
efficient that is.

Has anyone came across this kind of thing before?

- Apurva


RDD generated from Dataframes

2016-04-21 Thread Apurva Nandan
Hello everyone,

Generally speaking, I guess it's well known that dataframes are much faster
than RDD when it comes to performance.
My question is how do you go around when it comes to transforming a
dataframe using map.
I mean then the dataframe gets converted into RDD, hence now do you again
convert this RDD to a new dataframe for better performance?
Further, if you have a process which involves series of transformations
i.e. from one RDD to another, do you keep on converting each RDD to a
dataframe first, all the time?

It's also possible that I might be missing something here, please share
your experiences.


Thanks and Regards,
Apurva