Union of multiple RDDs
Hello, I am trying to combine several small text files (each file is approx hundreds of MBs to 2-3 gigs) into one big parquet file. I am loading each one of them and trying to take a union, however this leads to enormous amounts of partitions, as union keeps on adding the partitions of the input RDDs together. I also tried loading all the files via wildcard, but that behaves almost the same as union i.e. generates a lot of partitions. One of the approach that I thought was to reparititon the rdd generated after each union and then continue the process, but I don't know how efficient that is. Has anyone came across this kind of thing before? - Apurva
RDD generated from Dataframes
Hello everyone, Generally speaking, I guess it's well known that dataframes are much faster than RDD when it comes to performance. My question is how do you go around when it comes to transforming a dataframe using map. I mean then the dataframe gets converted into RDD, hence now do you again convert this RDD to a new dataframe for better performance? Further, if you have a process which involves series of transformations i.e. from one RDD to another, do you keep on converting each RDD to a dataframe first, all the time? It's also possible that I might be missing something here, please share your experiences. Thanks and Regards, Apurva