Hi In that case read entire folder as a rdd and give some reasonable number of partitions.
Best Ayan On 11 May 2015 01:35, "Peter Aberline" <peter.aberl...@gmail.com> wrote: > Hi > > Thanks for the quick response. > > No I'm not using Streaming. Each DataFrame represents tabular data read > from a CSV file. They have the same schema. > > There is also the option of appending each DF to the parquet file, but > then I can't maintain them as separate DF when reading back in without > filtering. > > I'll rethink maintaining each CSV file as a single DF. > > Thanks, > Peter > > > On 10 May 2015 at 15:51, ayan guha <guha.a...@gmail.com> wrote: > >> How did you end up with thousands of df? Are you using streaming? In >> that case you can do foreachRDD and keep merging incoming rdds to single >> rdd and then save it through your own checkpoint mechanism. >> >> If not, please share your use case. >> On 11 May 2015 00:38, "Peter Aberline" <peter.aberl...@gmail.com> wrote: >> >>> Hi >>> >>> I have many thousands of small DataFrames that I would like to save to >>> the one Parquet file to avoid the HDFS 'small files' problem. My >>> understanding is that there is a 1:1 relationship between DataFrames and >>> Parquet files if a single partition is used. >>> >>> Is it possible to have multiple DataFrames within the one Parquet File >>> using PySpark? >>> Or is the only way to achieve this to union the DataFrames into one? >>> >>> Thanks, >>> Peter >>> >>> >>> >