Hi Thanks for the quick response.
No I'm not using Streaming. Each DataFrame represents tabular data read from a CSV file. They have the same schema. There is also the option of appending each DF to the parquet file, but then I can't maintain them as separate DF when reading back in without filtering. I'll rethink maintaining each CSV file as a single DF. Thanks, Peter On 10 May 2015 at 15:51, ayan guha <guha.a...@gmail.com> wrote: > How did you end up with thousands of df? Are you using streaming? In that > case you can do foreachRDD and keep merging incoming rdds to single rdd and > then save it through your own checkpoint mechanism. > > If not, please share your use case. > On 11 May 2015 00:38, "Peter Aberline" <peter.aberl...@gmail.com> wrote: > >> Hi >> >> I have many thousands of small DataFrames that I would like to save to >> the one Parquet file to avoid the HDFS 'small files' problem. My >> understanding is that there is a 1:1 relationship between DataFrames and >> Parquet files if a single partition is used. >> >> Is it possible to have multiple DataFrames within the one Parquet File >> using PySpark? >> Or is the only way to achieve this to union the DataFrames into one? >> >> Thanks, >> Peter >> >> >>