Hi all,

Can anyone share their experiences working with storing and organising larger 
datasets with Spark?

I've got a dataframe stored in Parquet on Amazon S3 (using EMRFS) which has a 
fairly complex nested schema (based on JSON files), which I can query in Spark, 
but the initial setup takes a few minutes, as we've got roughly 5000 partitions 
and 150GB of compressed parquet part files.

Generally things work, but we seem to be hitting various limitations now we're 
working with 100+GB of data, such as the 2GB block size limit in Spark which 
means we need a large number of partitions, slow startup due to partition 
discovery, etc.

Storing data in one big dataframe has worked well so far, but do we need to 
look at breaking it out into multiple dataframes?

Has anyone got any advice on how to structure this?

Thanks,
Ewan

Reply via email to