I'm trying to use SparkSQL to efficiently query structured data from datasets in S3. The data is naturally partitioned by date, so I've laid it out in S3 as follows:
s3://bucket/dataset/dt=2015-07-05/ s3://bucket/dataset/dt=2015-07-04/ s3://bucket/dataset/dt=2015-07-03/ etc. In each directory, data for the given date would be written as Parquet (although that's not a hard requirement if there's a more efficient way to do it). I have data going back a couple of decades, so we're looking at thousands of partitions. In each partition, the day's data is anywhere from 100MB to 5GB. Most queries only use data from one to five partitions (i.e., they filter on the partition column very tightly). Ideally, I would like my queries to run quickly in this case, ignoring data in irrelevant partitions and only accessing data in the required partitions. However, early tests seem to show Spark reading Parquet metadata for all files in the dataset, even those excluded by a predicate filter (and pruned by DataSourceStrategy). The result of this is that a query that should take about a second or two, spends a minute or more reading metadata for irrelevant partitions before extracting the data of interest. This greatly limits my ability to use the system, since a very common use case to run a script, download query data into a file, and return (now, running the script takes 1m instead of 2-3s). Am I missing something here? It feels to me like this is something that should be easy, but I'm not finding it to be. Can anyone suggest why this might be happening? Am I approaching this problem the wrong way? Thanks for your help.