> On 30 Apr 2017, at 09:19, Zeming Yu <zemin...@gmail.com> wrote: > > Hi, > > We're building a parquet based data lake. I was under the impression that > flat files are more efficient than deeply nested files (say 3 or 4 levels > down). Is that correct? > > Thanks, > Zeming
Where's the data going to live: HDFS or an object store? If it's somewhere like Amazon S3 I'd be biased towards the flatter structure as how the client libraries mimic treewalking is pretty expensive in terms of HTTP calls, and, as those calls all take place during the initial, serialized, query planning stage, expensive. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org