Posting this for John Humphreys who posted this in the MapR community but I
think this may benefit all users:

https://community.mapr.com/thread/22719-re-how-can-i-partition-data-in-drill


   1. If I had Spark re-partition a data frame based on a column, and then
   saved the data frame to parquet, this post is indicating that drill would
   query based on that column faster, correct?
   2. Does the coalesce # (the number of .snappy.parquet files inside the
   whole parquet file) make a big difference?  Spark defaults to 200.
   3. Also, does sorting the data help too?  Or does partitioning sort it
   implicitly?


Thanks,
Saurabh

Reply via email to