Drill querying S3

Richard Santoso Mon, 07 Mar 2016 09:36:19 -0800

Hi,

Our company is evaluating using Drill to query data on S3 storage, which will 
consist of several TB's of files. There will be a single file about 1.5 GB in 
size per day, which is basically a concatenation of ~1500 small JSON files. We 
will be backfilling for the last several years. There will be many queries by 
users across time spans - getting aggregates, trends over time etc. We were 
hoping to get some help answering some questions, like:
1. Can we have a drill cluster running on AWS which goes directly to S3 ? 
2. Are there any performance advantages/ disadvantages to going via MapR/ EmrFS 
?
3. How do the costs work with this type of setup - the Amazon documentation 
says that the costs for the data transfer portion for S3 are "The amount of 
data transferred out of the Amazon S3 region". Does this equate to the results 
coming out of the drill queries - or would this include the data Drill will be 
querying on ?
4. We need interactive speeds to give to our users in this setup, is this 
achievable ?


Any help or insight into above would be greatly appreciated.

regards
Richard

Drill querying S3

Reply via email to