Hi,
Our company is evaluating using Drill to query data on S3 storage, which will
consist of several TB's of files. There will be a single file about 1.5 GB in
size per day, which is basically a concatenation of ~1500 small JSON files. We
will be backfilling for the last several years. There will be many queries by
users across time spans - getting aggregates, trends over time etc. We were
hoping to get some help answering some questions, like:
1. Can we have a drill cluster running on AWS which goes directly to S3 ?
2. Are there any performance advantages/ disadvantages to going via MapR/ EmrFS
?
3. How do the costs work with this type of setup - the Amazon documentation
says that the costs for the data transfer portion for S3 are "The amount of
data transferred out of the Amazon S3 region". Does this equate to the results
coming out of the drill queries - or would this include the data Drill will be
querying on ?
4. We need interactive speeds to give to our users in this setup, is this
achievable ?
Any help or insight into above would be greatly appreciated.
regards
Richard