1. Yes you can query S3 from a drill cluster in AWS
2. There is a good chance you would see better performance putting the data
as close to Drill as possible, depending on your workload and dataset you
might encounter a case where this doesn't matter as much, but most typical
cases will benefit from avoiding remote scans. You won't be reading any
more data from disk hitting S3 rather than EmrFS, HDFS or MapR-FS, but
Drill will perform a first level filter or aggregate (where applicable)
before sending it over the network if you are reading data local to a node.
If you have a compute intensive rather than a read intensive workload this
won't have much impact.
3. I don't have a definite answer on this, but I assume you would be
charged for any data read out of S3. I believe "out of the Amazon S3
region" would include reading it into your own AWS instances. If they
instead mean that as long as it's not hitting the public web that you get
to do reads for free/cheap you might save money by co-locating your S3
storage and AWS nodes in the same regions.
4. The best thing to do here would be to test things out. We have some docs
about tuning Drill settings for different workloads, and if you run into
any issues you can post them on the list and we can try to help you out.
For starters, what are interactive speeds for your use case, like web page
loads? What volume of users are you expecting?

- Jason

On Mon, Mar 7, 2016 at 9:34 AM, Richard Santoso <[email protected]>
wrote:

> Hi,
>
> Our company is evaluating using Drill to query data on S3 storage, which
> will consist of several TB's of files. There will be a single file about
> 1.5 GB in size per day, which is basically a concatenation of ~1500 small
> JSON files. We will be backfilling for the last several years. There will
> be many queries by users across time spans - getting aggregates, trends
> over time etc. We were hoping to get some help answering some questions,
> like:
> 1. Can we have a drill cluster running on AWS which goes directly to S3 ?
> 2. Are there any performance advantages/ disadvantages to going via MapR/
> EmrFS ?
> 3. How do the costs work with this type of setup - the Amazon
> documentation says that the costs for the data transfer portion for S3 are
> "The amount of data transferred out of the Amazon S3 region". Does this
> equate to the results coming out of the drill queries - or would this
> include the data Drill will be querying on ?
> 4. We need interactive speeds to give to our users in this setup, is this
> achievable ?
>
> Any help or insight into above would be greatly appreciated.
>
> regards
> Richard
>

Reply via email to