We have been using drill directly against S3, we generate between 1 and 2TB a week of uncompressed JSON data files. Hosting all of our data on HDFS was going to be too expensive for us.
I highly recommend that you make sure that your json data is very clean with respect to schema changes and data type changes. I run all my data through a ‘cleaner’ that enforces a strict schema and fixes or removes non conforming records. I also highly recommend converting your raw JSON to parquet format after it is clean, it is 9x smaller for me than json data files. This is very simple in Drill with a CTAS statement. I don’t have a lot of end users and my end users are all technical people, they are ok with learning the ins and outs of drill. I wouldn’t recommend putting a large number of random people on a ‘production’ system. We have experienced instances where a perfectly formed sql query causes drill to crash. Granted they are sometimes complicated cases but they have resulted in us restarting the drillbit as the only recourse. We have tested on 1.5 and 1.4. We have found the community to be very helpful. If you have a finite number of parameterized queries you may find drill suitable or production use. Just make sure to test each one with a wide range of parameters across a large cross section of your data. Im going to wave my hands again and say ‘make sure that your data is clean’. Drill isn’t in charge of your schema so you will have to be. We have non critical parts of our application connect to a drill cluster (a cluster of 1 drill bit in this case) to run a limited set of analytical queries. We have an amazon autoscaling group that makes sure that its up and running all the time. We have a different cluster of 1 for running adhoc queries in case we crash it, same deal with an aws auto scaling group. Within some constraints Drill is awesome for querying json data off of s3. Moving data out of a region will incur charges, we collocate our drillbits in the same region as the S3 storage for this reason. Reading from S3 storage in the same region is free in terms of bandwidth charges. Different EC2 instances have differing amounts of bandwidth. For our use, we use r3.xlarge machines because we have queries that use quite a bit of ram and it has adequate bandwidth. Drill doesn’t necessarily like 1 big file per day. We have multiple files in a day and if we are only querying across a single day, we get better performance from having multiple parallel loads. > On Mar 7, 2016, at 2:59 PM, Jason Altekruse <[email protected]> wrote: > > 1. Yes you can query S3 from a drill cluster in AWS > 2. There is a good chance you would see better performance putting the data > as close to Drill as possible, depending on your workload and dataset you > might encounter a case where this doesn't matter as much, but most typical > cases will benefit from avoiding remote scans. You won't be reading any > more data from disk hitting S3 rather than EmrFS, HDFS or MapR-FS, but > Drill will perform a first level filter or aggregate (where applicable) > before sending it over the network if you are reading data local to a node. > If you have a compute intensive rather than a read intensive workload this > won't have much impact. > 3. I don't have a definite answer on this, but I assume you would be > charged for any data read out of S3. I believe "out of the Amazon S3 > region" would include reading it into your own AWS instances. If they > instead mean that as long as it's not hitting the public web that you get > to do reads for free/cheap you might save money by co-locating your S3 > storage and AWS nodes in the same regions. > 4. The best thing to do here would be to test things out. We have some docs > about tuning Drill settings for different workloads, and if you run into > any issues you can post them on the list and we can try to help you out. > For starters, what are interactive speeds for your use case, like web page > loads? What volume of users are you expecting? > > - Jason > > On Mon, Mar 7, 2016 at 9:34 AM, Richard Santoso <[email protected]> > wrote: > >> Hi, >> >> Our company is evaluating using Drill to query data on S3 storage, which >> will consist of several TB's of files. There will be a single file about >> 1.5 GB in size per day, which is basically a concatenation of ~1500 small >> JSON files. We will be backfilling for the last several years. There will >> be many queries by users across time spans - getting aggregates, trends >> over time etc. We were hoping to get some help answering some questions, >> like: >> 1. Can we have a drill cluster running on AWS which goes directly to S3 ? >> 2. Are there any performance advantages/ disadvantages to going via MapR/ >> EmrFS ? >> 3. How do the costs work with this type of setup - the Amazon >> documentation says that the costs for the data transfer portion for S3 are >> "The amount of data transferred out of the Amazon S3 region". Does this >> equate to the results coming out of the drill queries - or would this >> include the data Drill will be querying on ? >> 4. We need interactive speeds to give to our users in this setup, is this >> achievable ? >> >> Any help or insight into above would be greatly appreciated. >> >> regards >> Richard >>
