Hi Manu, To add a bit more background... Drill uses local storage only for spilling result sets when they are too large for memory. Otherwise, data never touches disk once read from S3.
Unlike Snowflake, Drill does not cache S3 data locally. This means that, if you query the same file multiple times, Drill will hit S3 for each query. Adding Snowflake-like S3 caching is an open project looking for volunteers. Spilling can be configured to go to the DFS (distributed file system). Presumably, this can be S3, though I don't think anyone has tried this. Information about configuring the spill directory is in [1]. Drill does not need Hadoop; it only needs ZK (and, as Nitin pointed out, the proper configuration for your cloud vendor.) As it turns out, there is some information on AWS and S3 setup in the "Learning Apache Drill" book. Probably not as much detail as you would like, but enough to get you started. The book does not include GCE setup, but the details should be similar. Drill uses the HDFS client (not server) to access a cloud vendor. So, as long as you install the correct HDFS client libraries, you are mostly good to go. Note that the S3 libraries have evolved over time. The book explains the most recent library at the time we wrote the book last year. Please check the HDFS project for which library you need for GCE access. Now a request: you will learn quite a number of important details as you set up your cloud-agnostic solution. Please post your findings here, and/or file JIRA tickets. so we can update documentation, or fix any issues that you discover. You are benefiting from the work of others who created Drill; please share your findings with the community so others can benefit from your work. Thanks, - Paul [1] https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/ On Friday, August 16, 2019, 05:10:00 AM PDT, Nitin Pawar <[email protected]> wrote: From my learning and I could be wrong in few things but wait for others to answer as well 1. When stetting up the drill cluster in prod environment to query data ranging from several gigabytes to few terabytes hosted in s3/blob storage/cloud storage, what are the considerations for disk space ? I understand drill bits make use of data locality, but how does that work in case of cloud storage like s3 ? Will the entire data from s3 be moved to drill cluster before starting the query processing ? It is advised to use parquet as your file formats. It improves your performance a lot. Drill will bring all the data it needs to process for a given query. This can be reduced if arrange your folder structure with filterable columns such as dates etc. When you are using parquet files, each of these files or blocks are downloaded separately by all the drillbit servers and then based on your query patterns the data localization happens such as when you say group by or filter and then sum etc. All the data generally resides in memory and then starts spilling to disks based on your query patterns. 2. Is it possible to use s3 or other cloud storage solutions for Sort, Hash Aggregate, and Hash Join operators spill data rather than using local disk ? As per my understanding, only local disks are used for non-memory based aggregations. Using the cloud based storage systems for intermediate outputs as heavy network IO and causes huge delays in queries. 3. Is it ok to run drill production cluster without hadoop ? Is just zookeeper quorum enough ? You do NOT need to set up a hadoop cluster. Apache drill has no per-requisite on hadoop for execution purposes unless you are using those fer,eature sets of apache drill. To run drill cluster, a zookeeper quorum is more than sufficient. From there on based on what storage systems you use, you will need to create storage plugins and use them. On Fri, Aug 16, 2019 at 10:38 AM Manu Mukundan <[email protected]> wrote: > Hi, > > My name is Manu and I am working as a Bigdata architect in a small startup > company in Kochi, India. Our new project handles visualizing large volume > of unstructured data in cloud storage (It can be S3, Azure blob storage or > Google cloud storage). We are planning to use Apache Drill as SQL query > execution engine so that we will be cloud agnostic. Unfortunately we are > finding some key questions unanswered before moving ahead with Drill as > our platform. Hoping you can provide some clarity and it will be much > appreciated. > > > 1. When stetting up the drill cluster in prod environment to query data > ranging from several gigabytes to few terabytes hosted in s3/blob > storage/cloud storage, what are the considerations for disk space ? I > understand drill bits make use of data locality, but how does that work in > case of cloud storage like s3 ? Will the entire data from s3 be moved to > drill cluster before starting the query processing ? > 2. Is it possible to use s3 or other cloud storage solutions for Sort, > Hash Aggregate, and Hash Join operators spill data rather than using local > disk ? > 3. Is it ok to run drill production cluster without hadoop ? Is just > zookeeper quorum enough ? > > > I totally understand how busy you can be but if you get a chance, please > help me to get a clarity on these items. It will be really helpful > > Thanks again! > Manu Mukundan > Bigdata Architect, > Prevalent AI, > [email protected] > > > -- Nitin Pawar
