Hi, My name is Manu and I am working as a Bigdata architect in a small startup company in Kochi, India. Our new project handles visualizing large volume of unstructured data in cloud storage (It can be S3, Azure blob storage or Google cloud storage). We are planning to use Apache Drill as SQL query execution engine so that we will be cloud agnostic. Unfortunately we are finding some key questions unanswered before moving ahead with Drill as our platform. Hoping you can provide some clarity and it will be much appreciated.
1. When stetting up the drill cluster in prod environment to query data ranging from several gigabytes to few terabytes hosted in s3/blob storage/cloud storage, what are the considerations for disk space ? I understand drill bits make use of data locality, but how does that work in case of cloud storage like s3 ? Will the entire data from s3 be moved to drill cluster before starting the query processing ? 2. Is it possible to use s3 or other cloud storage solutions for Sort, Hash Aggregate, and Hash Join operators spill data rather than using local disk ? 3. Is it ok to run drill production cluster without hadoop ? Is just zookeeper quorum enough ? I totally understand how busy you can be but if you get a chance, please help me to get a clarity on these items. It will be really helpful Thanks again! Manu Mukundan Bigdata Architect, Prevalent AI, manu.mukun...@prevalent.ai