Re: Clarification regarding Apache drill setup

Paul Rogers Fri, 16 Aug 2019 09:37:51 -0700

Hi Manu,

To add a bit more background... Drill uses local storage only for spilling 
result sets when they are too large for memory. Otherwise, data never touches 
disk once read from S3.

Unlike Snowflake, Drill does not cache S3 data locally. This means that, if you 
query the same file multiple times, Drill will hit S3 for each query. Adding 
Snowflake-like S3 caching is an open project looking for volunteers.

Spilling can be configured to go to the DFS (distributed file system). 
Presumably, this can be S3, though I don't think anyone has tried this. 
Information about configuring the spill directory is in [1].

Drill does not need Hadoop; it only needs ZK (and, as Nitin pointed out, the 
proper configuration for your cloud vendor.)

As it turns out, there is some information on AWS and S3 setup in the "Learning 
Apache Drill" book. Probably not as much detail as you would like, but enough 
to get you started. The book does not include GCE setup, but the details should 
be similar.

Drill uses the HDFS client (not server) to access a cloud vendor. So, as long 
as you install the correct HDFS client libraries, you are mostly good to go. 
Note that the S3 libraries have evolved over time. The book explains the most 
recent library at the time we wrote the book last year. Please check the HDFS 
project for which library you need for GCE access.

Now a request: you will learn quite a number of important details as you set up 
your cloud-agnostic solution. Please post your findings here, and/or file JIRA 
tickets. so we can update documentation, or fix any issues that you discover. 
You are benefiting from the work of others who created Drill; please share your 
findings with the community so others can benefit from your work.

Thanks,
- Paul

[1] 
https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/

    On Friday, August 16, 2019, 05:10:00 AM PDT, Nitin Pawar 
<[email protected]> wrote:  

 From my learning and I could be wrong in few things but wait for others to
answer as well

1.  When stetting up the drill cluster in prod environment to query data
ranging from several gigabytes to few terabytes hosted in s3/blob
storage/cloud storage, what are the considerations for disk space ? I
understand drill bits make use of data locality, but how does that work in
case of cloud storage like s3 ? Will the entire data from s3 be moved to
drill cluster before starting the query processing ?

It is advised to use parquet as your file formats. It improves your
performance a lot. Drill will bring all the data it needs to process for a
given query. This can be reduced if arrange your folder structure with
filterable columns such as dates etc. When you are using parquet files,
each of these files or blocks are downloaded separately by all the drillbit
servers and then based on your query patterns the data localization happens
such as when you say group by or filter and then sum etc. All the data
generally resides in memory and then starts spilling to disks based on your
query patterns.

  2.  Is it possible to use s3 or other cloud storage solutions for Sort,
Hash Aggregate, and Hash Join operators spill data rather than using local
disk ?
As per my understanding, only local disks are used for non-memory based
aggregations. Using the cloud based storage systems for intermediate
outputs as heavy network IO and causes huge delays in queries.

  3.  Is it ok to run drill production cluster without hadoop ? Is just
zookeeper quorum enough ?
You do NOT need to set up a hadoop cluster. Apache drill has no
per-requisite  on hadoop for execution purposes unless you are using those
fer,eature sets of apache drill.
To run drill cluster, a zookeeper quorum is more than sufficient. From
there on based on what storage systems you use, you will need to create
storage plugins and use them.

On Fri, Aug 16, 2019 at 10:38 AM Manu Mukundan <[email protected]>
wrote:

> Hi,
>
> My name is Manu and I am working as a Bigdata architect in a small startup
> company in Kochi, India. Our new project handles visualizing large volume
> of unstructured data in cloud storage (It can be S3, Azure blob storage or
> Google cloud storage). We are planning to use Apache Drill as SQL query
> execution engine so that we will be cloud agnostic. Unfortunately we are
> finding some  key questions unanswered before moving ahead with Drill as
> our platform. Hoping you can provide some clarity and it will be much
> appreciated.
>
>
>  1.  When stetting up the drill cluster in prod environment to query data
> ranging from several gigabytes to few terabytes hosted in s3/blob
> storage/cloud storage, what are the considerations for disk space ? I
> understand drill bits make use of data locality, but how does that work in
> case of cloud storage like s3 ? Will the entire data from s3 be moved to
> drill cluster before starting the query processing ?
>  2.  Is it possible to use s3 or other cloud storage solutions for Sort,
> Hash Aggregate, and Hash Join operators spill data rather than using local
> disk ?
>  3.  Is it ok to run drill production cluster without hadoop ? Is just
> zookeeper quorum enough ?
>
>
> I totally understand how busy you can be but if you get a chance, please
> help me to get a clarity on these items. It will be really helpful
>
> Thanks again!
> Manu Mukundan
> Bigdata Architect,
> Prevalent AI,
> [email protected]
>
>
>

-- 
Nitin Pawar

Re: Clarification regarding Apache drill setup

Reply via email to