Re: Understanding of Drill Architecture regd

Carol McDonald Tue, 17 Feb 2015 07:59:44 -0800

RDBMs process one record at a time
Drill processes more than one record at a time
–*sets of columns values* from multiple records
•called Record Batches
–Logical Vectorization
–Modern CPU  friendly
•Optimized for modern CPU architectures
•SIMD (single input multiple data) instructions
•Avoid branching to speed CPU pipeline
•Keeping all pipelines full to achieve efficiency


Vectorization: Rather than operating on single values from a single table
record at one time, vectorization in Drill allows the CPU to operate on
vectors, referred to as a Record Batches. Record Batches are arrays of
values from many different records. The technical basis for efficiency of
vectorized processing is modern chip technology with deep-pipelined CPU
designs. Keeping all pipelines full to achieve efficiency near peak
performance is something impossible to achieve in traditional database
engines, primarily due to code complexity.

On Tue, Feb 17, 2015 at 10:37 AM, Carol McDonald <[email protected]>
wrote:

> Drill allows for columnar *execution *as well as storage.  Other tools
> will flatten the file rows before executing. For non columnar data
> storage, drill optimizes with columnar execution.  This makes for much more
> efficient processing.
>
>
> Drill optimizes for both columnar storage and execution by using an
> in-memory data model that is hierarchical and columnar. When working with
> data stored in columnar formats such as Parquet, Drill avoids disk access
> for columns that are not involved in an analytic query. Drill also provides
> an execution layer that performs SQL processing directly on columnar data
> without row materialization. The combination of optimizations for columnar
> storage and direct columnar execution significantly lowers memory
> footprints and provides faster execution of BI/Analytic type of workloads.
>
> Drill tackles rapidly evolving application driven schemas and nested data
> structures with a unique hierarchical columnar representation of data
> allowing for high performance queries on such evolving data structures.
>
>
>
>
> On Sat, Feb 14, 2015 at 2:52 PM, Aditya <[email protected]> wrote:
>
>> One of the Drill's goal is to allow a direct access, i.e. without the need
>> for transformation,
>> from any data source in any format and that's why you see so many choices
>> with data sources
>> and data format.
>>
>> Having said that, it's also Drill's primary goal is to provide low latency
>> access to queries by
>> taking advantage of a storage format which provides optimization like the
>> one you mentioned.
>>
>> Please take a look at Parquet <http://parquet.incubator.apache.org/>,
>> which
>> is a columnar storage format and is used by Drill as the
>> de facto format for optimized access to the data.
>>
>> On Sat, Feb 14, 2015 at 5:56 AM, Tamil selvan R.S <[email protected]>
>> wrote:
>>
>> > Hi,
>> > As the project description says, I understand drill as a open source
>> > implementation of Dremel. Basically, Dremel optimizes adhoc queries on
>> > unstructured data by storing it columnar way instead of record wise. I
>> > assume drill doing the same. I saw drill supporting a wide variety of
>> > datasources like json, mongo, etc., How does drill achieve the
>> > transformation of source data into a columnar representation so that it
>> can
>> > optimize the queries?
>> >
>> > For Example:
>> > Data [Assume it to be in mongo]:
>> >
>> >
>> {"idtype":"ca","id":3,"metric":"purchases","time":"Y14/M0/D0","device":"nexus","devicegrp":"tablet","source":"minewhat","sourcegrp":"email","dofw":"weekend","tofd":"morning","browser":"chrome","engage":"return","location":"mumbai","locationgrp":"maharashtra","usertag":"frequent","search":"sony
>> > tab","total":56263}
>> >
>> > And for a query like below:
>> > select test.device, count(*) from mongo.mydata test where
>> test.idtype='b'
>> > and test.id=10 group by test.device, test.idtype, test.id;
>> >
>> > Will drill load *all documents* from mydata collection every time this
>> > query is fired and later map the data to columnar style? I'm 100% sure
>> this
>> > won't be the implementation as it look to worsen the situation more
>> > [loading data, transform [should go row by row] and then query the
>> > transformed data].
>> >
>> > It would be really helpful if someone can shed some light on this area,
>> as
>> > there is no material found in the documentation.
>> >
>> > Regards,
>> > Tamil.s
>> >
>>
>
>

Re: Understanding of Drill Architecture regd

Reply via email to