Re: Some questions on Drill.

Ted Dunning Mon, 26 Sep 2016 08:13:10 -0700

On Mon, Sep 26, 2016 at 4:05 PM, ramki <[email protected]> wrote:

> Below are some initial (+ some dumb) questions that I have.
>


Dumb questions are rare.

Naive questions are common and often quite valuable.


> 1) Does Drill have a way to represent the queries in JSON format? For
> example, "select ... where name = 'x' and age = 10 " can be written in JSON
> as {name = 'x', age ='10' }. You can think of it like Mongo queries. If
> this is not already there, can we implement the same on Drill to expose
> both SQL & JSON represented way to query?
>

Yes, but no.

There is an interior representation of the logical query that can be
injected in JSON form, but it isn't as simple as the Mongo query language.
Of course, the Mongo query language isn't nearly as simple as it appears,
either.

It wouldn't be hard to build something that converts a simple form of JSON
query into something suitable for Drill (this is a purposeful integration
point), but it doesn't quite exist just now.



>
> 2) Does Drill have any expectations from the data-sources in any ways? Can
> I plug any data-source to Drill by implementing the driver for it? Like if
> I want to add support for ElasticSearch and CouchBase, is it easily
> possible?
>

It is pretty easy if you don't allow any push-down or clever optimization.

It isn't hugely harder if you support simple forms of push down.

This is very doable for the cases you mention.


>
> 3) Does Drill have abilities to "stream" the results and so we can build
> some sort of pipelines? For example, Reactive Streams?
>

Internally, Drill works as a streaming engine, but only in the sense of
data streaming, not in the sense of an engine like Flink or Apex that
supports checkpoints and event time.

There is also a strong assumption that queries have a finite lifetime in
the memory management.

There has been some talk about making Drill into a true streaming engine,
but I don't think that there has been much progress in that direction.


> 4) Are there any characterization of resource usage like CPU, memory...
> on data source containing over many tera-bytes data?
>

I think it is impossible to answer this in general other than to say that
Drill will usually spill to disk when it can't keep everything in memory. A
number of production use cases for processing much larger amounts of data
than just terabytes with much smaller memory sets. There are also knobs to
turn that will strictly limit memory usage.

But saying anything more specific than that is probably impossible unless
you can give specifics.



>
> 5) We can use Drill for querying only and not for ingestion, right?
>

Yes and no.

Drill has very good capabilities for [create tables as ...]. This can be
used to create files in directories and the directories can be queried by
drill, thus reflecting a growing dataset. This works really well for
ingestion that works in fairly substantial chunks.

There is no support for replace or update, but I could be a bit out of date
on that.

Re: Some questions on Drill.

Reply via email to