Querying json files from multiple subdirectories

2020-01-16 Thread Prabhakar Bhosaale
Hi All,

I am new to apache drill and trying to retrieve data from json files by
querying the directories.

The directory structure is

|-->Year2012--->trans.json
|
|
transactions-->|
|
|-->Year2013--->trans.json

I would like to query trans.json from both the sub-directories as one table
and then join the resultant table with another table in a single query.
Please help with possible options. thx

Regards


Writing storage plugins in Kotlin

2020-01-16 Thread Andy Grove
I've been implementing a storage plugin in Java and this is the first time
I have written any substantial Java in a quite a while, and I would love to
be writing this in Kotlin instead!

My plugin is in its own project (I am not building as part of the Drill
project) and I am just adding dependencies for Drill jars that I need.

In theory, I shouldn't run into issues since Kotlin is supposedly 100%
compatible with Java, but I'm wondering if anyone has tried this yet?

Thanks,

Andy.


Re: Looking for advice on integrating with a custom data source

2020-01-16 Thread Andy Grove
Hi Charles,

I would like to be able to contribute something out of this effort. The PoC
I am working on is quite fluid at the moment but one possible outcome is
that this storage engine ends up supporting Arrow Flight, but I'm not sure
yet.

Andy.

On Wed, Jan 15, 2020 at 7:19 AM Charles Givre  wrote:

> Andy,
> Glad to hear you got it working!!   Can you share what data source you are
> working with?  Is it completely custom to your organization?  If not, would
> you consider submitting this as a pull request?
> Best,
> -- C
>
>
>
> > On Jan 15, 2020, at 9:07 AM, Andy Grove  wrote:
> >
> > And boom! With just 3 extra lines of code to adjust the CBO to make the
> row
> > count inversely proportional to the number of predicates, my little Poc
> > works :-)
> >
> > Now that I've achieved the instant gratification (relatively speaking!)
> of
> > making something work, I think it's time to step back and start doing
> this
> > the right way with the PR you mentioned.
> >
> > I would not have been able to get this working at all without all the
> > fantastic support!
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
> > On Tue, Jan 14, 2020 at 11:43 PM Paul Rogers 
> > wrote:
> >
> >> Hi Andy,
> >>
> >> Congratulations on making such fast progress!
> >>
> >> The code to do filter pushdowns is rather complex and, it seems, most
> >> plugins copy/paste the same wad of code (with the same bugs). PR 1914
> >> provides a layer that converts the messy Drill logical plan into a nice,
> >> simple set of predicates. You can then pick and choose which to push
> down,
> >> allowing the framework to do the rest.
> >>
> >> Note that most of the plugins do push-down as part of physical planning.
> >> While this works in most case, it WILL NOT work if you are doing
> push-down
> >> in order to shard the scan. For example, in order to divide a time
> range up
> >> into pieces for a time series scan. The PR thus does push-down in the
> >> logical phase so that we can "do the right thing."
> >>
> >> When you say that getNewWithChildren() is for an earlier instance, it is
> >> very likely because Calcite gave up on your filter-push-down version
> >> because there was no cost reduction.
> >>
> >>
> >> The Wiki page mentioned earlier explains all the copies a bit.
> Basically,
> >> Drill creates many copies of your GroupScan as it proceeds. First a
> "blank"
> >> one, then another with projected columns, then another full copy as
> Calcite
> >> explores planning options, and so on.
> >>
> >> One key trick is that if you implement filter push down, you MUST
> return a
> >> lower cost estimate after the push-down than before. Else, Calcite
> decides
> >> that it is not worth the hassle of doing the push-down if the costs
> remain
> >> the same. See the Wiki for details. this is what getScanStats() does:
> >> report stats that must get lower as you improve the scan.
> >>
> >> That is, one cost at the start, a lower cost after projection push down
> >> (reflecting the fact that we presumably now read less data per row) and
> a
> >> lower cost again after filter-push down (because we read fewer rows.)
> There
> >> is a "Dummy" storage plugin in PR 1914 that illustrates all of this.
> >>
> >> Don't worry about getDigest(), it is just Calcite trying to get a label
> to
> >> use for its internal objects. You will need to implement getString(),
> using
> >> Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan
> >> output. EXPLAIN PLAN output is:
> >>
> >> ClassName [field1=x, field2=y]
> >>
> >> There is a little builder in PR 1914 to do this for you.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove <
> >> andygrov...@gmail.com> wrote:
> >>
> >> With some extra debugging I can see that the getNewWithChildren call is
> >> made to an earlier instance of GroupScan and not the instance created by
> >> the filter push-down rule. I'm wondering if this is some kind of
> >> hashCode/equals/toString/getDigest issue?
> >>
> >> On Tue, Jan 14, 2020 at 7:52 PM Andy Grove 
> wrote:
> >>
> >>> I'm now working on predicate push down ... I have a filter rule that is
> >>> correctly extracting the predicates that the backend database supports
> >> and
> >>> I am creating a new GroupScan containing these predicates, using the
> >> Kafka
> >>> plugin as a reference. I see the GroupScan constructor being called
> after
> >>> this, with the predicates populated So far so good ... but then I see
> >> calls
> >>> to getDigest, getScanStats, and getNewWithChildren, and then I see
> calls
> >> to
> >>> the GroupScan constructor with the predicates missing.
> >>>
> >>> Any pointers on what I might be missing? Is there more magic I need to
> >>> know?
> >>>
> >>> Thanks!
> >>>
> >>> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers  >
> >>> wrote:
> >>>
>  Hi Andy,
> 
>  Congrats! You are making good progress. Yes, the BatchCreator is a bit
> >> of
>  magic: Drill looks for a subclass 

Re: Non-Relational Data

2020-01-16 Thread Paul Rogers
As a follow-up, to avoid us rehashing the same solutions we've discussed for 
some time, here are just a few:

* Provide a very light-weight schema "hint" only for those columns that need 
the hint. In the example, only provide schema for the ambiguous version column. 
If a column tends to have lots of nulls, tell Drill what type it will be when 
data actually appears. Arina has a good start on this approach.


* Populate the above automagically as queries run. (Ted has long advocated this 
one.)

* Populate the schema by converting existing specs, such as a Swagger spec.

* Make the existing UNION and LIST (repeated UNION) types work fully (whatever 
that means; we'd need type conversion rules for all type combinations.)

* Add true dynamic types. Drill half-way supports vectors that hold Java 
objects. Make this a first-class feature. Auto-convert conflicting fields. Also 
needs type conversion rules for all combinations. (Might be more handy for 
special cases, to heavy-weight for the typical schema-evolution case. Might be 
an easy way to handle complex structures such as images, complex JSON 
structures, etc.)

* (Insert your ideas here.)


Point is, we're not short on ideas. What we need to know is what folks want; 
then we can pick the idea that works best.


Thanks,
- Paul

 

On Thursday, January 16, 2020, 6:48:22 PM PST, Paul Rogers 
 wrote:  
 
 Hi Jiang,

Thanks for taking the time to explain your use case. In my experience, the 
scenario you describe is not unusual: the desire to integrate application data 
behind a common SQL interface. IMHO, this is where Drill differs from the "data 
lake guys" such as Impala and Presto. Would be helpful if you can help us get 
the solution right.


 You have hit upon one of the key challenges in making this model work: the 
question of how to use SQL to work with a column that has a varying data type. 
Your product version use case is a classic example.


In the non-relational world, types are messy. That's why we have Python: we can 
write code that forces the values to one type or another, we can do "if" 
statements based on types, or we can use "duck typing" to do dynamic method 
dispatch.


But, you want to use the convenience (and speed) of SQL. We want to hide these 
details somehow. So we have to be clever.

Hive (then Presto, Impala and even Drill, when used with HMS) solved the 
problem by requiring a schema stored in HMS. Simple idea, but HMS has become 
heavyweight. With HMS, we can declare that our version field is a VARCHAR and 
conversions can be done at read time (schema-on-read.) Problem solved. HMS 
works best for files. It does not work for the app integration use case: 
reading data from ad-hoc files, accessing APIs and so on.

Schemas really help with distributed systems. Suppose I say SELECT a + b, SQL 
can work out that a is an INT, b is a BIGINT and the result should be a BIGINT. 
Work can be distributed across many independent nodes and all will make the 
same type decisions. Data can then be merged and types will agree. Basic stuff. 
In fact, types are part of the very definition of the relational theory on 
which SQL is based. 


But, if data types vary (no schema, as in Drill without HMS), things get messy. 
Drill can distribute filters. Suppose I say WHERE version = 10. Drillbit 1 
reads the INT-valued version fields, Drillbit 2 reads the VARCHAR valued 
fields. How do we make sure that both nodes make the same decisions?

Later, when data is merged, should there be a common type? What if I say ORDER 
BY version. What is the intended result? Sort the INT values before (or after) 
VARCHAR? Convert VARCHAR to INT (or visa-versa)?

Another factor is speed. Python is slow because it does dynamic type handling 
on every operation. Java and C are fast because they are statically typed. 
Similarly, Impala is fast because of static types. Drill tries to be fast by 
having a fixed vector type for each column. But, Drill also tries to be 
flexible, where is when things start to get "interesting."

The crude-but-effective solution, without a schema, is to require users to 
explicitly include the CAST and other statements in every query. It seems this 
was standard Hadoop practice before HMS came along. But, since you want to use 
SQL, you presumably want to shield users from this complexity.

See? You have nailed the key challenge we must solve to make the app 
integration idea work.


We can make up lots of things we *could* do. They pretty much fall into three 
buckets:

* Give the user or app developer a way to declare the type so all nodes do the 
same thing. Declare the type either per-query (with a CAST) or per-table (using 
some kind of schema or view.)

* Discover or impose a uniform set of conversion rules that work in all cases. 
(Always convert to VARCHAR, say.)

* Embrace dynamic types. Add dynamic types to Drill to make it more 
Python-like, with "duck typing" to decide, say, that "+" means different things 
for different 

Re: Non-Relational Data

2020-01-16 Thread Paul Rogers
Hi Jiang,

Thanks for taking the time to explain your use case. In my experience, the 
scenario you describe is not unusual: the desire to integrate application data 
behind a common SQL interface. IMHO, this is where Drill differs from the "data 
lake guys" such as Impala and Presto. Would be helpful if you can help us get 
the solution right.


 You have hit upon one of the key challenges in making this model work: the 
question of how to use SQL to work with a column that has a varying data type. 
Your product version use case is a classic example.


In the non-relational world, types are messy. That's why we have Python: we can 
write code that forces the values to one type or another, we can do "if" 
statements based on types, or we can use "duck typing" to do dynamic method 
dispatch.


But, you want to use the convenience (and speed) of SQL. We want to hide these 
details somehow. So we have to be clever.

Hive (then Presto, Impala and even Drill, when used with HMS) solved the 
problem by requiring a schema stored in HMS. Simple idea, but HMS has become 
heavyweight. With HMS, we can declare that our version field is a VARCHAR and 
conversions can be done at read time (schema-on-read.) Problem solved. HMS 
works best for files. It does not work for the app integration use case: 
reading data from ad-hoc files, accessing APIs and so on.

Schemas really help with distributed systems. Suppose I say SELECT a + b, SQL 
can work out that a is an INT, b is a BIGINT and the result should be a BIGINT. 
Work can be distributed across many independent nodes and all will make the 
same type decisions. Data can then be merged and types will agree. Basic stuff. 
In fact, types are part of the very definition of the relational theory on 
which SQL is based. 


But, if data types vary (no schema, as in Drill without HMS), things get messy. 
Drill can distribute filters. Suppose I say WHERE version = 10. Drillbit 1 
reads the INT-valued version fields, Drillbit 2 reads the VARCHAR valued 
fields. How do we make sure that both nodes make the same decisions?

Later, when data is merged, should there be a common type? What if I say ORDER 
BY version. What is the intended result? Sort the INT values before (or after) 
VARCHAR? Convert VARCHAR to INT (or visa-versa)?

Another factor is speed. Python is slow because it does dynamic type handling 
on every operation. Java and C are fast because they are statically typed. 
Similarly, Impala is fast because of static types. Drill tries to be fast by 
having a fixed vector type for each column. But, Drill also tries to be 
flexible, where is when things start to get "interesting."

The crude-but-effective solution, without a schema, is to require users to 
explicitly include the CAST and other statements in every query. It seems this 
was standard Hadoop practice before HMS came along. But, since you want to use 
SQL, you presumably want to shield users from this complexity.

See? You have nailed the key challenge we must solve to make the app 
integration idea work.


We can make up lots of things we *could* do. They pretty much fall into three 
buckets:

* Give the user or app developer a way to declare the type so all nodes do the 
same thing. Declare the type either per-query (with a CAST) or per-table (using 
some kind of schema or view.)

* Discover or impose a uniform set of conversion rules that work in all cases. 
(Always convert to VARCHAR, say.)

* Embrace dynamic types. Add dynamic types to Drill to make it more 
Python-like, with "duck typing" to decide, say, that "+" means different things 
for different type combinations. Accept the resulting performance hit.


Before we brainstorm specific alternatives, it would be very helpful to 
understand what you actually *want* to do. What was the intent when the type 
changed? How do you want to users to work with such data in a SQL context? How 
would your ideal user experience look?


Thanks,
- Paul

 

On Thursday, January 16, 2020, 4:47:01 PM PST, Jiang Wu 
 wrote:  
 
 Moving the topic on non-relational data to this dedicated thread.  First a
bit of context based on our use case:

* We want to do ad-hoc analyze data coming from diverse sources like APIs,
document stores, and relational stores.
* Data are not limited to relational structures, e.g. API returning complex
object collections.
* Data may change its structure over time, e.g. due to implementation
upgrades.
* We want to use high level declarative query languages such as SQL.

Various techniques exist to tackle non-relational data analysis such as
mapping to a relational schema or run custom code in a distributed compute
cluster (map-reduce, spark jobs, etc) on blob data.  These have their
drawbacks like data latency and effort on structure transformation, and
query latency and cost computing on blob data.

We built a columnar data store for non-relational data without pre-defined
schema.  For querying this data, technologies like Drill made it almost

Non-Relational Data

2020-01-16 Thread Jiang Wu
Moving the topic on non-relational data to this dedicated thread.  First a
bit of context based on our use case:

* We want to do ad-hoc analyze data coming from diverse sources like APIs,
document stores, and relational stores.
* Data are not limited to relational structures, e.g. API returning complex
object collections.
* Data may change its structure over time, e.g. due to implementation
upgrades.
* We want to use high level declarative query languages such as SQL.

Various techniques exist to tackle non-relational data analysis such as
mapping to a relational schema or run custom code in a distributed compute
cluster (map-reduce, spark jobs, etc) on blob data.  These have their
drawbacks like data latency and effort on structure transformation, and
query latency and cost computing on blob data.

We built a columnar data store for non-relational data without pre-defined
schema.  For querying this data, technologies like Drill made it almost
possible to directly work with non-relational data using array and map data
types.  However, we feel more can be done to truly make non-relational data
a first class citizen:

1) functions on array and map -- e.g. sizeOf(person.addresses) where
person.addresses is an array.  Using FLATTEN is not the same as working
with complex objects directly,
2) heterogenous types -- better handling of heterogeneous data types within
the same column, e.g. product.version started as numbers, but some are
strings.  Treating every value as a String is a workaround.
3) better storage plugin support for complex types -- we had to re-generate
from our columnar vectors into objects to give to Drill, rather than
feeding vectors directly.

I don't think any of these are easy to do.  Much research and thinking will
be needed for a cohesive solution.

-- Jiang


Re: About integration of drill and arrow

2020-01-16 Thread Jiang Wu
We experienced the same issue on first "null" value tripping up Drill's
schema on-demand logic that assumes certain data type for "null" causing
subsequent exception when a non-null is encountered.  We added logic inside
our storage plugin to make this work.  But good to hear there is a general
solution being worked on.  Logically, one can defer the type determination
until first non-null value.

Regarding "If Drill had a schema", I do fully understand the need for
schemas for purposes of vectorization.  Still having schema on demand has
been a major benefit for our use case.  So the tricky part is how to
achieve the best of both worlds.

-- Jiang


On Wed, Jan 15, 2020 at 7:09 PM Paul Rogers 
wrote:

> Hi Ted,
>
> Thanks much for the feedback! Good test cases indeed. The good news is
> that we're close to finishing a "V2" JSON reader that smooths over a few
> more JSON quirks like the "first column null" issue that can cause problems:
>
> {n: 1, v: null}{n: 2, v: "Gotcha! Wasn't Int, is actually VARCHAR!"}
>
> Glad your queries work. You gave an example that had fooled me multiple
> times:
>
> select nest.u from dfs.root.`/Users/tdunning/x.json`;
>
>
> The trick here is that Drill has no schema. All the parser can tell is,
> "hey, I've got a two-part name, `nest.u`. For me a two part name means
> schema.table or table.column, so, since `nest` isn't a schema, it must be a
> table.Oh, look, no such table exists. FAIL!" Using a three-part name works
> (IIRC):
>
> select t.nest.u from dfs.root.`/Users/tdunning/x.json` t;
>
>
> Now Drill sees that `t` is a table name, and works its way down from there.
>
>
> If Drill had a schema, then the planner could first check if `nest` is a
> schema, then if it is a table, then if it is a structured field in the
> query. Impala can do this because it has a schema; Drill can't. We can hope
> that, with the new schema work being added to Drill, that your query will
> "do the right thing" in the future.
>
> Adding `columns` to your query won't help: the `columns` name is valid in
> only one place: when working with CSV (or, more generally, delimited) data
> with no headers.
>
> This gets back to Jaing's point: we could really use better/more
> documentation. We're good at the bare basics, "such-and-so syntax exists",
> but we're not as good at explaining how to solve problems using Drill
> features. The Learning Apache Drill book tries to address some holes.
> Clearly, if you have a hard time with this, being part of the team that
> created Drill, we've got a bit of work to do! (To be honest, neither Impala
> nor Presto are much better in the "how to" department.)
>
>
> Additional use cases/frustrations are very welcome as you find them.
>
>
> Thanks,
> - Paul
>
>
>
> On Wednesday, January 15, 2020, 3:44:09 PM PST, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  On Wed, Jan 15, 2020 at 2:58 PM Paul Rogers 
> wrote:
>
> > ...
> >
> > For example, Ted, you mention lack of nullability on structure members.
> > But, Drill represents structures as MAPs, and MAPs can have nullable
> > members. So, there is likely more to your request than the short summary
> > suggests. Perhaps you can help us understand this a bit more.
> >
>
> This was quite a while ago.
>
> I was reading JSON data with substructures of variable form.
>
> I think, however, that this impression is old news. I just tried it and it
> works the way I wanted.
>
> Here is my data:
>
> {"top":"a","nest":{"u":1, "v":"other"}}
> {"top":"b","nest":{"v":"this", "w":"that"}}
>
> And here are some queries that behave just the way that I wanted:
>
> apache drill> *select* * *from* dfs.root.`/Users/tdunning/x.json`;
>
> +-+-+
>
> | *top* | *nest  * |
>
> +-+-+
>
> | a  | {"u":1,"v":"other"}|
>
> | b  | {"v":"this","w":"that"} |
>
> +-+-+
>
> 2 rows selected (0.079 seconds)
>
> apache drill> *select* nest *from* dfs.root.`/Users/tdunning/x.json`;
>
> +-+
>
> | *nest  * |
>
> +-+
>
> | {"u":1,"v":"other"}|
>
> | {"v":"this","w":"that"} |
>
> +-+
>
> 2 rows selected (0.114 seconds)
>
> apache drill> *select* nest.u *from* dfs.root.`/Users/tdunning/x.json`;
>
> Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 11: Table
> 'nest' not found
>
>
>
> [Error Id: b2100faf-adf7-453e-957f-56726b96e06f ] (state=,code=0)
>
> apache drill> *select* columns.nest.u *from* dfs.root.
> `/Users/tdunning/x.json`;
>
> Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 14: Table
> 'columns' not found
>
>
>
> [Error Id: a793e6bd-c2ed-477a-9f23-70d67b2b85df ] (state=,code=0)
>
> apache drill> *select* x.nest.u *from* dfs.root.`/Users/tdunning/x.json` x;
>
> ++
>
> | *EXPR$0* |
>
> ++
>
> | 1  |
>
> | null  |
>
> ++
>
> 2 rows selected (0.126 seconds)
> apache drill>
>


Re: RDBMS Storage Plugin Configurations

2020-01-16 Thread Jiang Wu
Thank you all for the replies!  Yes, looks like DRILL-7467 is what I am
looking for.  The use case is do an occational join across two storage
plugins with one of them being the jdbc storage plugin.  Now we would like
to release the JDBC connection used afterward.

-- Jiang



On Thu, Jan 16, 2020 at 4:19 AM Arina Yelchiyeva 
wrote:

> In the scope of DRILL-7467 one of the enhancements will be to add
> opportunity to control basic data source setup thought the storage plugin
> config.
> You will be able to indicate basic data source parameters and their
> values: https://commons.apache.org/proper/commons-dbcp/configuration.html
> 
>
> Feature will be available in Drill 1.18 or in master branch as soon as it
> will be committed.
>
> Kind regards,
> Arina
>
> > On Jan 16, 2020, at 1:28 AM, Charles Givre  wrote:
> >
> > Hi Jiang,
> > Welcome to Drill!
> > Just as an FYI, there are several improvements underway for the JDBC
> plugin:
> > https://issues.apache.org/jira/browse/DRILL-7467 <
> https://issues.apache.org/jira/browse/DRILL-7467>
> > https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490 <
> https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490?filter=allissues=created+DESC,+priority+DESC,+updated+DESC
> >
> >
> > With respect to the non-relational model, I'd echo Ted's question and
> ask what are you looking for specifically?  There is work underway to get
> Drill to natively support additional non-relational source systems as well
> as the ability to natively query rest endpoints.
> >
> > Best,
> > -- C
> >
> >
> >> On Jan 15, 2020, at 5:51 PM, Paul Rogers 
> wrote:
> >>
> >> Hi Jiang,
> >>
> >> Welcome to the Drill mailing list.
> >>
> >> I think you may be making some assumptions about how Drill works,
> perhaps based on how other DB-driven applications work.
> >>
> >> Drill is not primarily a front-end for an RDBS. Instead, it is
> primarily designed to scan distributed data as fast as possible to extract
> records of interest. Drill does support JDBC data sources, but this is not
> the main use case.
> >>
> >> In Drill, each query is stand-alone: Drill opens connections as needed
> to whatever data source you use; reads data, and releases all resources.
> Since Drill is distributed, this happens on each node. Since Drill is
> multi-threaded, this work also happens for each "minor fragment" (thread of
> execution) on each node. Drill is also multi-user; each user might have
> their own DB security restrictions.
> >>
> >> This makes sense: if we want to read at maximum speed across 10 minor
> fragments (say) then all 10 need their own DB connections and all will try
> to keep those connections 100% busy.
> >>
> >> As a result, Drill has no DB connection pool: not within a query and
> not across queries. So, there is no idle timeout. The maximum number of
> connections is set by the maximum "slice width" (number of fragments per
> node) and the total number of nodes. Slice width is, by default, 70% of
> your CPU count. So, if you have 10 nodes with 8 cores each, you will have
> roughly 60 open DB connections for the duration of the query (assuming that
> the DB storage plugin knows how to shard queries across all those minor
> fragments. I'm not sure that the JDBC storage plugin knows how to do this.
> Can anyone clarify this point?)
> >>
> >> It sounds like you have a particular use-case in mind that might
> benefit from connection caching. Can you share that use case to help us
> understand? And, of course, Drill is open source; if you find you need this
> ability, it can certainly be added.
> >>
> >> Drillers: please offer corrections if I've overlooked something; I'm
> not super familiar with the details of the JDBC data source.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>   On Wednesday, January 15, 2020, 01:49:21 PM PST, Jiang Wu
>  wrote:
> >>
> >> Question on the RDBMS Storage Plugin: is it possible to set various
> options
> >> for the database connection pool used for this storage plugin?  For
> >> example, max number of connections, idle timeout, etc?
> >>
> >> Thanks.
> >>
> >> -- Jiang
> >
>
>


Re: Regarding EXCEPT clause of drill

2020-01-16 Thread Ted Dunning
I think that set subtraction is considerably harder than union. It is
unlikely to be a simple hack to an operator.

But it should be possible to build planner rules that expand an except
expression into a plan to do an outer join and filter away results that
have non-null right hand sides.

The real problem is that virtually nobody even knows that except exists.





On Thu, Jan 16, 2020 at 10:57 AM Paul Rogers 
wrote:

> Thanks Arina.
>
> I believe that EXCEPT is set subtraction A - B. Drill does (I believe)
> already support UNION distinct, which is set union (A U B). If so, it would
> be a nice little project for someone to implement EXCEPT by reversing the
> polarity of comparison: discard the non-dups rather than the dups. I'm sure
> it is a bit harder than that, but still would be a good project for someone
> who want to get experience with query engine internals.
>
> Thanks,
> - Paul
>
>
>
> On Thursday, January 16, 2020, 04:16:49 AM PST, Arina Yelchiyeva <
> arina.yelchiy...@gmail.com> wrote:
>
>  Paul is right, Drill does not support Except syntax.
> In this Jira in comments you can find ways how to rewrite the query to
> achieve the same result: https://issues.apache.org/jira/browse/DRILL-4232
> 
>
> Kind regards,
> Arina
>
> > On Jan 16, 2020, at 1:09 AM, Paul Rogers 
> wrote:
> >
> > Hi Itisha,
> >
> > Welcome to the Drill user mail list!
> >
> > For those readers (like me) who are not familiar with the keyword,
> EXCEPT works like UNION to join to SELECT statements: SELECT ... EXCEPT
> SELECT ... See [1]. Seems pretty useful.
> >
> > A quick check of our parser source code suggests that Drill does not
> support the EXCEPT syntax. Drill does not even treat EXCEPT as a keyword.
> >
> > Please file a JIRA ticket to request this feature.
> >
> > Thanks,
> > - Paul
> >
> >
> > [1] https://www.tutorialspoint.com/sql/sql-except-clause.htm
> >
> >
> >
> >On Wednesday, January 15, 2020, 12:50:36 AM PST, itisha Gupta <
> itishagupta.2...@gmail.com> wrote:
> >
> > Hey team,
> >
> > I am trying to use EXCEPT operator of Apache drill but it's not working.
> > Does Apache drill supports EXCEPT?
> >
> > Regards,
> > Itisha
>


Re: Regarding EXCEPT clause of drill

2020-01-16 Thread Paul Rogers
Thanks Arina.

I believe that EXCEPT is set subtraction A - B. Drill does (I believe) already 
support UNION distinct, which is set union (A U B). If so, it would be a nice 
little project for someone to implement EXCEPT by reversing the polarity of 
comparison: discard the non-dups rather than the dups. I'm sure it is a bit 
harder than that, but still would be a good project for someone who want to get 
experience with query engine internals.

Thanks,
- Paul

 

On Thursday, January 16, 2020, 04:16:49 AM PST, Arina Yelchiyeva 
 wrote:  
 
 Paul is right, Drill does not support Except syntax.
In this Jira in comments you can find ways how to rewrite the query to achieve 
the same result: https://issues.apache.org/jira/browse/DRILL-4232 


Kind regards,
Arina

> On Jan 16, 2020, at 1:09 AM, Paul Rogers  wrote:
> 
> Hi Itisha,
> 
> Welcome to the Drill user mail list!
> 
> For those readers (like me) who are not familiar with the keyword, EXCEPT 
> works like UNION to join to SELECT statements: SELECT ... EXCEPT SELECT ... 
> See [1]. Seems pretty useful.
> 
> A quick check of our parser source code suggests that Drill does not support 
> the EXCEPT syntax. Drill does not even treat EXCEPT as a keyword.
> 
> Please file a JIRA ticket to request this feature.
> 
> Thanks,
> - Paul
> 
> 
> [1] https://www.tutorialspoint.com/sql/sql-except-clause.htm
> 
> 
> 
>    On Wednesday, January 15, 2020, 12:50:36 AM PST, itisha Gupta 
> wrote:  
> 
> Hey team,
> 
> I am trying to use EXCEPT operator of Apache drill but it's not working.
> Does Apache drill supports EXCEPT?
> 
> Regards,
> Itisha
  

Invitation: Drill Hangout @ Fri Jan 24, 2020 9:30am - 10:30am (EST) (user@drill.apache.org)

2020-01-16 Thread charles . givre
BEGIN:VCALENDAR
PRODID:-//Google Inc//Google Calendar 70.9054//EN
VERSION:2.0
CALSCALE:GREGORIAN
METHOD:REQUEST
BEGIN:VEVENT
DTSTART:20200124T143000Z
DTEND:20200124T153000Z
DTSTAMP:20200116T13Z
ORGANIZER;CN=charles.gi...@gtkcyber.com:mailto:charles.gi...@gtkcyber.com
UID:1E497E97-6F49-49CB-9CAD-C8DFA9571876
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE
 ;CN=charles.gi...@gtkcyber.com;X-NUM-GUESTS=0:mailto:charles.givre@gtkcyber
 .com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=d...@drill.apache.org;X-NUM-GUESTS=0:mailto:d...@drill.apache.org
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=
 TRUE;CN=user@drill.apache.org;X-NUM-GUESTS=0:mailto:user@drill.apache.org
X-MICROSOFT-CDO-OWNERAPPTID:-2556528
CREATED:20200116T155121Z
DESCRIPTION:Hello Drillers\, \nI'd like to propose the follow time for a Dr
 ill Hangout. \nTopics to disucss include:\n1.  Drill / Arrow Integration\n2
 .  Future integrations & functionality\n\n-::~:~::~:~:~:~:~:~:~:~:~:~:~:~:~
 :~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~::~:~::-\nPlease do not edit 
 this section of the description.\n\nView your event at https://www.google.c
 om/calendar/event?action=VIEW=XzY1Mmo4ZTluOGtzamViOW04b3EzaWI5azc1MWs0Y
 jlwOGQwazhiYTM3MTI0Y2c5cDZrcmoyZTFuNm8gdXNlckBkcmlsbC5hcGFjaGUub3Jn=MjY
 jY2hhcmxlcy5naXZyZUBndGtjeWJlci5jb21iMjVkNTlhNTBmMTRmZTNlODM4NjUwZGUxZGU5Yz
 llMmJjYTE4NjYx=America%2FNew_York=en=1.\n-::~:~::~:~:~:~:~:~:~:~:
 ~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~:~::~:~::-
LAST-MODIFIED:20200116T13Z
LOCATION:Google Hangout
SEQUENCE:1
STATUS:CONFIRMED
SUMMARY:Drill Hangout
TRANSP:OPAQUE
X-APPLE-TRAVEL-ADVISORY-BEHAVIOR:AUTOMATIC
END:VEVENT
END:VCALENDAR


invite.ics
Description: application/ics


Re: RDBMS Storage Plugin Configurations

2020-01-16 Thread Arina Yelchiyeva
In the scope of DRILL-7467 one of the enhancements will be to add opportunity 
to control basic data source setup thought the storage plugin config.
You will be able to indicate basic data source parameters and their values: 
https://commons.apache.org/proper/commons-dbcp/configuration.html 


Feature will be available in Drill 1.18 or in master branch as soon as it will 
be committed.

Kind regards,
Arina

> On Jan 16, 2020, at 1:28 AM, Charles Givre  wrote:
> 
> Hi Jiang, 
> Welcome to Drill!
> Just as an FYI, there are several improvements underway for the JDBC plugin:
> https://issues.apache.org/jira/browse/DRILL-7467 
> 
> https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490 
> 
> 
> With respect to the non-relational model, I'd echo Ted's question and ask 
> what are you looking for specifically?  There is work underway to get Drill 
> to natively support additional non-relational source systems as well as the 
> ability to natively query rest endpoints. 
> 
> Best,
> -- C
> 
> 
>> On Jan 15, 2020, at 5:51 PM, Paul Rogers  wrote:
>> 
>> Hi Jiang,
>> 
>> Welcome to the Drill mailing list.
>> 
>> I think you may be making some assumptions about how Drill works, perhaps 
>> based on how other DB-driven applications work.
>> 
>> Drill is not primarily a front-end for an RDBS. Instead, it is primarily 
>> designed to scan distributed data as fast as possible to extract records of 
>> interest. Drill does support JDBC data sources, but this is not the main use 
>> case.
>> 
>> In Drill, each query is stand-alone: Drill opens connections as needed to 
>> whatever data source you use; reads data, and releases all resources. Since 
>> Drill is distributed, this happens on each node. Since Drill is 
>> multi-threaded, this work also happens for each "minor fragment" (thread of 
>> execution) on each node. Drill is also multi-user; each user might have 
>> their own DB security restrictions.
>> 
>> This makes sense: if we want to read at maximum speed across 10 minor 
>> fragments (say) then all 10 need their own DB connections and all will try 
>> to keep those connections 100% busy.
>> 
>> As a result, Drill has no DB connection pool: not within a query and not 
>> across queries. So, there is no idle timeout. The maximum number of 
>> connections is set by the maximum "slice width" (number of fragments per 
>> node) and the total number of nodes. Slice width is, by default, 70% of your 
>> CPU count. So, if you have 10 nodes with 8 cores each, you will have roughly 
>> 60 open DB connections for the duration of the query (assuming that the DB 
>> storage plugin knows how to shard queries across all those minor fragments. 
>> I'm not sure that the JDBC storage plugin knows how to do this. Can anyone 
>> clarify this point?)
>> 
>> It sounds like you have a particular use-case in mind that might benefit 
>> from connection caching. Can you share that use case to help us understand? 
>> And, of course, Drill is open source; if you find you need this ability, it 
>> can certainly be added.
>> 
>> Drillers: please offer corrections if I've overlooked something; I'm not 
>> super familiar with the details of the JDBC data source.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>   On Wednesday, January 15, 2020, 01:49:21 PM PST, Jiang Wu 
>>  wrote:  
>> 
>> Question on the RDBMS Storage Plugin: is it possible to set various options
>> for the database connection pool used for this storage plugin?  For
>> example, max number of connections, idle timeout, etc?
>> 
>> Thanks.
>> 
>> -- Jiang
>