Re: About integration of drill and arrow

2020-01-16 Thread Jiang Wu
We experienced the same issue on first "null" value tripping up Drill's
schema on-demand logic that assumes certain data type for "null" causing
subsequent exception when a non-null is encountered.  We added logic inside
our storage plugin to make this work.  But good to hear there is a general
solution being worked on.  Logically, one can defer the type determination
until first non-null value.

Regarding "If Drill had a schema", I do fully understand the need for
schemas for purposes of vectorization.  Still having schema on demand has
been a major benefit for our use case.  So the tricky part is how to
achieve the best of both worlds.

-- Jiang


On Wed, Jan 15, 2020 at 7:09 PM Paul Rogers 
wrote:

> Hi Ted,
>
> Thanks much for the feedback! Good test cases indeed. The good news is
> that we're close to finishing a "V2" JSON reader that smooths over a few
> more JSON quirks like the "first column null" issue that can cause problems:
>
> {n: 1, v: null}{n: 2, v: "Gotcha! Wasn't Int, is actually VARCHAR!"}
>
> Glad your queries work. You gave an example that had fooled me multiple
> times:
>
> select nest.u from dfs.root.`/Users/tdunning/x.json`;
>
>
> The trick here is that Drill has no schema. All the parser can tell is,
> "hey, I've got a two-part name, `nest.u`. For me a two part name means
> schema.table or table.column, so, since `nest` isn't a schema, it must be a
> table.Oh, look, no such table exists. FAIL!" Using a three-part name works
> (IIRC):
>
> select t.nest.u from dfs.root.`/Users/tdunning/x.json` t;
>
>
> Now Drill sees that `t` is a table name, and works its way down from there.
>
>
> If Drill had a schema, then the planner could first check if `nest` is a
> schema, then if it is a table, then if it is a structured field in the
> query. Impala can do this because it has a schema; Drill can't. We can hope
> that, with the new schema work being added to Drill, that your query will
> "do the right thing" in the future.
>
> Adding `columns` to your query won't help: the `columns` name is valid in
> only one place: when working with CSV (or, more generally, delimited) data
> with no headers.
>
> This gets back to Jaing's point: we could really use better/more
> documentation. We're good at the bare basics, "such-and-so syntax exists",
> but we're not as good at explaining how to solve problems using Drill
> features. The Learning Apache Drill book tries to address some holes.
> Clearly, if you have a hard time with this, being part of the team that
> created Drill, we've got a bit of work to do! (To be honest, neither Impala
> nor Presto are much better in the "how to" department.)
>
>
> Additional use cases/frustrations are very welcome as you find them.
>
>
> Thanks,
> - Paul
>
>
>
> On Wednesday, January 15, 2020, 3:44:09 PM PST, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  On Wed, Jan 15, 2020 at 2:58 PM Paul Rogers 
> wrote:
>
> > ...
> >
> > For example, Ted, you mention lack of nullability on structure members.
> > But, Drill represents structures as MAPs, and MAPs can have nullable
> > members. So, there is likely more to your request than the short summary
> > suggests. Perhaps you can help us understand this a bit more.
> >
>
> This was quite a while ago.
>
> I was reading JSON data with substructures of variable form.
>
> I think, however, that this impression is old news. I just tried it and it
> works the way I wanted.
>
> Here is my data:
>
> {"top":"a","nest":{"u":1, "v":"other"}}
> {"top":"b","nest":{"v":"this", "w":"that"}}
>
> And here are some queries that behave just the way that I wanted:
>
> apache drill> *select* * *from* dfs.root.`/Users/tdunning/x.json`;
>
> +-+-+
>
> | *top* | *nest  * |
>
> +-+-+
>
> | a  | {"u":1,"v":"other"}|
>
> | b  | {"v":"this","w":"that"} |
>
> +-+-+
>
> 2 rows selected (0.079 seconds)
>
> apache drill> *select* nest *from* dfs.root.`/Users/tdunning/x.json`;
>
> +-+
>
> | *nest  * |
>
> +-+
>
> | {"u":1,"v":"other"}|
>
> | {"v":"this","w":"that"} |
>
> +-+
>
> 2 rows selected (0.114 seconds)
>
> apache drill> *select* nest.u *from* dfs.root.`/Users/tdunning/x.json`;
>
> Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 11: Table
> 'nest' not found
>
>
>
> [Error Id: b2100faf-adf7-453e-957f-56726b96e06f ] (state=,code=0)
>
> apache drill> *select* columns.nest.u *from* dfs.root.
> `/Users/tdunning/x.json`;
>
> Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 14: Table
> 'columns' not found
>
>
>
> [Error Id: a793e6bd-c2ed-477a-9f23-70d67b2b85df ] (state=,code=0)
>
> apache drill> *select* x.nest.u *from* dfs.root.`/Users/tdunning/x.json` x;
>
> ++
>
> | *EXPR$0* |
>
> ++
>
> | 1  |
>
> | null  |
>
> ++
>
> 2 rows selected (0.126 seconds)
> apache drill>
>


Re: About integration of drill and arrow

2020-01-15 Thread Paul Rogers
Hi Ted,

Thanks much for the feedback! Good test cases indeed. The good news is that 
we're close to finishing a "V2" JSON reader that smooths over a few more JSON 
quirks like the "first column null" issue that can cause problems:

{n: 1, v: null}{n: 2, v: "Gotcha! Wasn't Int, is actually VARCHAR!"}

Glad your queries work. You gave an example that had fooled me multiple times:

select nest.u from dfs.root.`/Users/tdunning/x.json`;


The trick here is that Drill has no schema. All the parser can tell is, "hey, 
I've got a two-part name, `nest.u`. For me a two part name means schema.table 
or table.column, so, since `nest` isn't a schema, it must be a table.Oh, look, 
no such table exists. FAIL!" Using a three-part name works (IIRC):

select t.nest.u from dfs.root.`/Users/tdunning/x.json` t;


Now Drill sees that `t` is a table name, and works its way down from there.


If Drill had a schema, then the planner could first check if `nest` is a 
schema, then if it is a table, then if it is a structured field in the query. 
Impala can do this because it has a schema; Drill can't. We can hope that, with 
the new schema work being added to Drill, that your query will "do the right 
thing" in the future.

Adding `columns` to your query won't help: the `columns` name is valid in only 
one place: when working with CSV (or, more generally, delimited) data with no 
headers.

This gets back to Jaing's point: we could really use better/more documentation. 
We're good at the bare basics, "such-and-so syntax exists", but we're not as 
good at explaining how to solve problems using Drill features. The Learning 
Apache Drill book tries to address some holes. Clearly, if you have a hard time 
with this, being part of the team that created Drill, we've got a bit of work 
to do! (To be honest, neither Impala nor Presto are much better in the "how to" 
department.)


Additional use cases/frustrations are very welcome as you find them.


Thanks,
- Paul

 

On Wednesday, January 15, 2020, 3:44:09 PM PST, Ted Dunning 
 wrote:  
 
 On Wed, Jan 15, 2020 at 2:58 PM Paul Rogers 
wrote:

> ...
>
> For example, Ted, you mention lack of nullability on structure members.
> But, Drill represents structures as MAPs, and MAPs can have nullable
> members. So, there is likely more to your request than the short summary
> suggests. Perhaps you can help us understand this a bit more.
>

This was quite a while ago.

I was reading JSON data with substructures of variable form.

I think, however, that this impression is old news. I just tried it and it
works the way I wanted.

Here is my data:

{"top":"a","nest":{"u":1, "v":"other"}}
{"top":"b","nest":{"v":"this", "w":"that"}}

And here are some queries that behave just the way that I wanted:

apache drill> *select* * *from* dfs.root.`/Users/tdunning/x.json`;

+-+-+

| *top* | *        nest          * |

+-+-+

| a  | {"u":1,"v":"other"}    |

| b  | {"v":"this","w":"that"} |

+-+-+

2 rows selected (0.079 seconds)

apache drill> *select* nest *from* dfs.root.`/Users/tdunning/x.json`;

+-+

| *        nest          * |

+-+

| {"u":1,"v":"other"}    |

| {"v":"this","w":"that"} |

+-+

2 rows selected (0.114 seconds)

apache drill> *select* nest.u *from* dfs.root.`/Users/tdunning/x.json`;

Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 11: Table
'nest' not found



[Error Id: b2100faf-adf7-453e-957f-56726b96e06f ] (state=,code=0)

apache drill> *select* columns.nest.u *from* dfs.root.
`/Users/tdunning/x.json`;

Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 14: Table
'columns' not found



[Error Id: a793e6bd-c2ed-477a-9f23-70d67b2b85df ] (state=,code=0)

apache drill> *select* x.nest.u *from* dfs.root.`/Users/tdunning/x.json` x;

++

| *EXPR$0* |

++

| 1      |

| null  |

++

2 rows selected (0.126 seconds)
apache drill>
  

Re: About integration of drill and arrow

2020-01-15 Thread Ted Dunning
On Wed, Jan 15, 2020 at 2:58 PM Paul Rogers 
wrote:

> ...
>
> For example, Ted, you mention lack of nullability on structure members.
> But, Drill represents structures as MAPs, and MAPs can have nullable
> members. So, there is likely more to your request than the short summary
> suggests. Perhaps you can help us understand this a bit more.
>

This was quite a while ago.

I was reading JSON data with substructures of variable form.

I think, however, that this impression is old news. I just tried it and it
works the way I wanted.

Here is my data:

{"top":"a","nest":{"u":1, "v":"other"}}
{"top":"b","nest":{"v":"this", "w":"that"}}

And here are some queries that behave just the way that I wanted:

apache drill> *select* * *from* dfs.root.`/Users/tdunning/x.json`;

+-+-+

| *top* | * nest  * |

+-+-+

| a   | {"u":1,"v":"other"} |

| b   | {"v":"this","w":"that"} |

+-+-+

2 rows selected (0.079 seconds)

apache drill> *select* nest *from* dfs.root.`/Users/tdunning/x.json`;

+-+

| * nest  * |

+-+

| {"u":1,"v":"other"} |

| {"v":"this","w":"that"} |

+-+

2 rows selected (0.114 seconds)

apache drill> *select* nest.u *from* dfs.root.`/Users/tdunning/x.json`;

Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 11: Table
'nest' not found



[Error Id: b2100faf-adf7-453e-957f-56726b96e06f ] (state=,code=0)

apache drill> *select* columns.nest.u *from* dfs.root.
`/Users/tdunning/x.json`;

Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 14: Table
'columns' not found



[Error Id: a793e6bd-c2ed-477a-9f23-70d67b2b85df ] (state=,code=0)

apache drill> *select* x.nest.u *from* dfs.root.`/Users/tdunning/x.json` x;

++

| *EXPR$0* |

++

| 1  |

| null   |

++

2 rows selected (0.126 seconds)
apache drill>


Re: About integration of drill and arrow

2020-01-15 Thread Charles Givre

Jiang, 
Thanks again for the feedback. See inline responses.



> On Jan 15, 2020, at 5:30 PM, Jiang Wu  wrote:
> 
> An interesting set of perspectives.  The market has many systems similar to
> Drill dealing with relational data model.  However, there are a large set
> of non-relational data from various APIs.  An efficient and extensible
> query engine for this type of non-relational schema-on-demand data is what
> we are looking for.
> 
> Here are our perspectives on developing and using Drill:
> 
> 1) Schema on-demand and non-relational model: this is the primary reason.
> We use Drill to interface with a schema-less columnar object store, where
> objects in a collection don't need to have uniform schema.
> 2) Small foot-print: we use both embedded and clustered mode

Thank you for sharing this.  It is really good to know what features users 
actually find useful. 


> 
> What we find lacking in Drill:
> 
> 1) Support for non-relational data model is still very limited.  e.g.
> lacking functions that work directly on non-relational values

Again, if you could tell the community what you are looking for, we can 
probably help you out.  I've found that unfortunately, (and this relates to 
your second point) there is often a non-obvious way of getting Drill to do what 
you want it to do.  Drill is EXTREMELY flexible and often there is a way of 
getting the job done. 

> 2) Documentation.  Requires a lot of expertise and experiences to figure
> out how things work.

Completely agree with you here.  There is an O'Reilly book (Learning Apache 
Drill: https://amzn.to/2srhNs8 <https://amzn.to/2srhNs8>) by me and Paul 
Rogers.  I believe the book was translated into Chinese as well, but I'm not 
sure of the current status. Paul and I have written a few tutorials about 
developing storage plugins, format plugins as well as user defined functions. 

> 3) Not widely adopted causing issues with finding experts to continue our
> work.

If you are looking for expert level assistance, training or consulting I'd be 
willing to help you out here.  Please email me at cgi...@apache.org 
<mailto:cgi...@apache.org> and we can take that conversation off the alias.

> 
> -- Jiang
> 
> 
> On Fri, Jan 10, 2020 at 3:48 AM Igor Guzenko 
> wrote:
> 
>> ------ Forwarded message -
>> From: Igor Guzenko 
>> Date: Fri, Jan 10, 2020 at 1:46 PM
>> Subject: Re: About integration of drill and arrow
>> To: dev 
>> 
>> 
>> Hello Drill Developers and Drill Users,
>> 
>> This discussion started as migration to Arrow but uncovered questions of
>> strategical plans for moving towards Apache Drill 2.0.
>> Below are my personal thoughts of what we, as developers, should do to
>> offer Drill users better experience:
>> 
>> 1. High performant bulk insertions into as many data sources as possible.
>> There is a whole bunch of different tools for data pipelining to use...
>> But why people who know SQL should spend time learning something new for
>> simply moving data between tools?
>> 
>> 2. Improve the efficiency of memory management (EVF, resource management,
>> improved costs planning using meta store, etc.). Since we're dealing with
>> big data alongside other tools installed on data nodes we should utilize
>> memory very economically and effectively.
>> 
>> 3. Make integration with all other tools and formats as stable as possible.
>> The high amount of bugs in the area tells that we have lots to improve.
>> Every user is happy when he gets a tool and it simply works as expected.
>> Also, analyze user requirements and provide integration with new most
>> popular tools.  Querying high variety of
>> data sources were and still one of the biggest selling points.
>> 
>> 4. Make code highly extensible and extremely friendly for contributions. No
>> one would want to spend years of learning to make a contribution. This is
>> why I want to see a lot of modules that are highly cohesive and define
>> clear APIs for interaction with each other. This is also about paying old
>> technical debts related to fat JDBC client, copy of web server in Drill on
>> YARN, mixing everything in exec module, etc.
>> 
>> 5. Focus on performance improvements of every component, from query
>> planning to execution.
>> 
>> These are my thoughts from developer's perspective. Since I'm just
>> developer from Ukraine and far far away from Drill users, I believe that
>> Charles Givre is the one who can build a strong Drill user community and
>> collect their requirements for us.
>> 
>> 
>> What relates to Volodymyr's suggestion about 

Re: About integration of drill and arrow

2020-01-15 Thread Paul Rogers
Hi Ted and Jiang,

Thanks much for sharing your actual needs. As Ted noted, it can be VERY hard to 
learn what uses need. Not much different than the problem that a product manger 
has on a commercial product.

A good place to start on these issues is to file a JIRA. Of particular value is 
the use case: the background of how the feature would be used.

For example, Ted, you mention lack of nullability on structure members. But, 
Drill represents structures as MAPs, and MAPs can have nullable members. So, 
there is likely more to your request than the short summary suggests. Perhaps 
you can help us understand this a bit more.

Also, Jiang, it would be helpful to understand what you mean by non-relational 
data. One of our ongoing questions is this: Drill is based on SQL, and SQL 
works with relational data. How might you use (relational) SQL to work with 
non-relational data? Maybe flatten tree-structured data into a flat table 
(lateral join was added for this.) Would be super helpful if you could help us 
understand this a bit more.

Thanks,
- Paul

 

On Wednesday, January 15, 2020, 02:38:12 PM PST, Ted Dunning 
 wrote:  
 
 Jiang,

It is sooo cool to hear from actual users in the real world.

I would confirm that I have had real problems using drill on nested data.
My particular problem wasn't lack of functions, however. It had to do with
the fact that without nullable members of structures, I couldn't tell when
fields were missing.



On Wed, Jan 15, 2020 at 2:31 PM Jiang Wu 
wrote:

> An interesting set of perspectives.  The market has many systems similar to
> Drill dealing with relational data model.  However, there are a large set
> of non-relational data from various APIs.  An efficient and extensible
> query engine for this type of non-relational schema-on-demand data is what
> we are looking for.
>
> Here are our perspectives on developing and using Drill:
>
> 1) Schema on-demand and non-relational model: this is the primary reason.
> We use Drill to interface with a schema-less columnar object store, where
> objects in a collection don't need to have uniform schema.
> 2) Small foot-print: we use both embedded and clustered mode
>
> What we find lacking in Drill:
>
> 1) Support for non-relational data model is still very limited.  e.g.
> lacking functions that work directly on non-relational values
> 2) Documentation.  Requires a lot of expertise and experiences to figure
> out how things work.
> 3) Not widely adopted causing issues with finding experts to continue our
> work.
>
> -- Jiang
>
>
> On Fri, Jan 10, 2020 at 3:48 AM Igor Guzenko 
> wrote:
>
> > -- Forwarded message ---------
> > From: Igor Guzenko 
> > Date: Fri, Jan 10, 2020 at 1:46 PM
> > Subject: Re: About integration of drill and arrow
> > To: dev 
> >
> >
> > Hello Drill Developers and Drill Users,
> >
> > This discussion started as migration to Arrow but uncovered questions of
> > strategical plans for moving towards Apache Drill 2.0.
> > Below are my personal thoughts of what we, as developers, should do to
> > offer Drill users better experience:
> >
> > 1. High performant bulk insertions into as many data sources as possible.
> > There is a whole bunch of different tools for data pipelining to use...
> > But why people who know SQL should spend time learning something new for
> > simply moving data between tools?
> >
> > 2. Improve the efficiency of memory management (EVF, resource management,
> > improved costs planning using meta store, etc.). Since we're dealing with
> > big data alongside other tools installed on data nodes we should utilize
> > memory very economically and effectively.
> >
> > 3. Make integration with all other tools and formats as stable as
> possible.
> > The high amount of bugs in the area tells that we have lots to improve.
> > Every user is happy when he gets a tool and it simply works as expected.
> > Also, analyze user requirements and provide integration with new most
> > popular tools.  Querying high variety of
> > data sources were and still one of the biggest selling points.
> >
> > 4. Make code highly extensible and extremely friendly for contributions.
> No
> > one would want to spend years of learning to make a contribution. This is
> > why I want to see a lot of modules that are highly cohesive and define
> > clear APIs for interaction with each other. This is also about paying old
> > technical debts related to fat JDBC client, copy of web server in Drill
> on
> > YARN, mixing everything in exec module, etc.
> >
> > 5. Focus on performance improvements of every component, from query
> > planning to execution.
> >
> &

Re: About integration of drill and arrow

2020-01-15 Thread Ted Dunning
Jiang,

It is sooo cool to hear from actual users in the real world.

I would confirm that I have had real problems using drill on nested data.
My particular problem wasn't lack of functions, however. It had to do with
the fact that without nullable members of structures, I couldn't tell when
fields were missing.



On Wed, Jan 15, 2020 at 2:31 PM Jiang Wu 
wrote:

> An interesting set of perspectives.  The market has many systems similar to
> Drill dealing with relational data model.  However, there are a large set
> of non-relational data from various APIs.  An efficient and extensible
> query engine for this type of non-relational schema-on-demand data is what
> we are looking for.
>
> Here are our perspectives on developing and using Drill:
>
> 1) Schema on-demand and non-relational model: this is the primary reason.
> We use Drill to interface with a schema-less columnar object store, where
> objects in a collection don't need to have uniform schema.
> 2) Small foot-print: we use both embedded and clustered mode
>
> What we find lacking in Drill:
>
> 1) Support for non-relational data model is still very limited.  e.g.
> lacking functions that work directly on non-relational values
> 2) Documentation.  Requires a lot of expertise and experiences to figure
> out how things work.
> 3) Not widely adopted causing issues with finding experts to continue our
> work.
>
> -- Jiang
>
>
> On Fri, Jan 10, 2020 at 3:48 AM Igor Guzenko 
> wrote:
>
> > -- Forwarded message ---------
> > From: Igor Guzenko 
> > Date: Fri, Jan 10, 2020 at 1:46 PM
> > Subject: Re: About integration of drill and arrow
> > To: dev 
> >
> >
> > Hello Drill Developers and Drill Users,
> >
> > This discussion started as migration to Arrow but uncovered questions of
> > strategical plans for moving towards Apache Drill 2.0.
> > Below are my personal thoughts of what we, as developers, should do to
> > offer Drill users better experience:
> >
> > 1. High performant bulk insertions into as many data sources as possible.
> > There is a whole bunch of different tools for data pipelining to use...
> > But why people who know SQL should spend time learning something new for
> > simply moving data between tools?
> >
> > 2. Improve the efficiency of memory management (EVF, resource management,
> > improved costs planning using meta store, etc.). Since we're dealing with
> > big data alongside other tools installed on data nodes we should utilize
> > memory very economically and effectively.
> >
> > 3. Make integration with all other tools and formats as stable as
> possible.
> > The high amount of bugs in the area tells that we have lots to improve.
> > Every user is happy when he gets a tool and it simply works as expected.
> > Also, analyze user requirements and provide integration with new most
> > popular tools.  Querying high variety of
> > data sources were and still one of the biggest selling points.
> >
> > 4. Make code highly extensible and extremely friendly for contributions.
> No
> > one would want to spend years of learning to make a contribution. This is
> > why I want to see a lot of modules that are highly cohesive and define
> > clear APIs for interaction with each other. This is also about paying old
> > technical debts related to fat JDBC client, copy of web server in Drill
> on
> > YARN, mixing everything in exec module, etc.
> >
> > 5. Focus on performance improvements of every component, from query
> > planning to execution.
> >
> > These are my thoughts from developer's perspective. Since I'm just
> > developer from Ukraine and far far away from Drill users, I believe that
> > Charles Givre is the one who can build a strong Drill user community and
> > collect their requirements for us.
> >
> >
> > What relates to Volodymyr's suggestion about adapting Arrow and Drill
> > vectors to work together (the same step is required to implement an Arrow
> > client, suggested by Paul).
> > I'm totally against the idea because it brings a huge amount of
> unnecessary
> > complexity just to uncover small insides into the integration. First is
> > that this is against the whole idea of Arrow since the main idea of Arrow
> > is to provide unified columnar memory layout between different tools
> > without any data conversions. But the step exactly requires data
> > conversions, at least our nullability vector and their validity bitmaps
> are
> > not the same, also Dict vector and their meaning of Dict may also cause
> > data conversion.
> > Another 

Re: About integration of drill and arrow

2020-01-15 Thread Jiang Wu
An interesting set of perspectives.  The market has many systems similar to
Drill dealing with relational data model.  However, there are a large set
of non-relational data from various APIs.  An efficient and extensible
query engine for this type of non-relational schema-on-demand data is what
we are looking for.

Here are our perspectives on developing and using Drill:

1) Schema on-demand and non-relational model: this is the primary reason.
We use Drill to interface with a schema-less columnar object store, where
objects in a collection don't need to have uniform schema.
2) Small foot-print: we use both embedded and clustered mode

What we find lacking in Drill:

1) Support for non-relational data model is still very limited.  e.g.
lacking functions that work directly on non-relational values
2) Documentation.  Requires a lot of expertise and experiences to figure
out how things work.
3) Not widely adopted causing issues with finding experts to continue our
work.

-- Jiang


On Fri, Jan 10, 2020 at 3:48 AM Igor Guzenko 
wrote:

> -- Forwarded message -
> From: Igor Guzenko 
> Date: Fri, Jan 10, 2020 at 1:46 PM
> Subject: Re: About integration of drill and arrow
> To: dev 
>
>
> Hello Drill Developers and Drill Users,
>
> This discussion started as migration to Arrow but uncovered questions of
> strategical plans for moving towards Apache Drill 2.0.
> Below are my personal thoughts of what we, as developers, should do to
> offer Drill users better experience:
>
> 1. High performant bulk insertions into as many data sources as possible.
> There is a whole bunch of different tools for data pipelining to use...
> But why people who know SQL should spend time learning something new for
> simply moving data between tools?
>
> 2. Improve the efficiency of memory management (EVF, resource management,
> improved costs planning using meta store, etc.). Since we're dealing with
> big data alongside other tools installed on data nodes we should utilize
> memory very economically and effectively.
>
> 3. Make integration with all other tools and formats as stable as possible.
> The high amount of bugs in the area tells that we have lots to improve.
> Every user is happy when he gets a tool and it simply works as expected.
> Also, analyze user requirements and provide integration with new most
> popular tools.  Querying high variety of
> data sources were and still one of the biggest selling points.
>
> 4. Make code highly extensible and extremely friendly for contributions. No
> one would want to spend years of learning to make a contribution. This is
> why I want to see a lot of modules that are highly cohesive and define
> clear APIs for interaction with each other. This is also about paying old
> technical debts related to fat JDBC client, copy of web server in Drill on
> YARN, mixing everything in exec module, etc.
>
> 5. Focus on performance improvements of every component, from query
> planning to execution.
>
> These are my thoughts from developer's perspective. Since I'm just
> developer from Ukraine and far far away from Drill users, I believe that
> Charles Givre is the one who can build a strong Drill user community and
> collect their requirements for us.
>
>
> What relates to Volodymyr's suggestion about adapting Arrow and Drill
> vectors to work together (the same step is required to implement an Arrow
> client, suggested by Paul).
> I'm totally against the idea because it brings a huge amount of unnecessary
> complexity just to uncover small insides into the integration. First is
> that this is against the whole idea of Arrow since the main idea of Arrow
> is to provide unified columnar memory layout between different tools
> without any data conversions. But the step exactly requires data
> conversions, at least our nullability vector and their validity bitmaps are
> not the same, also Dict vector and their meaning of Dict may also cause
> data conversion.
> Another waste is the difference in metadata contracts, who knows whether
> it's even possible to combine them. Another problem, like I already
> mentioned is the huge complexity of the work,
> To do the work I should overcome all underlying pitfalls of both projects,
> in addition, I should cover all the untestable code with a comprehensive
> amount of tests to show that back and forth conversion is done correctly
> for every single unit of data in both vectors. The idea of adapters and
> clients is about 4 years old or more and no one did practical work to
> implement it. I think I explained why.
>
> What I really like in Volodymyr's and Paul's suggestions is that we can
> extract clear API from existing EVF implementation and in practice provide
> Arrow or any other impleme

Re: About integration of drill and arrow

2020-01-10 Thread Igor Guzenko
Hi Charles,

Thanks for the quick answer. Nice to know that this is not a discussion
only between me, Volodymyr and Paul. I'm fine with having a call on the
next week, but first, it will be very interesting to hear your thoughts in
this email thread. Perhaps, you'll uncover many more interesting questions
to think about before the live call.

Best regards,
Igor

On Fri, Jan 10, 2020 at 10:30 PM Charles Givre  wrote:

> Hi Igor,
> Thanks for your thoughts.  I'm a little swamped today, but send a response
> over the weekend.  Perhaps would you and your team in the Ukraine be
> interested in doing a virtual get together to discuss further?  I'm based
> on eastern time in the US, so it's a little more convenient than California
> time.
> Thanks,
> -- C
>
>
>
>
>
>
> > On Jan 10, 2020, at 6:47 AM, Igor Guzenko 
> wrote:
> >
> > -- Forwarded message -----
> > From: Igor Guzenko 
> > Date: Fri, Jan 10, 2020 at 1:46 PM
> > Subject: Re: About integration of drill and arrow
> > To: dev 
> >
> >
> > Hello Drill Developers and Drill Users,
> >
> > This discussion started as migration to Arrow but uncovered questions of
> > strategical plans for moving towards Apache Drill 2.0.
> > Below are my personal thoughts of what we, as developers, should do to
> > offer Drill users better experience:
> >
> > 1. High performant bulk insertions into as many data sources as possible.
> > There is a whole bunch of different tools for data pipelining to use...
> > But why people who know SQL should spend time learning something new for
> > simply moving data between tools?
> >
> > 2. Improve the efficiency of memory management (EVF, resource management,
> > improved costs planning using meta store, etc.). Since we're dealing with
> > big data alongside other tools installed on data nodes we should utilize
> > memory very economically and effectively.
> >
> > 3. Make integration with all other tools and formats as stable as
> possible.
> > The high amount of bugs in the area tells that we have lots to improve.
> > Every user is happy when he gets a tool and it simply works as expected.
> > Also, analyze user requirements and provide integration with new most
> > popular tools.  Querying high variety of
> > data sources were and still one of the biggest selling points.
> >
> > 4. Make code highly extensible and extremely friendly for contributions.
> No
> > one would want to spend years of learning to make a contribution. This is
> > why I want to see a lot of modules that are highly cohesive and define
> > clear APIs for interaction with each other. This is also about paying old
> > technical debts related to fat JDBC client, copy of web server in Drill
> on
> > YARN, mixing everything in exec module, etc.
> >
> > 5. Focus on performance improvements of every component, from query
> > planning to execution.
> >
> > These are my thoughts from developer's perspective. Since I'm just
> > developer from Ukraine and far far away from Drill users, I believe that
> > Charles Givre is the one who can build a strong Drill user community and
> > collect their requirements for us.
> >
> >
> > What relates to Volodymyr's suggestion about adapting Arrow and Drill
> > vectors to work together (the same step is required to implement an Arrow
> > client, suggested by Paul).
> > I'm totally against the idea because it brings a huge amount of
> unnecessary
> > complexity just to uncover small insides into the integration. First is
> > that this is against the whole idea of Arrow since the main idea of Arrow
> > is to provide unified columnar memory layout between different tools
> > without any data conversions. But the step exactly requires data
> > conversions, at least our nullability vector and their validity bitmaps
> are
> > not the same, also Dict vector and their meaning of Dict may also cause
> > data conversion.
> > Another waste is the difference in metadata contracts, who knows whether
> > it's even possible to combine them. Another problem, like I already
> > mentioned is the huge complexity of the work,
> > To do the work I should overcome all underlying pitfalls of both
> projects,
> > in addition, I should cover all the untestable code with a comprehensive
> > amount of tests to show that back and forth conversion is done correctly
> > for every single unit of data in both vectors. The idea of adapters and
> > clients is about 4 years old or more and no one did practical work to
> > implement it. I

Re: About integration of drill and arrow

2020-01-10 Thread Charles Givre
Hi Igor,
Thanks for your thoughts.  I'm a little swamped today, but send a response over 
the weekend.  Perhaps would you and your team in the Ukraine be interested in 
doing a virtual get together to discuss further?  I'm based on eastern time in 
the US, so it's a little more convenient than California time.
Thanks,
-- C






> On Jan 10, 2020, at 6:47 AM, Igor Guzenko  wrote:
> 
> -- Forwarded message -
> From: Igor Guzenko 
> Date: Fri, Jan 10, 2020 at 1:46 PM
> Subject: Re: About integration of drill and arrow
> To: dev 
> 
> 
> Hello Drill Developers and Drill Users,
> 
> This discussion started as migration to Arrow but uncovered questions of
> strategical plans for moving towards Apache Drill 2.0.
> Below are my personal thoughts of what we, as developers, should do to
> offer Drill users better experience:
> 
> 1. High performant bulk insertions into as many data sources as possible.
> There is a whole bunch of different tools for data pipelining to use...
> But why people who know SQL should spend time learning something new for
> simply moving data between tools?
> 
> 2. Improve the efficiency of memory management (EVF, resource management,
> improved costs planning using meta store, etc.). Since we're dealing with
> big data alongside other tools installed on data nodes we should utilize
> memory very economically and effectively.
> 
> 3. Make integration with all other tools and formats as stable as possible.
> The high amount of bugs in the area tells that we have lots to improve.
> Every user is happy when he gets a tool and it simply works as expected.
> Also, analyze user requirements and provide integration with new most
> popular tools.  Querying high variety of
> data sources were and still one of the biggest selling points.
> 
> 4. Make code highly extensible and extremely friendly for contributions. No
> one would want to spend years of learning to make a contribution. This is
> why I want to see a lot of modules that are highly cohesive and define
> clear APIs for interaction with each other. This is also about paying old
> technical debts related to fat JDBC client, copy of web server in Drill on
> YARN, mixing everything in exec module, etc.
> 
> 5. Focus on performance improvements of every component, from query
> planning to execution.
> 
> These are my thoughts from developer's perspective. Since I'm just
> developer from Ukraine and far far away from Drill users, I believe that
> Charles Givre is the one who can build a strong Drill user community and
> collect their requirements for us.
> 
> 
> What relates to Volodymyr's suggestion about adapting Arrow and Drill
> vectors to work together (the same step is required to implement an Arrow
> client, suggested by Paul).
> I'm totally against the idea because it brings a huge amount of unnecessary
> complexity just to uncover small insides into the integration. First is
> that this is against the whole idea of Arrow since the main idea of Arrow
> is to provide unified columnar memory layout between different tools
> without any data conversions. But the step exactly requires data
> conversions, at least our nullability vector and their validity bitmaps are
> not the same, also Dict vector and their meaning of Dict may also cause
> data conversion.
> Another waste is the difference in metadata contracts, who knows whether
> it's even possible to combine them. Another problem, like I already
> mentioned is the huge complexity of the work,
> To do the work I should overcome all underlying pitfalls of both projects,
> in addition, I should cover all the untestable code with a comprehensive
> amount of tests to show that back and forth conversion is done correctly
> for every single unit of data in both vectors. The idea of adapters and
> clients is about 4 years old or more and no one did practical work to
> implement it. I think I explained why.
> 
> What I really like in Volodymyr's and Paul's suggestions is that we can
> extract clear API from existing EVF implementation and in practice provide
> Arrow or any other implementation for it. Who knows, maybe with new
> improved garbage collectors using direct memory is not necessary at all? It
> is quite clear what we need the middle layer between operators and memory,
> we need extensive benchmarks over the layer and experiments to show what is
> the best underlying memory for Drill.
> 
> What about client tools compatibility there is only one solution I can see
> is to provide new clients for Drill 2.0, although I agree that this is a
> tremendous amount of work there is no other way for making major steps into
> the future. Without it, we should lay back and watch whil

Re: About integration of drill and arrow

2020-01-10 Thread Paul Rogers
Hi Igor,

+1! Very well said! This is exactly the discussion we should have.

Perhaps we can drive towards an overall project goal: a vision that helps us 
choose which of the many options we should select.

Here is a suggestion: Drill should become to queries what Python is to data 
science: a flexible, high-quality platform on which many projects can build 
their applications. This means, as Igor suggests, easy-to-use APIs, excellent 
documentation, many high-quality clients, a continuing effort to improve 
performance and functionality. Even more important, an ability for others to 
extend Drill with custom operators and maybe even some clever new memory model. 
Charles is our prototypical target user as he would benefit from many of these 
directions.

Why this concept? The classic on-prem data lake space is shrinking, and is 
dominated by Impala and Hive (which have a large company behind them.) The pure 
big-data SQL space is dominated by Presto (which was originated by Facebook and 
now has a foundation behind it.) The cloud is dominated by Athena (AWS, based 
on Presto) and Big Query (Google, derived from Dremel, the grand-daddy of big 
data query engines.) There are dozens of smaller players (including the 
commercial concern that sponsored Arrow and has a product that started with 
Drill and has evolved quickly far behind what Drill can do.)

The opportunity, which I have seen first hand, is for a single tool that works 
from laptop to (on-prem or cloud) cluster; that integrates with many data 
sources; and to which I can add custom adapters, security models, operators and 
so on. This is what Python does quite well without there being a "Python 
Company" to back it.

Frankly, Presto seems close to serving these needs already, but my experience 
is very limited. Hence the crazy notion that "Drill 2.0 is Presto with 
enhancements."

Thanks,
- Paul

 

On Friday, January 10, 2020, 03:47:17 AM PST, Igor Guzenko 
 wrote:  
 
 Hello Drill Developers and Drill Users,

This discussion started as migration to Arrow but uncovered questions of
strategical plans for moving towards Apache Drill 2.0.
Below are my personal thoughts of what we, as developers, should do to
offer Drill users better experience:

1. High performant bulk insertions into as many data sources as possible.
There is a whole bunch of different tools for data pipelining to use...
But why people who know SQL should spend time learning something new for
simply moving data between tools?

2. Improve the efficiency of memory management (EVF, resource management,
improved costs planning using meta store, etc.). Since we're dealing with
big data alongside other tools installed on data nodes we should utilize
memory very economically and effectively.

3. Make integration with all other tools and formats as stable as possible.
The high amount of bugs in the area tells that we have lots to improve.
Every user is happy when he gets a tool and it simply works as expected.
Also, analyze user requirements and provide integration with new most
popular tools.  Querying high variety of
data sources were and still one of the biggest selling points.

4. Make code highly extensible and extremely friendly for contributions. No
one would want to spend years of learning to make a contribution. This is
why I want to see a lot of modules that are highly cohesive and define
clear APIs for interaction with each other. This is also about paying old
technical debts related to fat JDBC client, copy of web server in Drill on
YARN, mixing everything in exec module, etc.

5. Focus on performance improvements of every component, from query
planning to execution.

These are my thoughts from developer's perspective. Since I'm just
developer from Ukraine and far far away from Drill users, I believe that
Charles Givre is the one who can build a strong Drill user community and
collect their requirements for us.


What relates to Volodymyr's suggestion about adapting Arrow and Drill
vectors to work together (the same step is required to implement an Arrow
client, suggested by Paul).
I'm totally against the idea because it brings a huge amount of unnecessary
complexity just to uncover small insides into the integration. First is
that this is against the whole idea of Arrow since the main idea of Arrow
is to provide unified columnar memory layout between different tools
without any data conversions. But the step exactly requires data
conversions, at least our nullability vector and their validity bitmaps are
not the same, also Dict vector and their meaning of Dict may also cause
data conversion.
Another waste is the difference in metadata contracts, who knows whether
it's even possible to combine them. Another problem, like I already
mentioned is the huge complexity of the work,
To do the work I should overcome all underlying pitfalls of both projects,
in addition, I should cover all the untestable code with a comprehensive
amount of tests to show that back and forth convers