Nice! So, to clarify, is this accurate?
1) login, alter session set to define store.format for that user
2) session stickiness (i.e. HAProxy or whatever else will support this)
will ensure that user gets same planner/drillbit, which contains the
session info
3) planner will farm out work to additional drillbits respecting all of
the defined session values
John Omernik wrote on 2018-09-05 12:03 PM:
You can have one user logged in using store.format of CSV in one query,
while another user uses store.format of parquet at the same time. The work
from query one, whether bit 1 or 2 will know to store that query as csv and
the work from query two, where ever it is, will be parquet.
Essentially, store.format is checked by the planner at query time for THAT
query. You have the cluster default, and then you have the ability to
change that at a session level. When you run a query that uses it (Query
time of CREATE TABLE as) , the planner reads the value, session first (if
it is different from the system) and then system if the session is not set,
and then completes the query based on the value read.
John
On Wed, Sep 5, 2018 at 10:41 AM, Joe Auty <[email protected]> wrote:
Hmmm....
So, if user 1 sets the store.format to CSV on Drillbit 1 and work gets
farmed out to Drillbit 2, this session setting will "travel" with the user
from drillbit to drillbit? We were originally thinking that this would be
the case if the session information was retained in Zookeeper, but we
weren't so sure about this. How is session information tracked and
maintained across multiple drillbits?
Thanks so much for taking the time to engage on this!
John Omernik wrote on 2018-09-05 9:40 AM:
Rereading your post, I think there is some concern between embedded/single
drillbit mode, and distributed mode.
When you run multiple drillbits in distributed mode, you will (should) be
enabling authentication. Thus each user will log in to "a" drill bit.
There is no concern on which one, it's not the only one doing the work,
You
could have two users on the same drill bit, or on different bits. The
"System" options will be set by default for all users on logon and at this
point, a session for the user is created. You can see what the current
options for that user is with "Select * from sys.options" If you are a
user who is in the administrative group (security.admin.users in select *
from sys.options) then you can use
ALTER SYSTEM set `somekey` = `someval`
And that will set the default option for all users when they logon.
A user can use
ALTER SESSION set `somekey = `someval`
To alter what say their store.format is. So User 1 can authentication and
connect to Drillbit 1 or 2, and set store.format (at the session level) to
parquet and user 2 can logon it Drillbit 1 or 2 (doesn't matter, even if
user 1 is logged into the same drill bit with a different store.format)
and
set it to csv.
I think the key here is authenticated distributed mode in Drill and that
will be how you will do what you need to do.
John
On Tue, Sep 4, 2018 at 7:30 PM, John Omernik <[email protected]> wrote:
The session is the users session, not the drill bit. Since you are
connected to a specific drill bit, when you alter session it will work.
Try
to use session stickiness or pinning on your HA solution and you will be
go. With my DNS round robin it picks a "connecting" drill bit and sticks
to
it until the session is done. The settings apply to the drill cluster
while
in distributed mode.
On Tue, Sep 4, 2018 at 3:56 PM, Joe Auty <[email protected]> wrote:
Thanks for your response John!
We are using Drill both in an ETL context, as well as for general
warehouse queries. One Drill user uses store format set to Parquet while
the other uses store format set to CSV to read and write from HDFS. We
are
currently using Kubernetes Services rather than DNS round robin, but we
only have one drillbit in the cluster while we try to sort out this
issue.
I'm not clear though how alter session would work with a session with
multiple drillbits involved? We need to ensure that the right store
format
option is set, so like I said we are using different Drill usernames and
sessions to accommodate this, but how would alter session commands
apply to
preserve these different settings across multiple drillbits?
John Omernik wrote on 2018-09-04 2:48 PM:
Are these ETL ish type queries? store.format should only apply when
Drill
is writing data, when it is reading, it uses the filenames and other
hints
to read.
Thus, if you do HA, say with DNS (like like in the other thread) and
prior
to running your CREATE TABLE AS (I Am assuming this is what you are
doing)
you can do ALTER SESSION set store.format = 'parquet'
Instead of setting the ALTER SYSTEM, you can use ALTER SESSION so it
only
applies to the current session, regardless of foreman.
John
On Tue, Sep 4, 2018 at 1:00 PM, Joe Auty <[email protected]>
wrote:
Hello,
We need to have some queries executed with store.format set to parquet
and
some with this option set to CSV. To date we have experimented with
setting
the store format for sessions controlled by using two separate user
logins
as a sort of context switch, but I'm wondering if the group here might
have
suggestions for a better way to handle this, particularly one that
will
scale a little better for us?
The main problem we have with this approach is in introducing multiple
drillbits/HA and assuring that the session and the settings we need
are
respected across all drillbits (whether with an HAProxy + sticky
session
approach or any other approach). There is a more general thread (which
I've
chosen not to hijack) about HA Drill from a more general standpoint,
you
might think of my question here as being similar, but with the need
for
a
context switch to support multiple Drill configurations/session
options.
Here are the various attempts and approaches we have come up with so
far.
I'm wondering if you'd have any general advice as to which approach
would
be best for us to take, considering future plans for Drill itself. For
example, if need be we can write our own plugin(s) if this is the
smartest
approach:
- embedded the store.format option into the query itself by chaining
multiple queries together separated by a comma (it appears that this
doesn't work at all)
- look into writing some sort of plugin to allow us to scale our
current
approach somehow (I realize that this is vague)
- a "foreman" approach where we stick with our current approach and
direct
all requests to our "foreman"/master with the hope and expectation
that
it
will farm out work to the workers/slaves
- multiple clusters set with different settings
Each of these approaches seems to have its pros and cons. To
reiterate:
what approach do you think would be the smartest and most future-proof
approach for us to take?
Thanks in advance!