from:"John Omernik"

Hello - and collect_set UDF?

2020-06-08 Thread John Omernik

Hey all,

I know it's been a while I was in a job where there was no Drill to be
found, just a bunch of Hive. In that role, I got very used to using the
hive collect_set function. It allowed you to return in a single list, all
the unique values based on an aggregate.   Curious, is there anything like
that in Drill? I remember something like that, and when I query
sys.functions there is collect_list and collect_to_list neither of which
work in this context... (collect_list gave me an array index out of bounds,
and collect_to_list told me the column I was collecting wasn't in the
aggregate)

Any UDFs that will achieve this? Even if you've seen them on git hub etc,
(i.e. not built into Drill)

Thanks!

John Omernik

Documentation Issue

2019-01-03 Thread John Omernik

Hey all, I was looking for some tools to help me add/subtract dates, and I
went to:

https://drill.apache.org/docs/date-time-functions-and-arithmetic/


TIMESTAMPDIFF and TIMESTAMPADD where there, and I got excited, I organized
my query, and tried to run them and got the error below. So I tried
troubleshooting. 30 minutes later, after reviewing my query, reviewing the
docs, reviewing all the things, I came here, as I did that, I see there is
a new Drill Release... hmmm. On a hunch, I command+F and sure enough, the
function I was trying to use was just added in 1.15, (I am running 1.14) .

So, A. I think the anytime a doc is updated to reflect a new function, we
should add "Added in Drill version 1.15"  The error I got didn't tell me
the function didn't exist, and was not helpful. This caused me to go down a
30 minute rabbit hole. This is a bad user experience.

B. I see that sys.functions is in the 1.15 release. This is awesome. I
think we should add a min_drill_version column to this though. For those
functions that are added beyond 1.15, we should know where they sit. This
will help us create a process that can auto update Documentation as well.
Anytime a new function is added, we can run a query that shows the those
not in the docs.

Thoughts?

SYSTEM ERROR: AssertionError: todo: implement syntax
SPECIAL(Reinterpret(-($0, ITEM($2, 'charge_state_ts'

Nested Window Queries

2019-01-03 Thread John Omernik

Is there a limitation on nesting of of Window Queries?  I have a query
where I am using an event stream, and the changing of a value to indicate
an event. (The state goes from disconnected, to charging, to complete, it
reports many times in each of those states, but I am using lag(state, 1)
over (order by event_ts) to find those records where current record is say
complete and previous record was charging to indicate that the state has
changed.

This works well.

Now I am trying to take that result set, and do another window to find the
delta time between when the charging started and when the charging
completed, you would think that a nested query showing the state change
times, and then a select lag(event_ts, 1) over (order by event_ts) would be
able to get me both the current time (when the charging was complete) and
the previous record event_ts (when the charging started as outputted by the
subquery). However, I am getting a verbose confusing error that perhaps we
can help figure out...

The first part is:

SYSTEM ERROR: CannotPlanException: Node
[rel#92358:Subset#11.LOGICAL.ANY([]).[3]] could not be implemented;
planner state:

Root: rel#92358:Subset#11.LOGICAL.ANY([]).[3]
Original rel:


The rest is pages of information related to the query, the data etc. I
can send that separately if you are interested.


I guess, before sending all of that, are nested window queries just
not alloweable?


If that is the case, what are some alternative approaches?


Thanks,

John

Re: Time for a fun Query Question

2018-12-04 Thread John Omernik

So one approach I have is this:



select ds, ts_win, opt_string, opt_id, max(eday) as max_eday from (
select a.ds, a.ts_win, a.opt_string, b.opt_id, b.eday, b.ts_win as
rpt_ts_win, b.ts
FROM
(
select distinct ds, opt_string, opt_id, concat(substr(ts, 1, 4), '0:00') as
ts_win
from dfs.prod.view_optdata where ds = '2018-12-02' order by ds, ts_win,
opt_string
) a JOIN
(
select opt_id, ds, opt_string, concat(substr(ts, 1, 4), '0:00') as ts_win,
ts, eday
from dfs.prod.view_optdata where ds = '2018-12-02'
) b ON a.ds = b.ds and a.opt_string = b.opt_string
WHERE b.ts <= a.ts_win
) c
group by ds, ts_win, opt_string, opt_id
order by ds, ts_win ASC, opt_string ASC, opt_id ASC

This breaks up my day into 10 minute intervals (based on string slicing)
and then shows me what's reported before that... I think :)

This is limited in that I can do it only in time intervals that can be
substringed out of a time field in  string format.  Still open to other
ideas :)


On Tue, Dec 4, 2018 at 10:05 AM John Omernik  wrote:

> Time for a fun question: How to be clever with queries!
>
>
> I have a table that takes readings from an IoT type device
>
>
> opt_id dt   ts  eday   opt_string
> 2.1.1   2018-12-01   10:43:43   12.5   1
> 2.1.2   2018-12-01   10:32:43   5.51
> 2.1.3   2018-12-01   10:55:02   20.5   1
> 2.2.1   2018-12-01   10:43:43   12.2   2
> 2.2.2   2018-12-01   10:12:56   43.7   2
> 2.2.3   2018-12-01   10:50:23   12.2   2
> 2.1.1   2018-12-01   13:43:43   45.5   1
> 2.1.2   2018-12-01   13:32:43   20.5   1
> 2.1.3   2018-12-01   13:55:02   45.5   1
> 2.2.1   2018-12-01   13:43:43   30.2   2
> 2.2.2   2018-12-01   13:12:56   81.7   2
> 2.2.3   2018-12-01   13:50:23   50.2   2
>
>
> Pretty simple. Essentially, each "device" (opt_id) reports throughout the
> day, what it's cumulative value (eday), at that time, for that day
>
> These devices belong to groups which is it's own column (opt_string)
>
>
> So if I run a query like this:
>
> select opt_id, ds, max(opt_string) as opt_string, max(ts) as max_ts,
> max(Eday) as max_Eday
> from `mydata`
> where ds = '2018-12-01'
> group by ds, opt_id, order by opt_id ASC
>
> I get:
>
> opt_id ds opt_string max_ts max_Eday
> 2.1.1 2018-12-02 1 15:56:33 181.000
> 2.1.3 2018-12-02 1 15:47:02 162.500
> 2.1.4 2018-12-02 1 15:54:19 122.250
> 2.1.5 2018-12-02 1 15:40:15 132.000
> 2.1.6 2018-12-02 1 15:45:04 183.250
> 2.1.7 2018-12-02 1 15:41:15 161.000
> 2.1.8 2018-12-02 1 15:42:10 166.250
> 2.1.9 2018-12-02 1 15:39:23 126.250
> 2.2.1 2018-12-02 2 16:02:54 163.250
> 2.2.2 2018-12-02 2 15:57:39 178.750
> 2.2.3 2018-12-02 2 15:46:53 148.250
> 2.2.4 2018-12-02 2 15:40:27 163.500
> 2.2.5 2018-12-02 2 15:50:49 147.500
> 2.2.6 2018-12-02 2 15:40:24 156.000
> 2.2.7 2018-12-02 2 15:43:00 104.750
> 2.2.8 2018-12-02 2 15:41:55 170.750
> 2.2.9 2018-12-02 2 15:41:04 127.250
>
> Which shows me the last time each device reported on a day, and what that
> report time was, Great.
>
>
> So what if I wanted to show the values through the day? Is there a way
> from a query perpective, to perhaps use Windowing or something to show
> results like the above raw table, but instead of just having the random
> times returned, group by opt_id and get the max_ts as I am doing, but do
> that for each say 5 minute, or 30 minute, or 60 minute block in a day?
>
> The best example to explain what I am looking for is using 3 groups of 1
> hour blocks: I could use a UNION like this:
>
> select * from (
> select opt_id, ds,  max(opt_string) as opt_string, '10:00:00' as max_ts,
> max(Eday) as max_Eday
> from dfs.prod.view_optdata where ds = '2018-12-02' and ts <= '10:00:00'
> group by ds, opt_id
> order by opt_id ASC
> )
> UNION
> (select opt_id, ds,  max(opt_string) as opt_string, '11:00:00' as max_ts,
> max(Eday) as max_Eday
> from dfs.prod.view_optdata where ds = '2018-12-02' and ts <= '11:00:00'
> group by ds, opt_id
> order by opt_id ASC)
> UNION
> (select opt_id, ds,  max(opt_string) as opt_string, '12:00:00' as max_ts,
> max(Eday) as max_Eday
> from dfs.prod.view_optdata where ds = '2018-12-02' and ts <= '12:00:00'
> group by ds, opt_id
> order by opt_id ASC)
>
>
> Where I would get something like:
>
> index opt_id ds opt_string max_ts max_Eday
> 0 2.1.1 12/2/18 1 10:00:00 18
> 1 2.1.3 12/2/18 1 10:00:00 15.75
> 2 2.1.4 12/2/18 1 10:00:00 12.5
> 3 2.1.5 12/2/18 1 10:00:00 12.75
> 4 2.1.6 12/2/18 1 10:00:00 18.75
> 5 2.1.7 12/2/18 1 10:00:00 7.25
> 6 2.1.8 12/2/18 1 10:00:00 16.25
> 7 2.1.9 12/2/18 1 10:00:00 9.75
> 8 2.2.1

Time for a fun Query Question

2018-12-04 Thread John Omernik

Time for a fun question: How to be clever with queries!


I have a table that takes readings from an IoT type device


opt_id dt   ts  eday   opt_string
2.1.1   2018-12-01   10:43:43   12.5   1
2.1.2   2018-12-01   10:32:43   5.51
2.1.3   2018-12-01   10:55:02   20.5   1
2.2.1   2018-12-01   10:43:43   12.2   2
2.2.2   2018-12-01   10:12:56   43.7   2
2.2.3   2018-12-01   10:50:23   12.2   2
2.1.1   2018-12-01   13:43:43   45.5   1
2.1.2   2018-12-01   13:32:43   20.5   1
2.1.3   2018-12-01   13:55:02   45.5   1
2.2.1   2018-12-01   13:43:43   30.2   2
2.2.2   2018-12-01   13:12:56   81.7   2
2.2.3   2018-12-01   13:50:23   50.2   2


Pretty simple. Essentially, each "device" (opt_id) reports throughout the
day, what it's cumulative value (eday), at that time, for that day

These devices belong to groups which is it's own column (opt_string)


So if I run a query like this:

select opt_id, ds, max(opt_string) as opt_string, max(ts) as max_ts,
max(Eday) as max_Eday
from `mydata`
where ds = '2018-12-01'
group by ds, opt_id, order by opt_id ASC

I get:

opt_id ds opt_string max_ts max_Eday
2.1.1 2018-12-02 1 15:56:33 181.000
2.1.3 2018-12-02 1 15:47:02 162.500
2.1.4 2018-12-02 1 15:54:19 122.250
2.1.5 2018-12-02 1 15:40:15 132.000
2.1.6 2018-12-02 1 15:45:04 183.250
2.1.7 2018-12-02 1 15:41:15 161.000
2.1.8 2018-12-02 1 15:42:10 166.250
2.1.9 2018-12-02 1 15:39:23 126.250
2.2.1 2018-12-02 2 16:02:54 163.250
2.2.2 2018-12-02 2 15:57:39 178.750
2.2.3 2018-12-02 2 15:46:53 148.250
2.2.4 2018-12-02 2 15:40:27 163.500
2.2.5 2018-12-02 2 15:50:49 147.500
2.2.6 2018-12-02 2 15:40:24 156.000
2.2.7 2018-12-02 2 15:43:00 104.750
2.2.8 2018-12-02 2 15:41:55 170.750
2.2.9 2018-12-02 2 15:41:04 127.250

Which shows me the last time each device reported on a day, and what that
report time was, Great.


So what if I wanted to show the values through the day? Is there a way from
a query perpective, to perhaps use Windowing or something to show results
like the above raw table, but instead of just having the random times
returned, group by opt_id and get the max_ts as I am doing, but do that for
each say 5 minute, or 30 minute, or 60 minute block in a day?

The best example to explain what I am looking for is using 3 groups of 1
hour blocks: I could use a UNION like this:

select * from (
select opt_id, ds,  max(opt_string) as opt_string, '10:00:00' as max_ts,
max(Eday) as max_Eday
from dfs.prod.view_optdata where ds = '2018-12-02' and ts <= '10:00:00'
group by ds, opt_id
order by opt_id ASC
)
UNION
(select opt_id, ds,  max(opt_string) as opt_string, '11:00:00' as max_ts,
max(Eday) as max_Eday
from dfs.prod.view_optdata where ds = '2018-12-02' and ts <= '11:00:00'
group by ds, opt_id
order by opt_id ASC)
UNION
(select opt_id, ds,  max(opt_string) as opt_string, '12:00:00' as max_ts,
max(Eday) as max_Eday
from dfs.prod.view_optdata where ds = '2018-12-02' and ts <= '12:00:00'
group by ds, opt_id
order by opt_id ASC)


Where I would get something like:

index opt_id ds opt_string max_ts max_Eday
0 2.1.1 12/2/18 1 10:00:00 18
1 2.1.3 12/2/18 1 10:00:00 15.75
2 2.1.4 12/2/18 1 10:00:00 12.5
3 2.1.5 12/2/18 1 10:00:00 12.75
4 2.1.6 12/2/18 1 10:00:00 18.75
5 2.1.7 12/2/18 1 10:00:00 7.25
6 2.1.8 12/2/18 1 10:00:00 16.25
7 2.1.9 12/2/18 1 10:00:00 9.75
8 2.2.1 12/2/18 2 10:00:00 18
9 2.2.2 12/2/18 2 10:00:00 15.25
10 2.2.3 12/2/18 2 10:00:00 2.75
11 2.2.4 12/2/18 2 10:00:00 6.5
12 2.2.5 12/2/18 2 10:00:00 16.25
13 2.2.6 12/2/18 2 10:00:00 11.25
14 2.2.7 12/2/18 2 10:00:00 13.5
15 2.2.8 12/2/18 2 10:00:00 13.75
16 2.2.9 12/2/18 2 10:00:00 14.5
17 2.1.1 12/2/18 1 11:00:00 41
18 2.1.3 12/2/18 1 11:00:00 32
19 2.1.4 12/2/18 1 11:00:00 25
20 2.1.5 12/2/18 1 11:00:00 26.75
21 2.1.6 12/2/18 1 11:00:00 42.25
22 2.1.7 12/2/18 1 11:00:00 27
23 2.1.8 12/2/18 1 11:00:00 35.75
24 2.1.9 12/2/18 1 11:00:00 22
25 2.2.1 12/2/18 2 11:00:00 34.5
26 2.2.2 12/2/18 2 11:00:00 42.5
27 2.2.3 12/2/18 2 11:00:00 19
28 2.2.4 12/2/18 2 11:00:00 29.25
29 2.2.5 12/2/18 2 11:00:00 32.25
30 2.2.6 12/2/18 2 11:00:00 30.25
31 2.2.7 12/2/18 2 11:00:00 26.75
32 2.2.8 12/2/18 2 11:00:00 33.5
33 2.2.9 12/2/18 2 11:00:00 28.5
34 2.1.1 12/2/18 1 12:00:00 81.5
35 2.1.3 12/2/18 1 12:00:00 62.75
36 2.1.4 12/2/18 1 12:00:00 48.5
37 2.1.5 12/2/18 1 12:00:00 48.25
38 2.1.6 12/2/18 1 12:00:00 80.5
39 2.1.7 12/2/18 1 12:00:00 62
40 2.1.8 12/2/18 1 12:00:00 66.25
41 2.1.9 12/2/18 1 12:00:00 43.75
42 2.2.1 12/2/18 2 12:00:00 62.75
43 2.2.2 12/2/18 2 12:00:00 83.5
44 2.2.3 12/2/18 2 12:00:00 49
45 2.2.4 12/2/18 2 12:00:00 66.25
46 2.2.5 12/2/18 2 12:00:00 56.5
47 2.2.6 12/2/18 2 12:00:00 68
48 2.2.7 12/2/18 2 12:00:00 43
49 2.2.8 12/2/18 2 12:00:00 72.5
50 2.2.9 12/2/18 2 12:00:00 49.25


That's great, but if I wanted to say all the hours in a day, that Union
would get ugly, or what about 30 minute blocks? 5

Re: [DISCUSS] Add Hadoop Credentials API support for Drill S3 storage plugin

2018-09-11 Thread John Omernik

I think in general Drill should consider the issue of credentials across
storage plugins. Credentials exist for S3, but also Mongo, JDBC, and others
as they get added.  This can be a pain to manage and leads to insecurely
setup Drill clusters.

One option may be to allow a generic integration with a Secrets stores like
Hashicorp Vault, or Kubernetes Secrets.  I am not sure the best approach
here, but one back of napkin idea is to have an plugin interface, where
users can retrieve passwords from stores.  (or perhaps even build it in to
Apache Drill itself) The goal here would be to allow Users to set once a
token/single password (Think Password Safe) at a session level on login.
This would allow Drill to use the session key, to open the password store
(whether K8s, Vault, or something built into Drill it self) and retrieve
all the passwords for the user. This would allow a one time session setting
of passwords for users, and still allow for secure use, storage, and
accountability of passwords in third party systems.

Process would be something like this:

Each storage plugin that could use a username/password would be able to set
a flag for username/access ID and then password/secret

The user exp would be something like this

1. I logon
2. I now have a session and can issue SQL. SQL for Filesystem stuff can
already just work.
3. If I want to access say a S3 bucket, I need to do something like ALTER
SESSION set session.key = ''
   * NOTE What ever this is , we have to find a way to allow it to NOT BE
LOGGED in queries. This process needs to allow for user accountability,
thus it should not be a normal query for logging purposes.
4. Now that I am logged in I could do something like
  ALTER SESSION set storage.plugin.s3.accessid = '' and ALTER SESSION set
storage.plugin.s3.secret ''
  * NOTE like the session.key, this would have to NOT BE LOGGED. In
addition, it should not print in plain text at select * from sys.options
(Perhaps this should be a special sys.options table of sys.secrets?)
5. Now, when ever a s3 query is made, it can use that (as long as
session.key is set correctly to unlock the password safe)
  * Note, storage.plugin.s3.x may not be enough, we may need to allow each
user to have multiple passwords per storage plugin based on defined
workspaces/plugins... or do we just define the plugin twice... should a
user have multiple passwords per data store? Could we just spoof that by
having a new plugin for each instance of store a user may need? Open to
discussion

6. When the session closes, the session.key is automatically ended. Any
plugins that require a key will require this to happen again.

This is just chicken scratch of an idea, but I think we need to think about
passwords holistically, and ensure that passwords are stored securely, and
allow for accountability and auditing of the connections various data
stores. I think this is a bigger problem them just S3 or JDBC, and
something that should be built into Drill's core.

John

On Thu, Aug 2, 2018 at 7:24 PM, Bohdan Kazydub 
wrote:

> Hi all,
>
> Currently, to access S3A filesystem, `fs.s3a.secret.key` and
> `fs.s3a.access.key` properties should be configured either in S3 Storage
> Plugin or in core-site.xml in plaintext. This approach is considered
> unsecure. To eliminate a need to store passwords in plaintext,
> CredentialProvider API [1] may be used to extract secret keys from
> encrypted store.
>
> Here is a document with implementation details:
> https://docs.google.com/document/d/1ow4v5HOh0qJh-
> 5KsZHqSjohM2ukGSayEd9360tHZZvo/edit#
> .
> And here is an open issue for the improvement:
> https://issues.apache.org/jira/browse/DRILL-6662
>
> Any thoughts?
>
> [1]
> https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/
> CredentialProviderAPI.html
>
> Kind regards,
> Bohdan
>

Re: Session handling with multiple drillbits

2018-09-05 Thread John Omernik

Paul, let's talk about this race condition you mention

Let's use a real option here for clarity. store.format.

SYSTEM store.format is parquet


Scenario 1: I log on, I set SESSION store.format to csv and run CREATE
TABLE foo as select * from bar.   The SESSION variable is read from my
login and that is passed to all drill bits that may get a fragment
correct?  I won't have some fragments on other drill bits reading SYSTEM of
parquet because I set a SESSION of csv.

Scenario 2 (Your described race condition):  I logon in, I do NOT set a
SESSION variable for store.format. (Thus it's parquet). and run CREATE
TABLE foo as select * from bar.  As it is running, and before the final
output, I change SYSTEM store.format to csv. And now I will run into this
race condition?

Scenario 2 is odd to me. Using the principle of least surprise, I would
expect to have my variables read once during planning and projected to all
downstream fragments rather than have a situation where variables could be
read at different times. This also seems like a lot of unnecessary ZK
reads... each fragment has to look up ZK settings? Should the foremen of
the query do that once?   Are you sure this race condition exists? This
just doesn't sit right with me.






On Wed, Sep 5, 2018 at 11:35 AM, Paul Rogers 
wrote:

> Hi Joe,
>
> To answer our question about how options "travel"...
>
> Drill maintains system options in ZK. Session options are maintained per
> connection on the Foreman Drillbit to which the user connects. This is why
> a simple round-robin load balancer does not work: why load balancing has to
> be session-oriented as is the Drill JDBC an ODBC clients. (It is also why
> the non-secure REST API does not preserve session options.) If you query
> the session options table, you get a merger of the system options from ZK
> overplayed with the session options on the Forman.
>
> When a query executes, the query planner copies session options into the
> query play that it sends to Drillbits for execution. The operators in each
> Drillbit merge the session options (from the query plan) and the system
> options (from ZK) to get the full set of options for the query.
>
> Note that this does, in fact, create a race condition: if we change a
> system option while queries run, some fragments of the query may see the
> old value, others the new value. The workaround is simple: don't change
> system options while queries are running.
>
> Thanks,
> - Paul
>
>
>
> On Wednesday, September 5, 2018, 8:42:02 AM PDT, Joe Auty <
> j...@thinkdataworks.com> wrote:
>
>  Hmmm
>
> So, if user 1 sets the store.format to CSV on Drillbit 1 and work gets
> farmed out to Drillbit 2, this session setting will "travel" with the
> user from drillbit to drillbit? We were originally thinking that this
> would be the case if the session information was retained in Zookeeper,
> but we weren't so sure about this. How is session information tracked
> and maintained across multiple drillbits?
>
> Thanks so much for taking the time to engage on this!
>
>
>
> John Omernik wrote on 2018-09-05 9:40 AM:
> > Rereading your post, I think there is some concern between
> embedded/single
> > drillbit mode, and distributed mode.
> >
> > When you run multiple drillbits in distributed mode, you will (should) be
> > enabling authentication.  Thus each user will log in to "a" drill bit.
> > There is no concern on which one, it's not the only one doing the work,
> You
> > could have two users on the same drill bit, or on different bits.  The
> > "System" options will be set by default for all users on logon and at
> this
> > point, a session for the user is created.  You can see what the current
> > options for that user is with "Select * from sys.options"  If you are a
> > user who is in the administrative group (security.admin.users in select *
> > from sys.options) then you can use
> >
> > ALTER SYSTEM set `somekey` = `someval`
> >
> > And that will set the default option for all users when they logon.
> >
> > A user can use
> >
> > ALTER SESSION set `somekey = `someval`
> >
> > To alter what say their store.format is.  So User 1 can authentication
> and
> > connect to Drillbit 1 or 2, and set store.format (at the session level)
> to
> > parquet and user 2 can logon it Drillbit 1 or 2 (doesn't matter, even if
> > user 1 is logged into the same drill bit with a different store.format)
> and
> > set it to csv.
> >
> > I think the key here is authenticated distributed mode in Drill and that
> > will be how you will do what you need to do.
> >
> > John
> >
> >
> >

Re: Session handling with multiple drillbits

2018-09-05 Thread John Omernik

You can have one user logged in using store.format of CSV in one query,
while another user uses store.format of parquet at the same time. The work
from query one, whether bit 1 or 2 will know to store that query as csv and
the work from query two, where ever it is, will be parquet.

Essentially, store.format is checked by the planner at query time for THAT
query.  You have the cluster default, and then you have the ability to
change that at a session level. When you run a query that uses it (Query
time of CREATE TABLE as) , the planner reads the value, session first (if
it is different from the system) and then system if the session is not set,
and then completes the query based on the value read.

John

On Wed, Sep 5, 2018 at 10:41 AM, Joe Auty  wrote:

> Hmmm
>
> So, if user 1 sets the store.format to CSV on Drillbit 1 and work gets
> farmed out to Drillbit 2, this session setting will "travel" with the user
> from drillbit to drillbit? We were originally thinking that this would be
> the case if the session information was retained in Zookeeper, but we
> weren't so sure about this. How is session information tracked and
> maintained across multiple drillbits?
>
> Thanks so much for taking the time to engage on this!
>
>
>
>
> John Omernik wrote on 2018-09-05 9:40 AM:
>
>> Rereading your post, I think there is some concern between embedded/single
>> drillbit mode, and distributed mode.
>>
>> When you run multiple drillbits in distributed mode, you will (should) be
>> enabling authentication.   Thus each user will log in to "a" drill bit.
>> There is no concern on which one, it's not the only one doing the work,
>> You
>> could have two users on the same drill bit, or on different bits.  The
>> "System" options will be set by default for all users on logon and at this
>> point, a session for the user is created.   You can see what the current
>> options for that user is with "Select * from sys.options"  If you are a
>> user who is in the administrative group (security.admin.users in select *
>> from sys.options) then you can use
>>
>> ALTER SYSTEM set `somekey` = `someval`
>>
>> And that will set the default option for all users when they logon.
>>
>> A user can use
>>
>> ALTER SESSION set `somekey = `someval`
>>
>> To alter what say their store.format is.  So User 1 can authentication and
>> connect to Drillbit 1 or 2, and set store.format (at the session level) to
>> parquet and user 2 can logon it Drillbit 1 or 2 (doesn't matter, even if
>> user 1 is logged into the same drill bit with a different store.format)
>> and
>> set it to csv.
>>
>> I think the key here is authenticated distributed mode in Drill and that
>> will be how you will do what you need to do.
>>
>> John
>>
>>
>>
>> On Tue, Sep 4, 2018 at 7:30 PM, John Omernik  wrote:
>>
>> The session is the users session, not the drill bit. Since you are
>>> connected to a specific drill bit, when you alter session it will work.
>>> Try
>>> to use session stickiness or pinning on your HA solution and you will be
>>> go. With my DNS round robin it picks a "connecting" drill bit and sticks
>>> to
>>> it until the session is done. The settings apply to the drill cluster
>>> while
>>> in distributed mode.
>>>
>>> On Tue, Sep 4, 2018 at 3:56 PM, Joe Auty  wrote:
>>>
>>> Thanks for your response John!
>>>>
>>>> We are using Drill both in an ETL context, as well as for general
>>>> warehouse queries. One Drill user uses store format set to Parquet while
>>>> the other uses store format set to CSV to read and write from HDFS. We
>>>> are
>>>> currently using Kubernetes Services rather than DNS round robin, but we
>>>> only have one drillbit in the cluster while we try to sort out this
>>>> issue.
>>>>
>>>> I'm not clear though how alter session would work with a session with
>>>> multiple drillbits involved? We need to ensure that the right store
>>>> format
>>>> option is set, so like I said we are using different Drill usernames and
>>>> sessions to accommodate this, but how would alter session commands
>>>> apply to
>>>> preserve these different settings across multiple drillbits?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> John Omernik wrote on 2018-09-04 2:48 PM:
>>>>
>>>> Are these ETL ish type queries?  store.format shoul

Re: Session handling with multiple drillbits

2018-09-05 Thread John Omernik

Rereading your post, I think there is some concern between embedded/single
drillbit mode, and distributed mode.

When you run multiple drillbits in distributed mode, you will (should) be
enabling authentication.   Thus each user will log in to "a" drill bit.
There is no concern on which one, it's not the only one doing the work, You
could have two users on the same drill bit, or on different bits.  The
"System" options will be set by default for all users on logon and at this
point, a session for the user is created.   You can see what the current
options for that user is with "Select * from sys.options"  If you are a
user who is in the administrative group (security.admin.users in select *
from sys.options) then you can use

ALTER SYSTEM set `somekey` = `someval`

And that will set the default option for all users when they logon.

A user can use

ALTER SESSION set `somekey = `someval`

To alter what say their store.format is.  So User 1 can authentication and
connect to Drillbit 1 or 2, and set store.format (at the session level) to
parquet and user 2 can logon it Drillbit 1 or 2 (doesn't matter, even if
user 1 is logged into the same drill bit with a different store.format) and
set it to csv.

I think the key here is authenticated distributed mode in Drill and that
will be how you will do what you need to do.

John



On Tue, Sep 4, 2018 at 7:30 PM, John Omernik  wrote:

> The session is the users session, not the drill bit. Since you are
> connected to a specific drill bit, when you alter session it will work. Try
> to use session stickiness or pinning on your HA solution and you will be
> go. With my DNS round robin it picks a "connecting" drill bit and sticks to
> it until the session is done. The settings apply to the drill cluster while
> in distributed mode.
>
> On Tue, Sep 4, 2018 at 3:56 PM, Joe Auty  wrote:
>
>> Thanks for your response John!
>>
>> We are using Drill both in an ETL context, as well as for general
>> warehouse queries. One Drill user uses store format set to Parquet while
>> the other uses store format set to CSV to read and write from HDFS. We are
>> currently using Kubernetes Services rather than DNS round robin, but we
>> only have one drillbit in the cluster while we try to sort out this issue.
>>
>> I'm not clear though how alter session would work with a session with
>> multiple drillbits involved? We need to ensure that the right store format
>> option is set, so like I said we are using different Drill usernames and
>> sessions to accommodate this, but how would alter session commands apply to
>> preserve these different settings across multiple drillbits?
>>
>>
>>
>>
>>
>>
>> John Omernik wrote on 2018-09-04 2:48 PM:
>>
>>> Are these ETL ish type queries?  store.format should only apply when
>>> Drill
>>> is writing data, when it is reading, it uses the filenames and other
>>> hints
>>> to read.
>>>
>>> Thus, if you do HA, say with DNS (like like in the other thread) and
>>> prior
>>> to running your CREATE TABLE AS (I Am assuming this is what you are
>>> doing)
>>> you can do ALTER SESSION set store.format = 'parquet'
>>>
>>> Instead of setting the ALTER SYSTEM, you can use ALTER SESSION so it only
>>> applies to the current session, regardless of foreman.
>>>
>>> John
>>>
>>>
>>> On Tue, Sep 4, 2018 at 1:00 PM, Joe Auty  wrote:
>>>
>>> Hello,
>>>>
>>>> We need to have some queries executed with store.format set to parquet
>>>> and
>>>> some with this option set to CSV. To date we have experimented with
>>>> setting
>>>> the store format for sessions controlled by using two separate user
>>>> logins
>>>> as a sort of context switch, but I'm wondering if the group here might
>>>> have
>>>> suggestions for a better way to handle this, particularly one that will
>>>> scale a little better for us?
>>>>
>>>> The main problem we have with this approach is in introducing multiple
>>>> drillbits/HA and assuring that the session and the settings we need are
>>>> respected across all drillbits (whether with an HAProxy + sticky session
>>>> approach or any other approach). There is a more general thread (which
>>>> I've
>>>> chosen not to hijack) about HA Drill from a more general standpoint, you
>>>> might think of my question here as being similar, but with the need for
>>>> a
>>>> context switch to support multiple Drill configurations/session optio

Re: Session handling with multiple drillbits

2018-09-04 Thread John Omernik

Are these ETL ish type queries?  store.format should only apply when Drill
is writing data, when it is reading, it uses the filenames and other hints
to read.

Thus, if you do HA, say with DNS (like like in the other thread) and prior
to running your CREATE TABLE AS (I Am assuming this is what you are doing)
you can do ALTER SESSION set store.format = 'parquet'

Instead of setting the ALTER SYSTEM, you can use ALTER SESSION so it only
applies to the current session, regardless of foreman.

John


On Tue, Sep 4, 2018 at 1:00 PM, Joe Auty  wrote:

> Hello,
>
> We need to have some queries executed with store.format set to parquet and
> some with this option set to CSV. To date we have experimented with setting
> the store format for sessions controlled by using two separate user logins
> as a sort of context switch, but I'm wondering if the group here might have
> suggestions for a better way to handle this, particularly one that will
> scale a little better for us?
>
> The main problem we have with this approach is in introducing multiple
> drillbits/HA and assuring that the session and the settings we need are
> respected across all drillbits (whether with an HAProxy + sticky session
> approach or any other approach). There is a more general thread (which I've
> chosen not to hijack) about HA Drill from a more general standpoint, you
> might think of my question here as being similar, but with the need for a
> context switch to support multiple Drill configurations/session options.
>
> Here are the various attempts and approaches we have come up with so far.
> I'm wondering if you'd have any general advice as to which approach would
> be best for us to take, considering future plans for Drill itself. For
> example, if need be we can write our own plugin(s) if this is the smartest
> approach:
>
> - embedded the store.format option into the query itself by chaining
> multiple queries together separated by a comma (it appears that this
> doesn't work at all)
> - look into writing some sort of plugin to allow us to scale our current
> approach somehow (I realize that this is vague)
> - a "foreman" approach where we stick with our current approach and direct
> all requests to our "foreman"/master with the hope and expectation that it
> will farm out work to the workers/slaves
> - multiple clusters set with different settings
>
> Each of these approaches seems to have its pros and cons. To reiterate:
> what approach do you think would be the smartest and most future-proof
> approach for us to take?
>
> Thanks in advance!
>

Re: Apache Drill High Availability using HAproxy

2018-08-27 Thread John Omernik

This is a great topic, that I have run into running Drill on Apache Mesos
due to each of my bits having essentially a DNS load balancer. (One DNS
Name, multiple Drill bits IPs assigned to them).   That said, I've run into
a few issues and have a few workarounds. Note, I am talking about the REST
API here, not the other interfaces, I am not sure how that would work,
(perhaps the same)

So the best way, if you are using HAProxy, is to use sticky connections.
Essentially, when a user connects to HA PRoxy, the connection to the
backend Drillbit will stay sticky there until a timeout period or the
session is closed.  This should allow you to ensure the best user exp,
while keeping HA.  I am not sure how HAProxy balances things, however, with
a decent Drill cluster size, it shouldn't be an issue.

I didn't have HAProxy setup, and so what I did in my jupyter_drill module (
https://github.com/johnomernik/jupyter_drill) is at the application level,
prior connecting to Drill, I did a DNS lookup and grabbed the first IP
returned. Then I directly connected to that drill bit, for the the duration
of the session. It's not perfect, and I have not tested this at scale, but
it has worked on a small scale. I even used some python requests module
magic to use use the host name in the SSL verification even though I am
connecting by IP.

So a few options, if you already are looking at HAProxy, checking into the
sticky connections.

John

On Mon, Aug 20, 2018 at 1:01 PM, Paul Rogers 
wrote:

> Hi Satish,
>
> You did not say if you are using HAProxy for the RESTful API or the native
> Drill RPC (as used by the Drill client, JDBC and ODBC.)
>
> To understand the use of proxies and load balancers, it is helpful to
> remember that Drill is a stateful SQL engine. Drill encourages the use of
> many stateful commands such as USE, CTTAS, and ALTER SESSION.
>
> Session state is lost when connecting to a new Drillbit, or reconnecting
> to the same Drillbit. Thus, a query that runs fine before the reconnect can
> fail afterwards.
>
> This issue is not unique to Drill; it is a common constraint of all
> old-school SQL engines.
>
> If state were not an issue, then the Drill client itself could handle HA.
> The client is given a list of ZK nodes. The client, on encountering a
> disconnect, could ask ZK for a new node and reconnect. Since ZK is HA, the
> client can also recover from a ZK node failure by trying another.
>
> We discussed this client-based HA approach multiple times, but each time,
> the SQL state has been a show-stopper.
>
> In short, the issue is not whether to use HAProxy to solve the problem;
> Drill can do it internally in the client. The issue is how to handle
> session state.
>
> A possible solution would be to store user session state in ZK so that we
> could re-establish the same logical session after a physical reconnection.
> In particular a unique session ID could be used to key connections to
> session state in ZK.
>
> Making this change would be a good contributor project: it involves
> detailed knowledge of how the Drill session and ZK state work, but is
> pretty isolated to just those specific areas.
> Thanks,
> - Paul
>
>
>
> On Monday, August 20, 2018, 8:26:09 AM PDT, drill <
> ganesh.satis...@gmail.com> wrote:
>
>  Hi Team,
>
> Good Evening . I am Satish working as big data developer. I need your help
> regarding Drill high availability usinh Ha proxy load balancer.
> Is Apache drill supports High availability if yes please let me know the
> process.
>
>
> -Thanks,
> Satish
>
> Sent from Mail for Windows 10
>
>

Functions in Drill but not in documentation

2018-08-02 Thread John Omernik

Hey all,

I was going to create a JIRA for this, but realized it's a symptom rather
than a problem.

I had a friend ask me "Hey, how do I do a regex search in Drill?"

I thought, well, look in the docs. And I go to string functions and it's
not there...(Doc page last updated Jan 14, 2016)

Then I thought... is there a way to do it... something is tickling my brain
on this... and I searched and found a thread where Charles Givre helpfully
pointed out that REGEX_MATCHES would work for me...

It all came back to me, we need a better way to do function. Now, yes,
Documentation needs to be updated, no doubt.  However, its more than that,
what functions should a doc person add to the docs?   The opportunity for
misses exist.

So then I realized, hey I suggested a solution for this problem too. I
added a request for sys.functions here:

https://issues.apache.org/jira/browse/DRILL-4258

and here

https://issues.apache.org/jira/browse/DRILL-3988

So while this is a bit muddied by other requests as well, I want to point
out that this issue here is important.

Why? Well today, I was able to help my friend do a regex search, but he was
my friend, and while I am good guy, not everyone is my friend.  Thus, what
happens when Drill could be the perfect tool for someone's use case, it has
all the right things for them, but they give up on it because of a document
miss.

yes, we could say "let's document better" or we could say "let's build
documentation into the code so it CAN'T be missed and saves time on
documenting?  This is one of those things where the doc pages could be
regenerated after every page. If we included a field in sys.functions
called "added in" we could even keep a ledger of what functions exist in
what versions!

This email hopefully is a call to arms to renew interest on my idea for
sys.functions. Please feel free to poke holes here... my goals are to help
make Drill easier for everyone!

John

Re: Apache Drill on Kubernetes

2018-07-26 Thread John Omernik

Absolutely it can! I don't have a docker file for it handy but I know it
works well!

On Thu, Jul 26, 2018, 10:45 AM Arjun Rao  wrote:

> Hi,
> I am new to this forum and am excited to use Drill. This might have been
> discussed in the past but I wanted to know if Apache Drill can be run on
> Kubernetes and if not, if it's on the roadmap for Drill?
>
> Appreciate the help!
>
> Best,
> Arjun
>

Re: Array Index Out of Bounds in String Binary

2018-07-18 Thread John Omernik

Interesting, so the 256 limit is on output, not on input?

Is Drill-6607 enough to track this?  If so, i have one more "feature" to
add to it, not sure if I should include it on 6607 or create a new JIRA and
link them. Basically, I'd like the ability to pass an int to the function.
(Does Java have optional arguments? It must because of substr(data, start)
and substr(data, start, nochars))  Basically string_binary(data) works as
intended (with the limitation fixed) and string_binary(data, 1) would work
on the binary, but replace EVERY character with the hex representation.

And optional third would be to do a format string of some sort so the user
could pick output, but I like the idea of having every character as hex for
analysis.

John

On Tue, Jul 17, 2018 at 3:40 PM, Vlad Rozov  wrote:

> A. +1.
>
> B. Every byte in a binary data may require up to 4 bytes (0xXX) in the
> string representation, so 80 may work, 60 should reliably work.
>
> Thank you,
>
> Vlad
>
>
> On 7/17/18 13:14, John Omernik wrote:
>
>> Yet this works?
>>
>> string_binary(byte_substr(`data`, 1, 80))
>>
>> On Tue, Jul 17, 2018 at 3:12 PM, John Omernik  wrote:
>>
>> So on B. I found the BYTE_SUBSTR and only send 200 bytes to the
>>> string_binary function, I still get an error.  Something else is
>>> happening
>>> here...
>>>
>>> select `type`, `timestamp`, `src_ip`, `dst_ip`, `src_port`, `dst_port`,
>>> string_binary(byte_substr(`data`, 1, 200)) as mydata from
>>> `user/jomernik/bf2_7306.pcap` limit 10
>>>
>>> I get the same
>>>
>>> Error Id: 213075e7-378a-437f-a5dc-408326f123f3 on
>>> zeta3.brewingintel.com:20005]
>>>
>>> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
>>> IndexOutOfBoundsException: index: 0, length: 379 (expected: range(0,
>>> 256))
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 17, 2018 at 12:56 PM, John Omernik  wrote:
>>>
>>> Thanks Vlad a couple of thoughts.
>>>>
>>>>
>>>> A. I think that should be fixed. That seems like a limitation that is
>>>> both unexpected and undocumented.
>>>>
>>>> B.  Is there a way, if my data in the table is returned as binary to
>>>> start with, for me to return the first 256 bytes? I tried substring, and
>>>> tries to force to UTF-8 and I am getting some issues there.
>>>>
>>>> On Tue, Jul 17, 2018 at 10:33 AM, Vlad Rozov  wrote:
>>>>
>>>> In case of DRILL-6607 the issue lies in the implementation of
>>>>> "string_binary" function: it is not prepared to handle incoming data
>>>>> that
>>>>> when converted to a binary string would exceed 256 bytes as it does not
>>>>> reallocate the output buffer. Until the function code is fixed, the
>>>>> only
>>>>> way to avoid the error is either not to use "string_binary" or to use
>>>>> it
>>>>> with the data that meets "string_binary" limitation.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> Vlad
>>>>>
>>>>>
>>>>> On 7/13/18 14:01, Ted Dunning wrote:
>>>>>
>>>>> There are bounds for acceptable behavior for a function like this.
>>>>>> Array
>>>>>> index out of bounds is not acceptable. Aborting with a clean message
>>>>>> about
>>>>>> to true problem might be fine, as would be to return a null.
>>>>>>
>>>>>> On Fri, Jul 13, 2018, 13:46 John Omernik  wrote:
>>>>>>
>>>>>> So, as to the actual problem, I opened a JIRA here:
>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/DRILL-6607
>>>>>>>
>>>>>>> The reason I brought this here is my own curiosity:  Does an issue in
>>>>>>> using
>>>>>>> this function most likely lie in the function code itself not
>>>>>>> handling
>>>>>>> good
>>>>>>> data, or is the issue in the pcap pluglin which produces the data for
>>>>>>> this
>>>>>>> function to consume, I am just curious on how something like this
>>>>>>> could be
>>>>>>> avoided.
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>>
>

Re: Array Index Out of Bounds in String Binary

2018-07-17 Thread John Omernik

Yet this works?

string_binary(byte_substr(`data`, 1, 80))

On Tue, Jul 17, 2018 at 3:12 PM, John Omernik  wrote:

> So on B. I found the BYTE_SUBSTR and only send 200 bytes to the
> string_binary function, I still get an error.  Something else is happening
> here...
>
> select `type`, `timestamp`, `src_ip`, `dst_ip`, `src_port`, `dst_port`,
> string_binary(byte_substr(`data`, 1, 200)) as mydata from
> `user/jomernik/bf2_7306.pcap` limit 10
>
> I get the same
>
> Error Id: 213075e7-378a-437f-a5dc-408326f123f3 on
> zeta3.brewingintel.com:20005]
>
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
> IndexOutOfBoundsException: index: 0, length: 379 (expected: range(0, 256))
>
>
>
>
>
> On Tue, Jul 17, 2018 at 12:56 PM, John Omernik  wrote:
>
>>
>> Thanks Vlad a couple of thoughts.
>>
>>
>> A. I think that should be fixed. That seems like a limitation that is
>> both unexpected and undocumented.
>>
>> B.  Is there a way, if my data in the table is returned as binary to
>> start with, for me to return the first 256 bytes? I tried substring, and
>> tries to force to UTF-8 and I am getting some issues there.
>>
>> On Tue, Jul 17, 2018 at 10:33 AM, Vlad Rozov  wrote:
>>
>>> In case of DRILL-6607 the issue lies in the implementation of
>>> "string_binary" function: it is not prepared to handle incoming data that
>>> when converted to a binary string would exceed 256 bytes as it does not
>>> reallocate the output buffer. Until the function code is fixed, the only
>>> way to avoid the error is either not to use "string_binary" or to use it
>>> with the data that meets "string_binary" limitation.
>>>
>>> Thank you,
>>>
>>> Vlad
>>>
>>>
>>> On 7/13/18 14:01, Ted Dunning wrote:
>>>
>>>> There are bounds for acceptable behavior for a function like this.
>>>> Array
>>>> index out of bounds is not acceptable. Aborting with a clean message
>>>> about
>>>> to true problem might be fine, as would be to return a null.
>>>>
>>>> On Fri, Jul 13, 2018, 13:46 John Omernik  wrote:
>>>>
>>>> So, as to the actual problem, I opened a JIRA here:
>>>>>
>>>>> https://issues.apache.org/jira/browse/DRILL-6607
>>>>>
>>>>> The reason I brought this here is my own curiosity:  Does an issue in
>>>>> using
>>>>> this function most likely lie in the function code itself not handling
>>>>> good
>>>>> data, or is the issue in the pcap pluglin which produces the data for
>>>>> this
>>>>> function to consume, I am just curious on how something like this
>>>>> could be
>>>>> avoided.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>
>>
>

Re: Array Index Out of Bounds in String Binary

2018-07-17 Thread John Omernik

So on B. I found the BYTE_SUBSTR and only send 200 bytes to the
string_binary function, I still get an error.  Something else is happening
here...

select `type`, `timestamp`, `src_ip`, `dst_ip`, `src_port`, `dst_port`,
string_binary(byte_substr(`data`, 1, 200)) as mydata from
`user/jomernik/bf2_7306.pcap` limit 10

I get the same

Error Id: 213075e7-378a-437f-a5dc-408326f123f3 on
zeta3.brewingintel.com:20005]

org.apache.drill.common.exceptions.UserException: SYSTEM ERROR:
IndexOutOfBoundsException: index: 0, length: 379 (expected: range(0, 256))





On Tue, Jul 17, 2018 at 12:56 PM, John Omernik  wrote:

>
> Thanks Vlad a couple of thoughts.
>
>
> A. I think that should be fixed. That seems like a limitation that is both
> unexpected and undocumented.
>
> B.  Is there a way, if my data in the table is returned as binary to start
> with, for me to return the first 256 bytes? I tried substring, and tries to
> force to UTF-8 and I am getting some issues there.
>
> On Tue, Jul 17, 2018 at 10:33 AM, Vlad Rozov  wrote:
>
>> In case of DRILL-6607 the issue lies in the implementation of
>> "string_binary" function: it is not prepared to handle incoming data that
>> when converted to a binary string would exceed 256 bytes as it does not
>> reallocate the output buffer. Until the function code is fixed, the only
>> way to avoid the error is either not to use "string_binary" or to use it
>> with the data that meets "string_binary" limitation.
>>
>> Thank you,
>>
>> Vlad
>>
>>
>> On 7/13/18 14:01, Ted Dunning wrote:
>>
>>> There are bounds for acceptable behavior for a function like this.  Array
>>> index out of bounds is not acceptable. Aborting with a clean message
>>> about
>>> to true problem might be fine, as would be to return a null.
>>>
>>> On Fri, Jul 13, 2018, 13:46 John Omernik  wrote:
>>>
>>> So, as to the actual problem, I opened a JIRA here:
>>>>
>>>> https://issues.apache.org/jira/browse/DRILL-6607
>>>>
>>>> The reason I brought this here is my own curiosity:  Does an issue in
>>>> using
>>>> this function most likely lie in the function code itself not handling
>>>> good
>>>> data, or is the issue in the pcap pluglin which produces the data for
>>>> this
>>>> function to consume, I am just curious on how something like this could
>>>> be
>>>> avoided.
>>>>
>>>> John
>>>>
>>>>
>>
>

Re: Array Index Out of Bounds in String Binary

2018-07-17 Thread John Omernik

Thanks Vlad a couple of thoughts.


A. I think that should be fixed. That seems like a limitation that is both
unexpected and undocumented.

B.  Is there a way, if my data in the table is returned as binary to start
with, for me to return the first 256 bytes? I tried substring, and tries to
force to UTF-8 and I am getting some issues there.

On Tue, Jul 17, 2018 at 10:33 AM, Vlad Rozov  wrote:

> In case of DRILL-6607 the issue lies in the implementation of
> "string_binary" function: it is not prepared to handle incoming data that
> when converted to a binary string would exceed 256 bytes as it does not
> reallocate the output buffer. Until the function code is fixed, the only
> way to avoid the error is either not to use "string_binary" or to use it
> with the data that meets "string_binary" limitation.
>
> Thank you,
>
> Vlad
>
>
> On 7/13/18 14:01, Ted Dunning wrote:
>
>> There are bounds for acceptable behavior for a function like this.  Array
>> index out of bounds is not acceptable. Aborting with a clean message about
>> to true problem might be fine, as would be to return a null.
>>
>> On Fri, Jul 13, 2018, 13:46 John Omernik  wrote:
>>
>> So, as to the actual problem, I opened a JIRA here:
>>>
>>> https://issues.apache.org/jira/browse/DRILL-6607
>>>
>>> The reason I brought this here is my own curiosity:  Does an issue in
>>> using
>>> this function most likely lie in the function code itself not handling
>>> good
>>> data, or is the issue in the pcap pluglin which produces the data for
>>> this
>>> function to consume, I am just curious on how something like this could
>>> be
>>> avoided.
>>>
>>> John
>>>
>>>
>

Array Index Out of Bounds in String Binary

2018-07-13 Thread John Omernik

So, as to the actual problem, I opened a JIRA here:

https://issues.apache.org/jira/browse/DRILL-6607

The reason I brought this here is my own curiosity:  Does an issue in using
this function most likely lie in the function code itself not handling good
data, or is the issue in the pcap pluglin which produces the data for this
function to consume, I am just curious on how something like this could be
avoided.

John

Re: Way to "pivot"

2018-03-06 Thread John Omernik

Looks as though there is a JIRA, I added some of the notes here to the
comments:

https://issues.apache.org/jira/browse/DRILL-4223



On Tue, Mar 6, 2018 at 1:59 PM, Saurabh Mahapatra <
saurabhmahapatr...@gmail.com> wrote:

> Looks like SQL Server supports it, not sure if this is in the SQL standard:
>
> https://stackoverflow.com/questions/15931607/convert-
> rows-to-columns-using-pivot-in-sql-server
>
>
>
> On Tue, Mar 6, 2018 at 11:47 AM, Kunal Khatua <kunalkha...@gmail.com>
> wrote:
>
> > Not until now  :)
> >
> > Can you file a JIRA so that we can track it?
> >
> > On Tue, Mar 6, 2018 at 11:40 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > Perfect.  That works for me because I have a limited number of values,
> I
> > > could see that getting out of hand if the values were unknown. Has
> there
> > > been any talk of a Pivot function in Drill? That would be helpful so
> you
> > > didn't have to know the column names ahead of time.
> > >
> > > John
> > >
> > > On Tue, Mar 6, 2018 at 10:50 AM, Ted Dunning <ted.dunn...@gmail.com>
> > > wrote:
> > >
> > > > Arjun's approach works even if the timestamps are not unique.
> > Especially
> > > if
> > > > you use avg instead of max.
> > > >
> > > > On Mar 6, 2018 8:47 AM, "Arjun kr" <arjun...@outlook.com> wrote:
> > > >
> > > > > If each timestamp has only one set of values for (x,y,z) , you can
> > try
> > > > > something like below.
> > > > >
> > > > > select dt ,
> > > > >  max(case when source='X' THEN `value` else 0.0 end) as X,
> > > > >  max(case when source='Y' THEN `value` else 0.0 end) as Y,
> > > > >  max(case when source='Z' THEN `value` else 0.0 end) as Z
> > > > > from
> > > > >  
> > > > > group by dt;
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Arjun
> > > > >
> > > > >
> > > > >
> > > > > 
> > > > > From: Andries Engelbrecht <aengelbre...@mapr.com>
> > > > > Sent: Tuesday, March 6, 2018 9:11 PM
> > > > > To: user@drill.apache.org
> > > > > Subject: Re: Way to "pivot"
> > > > >
> > > > > If the X, Y and Z is unique for each timestamp you can perhaps use
> > > group
> > > > > by (dt, X, Y , Z)  and case to make the X, Y , Z columns. May be
> > worth
> > > > > looking into, but is going to be expensive to execute. Just an
> idea,
> > > but
> > > > > have not tested it.
> > > > >
> > > > > --Andries
> > > > >
> > > > >
> > > > > On 3/6/18, 6:46 AM, "John Omernik" <j...@omernik.com> wrote:
> > > > >
> > > > > I am not sure if this is the right thing for what I am trying
> to
> > > do,
> > > > > but I
> > > > > have data in this formate
> > > > >
> > > > >
> > > > > source   dtvalue
> > > > > X2018-03-06 11:00 0.31
> > > > > X2018-03-06 12:00 0.94
> > > > > X2018-03-06 13:00 0.89
> > > > > X2018-03-06 14:00 0.01
> > > > > X2018-03-06 15:00 0.43
> > > > > Y2018-03-06 11:00 1.43
> > > > > Y2018-03-06 12:00 0.50
> > > > > Y2018-03-06 13:00 0.10
> > > > > Y2018-03-06 14:00 0.42
> > > > > Y2018-03-06 15:00 0.41
> > > > > Z2018-03-06 11:00 5.34
> > > > > Z2018-03-06 12:00 4.32
> > > > > Z2018-03-06 13:00 4.20
> > > > > Z2018-03-06 14:00 0.89
> > > > > Z2018-03-06 15:00 0.01
> > > > >
> > > > > I'd like to graph it as three lines (X, Y and Z) over time, so
> > the
> > > > > graph
> > > > > tool I am using asks for it this format:
> > > > >
> > > > >
> > > > >
> > > > > dt   X Y
> > > > >Z
> > > > >
> > > > > 2018-03-06 11:00 0.31   1.43 5.34
> > > > > 2018-03-06 12:00 0.94   0.50 4.32
> > > > > 2018-03-06 13:00 0.89   0.10 4.20
> > > > > 2018-03-06 14:00 0.01   0.42 0.89
> > > > > 2018-03-06 15:00 0.43   0.41 0.01
> > > > >
> > > > >
> > > > > So I think that would be a PIVOT like function right (which I
> > don't
> > > > > think
> > > > > Drill has) Is there a way to "fake" this in Drill using some
> > other
> > > > > built in
> > > > > functions?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > John
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Way to "pivot"

2018-03-06 Thread John Omernik

Perfect.  That works for me because I have a limited number of values, I
could see that getting out of hand if the values were unknown. Has there
been any talk of a Pivot function in Drill? That would be helpful so you
didn't have to know the column names ahead of time.

John

On Tue, Mar 6, 2018 at 10:50 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Arjun's approach works even if the timestamps are not unique. Especially if
> you use avg instead of max.
>
> On Mar 6, 2018 8:47 AM, "Arjun kr" <arjun...@outlook.com> wrote:
>
> > If each timestamp has only one set of values for (x,y,z) , you can try
> > something like below.
> >
> > select dt ,
> >  max(case when source='X' THEN `value` else 0.0 end) as X,
> >  max(case when source='Y' THEN `value` else 0.0 end) as Y,
> >  max(case when source='Z' THEN `value` else 0.0 end) as Z
> > from
> >  
> > group by dt;
> >
> > Thanks,
> >
> > Arjun
> >
> >
> >
> > 
> > From: Andries Engelbrecht <aengelbre...@mapr.com>
> > Sent: Tuesday, March 6, 2018 9:11 PM
> > To: user@drill.apache.org
> > Subject: Re: Way to "pivot"
> >
> > If the X, Y and Z is unique for each timestamp you can perhaps use group
> > by (dt, X, Y , Z)  and case to make the X, Y , Z columns. May be worth
> > looking into, but is going to be expensive to execute. Just an idea, but
> > have not tested it.
> >
> > --Andries
> >
> >
> > On 3/6/18, 6:46 AM, "John Omernik" <j...@omernik.com> wrote:
> >
> > I am not sure if this is the right thing for what I am trying to do,
> > but I
> > have data in this formate
> >
> >
> > source   dtvalue
> > X2018-03-06 11:00 0.31
> > X2018-03-06 12:00 0.94
> > X2018-03-06 13:00 0.89
> > X2018-03-06 14:00 0.01
> > X2018-03-06 15:00 0.43
> > Y2018-03-06 11:00 1.43
> > Y2018-03-06 12:00 0.50
> > Y2018-03-06 13:00 0.10
> > Y2018-03-06 14:00 0.42
> > Y2018-03-06 15:00 0.41
> > Z2018-03-06 11:00 5.34
> > Z2018-03-06 12:00 4.32
> > Z2018-03-06 13:00 4.20
> > Z2018-03-06 14:00 0.89
> > Z2018-03-06 15:00 0.01
> >
> > I'd like to graph it as three lines (X, Y and Z) over time, so the
> > graph
> > tool I am using asks for it this format:
> >
> >
> >
> > dt   X Y
> >Z
> >
> > 2018-03-06 11:00 0.31   1.43 5.34
> > 2018-03-06 12:00 0.94   0.50 4.32
> > 2018-03-06 13:00 0.89   0.10 4.20
> > 2018-03-06 14:00 0.01   0.42 0.89
> > 2018-03-06 15:00 0.43   0.41 0.01
> >
> >
> > So I think that would be a PIVOT like function right (which I don't
> > think
> > Drill has) Is there a way to "fake" this in Drill using some other
> > built in
> > functions?
> >
> > Thanks!
> >
> > John
> >
> >
> >
>

Way to "pivot"

2018-03-06 Thread John Omernik

I am not sure if this is the right thing for what I am trying to do, but I
have data in this formate


source   dtvalue
X2018-03-06 11:00 0.31
X2018-03-06 12:00 0.94
X2018-03-06 13:00 0.89
X2018-03-06 14:00 0.01
X2018-03-06 15:00 0.43
Y2018-03-06 11:00 1.43
Y2018-03-06 12:00 0.50
Y2018-03-06 13:00 0.10
Y2018-03-06 14:00 0.42
Y2018-03-06 15:00 0.41
Z2018-03-06 11:00 5.34
Z2018-03-06 12:00 4.32
Z2018-03-06 13:00 4.20
Z2018-03-06 14:00 0.89
Z2018-03-06 15:00 0.01

I'd like to graph it as three lines (X, Y and Z) over time, so the graph
tool I am using asks for it this format:



dt   X Y Z

2018-03-06 11:00 0.31   1.43 5.34
2018-03-06 12:00 0.94   0.50 4.32
2018-03-06 13:00 0.89   0.10 4.20
2018-03-06 14:00 0.01   0.42 0.89
2018-03-06 15:00 0.43   0.41 0.01


So I think that would be a PIVOT like function right (which I don't think
Drill has) Is there a way to "fake" this in Drill using some other built in
functions?

Thanks!

John

Re: Looking for user feedback on DRILL-5741

2018-02-18 Thread John Omernik

I like this this idea in general.  When running under orchestrators like
Yarn, Marathon, or Kubernetes, it's true that those things that start drill
"manage" memory, however, there exists issues in that you need to setup the
variables in drill to not exceed the amount that orchestrators have
allocated.  Once an orchestrator sees a managed process overtake what it
allocated, it often kills the process. In Drill that can mean drillbits
that get killed during queries and thus that leads to a bad user
experience.  Folks configuring Drill in the field have had to set the Heap,
Direct and other settings and hope that they did it right to ensure this
didn't happen.

This option, provides a way for people to start working with reasonable
settings, I like the by % or absolute values. This is important in
multi-tenant environments.

I think I saw in the JIRA that Drill will indicate at startup what
allocation was used, based on what variables.  I think this is important.
Log at bit start both in stdout and in the drillbit.log file.  Indicate
what method was used for allocation, what the user provided values were,
and for auto allocations the split provided. (maybe even provide it in such
a way, that if if a user read it, and wanted to tweak, they could take the
auto allocated output message, and cut and paste that into drill_env.sh.
I.e. print the variables and the values that got auto allocated. That way,
as an administrator, if I felt the need to tweak settings, I can take
exactly what the auto-allocation outputted, put it into my env script, and
then tweak to my hearts desire.

This is a pretty cool..

John

On Fri, Feb 16, 2018 at 1:15 PM, Kunal Khatua  wrote:

> Hi everyone
>
> We're working on simplifying the Drill memory configuration process, where
> by, users no more have the need for getting into the specifics of Heap and
> Direct memory allocations.
> Here is the original JIRA https://issues.apache.org/jira/browse/DRILL-5741
>
> The idea is to simply provide the Drill process with a single dimensional
> of total process memory (either in absolute values, or as a percentage of
> the total system memory), while Drill's startup scripts automatically do an
> optimal allocations. This, of course, can be overridden.
>
> What I'm looking for feed back for which, you're welcome to try the
> Commit (1ad11ee44902c11efa69cde908002f59169f61d7) specified in the
> following
> https://github.com/apache/drill/pull/1082
>
> You can try building Drill with this private branch (to which the pull
> request is linked): https://github.com/kkhatua/drill/commit/
> 1ad11ee44902c11efa69cde908002f59169f61d7
>
> Once you've done a clean setup, you should only need to edit
>
> and uncomment the line having the property - "DRILLBIT_MAX_PROC_MEM" with
> a setting like (say) 50%.
> export DRILLBIT_MAX_PROC_MEM=50%
>
> After that, Drill should start up successfully. Log messages should appear
> in drillbit.out showing that Drill has auto-configured the memory.
>
> I'm looking forward to hearing back from folks who've tried this.
>
> TIA
> ~ Kunal
>

Re: Documentation Update for Drill-4286

2018-02-12 Thread John Omernik

Ah cool. Thanks for the update!

On Mon, Feb 12, 2018 at 3:14 PM, Kunal Khatua <kkha...@mapr.com> wrote:

> This made it into Drill 1.12, but there is some additional work to be
> done. We are hoping to have it completed in time for 1.13 release, along
> with documentation.
>
> -Original Message-
> From: John Omernik [mailto:j...@omernik.com]
> Sent: Monday, February 12, 2018 12:44 PM
> To: user <user@drill.apache.org>
> Subject: Documentation Update for Drill-4286
>
> I see in 1.12 this feature was added("Have an ability to put server in
> quiescent mode of operation")
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.
> apache.org_jira_browse_DRILL-2D4286=DwIBaQ=cskdkSMqhcnjZxdQVpwTXg=-
> cT6otg6lpT_XkmYy7yg3A=PrOyVv1V67549FcefWPJBS2OsTgami-UA0o63UL7gwk=
> VrD0nSaVQysOqvIJjjEcmRcWME2mbMMZ5CRiywVla9o=
>
> Was the documentation updated? I couldn't find it.  Perhaps we need a new
> section on the Drill Docs page "Administrate Drill" for this and other
> related items to how to administrate your Drill cluster (I could also see
> the argument for putting it under configure drill)
>
>
> John
>

Documentation Update for Drill-4286

2018-02-12 Thread John Omernik

I see in 1.12 this feature was added("Have an ability to put server in
quiescent mode of operation")

https://issues.apache.org/jira/browse/DRILL-4286

Was the documentation updated? I couldn't find it.  Perhaps we need a new
section on the Drill Docs page "Administrate Drill" for this and other
related items to how to administrate your Drill cluster (I could also see
the argument for putting it under configure drill)


John

Re: MapR Drill 1.12 Mismatch between Native and Library Versions

2018-02-09 Thread John Omernik

It is a Secure 6.0 MapR Cluster.

It's running drill-1.12.0. I initially used the RPM directory, but have now
moved to the drill.tar.gz archine inside the drill-yarn package from the
same version of Drill. I am running this in Marathon on Mesos.

I am using the same config that worked just fine with Drill 1.10.  I have
made changed based on reviewing the scripts in the MapR install to insure I
pass the right flags to Drill to start it in secure mode.

Well this is interesting, I started it, and stepped away... and this is
what I came back to: a running drill bit with (it appears it's not failed
web server, it's slow to start web server... is there a good way to trouble
shoot this aspect?

2018-02-09 12:33:18,575 [main] INFO  o.apache.drill.exec.server.Drillbit -
Startup completed (600534 ms).

On Fri, Feb 9, 2018 at 12:44 PM, Sorabh Hamirwasia <shamirwa...@mapr.com>
wrote:

> Hi John,
>
> Can you please share all your config files (*.conf and *.env) ? If I
> understand correctly below is your environment setting:
>
> 1) Secure MapR cluster.
>
> 2) Running drill-1.12.0 installed from MapR released RPM.
>
> 3) Re-using the configuration from drill-1.10.0.
>
>
> Thanks,
> Sorabh
>
> 
> From: John Omernik <j...@omernik.com>
> Sent: Friday, February 9, 2018 6:40:02 AM
> To: user
> Subject: Re: MapR Drill 1.12 Mismatch between Native and Library Versions
>
> I've made progress. I've pulled out the drill-yarn rpm, and realized there
> was a drill.tar.gz file in there. I had a hunch with the work that Paul
> Rogers was doing on Drill on Yarn, that this was a executor package and
> thus had the MapR stuff, but also the complete MapR stuff so it could be
> downloaded to a node and ran.  Unpacking this, I think I am correct in that
> assumption.  (I am running Drill on Mesos, so this  is pretty good for me).
> (Suggestion to MapR: Make an RPM that is drill-executor that is only the
> prepackaged drill directory with all the things, and have that be a
> separate rpm, and have drill-yarn be all the additional stuff needed to run
> drill on yarn, that way, when I get around to contributing a drill on mesos
> rpm, I can just use your executor package!)
>
> That said, it's starting, but now I am getting no errors, not in the
> drillbit.log file, not in std out or std err. The RPC ports are listening,
> but the web port is not. I am wondering if something changed in the configs
> that I need to account for (I am using my configs from 1.10)
>
> I am running on a secure cluster in mapr, and using authentication and SSL
> in Drill.   The only "MAYBE" related message I saw was below.  However, my
> only concern is "Encryption disabled" it still doesn't explain why the web
> server is not starting without an error. Any ideas on where to look to
> trouble shoot?
>
> Thanks!
>
>
>
>
> 2018-02-09 08:34:03,915 [main] INFO  o.a.d.c.s.persistence.ScanResult -
> loading 2 classes for
> org.apache.drill.exec.rpc.user.security.UserAuthenticator took 2ms
>
> 2018-02-09 08:34:03,925 [main] INFO  o.a.d.e.r.s.AuthenticatorProviderImpl
> - Configured authentication mechanisms: [plain]
>
> 2018-02-09 08:34:04,352 [main] INFO  o.a.d.e.s.s.PersistentStoreRegistry -
> Using the configured PStoreProvider class:
> 'org.apache.drill.exec.store.sys.store.provider.
> ZookeeperPersistentStoreProvider'.
>
> 2018-02-09 08:34:05,788 [main] INFO  o.a.d.e.r.user.UserConnectionConfig -
> Configured all user connections to require authentication with encryption:
> Encryption: disabled , MaxWrappedSize: 65536 , WrapSizeLimit: 0 using:
> [plain]
>
> 2018-02-09 08:34:05,896 [main] INFO  o.apache.drill.exec.server.Drillbit -
> Construction completed (3461 ms).
>
>
> On Fri, Feb 9, 2018 at 8:10 AM, John Omernik <j...@omernik.com> wrote:
>
> > So already, you have given me some things to work with. Knowing that
> there
> > may links to jars/3rdparty was very helpful.  In opening that folder on
> > drill-1.12.0 I found there were no links/files related to mapr.  In my
> > drill-1.10.0 version (both of them from MapR) there were three files,
> > maprfs, maprdb, and mapr-hbase.
> >
> > So I added only those three files from /opt/mapr/lib to $DRILL_CLASSPATH
> > in my drill_env.sh.   This allowed drill to start without that same
> error.
> >
> > Now, I am being a little different. Instead of "installing" drill via
> > RPMs, I download the RPMs (and I did this for both 1.10 and 1.12 from
> MapR)
> > The difference I think is in 1.10 there was a "drill" package and now in
> > 1.12, there is a drill-internal package.  Perhaps the drill in 1.10 moved
> > some things aroun

Re: MapR Drill 1.12 Mismatch between Native and Library Versions

2018-02-09 Thread John Omernik

I've made progress. I've pulled out the drill-yarn rpm, and realized there
was a drill.tar.gz file in there. I had a hunch with the work that Paul
Rogers was doing on Drill on Yarn, that this was a executor package and
thus had the MapR stuff, but also the complete MapR stuff so it could be
downloaded to a node and ran.  Unpacking this, I think I am correct in that
assumption.  (I am running Drill on Mesos, so this  is pretty good for me).
(Suggestion to MapR: Make an RPM that is drill-executor that is only the
prepackaged drill directory with all the things, and have that be a
separate rpm, and have drill-yarn be all the additional stuff needed to run
drill on yarn, that way, when I get around to contributing a drill on mesos
rpm, I can just use your executor package!)

That said, it's starting, but now I am getting no errors, not in the
drillbit.log file, not in std out or std err. The RPC ports are listening,
but the web port is not. I am wondering if something changed in the configs
that I need to account for (I am using my configs from 1.10)

I am running on a secure cluster in mapr, and using authentication and SSL
in Drill.   The only "MAYBE" related message I saw was below.  However, my
only concern is "Encryption disabled" it still doesn't explain why the web
server is not starting without an error. Any ideas on where to look to
trouble shoot?

Thanks!

2018-02-09 08:34:03,915 [main] INFO  o.a.d.c.s.persistence.ScanResult -
loading 2 classes for
org.apache.drill.exec.rpc.user.security.UserAuthenticator took 2ms

2018-02-09 08:34:03,925 [main] INFO  o.a.d.e.r.s.AuthenticatorProviderImpl
- Configured authentication mechanisms: [plain]

2018-02-09 08:34:04,352 [main] INFO  o.a.d.e.s.s.PersistentStoreRegistry -
Using the configured PStoreProvider class:
'org.apache.drill.exec.store.sys.store.provider.ZookeeperPersistentStoreProvider'.

2018-02-09 08:34:05,788 [main] INFO  o.a.d.e.r.user.UserConnectionConfig -
Configured all user connections to require authentication with encryption:
Encryption: disabled , MaxWrappedSize: 65536 , WrapSizeLimit: 0 using:
[plain]

2018-02-09 08:34:05,896 [main] INFO  o.apache.drill.exec.server.Drillbit -
Construction completed (3461 ms).

On Fri, Feb 9, 2018 at 8:10 AM, John Omernik <j...@omernik.com> wrote:

> So already, you have given me some things to work with. Knowing that there
> may links to jars/3rdparty was very helpful.  In opening that folder on
> drill-1.12.0 I found there were no links/files related to mapr.  In my
> drill-1.10.0 version (both of them from MapR) there were three files,
> maprfs, maprdb, and mapr-hbase.
>
> So I added only those three files from /opt/mapr/lib to $DRILL_CLASSPATH
> in my drill_env.sh.   This allowed drill to start without that same error.
>
> Now, I am being a little different. Instead of "installing" drill via
> RPMs, I download the RPMs (and I did this for both 1.10 and 1.12 from MapR)
> The difference I think is in 1.10 there was a "drill" package and now in
> 1.12, there is a drill-internal package.  Perhaps the drill in 1.10 moved
> some things around better. For what ever reason that changed (I think it
> changed in 1.11, but I was having cluster issues during that).
>
> That said, I have a drill bit running, but it's not starting it's web
> service. I am going to go reverse engineer what the RPM actually does. In
> my case, for 1.10 and every version previous, I just took the mapr RPM,
> unpacked it, and grabbed the drill directory and it worked great. This is
> obviously no longer the case, and I will have to dig deeper.
>
>
>
> On Thu, Feb 8, 2018 at 4:01 PM, Abhishek Girish <agir...@apache.org>
> wrote:
>
>> Can you also share the contents of (1) MapR build version on the cluster
>> nodes (cat /opt/mapr/MapRBuildVersion) (2) Drill RPM version installed
>> (rpm
>> -qa |grep -i mapr-drill)
>>
>> And also verify if the maprfs and maprdb jars inside
>> $DRILL_HOME/jars/3rdparty are links to the corresponding jars in
>> /opt/mapr/lib?
>>
>> On Thu, Feb 8, 2018 at 1:50 PM, Kunal Khatua <kkha...@mapr.com> wrote:
>>
>> > It might be to do with the way you've installed Drill.
>> >
>> > If you've built and deployed Drill, odds are that the client will be
>> > different. With the RPM installation, however, the installer has
>> symlinks
>> > to make the mapr-client libraries required by Drill be pointing to the
>> > libraries available in /opt/mapr/lib/
>> >
>> > I don't know the exact details of what all gets symlinked, but this step
>> > should have ensured that you don't see mismatch between the versions.
>> >
>> > That said... Support would be better equipped to help you with this.
>> >
>> > -

Re: MapR Drill 1.12 Mismatch between Native and Library Versions

2018-02-09 Thread John Omernik

So already, you have given me some things to work with. Knowing that there
may links to jars/3rdparty was very helpful.  In opening that folder on
drill-1.12.0 I found there were no links/files related to mapr.  In my
drill-1.10.0 version (both of them from MapR) there were three files,
maprfs, maprdb, and mapr-hbase.

So I added only those three files from /opt/mapr/lib to $DRILL_CLASSPATH in
my drill_env.sh.   This allowed drill to start without that same error.

Now, I am being a little different. Instead of "installing" drill via RPMs,
I download the RPMs (and I did this for both 1.10 and 1.12 from MapR) The
difference I think is in 1.10 there was a "drill" package and now in 1.12,
there is a drill-internal package.  Perhaps the drill in 1.10 moved some
things around better. For what ever reason that changed (I think it changed
in 1.11, but I was having cluster issues during that).

That said, I have a drill bit running, but it's not starting it's web
service. I am going to go reverse engineer what the RPM actually does. In
my case, for 1.10 and every version previous, I just took the mapr RPM,
unpacked it, and grabbed the drill directory and it worked great. This is
obviously no longer the case, and I will have to dig deeper.

On Thu, Feb 8, 2018 at 4:01 PM, Abhishek Girish <agir...@apache.org> wrote:

> Can you also share the contents of (1) MapR build version on the cluster
> nodes (cat /opt/mapr/MapRBuildVersion) (2) Drill RPM version installed (rpm
> -qa |grep -i mapr-drill)
>
> And also verify if the maprfs and maprdb jars inside
> $DRILL_HOME/jars/3rdparty are links to the corresponding jars in
> /opt/mapr/lib?
>
> On Thu, Feb 8, 2018 at 1:50 PM, Kunal Khatua <kkha...@mapr.com> wrote:
>
> > It might be to do with the way you've installed Drill.
> >
> > If you've built and deployed Drill, odds are that the client will be
> > different. With the RPM installation, however, the installer has symlinks
> > to make the mapr-client libraries required by Drill be pointing to the
> > libraries available in /opt/mapr/lib/
> >
> > I don't know the exact details of what all gets symlinked, but this step
> > should have ensured that you don't see mismatch between the versions.
> >
> > That said... Support would be better equipped to help you with this.
> >
> > -Original Message-
> > From: John Omernik [mailto:j...@omernik.com]
> > Sent: Thursday, February 08, 2018 1:38 PM
> > To: user <user@drill.apache.org>
> > Subject: MapR Drill 1.12 Mismatch between Native and Library Versions
> >
> > I am running MapR's 1.12 drill on a node that only has posix client
> > installed (and thus has a MapR client from that).
> >
> > I've recently had to work with MapR Support to get a fix to posix, and
> > that fixed one issue, but now when I try to start a drill bit, I get this
> > error.
> > The fix was a small patch that only updated the posix library... I guess
> I
> > am confused why I am seeing one "non-patched" version (in the java
> library)
> > and one "patched" version in the native library and why I can't start
> > drill. I was debating where to post this, MapR community or here, but
> it's
> > only Drill I having an issue... any thoughts?
> >
> > John
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 2018-02-08 15:20:54,3305 ERROR JniCommon
> > fs/client/fileclient/cc/jni_MapRClient.cc:687 Thread: 71 Mismatch found
> > for java and native libraries java build version 6.0.0.20171109191718.GA
> ,
> > native build version 6.0.0.20171229015939.GA java patch vserion $Id:
> > mapr-version: 6.0.0.20171109191718.GA e892229b271c98c75ccb, native patch
> > version $Id: mapr-version:
> > 6.0.0.20171229015939.GA bd8dae73f45572194c89
> > 2018-02-08 15:20:54,3305 ERROR JniCommon
> > fs/client/fileclient/cc/jni_MapRClient.cc:704 Thread: 71 Client
> > initialization failed.
> > Exception in thread "main"
> > org.apache.drill.exec.exception.DrillbitStartupException: Failure while
> > initializing values in Drillbit.
> >
>

MapR Drill 1.12 Mismatch between Native and Library Versions

2018-02-08 Thread John Omernik

I am running MapR's 1.12 drill on a node that only has posix client
installed (and thus has a MapR client from that).

I've recently had to work with MapR Support to get a fix to posix, and that
fixed one issue, but now when I try to start a drill bit, I get this error.
The fix was a small patch that only updated the posix library... I guess I
am confused why I am seeing one "non-patched" version (in the java library)
and one "patched" version in the native library and why I can't start
drill. I was debating where to post this, MapR community or here, but it's
only Drill I having an issue... any thoughts?

John












2018-02-08 15:20:54,3305 ERROR JniCommon
fs/client/fileclient/cc/jni_MapRClient.cc:687 Thread: 71 Mismatch
found for java and native libraries java build version
6.0.0.20171109191718.GA, native build version 6.0.0.20171229015939.GA
java patch vserion $Id: mapr-version: 6.0.0.20171109191718.GA
e892229b271c98c75ccb, native patch version $Id: mapr-version:
6.0.0.20171229015939.GA bd8dae73f45572194c89
2018-02-08 15:20:54,3305 ERROR JniCommon
fs/client/fileclient/cc/jni_MapRClient.cc:704 Thread: 71 Client
initialization failed.
Exception in thread "main"
org.apache.drill.exec.exception.DrillbitStartupException: Failure
while initializing values in Drillbit.

Using a plugin on files with wrong extensions

2018-01-04 Thread John Omernik

Hello all -

I was looking at using the PCAP plugin here, and I setup the type for pcap
in the storage plugin. I went to go query data, and realized that my TB or
so of packet captures from my malware lab were all recorded with
Daemonlogger, and the extension on each file is the 10 digit unix timestamp
instead of .pcap.

So, aside from renaming all my files. Is there a good way to tell Drill
that a certain directory, file, or workspace should use X plugin outside of
the file extension?

Thanks!

John

Re: Proposed Slack Channel for Drill Users & Devs

2018-01-04 Thread John Omernik

Yes Please!

On Thu, Jan 4, 2018 at 11:36 AM, Robert Wu  wrote:

> Hi,
>
> I think someone created one a while back (under "drillers.slack.com").
>
> Best regards,
>
> Rob
>
> -Original Message-
> From: Charles Givre [mailto:cgi...@gmail.com]
> Sent: Thursday, January 04, 2018 8:10 AM
> To: d...@drill.apache.org; user 
> Subject: Proposed Slack Channel for Drill Users & Devs
>
> Hello everyone,
> I’m curious, if I were to start a Slack channel for Drill users and Devs,
> would there be interest?
> — C
>

Timeframe on Apache Drill 1.12 in MapR Package?

2018-01-03 Thread John Omernik

Hey all, just wondering if MapR has a timeframe on getting 1.12 into the
MEP?

Thanks!

John

Re: dotnet integration

2017-10-11 Thread John Omernik

Drill is a great tool, I am glad you are looking into it!

I would say it highly depends on what you will be returning in individual
queries.   If your result sets are going to be huge (my rarely are) then
using the MapR ODBC will likely be your best bet, and posting on the MapR
Community Forums may help you resolve any issues with that. I've recently
been using the Rest API with great success. If your result sets are small
you may want to look at that.

John

On Tue, Oct 10, 2017 at 9:20 PM, Duraisamy, Ram 
wrote:

>
> We are thinking to build a asset systems with help of Drill using file
> system approach for our company huge products. Aseets along with meta data
> as json based.
> Basically we are Microsoft shop company. Wants to integrate drill with
> dotnet app and build restful service to feed asset stream to brand sites.
> Is there any drivers available to connect to. We tried with MapR Odbc
> driver but we are unable to connect successfully with dotnet code. But able
> to configure through windows 7 machine.
>
> It would be really helpful if someone provide us code snippet for dotnet
> integration.
>
> Regards
> Ram
> Stanley Black & Decker, Inc.
> www.stanleyblackanddecker.com
> 701 East Joppa Rd., Towson, MD 21286
> Business :   410-716-3167 Ext 23167
> Mobile:  410-790-0677
>
>

Re: Error Messages that are difficult to parse.

2017-09-26 Thread John Omernik

Hmm, I updated my data set and cleared the offending data by accident. Now
it's not reproducing the issue. I will keep an I out for it...

On Mon, Sep 25, 2017 at 1:54 PM, Kunal Khatua <kkha...@mapr.com> wrote:

> Not sure what is going on, but my hunch is that the outermost wrapping SQL
> is probably using the final projections to eliminate some of the columns
> early on, which "helps" avoid the NumberFormat exception.
>
> Perhaps adding back the other columns, one by one, should narrow down the
> source of the NumberFormatException.
>
> I think this is a simple but important bug, since it seems we don't know
> which column is triggering the Exception. In case there isn't an existing
> bug, could you file one for this?
>
>
>
> -Original Message-
> From: John Omernik [mailto:j...@omernik.com]
> Sent: Monday, September 25, 2017 11:44 AM
> To: user <user@drill.apache.org>
> Subject: Re: Error Messages that are difficult to parse.
>
> This is so odd...
>
>
>
> select a.sa, a.sp, a.da, a.dp, subject_name from ( select b.sa, b.sp,
> b.da, b.dp, CASE WHEN b.tls.SNI[0] is null then '' else b.tls.SNI[0]  END
> as subject_name from `ciscojoy/2017-09-22/*.json` b  ) a where subject_name
> <> ''
> limit 10
>
> This query works
>
> select subject_name from (
> select a.sa, a.sp, a.da, a.dp, subject_name from ( select b.sa, b.sp,
> b.da, b.dp, CASE WHEN b.tls.SNI[0] is null then '' else b.tls.SNI[0]  END
> as subject_name from `ciscojoy/2017-09-22/*.json` b  ) a where subject_name
> <> ''
> ) x
> limit 10
> Fails with the
>
>
> Error Returned - Code: 500
> Error Text:
> SYSTEM ERROR: NumberFormatException:
>
>
>
> On Mon, Sep 25, 2017 at 1:40 PM, John Omernik <j...@omernik.com> wrote:
>
> > So I think I addressed the first one with
> >
> > select CASE when tbl.field.subfield[0] is null then '' else
> > tbl.field.subfield[0] end as myfield from table as tbl
> >
> > That seems to work on small queries
> >
> > but when I try to group by myfield I get
> >
> > Error Returned - Code: 500
> > Error Text:
> > SYSTEM ERROR: NumberFormatException:
> >
> >
> >
> > Which I am confused by because '' or the field isn't a number.. so not
> sure how to troubleshoot this one either..
> >
> >
> > John
> >
> >
> > On Mon, Sep 25, 2017 at 1:14 PM, John Omernik <j...@omernik.com> wrote:
> >
> >>
> >> So as a user, I got this
> >>
> >>
> >> Error Returned - Code: 500
> >> Error Text:
> >> SYSTEM ERROR: IllegalArgumentException: You tried to read a
> [RepeatedInt] type when you are using a field reader of type
> [NullableIntReaderImpl].
> >>
> >>
> >> It's a JSON dataset, the record exists in some row, and not in others,
> but I have no idea how to interpret this or how to handle it.  I feel like
> we should try to curate error messages like this to help improve them for
> the end user.
> >>
> >>
> >> John
> >>
> >>
> >
>

Re: Error Messages that are difficult to parse.

2017-09-25 Thread John Omernik

This is so odd...



select a.sa, a.sp, a.da, a.dp, subject_name from (
select b.sa, b.sp, b.da, b.dp, CASE WHEN b.tls.SNI[0] is null then '' else
b.tls.SNI[0]  END as subject_name from `ciscojoy/2017-09-22/*.json` b  ) a
where subject_name <> ''
limit 10

This query works

select subject_name from (
select a.sa, a.sp, a.da, a.dp, subject_name from (
select b.sa, b.sp, b.da, b.dp, CASE WHEN b.tls.SNI[0] is null then '' else
b.tls.SNI[0]  END as subject_name from `ciscojoy/2017-09-22/*.json` b  ) a
where subject_name <> ''
) x
limit 10
Fails with the


Error Returned - Code: 500
Error Text:
SYSTEM ERROR: NumberFormatException:



On Mon, Sep 25, 2017 at 1:40 PM, John Omernik <j...@omernik.com> wrote:

> So I think I addressed the first one with
>
> select CASE when tbl.field.subfield[0] is null then '' else
> tbl.field.subfield[0] end as myfield from table as tbl
>
> That seems to work on small queries
>
> but when I try to group by myfield I get
>
> Error Returned - Code: 500
> Error Text:
> SYSTEM ERROR: NumberFormatException:
>
>
>
> Which I am confused by because '' or the field isn't a number.. so not sure 
> how to troubleshoot this one either..
>
>
> John
>
>
> On Mon, Sep 25, 2017 at 1:14 PM, John Omernik <j...@omernik.com> wrote:
>
>>
>> So as a user, I got this
>>
>>
>> Error Returned - Code: 500
>> Error Text:
>> SYSTEM ERROR: IllegalArgumentException: You tried to read a [RepeatedInt] 
>> type when you are using a field reader of type [NullableIntReaderImpl].
>>
>>
>> It's a JSON dataset, the record exists in some row, and not in others, but I 
>> have no idea how to interpret this or how to handle it.  I feel like we 
>> should try to curate error messages like this to help improve them for the 
>> end user.
>>
>>
>> John
>>
>>
>

Re: Error Messages that are difficult to parse.

2017-09-25 Thread John Omernik

So I think I addressed the first one with

select CASE when tbl.field.subfield[0] is null then '' else
tbl.field.subfield[0] end as myfield from table as tbl

That seems to work on small queries

but when I try to group by myfield I get

Error Returned - Code: 500
Error Text:
SYSTEM ERROR: NumberFormatException:



Which I am confused by because '' or the field isn't a number.. so not
sure how to troubleshoot this one either..


John


On Mon, Sep 25, 2017 at 1:14 PM, John Omernik <j...@omernik.com> wrote:

>
> So as a user, I got this
>
>
> Error Returned - Code: 500
> Error Text:
> SYSTEM ERROR: IllegalArgumentException: You tried to read a [RepeatedInt] 
> type when you are using a field reader of type [NullableIntReaderImpl].
>
>
> It's a JSON dataset, the record exists in some row, and not in others, but I 
> have no idea how to interpret this or how to handle it.  I feel like we 
> should try to curate error messages like this to help improve them for the 
> end user.
>
>
> John
>
>

Error Messages that are difficult to parse.

2017-09-25 Thread John Omernik

So as a user, I got this


Error Returned - Code: 500
Error Text:
SYSTEM ERROR: IllegalArgumentException: You tried to read a
[RepeatedInt] type when you are using a field reader of type
[NullableIntReaderImpl].


It's a JSON dataset, the record exists in some row, and not in others,
but I have no idea how to interpret this or how to handle it.  I feel
like we should try to curate error messages like this to help improve
them for the end user.


John

Re: Workaround for drill queries during node failure

2017-09-13 Thread John Omernik

As long as the nodes are "up" during the planning phase they will be
included in the planning. If they go "down" after planning, (i.e. during
execution) and fragments are requested, they will not report, and will fail
the query.

So if you start off with 5 nodes, but node 4 is down for patches, and then
you submit a query, the remaining 4 nodes will execute the query with no
issues.

If you start out with 5 nodes, all up and running, you submit a query, and
during execution, you take down node 4, that query will fail.

There is work being done to help with "taking down" nodes sanely to help
with patching.

Resending work expected from a certain node to another node is "hard" if I
understand past email chains correctly. You can't just retry fragments on
other nodes once Node 4 goes down. It will take a Drill expert to explain
that better than I can though.

John

On Mon, Sep 11, 2017 at 12:30 PM, Padma Penumarthy 
wrote:

> Did you mean to say “we could not execute any queries” ?
>
> Need more details about configuration you have.
> When you say data is available on other nodes, is it because you
> have replication configured (assuming it is DFS) ?
>
> What exactly are you trying and what error you see when you try to
> execute the query ?
>
> Thanks,
> Padma
>
>
> On Sep 11, 2017, at 9:40 AM, Kshitija Shinde > wrote:
>
> Hi,
>
> We have installed drill in distributed mode. While testing drillbit we have
> observed that if one of node is done then we could execute any queries
> against the drill even if data is available on other nodes.
>
>
>
> Is there any workaround for this?
>
>
>
> Thanks,
>
> Kshitija
>
>

Re: Does Drill Use Apache Struts

2017-09-08 Thread John Omernik

Also, thank you for the pointer to the pom.xml

On Fri, Sep 8, 2017 at 9:41 AM, John Omernik <j...@omernik.com> wrote:

> So, I thought I was clear that it was unverified, but I also I am in cyber
> security research, and this is what is being discussed in closed circles. I
> agree, it may not be just struts, it's not spreading rumors to say, this
> struts vulnerability is serious, and it's something that should be
> considered in a massive breech like this. Also, as with most security
> incidents, it is likely only a part of the story. It could be SQLi and it
> could be Struts and it could be both or neither. To imply it was unrelated
> SQLi is just as presumptuous as saying it was struts. Some folks are
> talking about attackers using Struts to get to a zone where SQLi was
> possible.  I will be clear(er): I have not verified that Equifax is wholly
> struts, or even related to Struts, but my fear right now is focused on open
> source projects that may use Struts and I think this is legitimate. Putting
> it into context, I want to learn more how to ensure vulnerabilities in one
> project/library are handled from a cascading point of view.
>
> John
>
> On Fri, Sep 8, 2017 at 9:15 AM, Bob Rudis <b...@rud.is> wrote:
>
>> Equifax was likely unrelated SQL injection. Don't spread rumors.
>>
>> Struts had yet-another-remote exploit (three of 'em, actually).
>>
>> I do this for a living (cybersecurity research).
>>
>> Drill is not impacted which can be verified by looking at dependencies
>> in https://github.com/apache/drill/blob/master/pom.xml
>>
>> On Fri, Sep 8, 2017 at 10:12 AM, John Omernik <j...@omernik.com> wrote:
>> > Rumors are pointing to it being related to the Equifax breech (no
>> > confirmation from me on that, just seeing it referenced as a
>> possibility)
>> >
>> > http://thehackernews.com/2017/09/apache-struts-vulnerability.html
>> >
>> >
>> >
>> >
>> > On Fri, Sep 8, 2017 at 9:07 AM, Ted Dunning <ted.dunn...@gmail.com>
>> wrote:
>> >
>> >> Almost certainly not.
>> >>
>> >> What issues are you referring to? I don't follow struts.
>> >>
>> >>
>> >> On Sep 8, 2017 16:00, "John Omernik" <j...@omernik.com> wrote:
>> >>
>> >> Hey all, given the recent issues related to Struts, can we confirm that
>> >> Drill doesn't use this Apache component for anything? I am not good
>> enough
>> >> at code reviews to see what may be used.
>> >>
>> >> John
>> >>
>>
>
>

Re: Does Drill Use Apache Struts

2017-09-08 Thread John Omernik

So, I thought I was clear that it was unverified, but I also I am in cyber
security research, and this is what is being discussed in closed circles. I
agree, it may not be just struts, it's not spreading rumors to say, this
struts vulnerability is serious, and it's something that should be
considered in a massive breech like this. Also, as with most security
incidents, it is likely only a part of the story. It could be SQLi and it
could be Struts and it could be both or neither. To imply it was unrelated
SQLi is just as presumptuous as saying it was struts. Some folks are
talking about attackers using Struts to get to a zone where SQLi was
possible.  I will be clear(er): I have not verified that Equifax is wholly
struts, or even related to Struts, but my fear right now is focused on open
source projects that may use Struts and I think this is legitimate. Putting
it into context, I want to learn more how to ensure vulnerabilities in one
project/library are handled from a cascading point of view.

John

On Fri, Sep 8, 2017 at 9:15 AM, Bob Rudis <b...@rud.is> wrote:

> Equifax was likely unrelated SQL injection. Don't spread rumors.
>
> Struts had yet-another-remote exploit (three of 'em, actually).
>
> I do this for a living (cybersecurity research).
>
> Drill is not impacted which can be verified by looking at dependencies
> in https://github.com/apache/drill/blob/master/pom.xml
>
> On Fri, Sep 8, 2017 at 10:12 AM, John Omernik <j...@omernik.com> wrote:
> > Rumors are pointing to it being related to the Equifax breech (no
> > confirmation from me on that, just seeing it referenced as a possibility)
> >
> > http://thehackernews.com/2017/09/apache-struts-vulnerability.html
> >
> >
> >
> >
> > On Fri, Sep 8, 2017 at 9:07 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >
> >> Almost certainly not.
> >>
> >> What issues are you referring to? I don't follow struts.
> >>
> >>
> >> On Sep 8, 2017 16:00, "John Omernik" <j...@omernik.com> wrote:
> >>
> >> Hey all, given the recent issues related to Struts, can we confirm that
> >> Drill doesn't use this Apache component for anything? I am not good
> enough
> >> at code reviews to see what may be used.
> >>
> >> John
> >>
>

Re: Does Drill Use Apache Struts

2017-09-08 Thread John Omernik

Rumors are pointing to it being related to the Equifax breech (no
confirmation from me on that, just seeing it referenced as a possibility)

http://thehackernews.com/2017/09/apache-struts-vulnerability.html




On Fri, Sep 8, 2017 at 9:07 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Almost certainly not.
>
> What issues are you referring to? I don't follow struts.
>
>
> On Sep 8, 2017 16:00, "John Omernik" <j...@omernik.com> wrote:
>
> Hey all, given the recent issues related to Struts, can we confirm that
> Drill doesn't use this Apache component for anything? I am not good enough
> at code reviews to see what may be used.
>
> John
>

Does Drill Use Apache Struts

2017-09-08 Thread John Omernik

Hey all, given the recent issues related to Struts, can we confirm that
Drill doesn't use this Apache component for anything? I am not good enough
at code reviews to see what may be used.

John

Re: Performance on Unions - Some ideas please!

2017-08-23 Thread John Omernik

So, some disadvantages I've already thought of (mine are more logistics, I
am still very curious on performance etc).

1. I will have to update my view every every night. I don't believe Drill
has a way to address the folder date at run time.  That said, I could also
easily, instead of a the current day, use "current" as my folder name.  I
am reluctant though, because dir0 would not be correct in the query or
would it be... I could use tsday which is also the date as dir0 for the
avro data, and dir0 just be dir0 for the parquet... that would ensure I
would not have to update... however, it would be harder to delineate when
one day started and another ended in the avro data...

2. I would have to orchestrate both the CTAS and the View Update outside of
Drill, not a huge pain, but I like self contained setups :)

John

On Wed, Aug 23, 2017 at 7:48 AM, John Omernik <j...@omernik.com> wrote:

> I have a streaming process that writing to an avro table, with Schema etc.
> It's coming from BroIDS Connection logs, so my table name is like this:
>
>
> broconnavro/-MM-DD
>
>
> Basically I take any data that has come in on that day and put it into a
> dated folder in Avro format.
>
> 1. Avro support is hard, and select * from the table is weird.
> 2. I'd like to get my data into Parquet long term.
>
>
> So I came up with an idea. I have a table called broconnparq.  That has
> the same format. (broconnparq/-MM-DD).
>
> My idea this, I have a view that is essentially
> CREATE OR REPLACE VIEW view_broconn
> select * from (
>   select a.*, 'curdate' as dir0 from broconavro/curdate a
>   UNION ALL
>   select * from broconparq b
> ) c
>
> Then every evening, once my days roll over I do a CTAS from the current
> day Avro to the Parquet...
>
> So essentially I'd be Querying PArquet table + today's Avro
>
> My main question is this... what am I losing here (optimizations etc) Is
> this going to kill me on performance at scale? I.e. if I had 4 TB a day of
> data, will I regret this?
>
>
> For some reference, I am looking the lagging Avro support and lagging
> INSERT support (https://issues.apache.org/jira/browse/DRILL-3534) as
> reasons for this work around...
>
> I'd be open to any ideas here!
>
> John
>
>

Performance on Unions - Some ideas please!

2017-08-23 Thread John Omernik

I have a streaming process that writing to an avro table, with Schema etc.
It's coming from BroIDS Connection logs, so my table name is like this:


broconnavro/-MM-DD


Basically I take any data that has come in on that day and put it into a
dated folder in Avro format.

1. Avro support is hard, and select * from the table is weird.
2. I'd like to get my data into Parquet long term.


So I came up with an idea. I have a table called broconnparq.  That has the
same format. (broconnparq/-MM-DD).

My idea this, I have a view that is essentially
CREATE OR REPLACE VIEW view_broconn
select * from (
  select a.*, 'curdate' as dir0 from broconavro/curdate a
  UNION ALL
  select * from broconparq b
) c

Then every evening, once my days roll over I do a CTAS from the current day
Avro to the Parquet...

So essentially I'd be Querying PArquet table + today's Avro

My main question is this... what am I losing here (optimizations etc) Is
this going to kill me on performance at scale? I.e. if I had 4 TB a day of
data, will I regret this?


For some reference, I am looking the lagging Avro support and lagging
INSERT support (https://issues.apache.org/jira/browse/DRILL-3534) as
reasons for this work around...

I'd be open to any ideas here!

John

Re: Avro - Let's talk Avro again

2017-08-22 Thread John Omernik

This is all great news. Seeing movement, and seeing it articulated, at a
higher level, what has led to the discrepancies in Avro is also very
helpful (thanks Paul).

On Sat, Aug 19, 2017 at 6:29 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Thank you Saurabh,
>
> I was not really expecting a constructive reply to my previous email, was
> appreciated.
>
> I guess some old frustration got the better of me.
>
> All the best,
>  -Stefan
>
> On Fri, Aug 18, 2017 at 10:55 PM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
>
> > Thank you for this candid feedback, Stefan. The fact that you even
> decided
> > to write an email offering this feedback despite moving away from Drill
> > just suggests to me that you are still a supporter. We need all the help
> > that we can get from every member in this community to make Drill provide
> > value to all users that include you.
> >
> > I am new to the community but I have looked at your emails where your
> past
> > attempts to doing this have not taken you anywhere. We have to change
> that.
> >
> > We cannot undo the past as far as addressing your needs are concerned
> but I
> > want to assure you that we are bringing reform to the community in
> general.
> > The stakeholders who are impacted by Drill have increased beyond the
> small
> > group that existed a couple of years ago. So be rest assured that you
> have
> > a voice here.
> >
> > I think the biggest challenge we have in the community is that there are
> > users who could get a lot of value if some work was done to support
> > integrations. I know for sure that there are many developers who would
> love
> > to participate in this community and do the work for a modest fee. It
> helps
> > them get interested in the project, helps them provide support beyond
> just
> > the open source aspect and also helps users such as you to get the value
> > that you need where you need it.
> >
> > Please let me know if you would be willing to pursue that route.
> >
> > On the Avro front, I do hear a lot of users asking for it but I hear a
> lot
> > more requests on Parquet. Plus, there are core issues in Drill that needs
> > to be addressed first. The community is definitely trying to prioritize
> > given what we have. But we do not have to feel constrained. We can get
> more
> > developers to participate in this and help out. And I am very positive
> > about that approach-I know that I helped a user here to get help on using
> > Apache Drill inside a commercial setting where there asks were very
> > specific.
> >
> > Those are my thoughts but please do not give up on us. Your critical
> > feedback may not sound nice to the ears but is exactly the kind of
> feedback
> > that will make this project truly successful.
> >
> > Best,
> > Saurabh
> >
> >
> >
> > I
> >
> > On Fri, Aug 18, 2017 at 1:42 PM, Stefán Baxter <
> ste...@activitystream.com>
> > wrote:
> >
> > > Hi John,
> > >
> > > Love Drill but we no longer use it in production as our main query
> tool.
> > >
> > > I do have a fairly long list of pet peeves but I also have a long list
> of
> > > features that I love and would not want to be without.
> > >
> > > In my opinion it's time for Drill to decide where its commitment lies
> > > regarding evolving schema and ETL elimination and if it wants to be
> > > something more than a cogil in a Hadoop distribution wheel or an effort
> > > some see as a way to their startup stardom.
> > >
> > > There is no denying the great effect it has had and its usefulness
> (Arrow
> > > also making waves now). I am, as I have been, just frustrated by
> > > shortcomings I feel are not addressed because they are addressed else
> > where
> > > (where the true loyalties lie)
> > >
> > > I can name a few (I have not upgraded to 1.11):
> > >- Empty values still default to double for partial/segment lists
> which
> > > triggers all sorts of problems  (no attempt is made to convert values
> to
> > > lowest common denominator (string))
> > >- Two NullableX values both containing nothing (Null) still produce
> > > schema change errors instead of waiting for a type to become apparent
> > >- Syntax error reporting is terrible
> > >- Schema change reporting is almost absent
> > >- Avro schema is fixed/strict even though text formats support
> > > evol

Re: Merge and save parquet files in Drill

2017-08-17 Thread John Omernik

Also, what is the cardinality of the partition field? If you have lots of
partitions, you will have lots of files...


On Thu, Aug 17, 2017 at 9:55 AM, Andries Engelbrecht 
wrote:

> Do you partition the table?
> You may want to sort (order by) on the columns you partition, or just
> order by in any case on the column(s) you are most likely going to use for
> predicates. It increases the CTAS time, but normally will improve the query
> performance quite a bit.
>
> Yes a large number of files does affect the query performance, using
> metadata caching helps improve the query planning time a lot.
>
> --Andries
>
>
> On 8/16/17, 11:12 PM, "Divya Gehlot"  wrote:
>
> Hi,
> I have CTAS with partition on 4 columns and when I save it it creates
> lots
> of small files ~ 102290 where size of each file is in KBs .
>
> My queries are :
> 1.Does the lots of small files reduce the performance while reading the
> data in Drill ?
> 2.If yes ,How can I merge the small parquet files ?
>
>
>
> Thanks,
> Divya
>
>
>

Re: Avro - Let's talk Avro again

2017-08-17 Thread John Omernik

I was guessing you would chime in with a response ;)

Are you still using Drill w/ Avro how has things been lately?

On Thu, Aug 17, 2017 at 8:00 AM, Stefán Baxter <ste...@activitystream.com>
wrote:

> woha!!!
>
>
> (sorry, I just had to)
>
>
> Best of luck with that!
>
> Regards,
>  -Stefán
>
> On Thu, Aug 17, 2017 at 12:37 PM, John Omernik <j...@omernik.com> wrote:
>
> > I know Avro is the unwanted child of the Drill world. (I know others have
> > tried to mature the Avro support and that has been something that still
> is
> > in a "experiemental" state.
> >
> > That said, isn't it time for us to clean it up?
> >
> > I am sure I there are some open JIRAs out there, (last Doc update on the
> > Avro Page, Nov 21, 2016) points to this
> > https://issues.apache.org/jira/browse/DRILL/component/
> > 12328941/?selectedTab=com.atlassian.jira.jira-projects-
> > plugin:component-summary-panel
> >
> > And I just ran into a issue... I am going to run it by here to see if
> it's
> > JIRA worthy or known:
> >
> > I have two directories, one json (brodns) and one avro (brodnsavro)
> >
> > The both have subdirectories that are -MM-DD dates.
> >
> > Where I run
> >
> > select dir0, count(*) from `brodns` group by dir0  - This works great!
> >
> > when I run
> >
> > select dir0, count(*) from `brodnsavro` group by dir0 - I get:
> >
> > VALIDATION ERROR: From line 1, column 58 to line 1, column 61: Column
> > 'dir0' not found in any table
> >
> >
> > If I run
> >
> >
> > select count(*) from `brodnsavro/2017-08-17` this works
> >
> > if I run
> >
> >
> > select count(*) from `brodnsavro` this also works
> >
> >
> > But dir0 doesn't appear to be applied to Avro.
> >
> >
> >
> > I really feel this should be consistent (in addition to fixing the
> > other issues in Avro) and lets make Avro o a
> >
> > first class citizen of the Drill world.
> >
> >
> > (If folks are interested, I'd be happy to discuss my use case, it
> involves
> >
> > applying a schema to json records on kafka/maprstreams in streamsets, and
> > then
> >
> > outputting to avro files... from there I hope to convert to parquet, but
> >
> > don't want to use mapreduce, hence drill!
> >
> > )
> >
>

Avro - Let's talk Avro again

2017-08-17 Thread John Omernik

I know Avro is the unwanted child of the Drill world. (I know others have
tried to mature the Avro support and that has been something that still is
in a "experiemental" state.

That said, isn't it time for us to clean it up?

I am sure I there are some open JIRAs out there, (last Doc update on the
Avro Page, Nov 21, 2016) points to this
https://issues.apache.org/jira/browse/DRILL/component/12328941/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

And I just ran into a issue... I am going to run it by here to see if it's
JIRA worthy or known:

I have two directories, one json (brodns) and one avro (brodnsavro)

The both have subdirectories that are -MM-DD dates.

Where I run

select dir0, count(*) from `brodns` group by dir0  - This works great!

when I run

select dir0, count(*) from `brodnsavro` group by dir0 - I get:

VALIDATION ERROR: From line 1, column 58 to line 1, column 61: Column
'dir0' not found in any table


If I run


select count(*) from `brodnsavro/2017-08-17` this works

if I run


select count(*) from `brodnsavro` this also works


But dir0 doesn't appear to be applied to Avro.



I really feel this should be consistent (in addition to fixing the
other issues in Avro) and lets make Avro o a

first class citizen of the Drill world.


(If folks are interested, I'd be happy to discuss my use case, it involves

applying a schema to json records on kafka/maprstreams in streamsets, and then

outputting to avro files... from there I hope to convert to parquet, but

don't want to use mapreduce, hence drill!

)

Re: Querying Data with period in name

2017-08-11 Thread John Omernik

So, I added a comment on that JIRA... I have a parquet file, I need to
explore it... it comes from another user, I want to be able to explore with
select * but I don't know the field names... how can I do this? I know it
seems like we should always know the column names, but I don't :)


On Fri, Aug 11, 2017 at 1:33 PM, Vova Vysotskyi <vvo...@gmail.com> wrote:

> Hi John,
>
> Fix for the DRILL-4264
> <https://issues.apache.org/jira/browse/DRILL-4264> should
> solve this issue. This error appears when you try to do *select **. But
> while DRILL-4264 <https://issues.apache.org/jira/browse/DRILL-4264> is not
> fixed, you can try to do *select `**id.orig_h`*. It should not throw the
> error.
>
> Kind regards,
> Volodymyr Vysotskyi
>
> 2017-08-11 21:07 GMT+03:00 John Omernik <j...@omernik.com>:
>
> > Hey all,
> >
> > I am querying some json and parquet data that has dots in the name. Not
> all
> > the data I may be querying will come from Drill, thus dot is a valid
> > character... when I go to initially explore my data, Drill throws the
> error
> > below when I run a select * query.
> >
> > I understand the error, and I can create a view, selecting each column
> out
> > and renaming it for easier select *  in the future. However, as a user,
> if
> > I get a new data set, this could (unless I am informed of another way
> here)
> > force me to leave drill to explore my data.
> >
> > I get how using periods as field qualifiers causes issues... but if we
> had
> > had a way to read a file to get the schema, to either produce the all the
> > fields in a select query for easy view creation or a way to query with
> > periods in the name that would be awesome! It would keep users IN drill
> > instead of going elsewhere to explore their data.
> >
> > I am open to ideas!
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Error Returned - Code: 500
> > Error Text:
> > SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference
> > "id.orig_h"; a field reference identifier must not have the form of a
> > qualified name (i.e., with ".").
> >
> > Fragment 0:0
> >
> > [Error Id: 88acd3d8-4e44-49f6-b587-24bf26f89a3b on
> > zeta4.brewingintel.com:20005]
> >
>

Querying Data with period in name

2017-08-11 Thread John Omernik

Hey all,

I am querying some json and parquet data that has dots in the name. Not all
the data I may be querying will come from Drill, thus dot is a valid
character... when I go to initially explore my data, Drill throws the error
below when I run a select * query.

I understand the error, and I can create a view, selecting each column out
and renaming it for easier select *  in the future. However, as a user, if
I get a new data set, this could (unless I am informed of another way here)
force me to leave drill to explore my data.

I get how using periods as field qualifiers causes issues... but if we had
had a way to read a file to get the schema, to either produce the all the
fields in a select query for easy view creation or a way to query with
periods in the name that would be awesome! It would keep users IN drill
instead of going elsewhere to explore their data.

I am open to ideas!










Error Returned - Code: 500
Error Text:
SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference
"id.orig_h"; a field reference identifier must not have the form of a
qualified name (i.e., with ".").

Fragment 0:0

[Error Id: 88acd3d8-4e44-49f6-b587-24bf26f89a3b on zeta4.brewingintel.com:20005]

Elastic Search Plugins

2017-07-28 Thread John Omernik

Is there any work being done on an Elastic Search plugin? That would be a
huge benefit to the community!

I see there are some older JIRAs... anything else?


https://issues.apache.org/jira/browse/DRILL-3637

https://issues.apache.org/jira/browse/DRILL-3790

Re: How much more information can error messages provide?

2017-07-28 Thread John Omernik

That's good to know Paul on the UserException. Even then, there may be
exceptions sent to the user that may not prove as helpful, and I think we
should strive to improve those.  With the Class UserException, would it be
reasonable to have a documentation link of some sort? Perhaps not a hard
URL (because keeping that in sync can be hard) but if a specific
UserException has been defined, doesn't it make sense to also have a page
added to the documentation site with some more verbose remedies?

Saurabh - I think the nice thing about using a master jira with each
specific instance of a bad error message having it's own specific issue is
we CAN fix them in a way where we prioritize the issues that are most
impactful, while leaving the master JIRA open as an aggregator. Say a
developer wants to work on bad error messages, they can use the Master JIRA
to find the individual issues they want to work on. They don't close the
master JIRA, just the individual ones.  It's holistic AND incremental... I
guess ? :)



On Thu, Jul 27, 2017 at 7:23 PM, Paul Rogers <prog...@mapr.com> wrote:

> Hi All,
>
> As it turns out, many places in the code already use a class called
> UserException. (Read that as “an exception to send to the user” rather than
> “an exception that the user caused.”) Each use of UserException includes a
> message and context. So, the good news is, many places in the code already
> provide a useful message.
>
> Unfortunately, many places in the code do not provide such messages. Still
> we can certainly track those that are missing and add them incrementally.
> That is a pretty low cost, low risk change if we do them gradually.
>
> The question then becomes, why isn’t this information passed along to the
> user as a nice, simple text message? That would be worth investigating.
>
> Other SQL engines use standard SqlCode and SqlWarning codes. A bit old
> school, but allows tools to understand the meaning of the error without
> parsing a vendor-specific message. We might want to add that.
>
> So, for priority, again from a code perspective:
>
> 1) Better display the messages we already have.
> 2) Write a stack trace to the log, but not to the user’s console (in
> Sqlline, etc.)
> 3) As part of fixes and code reviews, look for spots without such messages
> and add them.
>
> Are there better, simpler ways we could solve this?
>
> - Paul
>
>
> > On Jul 27, 2017, at 3:55 PM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
> >
> > I completely agree with John. I think we need to make error/warning
> > messages more friendly moving forward with any new features that ship.
> >
> > Please share the JIRA that you create. But a holistic approach scares me.
> > How would we prioritize the ones that would impact most users? Any
> thoughts
> > on that.
> >
> > Saurabh
> >
> > On Thu, Jul 27, 2017 at 6:43 AM, John Omernik <j...@omernik.com> wrote:
> >
> >> Hey Dan, I created the master issue for tracking improvements to error
> >> messages here: https://issues.apache.org/jira/browse/DRILL-5692
> >>
> >> If you want to create your individual error and link to it in your
> message,
> >> we'll see how it works.  I like the idea of a wholistic effort to
> improve
> >> error messages!
> >>
> >> John
> >>
> >>
> >>
> >> On Thu, Jul 27, 2017 at 8:25 AM, John Omernik <j...@omernik.com> wrote:
> >>
> >>> I want to bump this up. I've had a number of troubleshooting times
> where
> >>> getting more concise error messages would really help me deal with my
> >> data.
> >>> It looks like Dan found verbose mode, but sometimes verbose isn't what
> we
> >>> need, but concise. Hey Dan, maybe we could come up with a Jira that is
> is
> >>> the "tracker" JIRA for error messages that are not helpful. Then each
> >> time
> >>> someone comes up with an error message that is frustrating or
> unhelpful,
> >> we
> >>> could create an individual JIRA for that message and link it to the
> >> master
> >>> JIRA. That would allow us to find, throughout the code any error
> messages
> >>> that are returning with limited information.
> >>>
> >>> My thought is this, in the Master JIRA (which I will be happy to
> create)
> >> I
> >>> will explain the purpose of the Master JIRA, and how best to describe
> >> error
> >>> messages that don't help the analyst (what information to include,
> >> showing
> >>> the error with both normal and verbose mode, if the problem

Re: How much more information can error messages provide?

2017-07-27 Thread John Omernik

Hey Dan, I created the master issue for tracking improvements to error
messages here: https://issues.apache.org/jira/browse/DRILL-5692

If you want to create your individual error and link to it in your message,
we'll see how it works.  I like the idea of a wholistic effort to improve
error messages!

John



On Thu, Jul 27, 2017 at 8:25 AM, John Omernik <j...@omernik.com> wrote:

> I want to bump this up. I've had a number of troubleshooting times where
> getting more concise error messages would really help me deal with my data.
> It looks like Dan found verbose mode, but sometimes verbose isn't what we
> need, but concise. Hey Dan, maybe we could come up with a Jira that is is
> the "tracker" JIRA for error messages that are not helpful. Then each time
> someone comes up with an error message that is frustrating or unhelpful, we
> could create an individual JIRA for that message and link it to the master
> JIRA. That would allow us to find, throughout the code any error messages
> that are returning with limited information.
>
> My thought is this, in the Master JIRA (which I will be happy to create) I
> will explain the purpose of the Master JIRA, and how best to describe error
> messages that don't help the analyst (what information to include, showing
> the error with both normal and verbose mode, if the problem is resolved,
> what resolved the error etc).
>
> I think we should focus on errors, not issues. What I mean by that, is
> there is an error like "Null Pointer Exception" which is a horrible error
> in Java that is hard for a user to make heads or tails over, we shouldn't
> open an "bad error message" for each Null Pointer Exception issue that
> occurs. Instead, perhaps we could open one for the error "Null Pointer
> Exception" (regardless of what caused it) and provide details how we can
> better trap that and communicate it to the user... does that make sense?
>
> I will start working on a master JIRA for this.
>
>
>
> On Thu, Jul 27, 2017 at 8:19 AM, Dan Holmes <dhol...@revenueanalytics.com>
> wrote:
>
>> With verbose mode on the following is the error message.  I guess this
>> confirms it is a date conversion but there isn't any help as to which file
>> or which expression (in this statement there is only one TO_DATE function
>> so it is obvious).
>>
>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>> IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id:
>> 22129e51-acbd-41f2-84f5-721541d9843b on ubuntu:31010]
>> (java.lang.IllegalArgumentException) Invalid format: " "
>> org.joda.time.format.DateTimeFormatter.parseDateTime():899
>> org.joda.time.DateMidnight.parse():152 org.apache.drill.exec.test.gen
>> erated.ProjectorGen367.doEval():76 org.apache.drill.exec.test.gen
>> erated.ProjectorGen367.projectRecords():67 org.apache.drill.exec.physical
>> .impl.project.ProjectRecordBatch.doWork():199
>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>> org.apache.drill.exec.physical.impl.SingleSenderCreator$Sing
>> leSenderRootExec.innerNext():92 
>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>> java.security.AccessController.doPrivileged():-2
>> javax.security.auth.Subject.doAs():422 
>> org.apache.hadoop.security.UserGroupInformation.doAs():1657
>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>> org.apache.drill.common.SelfCleaningRunnable.run():38
>> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
>> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
>> java.lang.Thread.run():748
>>
>> Dan Holmes | Revenue Analytics, Inc.
>> Direct: 7

Re: How much more information can error messages provide?

2017-07-27 Thread John Omernik

I want to bump this up. I've had a number of troubleshooting times where
getting more concise error messages would really help me deal with my data.
It looks like Dan found verbose mode, but sometimes verbose isn't what we
need, but concise. Hey Dan, maybe we could come up with a Jira that is is
the "tracker" JIRA for error messages that are not helpful. Then each time
someone comes up with an error message that is frustrating or unhelpful, we
could create an individual JIRA for that message and link it to the master
JIRA. That would allow us to find, throughout the code any error messages
that are returning with limited information.

My thought is this, in the Master JIRA (which I will be happy to create) I
will explain the purpose of the Master JIRA, and how best to describe error
messages that don't help the analyst (what information to include, showing
the error with both normal and verbose mode, if the problem is resolved,
what resolved the error etc).

I think we should focus on errors, not issues. What I mean by that, is
there is an error like "Null Pointer Exception" which is a horrible error
in Java that is hard for a user to make heads or tails over, we shouldn't
open an "bad error message" for each Null Pointer Exception issue that
occurs. Instead, perhaps we could open one for the error "Null Pointer
Exception" (regardless of what caused it) and provide details how we can
better trap that and communicate it to the user... does that make sense?

I will start working on a master JIRA for this.

On Thu, Jul 27, 2017 at 8:19 AM, Dan Holmes 
wrote:

> With verbose mode on the following is the error message.  I guess this
> confirms it is a date conversion but there isn't any help as to which file
> or which expression (in this statement there is only one TO_DATE function
> so it is obvious).
>
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id:
> 22129e51-acbd-41f2-84f5-721541d9843b on ubuntu:31010] 
> (java.lang.IllegalArgumentException)
> Invalid format: " " org.joda.time.format.DateTimeFormatter.parseDateTime():899
> org.joda.time.DateMidnight.parse():152 org.apache.drill.exec.test.
> generated.ProjectorGen367.doEval():76 org.apache.drill.exec.test.
> generated.ProjectorGen367.projectRecords():67 org.apache.drill.exec.
> physical.impl.project.ProjectRecordBatch.doWork():199
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> org.apache.drill.exec.physical.impl.SingleSenderCreator$
> SingleSenderRootExec.innerNext():92 org.apache.drill.exec.
> physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.
> fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.
> fragment.FragmentExecutor$1.run():226 java.security.
> AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748
>
> Dan Holmes | Revenue Analytics, Inc.
> Direct: 770.859.1255
> www.revenueanalytics.com
>
> -Original Message-
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 9:11 AM
> To: user@drill.apache.org
> Subject: How much more information can error messages provide?
>
> I am querying a 6GB data set in a directory.  There are 34 files in the
> directory.  Here is the statement.
>
> CREATE TABLE dfs.root.`twm/sales_p` AS
> SELECT to_date(columns[0], 'MM/dd/yy') transaction_date , CAST(columns[1]
> as INT) store_key , CAST(columns[2] AS BIGINT) item_code , columns[3]
> position_name , CAST(CASE WHEN columns[4] = 'NULL' THEN 0 ELSE columns[4]
> END as INT) quantity_per_sku , CAST(columns[5] as INT) sales_quantity,
> CAST(columns[6] as float) sales_dollars, columns[7] channel FROM
> dfs.root.`sales`
>
> I get the following error.  I was able to determine that columns[2] needed
> to be a BIGINT based on an

Re: Rest API - Need to Improve

2017-07-12 Thread John Omernik

My Java is sitting on my desk right now... on the plus side, when Java is
on my desk, my Python is incredible...

(Translation: Java scares me)

:)


On Tue, Jul 11, 2017 at 4:23 PM, Paul Rogers <prog...@mapr.com> wrote:

> Hi John,
>
> All good points!
>
> How comfortable are you with Java? The current REST API evolved
> organically based on the needs of the web UI. While we could wait for a
> REST re-design, we could also move forward just adding the messages you
> need — especially if you can contribute the code.
>
> The code is in org.apache.drill.exec.server.rest.LogInLogOutResources.
> This is a mercifully short file. The login page message itself is one line
> of actual code:
>
>   @POST
>   @Path("/login")
>   @Produces(MediaType.TEXT_HTML)
>   public Viewable getLoginPageAfterValidationError() {
> return ViewableWithPermissions.createLoginPage("Invalid
> username/password credentials.");
>   }
>
> Seems simple enough to change above method to add the HTTP status to the
> generated web page; the browser won’t care.
>
> The web framework Drill uses is quite rich (mucked about with it a year
> ago, but have gotten rusty since.) There is an easy way to indicate the
> HTTP status; I just can’t remember what it is…
>
> Anyone else remember how to set the return status in a Jetty response?
>
> Thanks,
>
> - Paul
>
>
> > On Jul 7, 2017, at 5:48 AM, John Omernik <j...@omernik.com> wrote:
> >
> > Hello all, I recently setup some notebooks using the Rest API.
> >
> > I found that I was using Drill 1.8, and my code for determining
> > authentication in Python, while hacky, worked...  What I found is using
> > python requests, when I posted to j_security check, the requests object
> > almost always returned a HTTP 200.
> >
> > If it was a bad username/password, I parsed the page and looked for that
> > text.  In 1.8, if login was successful the string "Number of Drill Bits"
> > appeared on in the response text. From the requests module perspective,
> > both used HTTP 200 as a status code.  However, on a successful login,
> Drill
> > actually sends a HTTP 303 that goes to / and apparently requests auto
> grabs
> > that request like nothing happened. Never telling me, the programmer
> about
> > the 303.
> >
> > However, in Drill 1.10, the UI for the / page improved, and likely, the
> > string "Number of Drill Bits" was removed.  This made it so I now had to
> > reprogram my auth code to handle it better.
> >
> > So here I am, What IS the best way to determine programmatically in a way
> > that will remain stable if login was successful. Obviously string parsing
> > is prone to error and can be changed.  I tried checking for the presence
> of
> > a JSESSIONID but that can show up either way.
> >
> > So what is the "right" way to indicate login is successful? Could we do
> > something better with response codes? I know having the Form seems to
> make
> > sense, but could we just use basic authentication and set the header with
> > the form? That way, we could issue single queries with basic auth and
> get a
> > Unauthorized if the authentication didn't work.  I don't know if I have
> an
> > answer, but I do know that working with the Rest API isn't all that
> > intuitive, it should be easier to tell if login was successful...  I am
> > interested in the thoughts of others here.
> >
> > John
> >
> >
> >
> >
> >
> >
> >
> > (Code)
> >def authDrill(self):
> >url = self.drill_base_url + "/j_security_check"
> >login = {'j_username': self.drill_user, 'j_password':
> > self.drill_pass}
> >
> >verify = "/etc/ssl/certs/ca-certificates.crt"
> >
> >if self.drill_pin_to_ip == True:
> >verify = False
> >
> > requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
> >else:
> >verify = "/etc/ssl/certs/ca-certificates.crt"
> >
> >r = self.session.post(url, data=login, headers=self.drill_headers,
> > verify=verify)
> >if r.status_code == 200:
> >if r.text.find("Invalid username/password credentials") >= 0:
> >raise Exception("Invalid username/password credentials")
> >elif r.text.find("Number of Drill Bits") >= 0:
> >pass
> >else:
> >raise Exception("Unknown HTTP 200 Code: %s" % r.text)
> >else:
> >raise Exception("Status Code: %s - Error" % r.status_code)
> >return self.session
>
>

Rest API - Need to Improve

2017-07-07 Thread John Omernik

Hello all, I recently setup some notebooks using the Rest API.

I found that I was using Drill 1.8, and my code for determining
authentication in Python, while hacky, worked...  What I found is using
python requests, when I posted to j_security check, the requests object
almost always returned a HTTP 200.

If it was a bad username/password, I parsed the page and looked for that
text.  In 1.8, if login was successful the string "Number of Drill Bits"
appeared on in the response text. From the requests module perspective,
both used HTTP 200 as a status code.  However, on a successful login, Drill
actually sends a HTTP 303 that goes to / and apparently requests auto grabs
that request like nothing happened. Never telling me, the programmer about
the 303.

However, in Drill 1.10, the UI for the / page improved, and likely, the
string "Number of Drill Bits" was removed.  This made it so I now had to
reprogram my auth code to handle it better.

So here I am, What IS the best way to determine programmatically in a way
that will remain stable if login was successful. Obviously string parsing
is prone to error and can be changed.  I tried checking for the presence of
a JSESSIONID but that can show up either way.

So what is the "right" way to indicate login is successful? Could we do
something better with response codes? I know having the Form seems to make
sense, but could we just use basic authentication and set the header with
the form? That way, we could issue single queries with basic auth and get a
Unauthorized if the authentication didn't work.  I don't know if I have an
answer, but I do know that working with the Rest API isn't all that
intuitive, it should be easier to tell if login was successful...  I am
interested in the thoughts of others here.

John







(Code)
def authDrill(self):
url = self.drill_base_url + "/j_security_check"
login = {'j_username': self.drill_user, 'j_password':
self.drill_pass}

verify = "/etc/ssl/certs/ca-certificates.crt"

if self.drill_pin_to_ip == True:
verify = False

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
else:
verify = "/etc/ssl/certs/ca-certificates.crt"

r = self.session.post(url, data=login, headers=self.drill_headers,
verify=verify)
if r.status_code == 200:
if r.text.find("Invalid username/password credentials") >= 0:
raise Exception("Invalid username/password credentials")
elif r.text.find("Number of Drill Bits") >= 0:
pass
else:
raise Exception("Unknown HTTP 200 Code: %s" % r.text)
else:
raise Exception("Status Code: %s - Error" % r.status_code)
return self.session

Re: Drill Session ID between Nodes

2017-06-23 Thread John Omernik

That makes sense, ya, I would love to hear about the challenges of this in
general from the Drill folks.

Also, I wonder if Paul R at MapR has any thoughts in how something like
this would be handled in the Drill on Yarn Setup.


John

On Fri, Jun 23, 2017 at 10:58 AM, Keys Botzum <kbot...@mapr.com> wrote:

> I think we are on the same page regarding SSL.
>
> Regarding (1) it's best that I defer to the drill experts but I will
> mention that sharing session state can greatly complicate scalability.
> Since switching drillbits should be a rare event, it is probably more
> scalable to send back to the client a token which represents the
> authenticated identity (encrypted and signed of course). Then should that
> show up at another drillbit, the user authentication state can be
> reestablished. Other state such as caches would be lost of course. I don't
> know enough about Drill internals of course - there may be other state
> issues beyond just authentication.
>
> Keys
> ___
> Keys Botzum
> Distinguished Engineer, Field Engineering
> kbot...@maprtech.com<mailto:kbot...@maprtech.com>
> 443-718-0098
> MapR Technologies
> http://www.mapr.com
>
>
>
> On Jun 23, 2017, at 11:33 AM, John Omernik <j...@omernik.com<mailto:john@
> omernik.com>> wrote:
>
> So a few things
>
> 1. The issue is that as is,  SSL stuff works fine, but when the IP address
> that DNS returns for the hostname changes, the session is invalidated and I
> am forced to logon again... this is annoying and loses session context
> information.  If I try to lay out my cluster differently, i.e. using the
> wildcard certs and the different marathon layout, I then have different
> issues. I can connect by IP, but then I lose the SSL Validity.  That's
> where the context for SSL comes into play.  My main issue is with the IP
> returned for a DNS request changing during the course of a session,
> invalidating it.  I think what it comes down to for me is this statement:
>
> As a user I connect to a drill cluster
>
> A simple statement, but what that means is as a user or an admin, my
> users/code accessing the cluster shouldn't have to care which individual
> node they connect to, they are connecting to a cluster.  This is over
> simplifying things, but session ids managed by the cluster via Zookeeper
> would solve this.
>
>
> 2. I am looking at doing the SSL handling overrides n my python code,
> requests has some handlers for SSL and I was looking to address this,
> however, there is bug in how it works because it drops my custom port
> value... I am working on this now with the python requests folks.  (i.e.
> the custom handlers would work, but only if I was connecting port 443)
>
>
>
>
>
> On Fri, Jun 23, 2017 at 9:52 AM, Keys Botzum <kbot...@mapr.com kbot...@mapr.com>> wrote:
>
> There is something here I'm not understanding. In the below the hostname
> is always the same so there should be no problem as long as all drillbits
> share a common signer.
>
> I'm also just not following how certificate authentication issues are even
> linked to the Drill session issues. Whether or not there is a Drill
> session, the SSL handshake rules still apply. Or there is something here I
> just don't understand - quite possibly of course. I'm just focused on the
> SSL issue as this I understand very well.
>
> Incidentally, regarding hostname verification, I'm not familiar with what
> controls you have but many libraries (including Java) give you the ability
> to write your own SSL verifier which is called only when the default
> hostname verification fails. In that code you can implement different
> rules. Perhaps you can find a rule that meets your needs (such as a common
> signer for all Drillbits). Remember that certificate hostname validation is
> just a convention. There is nothing about SSL that makes this necessary.
> Here's the Java version: https://docs.oracle.com/
> javase/7/docs/api/javax/net/ssl/HostnameVerifier.html. In case you are
> curious, this is how MapR's maprlogin works with HTTPS even though we use
> IP address by default.
>
> Keys
> ___
> Keys Botzum
> Distinguished Engineer, Field Engineering
> kbot...@maprtech.com<mailto:kbot...@maprtech.com><mailto:kbo
> t...@maprtech.com>
> 443-718-0098
> MapR Technologies
> http://www.mapr.com
>
>
>
> On Jun 23, 2017, at 10:22 AM, John Omernik <j...@omernik.com<mailto:john@
> omernik.com>> wrote:
>
> The wild card certificate isn't a problem on it's own, it's using it in a
> manner that allows me to maintain all of the various features I want.  Let
> me lay this out,
>
> In marathon I have a ta

Re: Drill Session ID between Nodes

2017-06-23 Thread John Omernik

So a few things

1. The issue is that as is,  SSL stuff works fine, but when the IP address
that DNS returns for the hostname changes, the session is invalidated and I
am forced to logon again... this is annoying and loses session context
information.  If I try to lay out my cluster differently, i.e. using the
wildcard certs and the different marathon layout, I then have different
issues. I can connect by IP, but then I lose the SSL Validity.  That's
where the context for SSL comes into play.  My main issue is with the IP
returned for a DNS request changing during the course of a session,
invalidating it.  I think what it comes down to for me is this statement:

As a user I connect to a drill cluster

A simple statement, but what that means is as a user or an admin, my
users/code accessing the cluster shouldn't have to care which individual
node they connect to, they are connecting to a cluster.  This is over
simplifying things, but session ids managed by the cluster via Zookeeper
would solve this.


2. I am looking at doing the SSL handling overrides n my python code,
requests has some handlers for SSL and I was looking to address this,
however, there is bug in how it works because it drops my custom port
value... I am working on this now with the python requests folks.  (i.e.
the custom handlers would work, but only if I was connecting port 443)





On Fri, Jun 23, 2017 at 9:52 AM, Keys Botzum <kbot...@mapr.com> wrote:

> There is something here I'm not understanding. In the below the hostname
> is always the same so there should be no problem as long as all drillbits
> share a common signer.
>
> I'm also just not following how certificate authentication issues are even
> linked to the Drill session issues. Whether or not there is a Drill
> session, the SSL handshake rules still apply. Or there is something here I
> just don't understand - quite possibly of course. I'm just focused on the
> SSL issue as this I understand very well.
>
> Incidentally, regarding hostname verification, I'm not familiar with what
> controls you have but many libraries (including Java) give you the ability
> to write your own SSL verifier which is called only when the default
> hostname verification fails. In that code you can implement different
> rules. Perhaps you can find a rule that meets your needs (such as a common
> signer for all Drillbits). Remember that certificate hostname validation is
> just a convention. There is nothing about SSL that makes this necessary.
> Here's the Java version: https://docs.oracle.com/
> javase/7/docs/api/javax/net/ssl/HostnameVerifier.html. In case you are
> curious, this is how MapR's maprlogin works with HTTPS even though we use
> IP address by default.
>
> Keys
> ___
> Keys Botzum
> Distinguished Engineer, Field Engineering
> kbot...@maprtech.com<mailto:kbot...@maprtech.com>
> 443-718-0098
> MapR Technologies
> http://www.mapr.com
>
>
>
> On Jun 23, 2017, at 10:22 AM, John Omernik <j...@omernik.com<mailto:john@
> omernik.com>> wrote:
>
> The wild card certificate isn't a problem on it's own, it's using it in a
> manner that allows me to maintain all of the various features I want.  Let
> me lay this out,
>
> In marathon I have a task, it runs a drill bit.  Since that task is located
> at the node prod/drillprod (for my env it's role/instanceid) the domain
> name is setup to be
>
> drillprod-prod.marathon.mesos.
>
> I can run X number of instances of that task. I tell Marathon to make them
> "host unique" so no two drill bits end up on the same node.  This gives me
> a few things
>
> 1. If choose there to be 3 drillbits running, they go and run, and I don't
> have to worry about them.  If I have to reboot the node one of them is on,
> Marathon says "oh look I am only running 2, let's spin up another, and then
> I get my required 3 bits running automatically.
>
> 2. They use a common config directory located in MapR-FS this is really
> nice because I don't have to maintain separate configurations for each
> drill bit.
>
> 3. The name above, drillprod-prod.marathon.mesos, using nslookup returns
>
> Name: drillprod-prod.marathon.mesos
>
> Address: 192.168.0.105
>
> Name: drillprod-prod.marathon.mesos
>
> Address: 192.168.0.103
>
> Name: drillprod-prod.marathon.mesos
>
> Address: 192.168.0.104
>
>
> Which is desired. When I have a client connect, I can program in a single
> name (drillprod-prod.marathon.mesos) into my script and never have to
> worry
> about where the bits run on the cluster.  It looks it up and works great.
> This has been my standard MO for scripts that do short lived things... I
> haven't had an issue until this new use case came up. (long running
> sessions f

Re: Drill Session ID between Nodes

2017-06-23 Thread John Omernik

 drill1, drill2, drill3, drill4
>
>
> Keys
> ___
> Keys Botzum
> MapR Technologies
>
>
>
> On Jun 22, 2017, at 8:24 PM, John Omernik <j...@omernik.com<mailto:john@
> omernik.com>> wrote:
>
> Would there be interest in finding a way to globalize this? This is
> challenging for me and others that may run drill with multi Tennant
> orchestrators.  In my particular setup, each node running drill gets added
> to an a record automatically giving me HA and distribution of Rest API
> queries.  It also allows me to have a single certificate for my cluster
> rather than managing certificates on a individual basis.   I set things up
> to connect via IP but then I had certificate mismatch warnings. My goal is
> to find a way to connect to the rest API , while maintaining a session to
> single node, with out sacrificing HA and balancing and with compromising
> ssl security.   I know it's a tall order, but if there I ideas outside of a
> global state management I am all ears.
>
> Note some ideas I've also considered:
>
> 1.  using a load balancer that would allow me to pin connections.  Not
> ideal because it's another service to manage but it would work.
>
> 2. There may be a way to hack things with a wild card cert but it's seems
> complicated and fragile.
>
> On Jun 22, 2017 5:47 PM, "Sorabh Hamirwasia" <shamirwa...@mapr.com shamirwa...@mapr.com>> wrote:
>
> Hi John,
> As Paul mentioned session ID's are not global. Each session is part of the
> BitToUserConnection instance created for a connection between Drillbit and
> client. Hence it's local to that Drillbit only and the lifetime of the
> session is tied to lifetime of the connection. You can find the code here<
> https://github.com/apache/drill/blob/master/exec/
> java-exec/src/main/java/org/apache/drill/exec/rpc/user/
> UserServer.java#L102>.
>
> Thanks,
> Sorabh
>
> 
> From: Paul Rogers <prog...@mapr.com>
> Sent: Thursday, June 22, 2017 2:19:50 PM
> To: user@drill.apache.org
> Subject: Re: Drill Session ID between Nodes
>
> Hi John,
>
> I do not believe that session IDs are global. Each Drillbit maintains its
> own concept of sessions. A global session would require some centralized
> registry of sessions, which Drill does not have.
>
> Would be great if someone can confirm…
>
> - Paul
>
> On Jun 22, 2017, at 12:14 PM, John Omernik <j...@omernik.com> wrote:
>
> When I log onto a drill node, and get Session Id, if I connect to another
> drill node in the cluster will the session id be valid?
>
> I am guessing not, but want to validate.
>
> My conumdrum, I have my Drill cluster running in such a way that the
> connections to the nodes are load balanced via DNS. However, if I get a
> DNS
> IP while in session it appears to invalidate, and thus forces me to log
> on...
>
>
>
>

Re: Drill Session ID between Nodes

2017-06-22 Thread John Omernik

Would there be interest in finding a way to globalize this? This is
challenging for me and others that may run drill with multi Tennant
orchestrators.  In my particular setup, each node running drill gets added
to an a record automatically giving me HA and distribution of Rest API
queries.  It also allows me to have a single certificate for my cluster
rather than managing certificates on a individual basis.   I set things up
to connect via IP but then I had certificate mismatch warnings. My goal is
to find a way to connect to the rest API , while maintaining a session to
single node, with out sacrificing HA and balancing and with compromising
ssl security.   I know it's a tall order, but if there I ideas outside of a
global state management I am all ears.

Note some ideas I've also considered:

1.  using a load balancer that would allow me to pin connections.  Not
ideal because it's another service to manage but it would work.

2. There may be a way to hack things with a wild card cert but it's seems
complicated and fragile.

On Jun 22, 2017 5:47 PM, "Sorabh Hamirwasia" <shamirwa...@mapr.com> wrote:

> Hi John,
> As Paul mentioned session ID's are not global. Each session is part of the
> BitToUserConnection instance created for a connection between Drillbit and
> client. Hence it's local to that Drillbit only and the lifetime of the
> session is tied to lifetime of the connection. You can find the code here<
> https://github.com/apache/drill/blob/master/exec/
> java-exec/src/main/java/org/apache/drill/exec/rpc/user/
> UserServer.java#L102>.
>
> Thanks,
> Sorabh
>
> 
> From: Paul Rogers <prog...@mapr.com>
> Sent: Thursday, June 22, 2017 2:19:50 PM
> To: user@drill.apache.org
> Subject: Re: Drill Session ID between Nodes
>
> Hi John,
>
> I do not believe that session IDs are global. Each Drillbit maintains its
> own concept of sessions. A global session would require some centralized
> registry of sessions, which Drill does not have.
>
> Would be great if someone can confirm…
>
> - Paul
>
> > On Jun 22, 2017, at 12:14 PM, John Omernik <j...@omernik.com> wrote:
> >
> > When I log onto a drill node, and get Session Id, if I connect to another
> > drill node in the cluster will the session id be valid?
> >
> > I am guessing not, but want to validate.
> >
> > My conumdrum, I have my Drill cluster running in such a way that the
> > connections to the nodes are load balanced via DNS. However, if I get a
> DNS
> > IP while in session it appears to invalidate, and thus forces me to log
> > on...
>
>

Drill Session ID between Nodes

2017-06-22 Thread John Omernik

When I log onto a drill node, and get Session Id, if I connect to another
drill node in the cluster will the session id be valid?

I am guessing not, but want to validate.

My conumdrum, I have my Drill cluster running in such a way that the
connections to the nodes are load balanced via DNS. However, if I get a DNS
IP while in session it appears to invalidate, and thus forces me to log
on...

Re: Using Apache Drill with AirBnB SuperSet

2017-06-14 Thread John Omernik

Be warned it's a hack together thing I and some others (Waves at Charles)
did to prove it out... we'd love it if MapR sought to extend it and put
some programming rigor around it... *hint hint :)



On Wed, Jun 14, 2017 at 3:04 PM, Kunal Khatua <kkha...@mapr.com> wrote:

> The Superset project looks pretty neat.
>
> Didn't realize that there is also a Python Driver for Drill. I'd think
> that would be useful too.
>
> -Original Message-----
> From: John Omernik [mailto:j...@omernik.com]
> Sent: Wednesday, June 14, 2017 3:45 AM
> To: user <user@drill.apache.org>
> Subject: Using Apache Drill with AirBnB SuperSet
>
> Hey all, I've had some success getting Drill working with Superset
> (Visualization tool) via work done by a few people (not just me) I thought
> I'd share what I was using and how it was working and see if others
> benefited from it, or if issues occur they can be posted to the issue to
> improve things.
>
>
>
> https://github.com/JohnOmernik/sadrill
>
>
> Super set Git:
>
> https://github.com/airbnb/superset
>
>
> Have fun, let me know how it goes
>
> John
>

Using Apache Drill with AirBnB SuperSet

2017-06-14 Thread John Omernik

Hey all, I've had some success getting Drill working with Superset
(Visualization tool) via work done by a few people (not just me) I thought
I'd share what I was using and how it was working and see if others
benefited from it, or if issues occur they can be posted to the issue to
improve things.



https://github.com/JohnOmernik/sadrill


Super set Git:

https://github.com/airbnb/superset


Have fun, let me know how it goes

John

Re: Parquet, Arrow, and Drill Roadmap

2017-05-04 Thread John Omernik

I've created a JIRA on this request. The idea here being some higher level
descriptions of these projects (I included Calcite in the JIRA too), what
they do for the project, what the current state of integration is, what
options we have for future states, and what benefits those future states
bring.   For Parquet, I think we could go deeper into some of the
settings/tweaks with real world examples to help folks do data better.

Thanks!


https://issues.apache.org/jira/browse/DRILL-5471

On Tue, May 2, 2017 at 1:46 PM, Padma Penumarthy 
wrote:

> One thing I want to add is use_new_reader uses reader from parquet-mr
> library, where as
> default one is drill’s native reader which is supposed to be better,
> performance wise.
> But, it does not support complex types and we automatically switch to use
> reader from parquet library
> when we have to read complex types.
>
> Thanks,
> Padma
>
>
> On May 2, 2017, at 11:09 AM, Jinfeng Ni  apache.org>> wrote:
>
>
> - What the two readers are (is one a special drill thing, is the other  a
> standard reader from the parquet project?)
> - What is the eventual goal here... to be able to use and switch between
> both? To provide the option? To have code parity with another project?
>
> Both readers were for reading parquet data into Drill's value vector.
> The default one (when store.parquet.use_new_reader is false) was
> faster (based on measurements done by people worked on the two
> readers), but it could not support complex type like map/array.  The
> new reader would be used by Drill either if you change the option to
> true, or when the parquet data you are querying contain complex type
> (even with the default option being false). Therefore, both readers
> might be used by Drill code.
>
> There was a Parquet hackathon some time ago, which aimed to make
> people in different projects using parquet work together to
> standardize a vectorized reader. I did not keep track of that effort.
> People with better knowledge of that may share their inputs.
>
>
> - Do either of the readers work with Arrow?
>
> For now, neither works with Arrow, since Drill has not integrated with
> Arrow yet. See DRILL-4455 for the latest discussion
> (https://issues.apache.org/jira/browse/DRILL-4455).  I would expect
> Drill's parquet reader will work with Arrow, once the integration is
> done.
>
>

Re: Discussion: Comments in Drill Views

2017-05-02 Thread John Omernik

I created  a JIRA for this based on the Hangout today!

https://issues.apache.org/jira/browse/DRILL-5461



On Mon, Mar 6, 2017 at 7:55 AM, John Omernik <j...@omernik.com> wrote:

> I can see both sides. But Ted is right, this won't hurt any thing from a
> performance perspective, even if they put War and Peace in there 30 times,
> that's 100mb of information to serve. People may choose to use formatting
> languages like Markup or something. I do think we should have a limit so we
> know what happens if someone tries to break that limit (from a security
> perspective) but we could set that quite high, and then just test putting
> data that exceeds that as a unit test.
>
>
>
> On Fri, Mar 3, 2017 at 8:28 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
>> All of War and Peace is only 3MB.
>>
>> Let people document however they want. Don't over-optimize for problems
>> that have never occurred.
>>
>>
>>
>> On Fri, Mar 3, 2017 at 3:19 PM, Kunal Khatua <kkha...@mapr.com> wrote:
>>
>> > It might be, incase someone begins to dump a massive design doc into the
>> > comment field for a view's JSON.
>> >
>> >
>> > I'm also not sure about how this information can be consumed. If it is
>> > through CLI, either we rely on the SQLLine shell to trim the output, or
>> not
>> > worry at all. I'm assuming we'd also probably want something like a
>> >
>> > DESCRIBE VIEW ...
>> >
>> > to be enhanced to something like
>> >
>> > DESCRIBE VIEW WITH COMMENTARY ...
>> >
>> >
>> > A 1KB field is quite generous IMHO. That's more than 7 tweets to
>> describe
>> > something ! [?]
>> >
>> >
>> > Kunal Khatua
>> >
>> > 
>> > From: Ted Dunning <ted.dunn...@gmail.com>
>> > Sent: Friday, March 3, 2017 12:56:44 PM
>> > To: user
>> > Subject: Re: Discussion: Comments in Drill Views
>> >
>> > It it really necessary to put a technical limit in to prevent people
>> from
>> > OVER-documenting views?
>> >
>> >
>> > What is the last time you saw code that had too many comments in it?
>> >
>> >
>> >
>> > On Thu, Mar 2, 2017 at 8:42 AM, John Omernik <j...@omernik.com> wrote:
>> >
>> > > So I think on your worry that's an easily definable "abuse"
>> condition...
>> > > i.e. if we set a limit of say 1024 characters, that provides ample
>> space
>> > > for descriptions, but at 1kb per view, that's an allowable condition,
>> > i.e.
>> > > it would be hard to abuse it ... or am I missing something?
>> > >
>> > > On Wed, Mar 1, 2017 at 8:08 PM, Kunal Khatua <kkha...@mapr.com>
>> wrote:
>> > >
>> > > > +1
>> > > >
>> > > >
>> > > > I this this can be very useful. The only worry is of someone abusing
>> > it,
>> > > > so we probably should have a limit on the size of this? Not sure
>> else
>> > it
>> > > > could be exposed and consumed.
>> > > >
>> > > >
>> > > > Kunal Khatua
>> > > >
>> > > > Engineering
>> > > >
>> > > > [MapR]<http://www.mapr.com/>
>> > > >
>> > > > www.mapr.com<http://www.mapr.com/>
>> > > >
>> > > > 
>> > > > From: John Omernik <j...@omernik.com>
>> > > > Sent: Wednesday, March 1, 2017 9:55:27 AM
>> > > > To: user
>> > > > Subject: Re: Discussion: Comments in Drill Views
>> > > >
>> > > > Sorry, I let this idea drop (I didn't follow up and found when
>> > searching
>> > > > for something else...)  Any other thoughts on this idea?
>> > > >
>> > > > Should I open a JIRA if people think it would be handy?
>> > > >
>> > > > On Thu, Jun 23, 2016 at 4:02 PM, Ted Dunning <ted.dunn...@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > This is very interesting.  I love docstrings in Lisp and Python
>> and
>> > > > Javadoc
>> > > > > in Java.
>> > > > >
>> > > > > Basically this is like that, but for SQL. Very helpful.
>> > > > >
>> > > > > On Thu, Jun 23, 2

Re: Clarification on Drill Options

2017-05-02 Thread John Omernik

Looks like some work has been done here, any chance we can move this along?

https://issues.apache.org/jira/browse/DRILL-4699


Thanks!

On Tue, May 31, 2016 at 12:51 PM, John Omernik <j...@omernik.com> wrote:

> I added a JIRA related to this:
>
> https://issues.apache.org/jira/browse/DRILL-4699
>
> On Sun, May 29, 2016 at 6:55 AM, John Omernik <j...@omernik.com> wrote:
>
>> Hey all, when looking at the drill options, and specifically as I was
>> trying to understand the parquet options, I realized that the naming of the
>> options was forming "question" as I looked at them. What do I mean?
>> Consider:
>>
>> ++
>>
>> |name|
>>
>> ++
>>
>> | store.parquet.block-size   |
>>
>> | store.parquet.compression  |
>>
>> | store.parquet.dictionary.page-size |
>>
>> | store.parquet.enable_dictionary_encoding   |
>>
>> | store.parquet.page-size|
>>
>> | store.parquet.use_new_reader   |
>>
>> | store.parquet.vector_fill_check_threshold  |
>>
>> | store.parquet.vector_fill_threshold|
>>
>> ++
>>
>>
>>
>> So I will remove "store.parquet" as I refer to them here:
>>
>>
>> use_new_reader - This seems fairly obvious an "on read" options and
>> (maybe?) does affect the Parquet writer, yet "enable_dictionary_encoding"
>> is likely ONLY an on write option correct? I mean, if the Parquet file
>> was written somewhere else, and written with Dictionary encoding, Drill
>> will still read it ok, regardless of this setting. Compression as well, if
>> the Parquet file was created with gzip, and this setting is snappy, it will
>> still read it, same goes for block size. Thus, those seem to be "writer"
>> settings, rather than reader settings.
>>
>>
>> So what about the vector settings? Write or Read (or both?) For json
>> there is this setting: | store.json.writer.uglifywhich seems to be
>> writer focused and obviously writer, but for other settings, knowing what
>> the setting applies to, on write, on read, neither, or both, could be very
>> useful for troubleshooting and knowing which settings to play with.
>>
>>
>> Now, changing these settings as they are is not recommended, even in my
>> test clusters, I have scripts that alter them for specific ETLs, and I
>> would hate to have things break, but how hard would it be to add a string
>> column to sys.options something like "applies_to" with write, read, both,
>> neither, n/a as options?   I think this could be valuable for users and
>> administrators of Drill.
>>
>>
>> One other note, in addition to the applies_to,  would it be horrifically
>> difficult to add a  "description" field for options?  Self documenting
>> settings sure would be handy  :)
>>
>>
>> John
>>
>>
>>
>

Re: Dealing with bad data when trying to do date computations

2017-05-02 Thread John Omernik

I just want to say, there is a great JIRA already opened here:

https://issues.apache.org/jira/browse/DRILL-4258

I added a comment, I would encourage others to add comments if they think
this idea would be beneficial.

On Wed, Mar 1, 2017 at 8:50 AM, John Omernik <j...@omernik.com> wrote:

> So what would need to be done to get this process kick started?  I see a
> few components here:
>
> 1. Develop the table in sys (sys.functions) that stores the information
> about the function.  For this I propose this for discussions
>
> name - The name of the function
> description - The Description of the function
> docgroup- This should be the groupings under "SQL Functions" in the
> documentation. See * below for proposal
> tags - A way to tag functions for easy searching like select  from
> sys.functions where tags like '%string%'
> arguments - The Ordered list of arguments and their types
> return - What is returned (and it's type)
> Examples: Usage examples that can display nicely in both select * from
> sys.functions as well as provide enough information for use in HTML Docs.
>
>
> * docgroup.
>
> Currently, in the documentation, the nesting is as below (I didn't expand
> all the stuff on SQL Window and Nested Data). I want to outline my proposal
> for to hand handle this... So I propose moving "SQL Window Functions",
> "Nested Data Functions" and "Query Directory Functions" all under SQL
> Functions and adding another group here "Common Functions".  This will now
> be the "Functions root".  The Docgroup field in sys.function will have
> levels and grouping separated by ":".  Thus,  firstlevel:second:level  or
> firstlevel:secondlevel:thirdlevel  This will allow us to keep the
> grouping in the sys.functions table, and make it so that on every release,
> the documentation could be updated with a query.  Note the level "SQL
> Functions" would not be represented in the docgroup.
>
> Here are some examples of functions
>
> LOWER()
> Current Doc Placement: SQL Functions -> String Manipulation
> Proposed docgroup: 'Common Functions:String Manipulation'
> How it would appear on the Doc page: SQL Functions -> Common Functions ->
> String Manipulation
>
> COUNT()
> Current Doc Placement: SQL Functions -> SQL Window Functions -> Aggregate
> Window Functions
> Proposed docgroup: 'SQL Window Functions:Aggregate Window Functions'
> How it would appear on the Doc pages: SQL Functions -> SQL Window
> Functions -> Aggregate Window Functions
>
> Basically it would be used to add the raw HTML to the final page in the
> docgroup (Aggregate Window Functions)
>
>
> Current Doc Layout:
>
> SQL Reference
> SQL Reference Introduction
> -> Data Types
> Lexical Structure
> Operators
> -> SQL Functions
> ->-> About SQL Function Examples
> ->-> Math and Trig
> ->->Data Type Conversion
> ->->Date/Time Functions and Arithmetic
> ->->String Manipulation
> ->->Aggregate and Aggregate Statistical
> ->-> Functions for handling Nulls
> -> SQL Window Functions
> ->-> ...
> ->-> ...
> -> Nested Data Functions
> ->->...
> ->->...
> -> Query Directory Functions
>
> 2. Once we get to a certain point in development on sys.functions, we need
> a call to arms.  We need to come up with an initial "fill" of
> sys.functions.  For that we'll need to take current data fill it in, as
> well as getting a list of all the ninja functions that have been added to
> drill and not documented... not sure how get those with out an intense
> code/jira review.
>
> 3.  Come up with new function proposal guidelines.  If you do a pull
> request with a function, what will need to be included for your pull
> request to be approved? We should not allow functions to be added to Drill
> without a basic doc addition.
>
> 4.  Update procedures?
>
> This is complicated, but done well, it could really put the knowledge and
> analyst needs right in the system itself!
>
> John
>
>
>
>
> On Tue, Feb 28, 2017 at 12:20 PM, John Omernik <j...@omernik.com> wrote:
>
>> You could also generate documentation updates via query at each release.
>> This would be a great feature, move the information close to the analysts
>> hands, I love how that would work.  (I think I remember some talk about
>> extending sys.options to be self documenting as well )
>>
>>
>>
>> On Tue, Feb 28, 2017 at 12:11 PM, Jinfeng Ni <j...@apache.org> wrote:
>>
>>> Regarding the list of functions (build-in or UDF), someone once
>>> suggested that

Parquet, Arrow, and Drill Roadmap

2017-05-01 Thread John Omernik

Hey all - I posted this to both dev and user as I could mentally make the
argument for both,

Sorry if this is answered somewhere already. I know in the past, there have
been discussions around using two different readers for Parquet, and
performance gains/losses, issues. etc.

Right now, the store.parquet.use_new_reader is set to false (1.10) and I
was trying to get more information about this and what the eventual roadmap
for drill will be...

So, in the docs I see use_new_reader is set to be not supported in this
release.  What I am looking for is a little information on:

- What the two readers are (is one a special drill thing, is the other  a
standard reader from the parquet project?)
- What is the eventual goal here... to be able to use and switch between
both? To provide the option? To have code parity with another project?
- Do either of the readers work with Arrow?
- How does Arrow and Parquet readers fit together?
- Will the readers ever converge?

I have other questions too, but these are examples of where I am coming
from. Is there a good starting place for my research on this subject?

Thanks,
John

Re: NPE When Selecting from MapR-DB Table

2017-04-12 Thread John Omernik

Drill 1.10.0 (from MapR with the patch) and it fixed my problem THanks!

On Thu, Apr 6, 2017 at 4:16 PM, Padma Penumarthy <ppenumar...@mapr.com>
wrote:

> BTW, the option to use to change slice target is planner.slice_target.
> For example,  alter session set `planner.slice_target` = 6;
>
> Thanks,
> Padma
>
>
> On Apr 6, 2017, at 1:38 PM, Padma Penumarthy <ppenumar...@mapr.com ppenumar...@mapr.com>> wrote:
>
> Can you send the query profile ? How is your data distributed i.e.
> how big is the table and how many regions and avg row count per region ?
>
> This problem can happen when we don’t have a minor fragment (for scanning)
> scheduled on
> a node which is hosting one or more hbase regions.  That can happen if we
> do not have
> enough work to do (based on total rowCount and slice target) to schedule
> fragments on all nodes.
> One thing you can try is lower the slice target so we create fragments on
> all nodes.
> Depending upon your configuration, making it close to average rowCount per
> region might
> be the ideal thing to do.
>
> Thanks,
> Padma
>
>
> On Apr 6, 2017, at 12:03 PM, John Omernik <j...@omernik.com<mailto:john@
> omernik.com>> wrote:
>
> By the way, are there any ways to manually prod the data to make it so the
> queries work again? It seems like an off by one type issue, can I add
> something to my data make it work?
>
> On Thu, Apr 6, 2017 at 2:03 PM, John Omernik <j...@omernik.com j...@omernik.com>> wrote:
>
> Oh nice, 1.10 from MapR has the fix? Great... Looking forward to that!
> Thanks!
>
> On Thu, Apr 6, 2017 at 1:59 PM, Abhishek Girish <agir...@apache.org
> <mailto:agir...@apache.org>>
> wrote:
>
> I'm guessing the next release of Apache Drill could be a few months away.
> MapR Drill 1.10.0 release (which does contain the fix for DRILL-5395)
> should be out shortly, within the next week or so.
>
> On Thu, Apr 6, 2017 at 11:50 AM, John Omernik <j...@omernik.com j...@omernik.com>> wrote:
>
> Nope no other issues. I was waiting on 1.10 to be available from MapR,
> do
> we know the release date for 1.11?
>
> On Thu, Apr 6, 2017 at 1:31 PM, Abhishek Girish <agir...@apache.org
> <mailto:agir...@apache.org>>
> wrote:
>
> Are there any other issues apart from the one being discussed? Builds
> from
> Apache should work with MapR-DB tables (when built with mapr profile).
> Let
> us know if you are having any trouble.
>
> The fix for DRILL-5395 should be available this week, afaik. You could
> also
> build off Padma's branch if you need it urgently.
>
> On Thu, Apr 6, 2017 at 11:25 AM, John Omernik <j...@omernik.com j...@omernik.com>>
> wrote:
>
> Is there any work around except wait? This is unfortunate...  I
> wonder
> if I
> could beg a MapR build off the MapR team if I offered them
> beer/cookies.
> (
> I have been unable to get MapR Tables to work with Builds off the
> Apache
> main line)
>
>
>
>
>
> On Thu, Apr 6, 2017 at 1:06 PM, Abhishek Girish <agir...@apache.org
> <mailto:agir...@apache.org>
>
> wrote:
>
> Could be related to DRILL-5395
> <https://issues.apache.org/jira/browse/DRILL-5395>. Once
> committed,
> the
> fix
> should be available in Apache master.
>
> On Thu, Apr 6, 2017 at 10:56 AM, John Omernik <j...@omernik.com j...@omernik.com>>
> wrote:
>
> Hello all, I am using Drill 1.8 and MapR 5.2. I just finished a
> large
> load
> of data into a mapr table. I was able to confirm that the table
> returns
> data from the c api for hbase, so no issue there, however, when
> I
> select
> from the table in Drill, either from the table directly, or
> from a
> view I
> created, then I get the NPE as listed below. Any advice on how
> to
> troubleshoot further would be appreciated!
>
>
> 0: jdbc:drill:zk:zeta2.brewingintel.com<http://brewingintel.com>:5181,>
> select * from
> maprpcaps
> limit 1;
>
> Error: SYSTEM ERROR: NullPointerException
>
>
>
> [Error Id: 6abaeadb-6e1b-4dce-9d1f-54b99a40becb on
> zeta4.brewingintel.com<http://zeta4.brewingintel.com>:20005]
>
>
> (org.apache.drill.exec.work.foreman.ForemanException)
> Unexpected
> exception during fragment initialization: null
>
>   org.apache.drill.exec.work.foreman.Foreman.run():281
>
>   java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>
>   java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>
>   java.lang.Thread.run():745
>
> Caused By (java.lang.NullPointerException) null
>
>
> org.apache.drill.exec.store.mapr.db.MapRDBGroupScan.
>

Re: NPE When Selecting from MapR-DB Table

2017-04-06 Thread John Omernik

By the way, are there any ways to manually prod the data to make it so the
queries work again? It seems like an off by one type issue, can I add
something to my data make it work?

On Thu, Apr 6, 2017 at 2:03 PM, John Omernik <j...@omernik.com> wrote:

> Oh nice, 1.10 from MapR has the fix? Great... Looking forward to that!
> Thanks!
>
> On Thu, Apr 6, 2017 at 1:59 PM, Abhishek Girish <agir...@apache.org>
> wrote:
>
>> I'm guessing the next release of Apache Drill could be a few months away.
>> MapR Drill 1.10.0 release (which does contain the fix for DRILL-5395)
>> should be out shortly, within the next week or so.
>>
>> On Thu, Apr 6, 2017 at 11:50 AM, John Omernik <j...@omernik.com> wrote:
>>
>> > Nope no other issues. I was waiting on 1.10 to be available from MapR,
>> do
>> > we know the release date for 1.11?
>> >
>> > On Thu, Apr 6, 2017 at 1:31 PM, Abhishek Girish <agir...@apache.org>
>> > wrote:
>> >
>> > > Are there any other issues apart from the one being discussed? Builds
>> > from
>> > > Apache should work with MapR-DB tables (when built with mapr profile).
>> > Let
>> > > us know if you are having any trouble.
>> > >
>> > > The fix for DRILL-5395 should be available this week, afaik. You could
>> > also
>> > > build off Padma's branch if you need it urgently.
>> > >
>> > > On Thu, Apr 6, 2017 at 11:25 AM, John Omernik <j...@omernik.com>
>> wrote:
>> > >
>> > > > Is there any work around except wait? This is unfortunate...  I
>> wonder
>> > > if I
>> > > > could beg a MapR build off the MapR team if I offered them
>> > beer/cookies.
>> > > (
>> > > > I have been unable to get MapR Tables to work with Builds off the
>> > Apache
>> > > > main line)
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Apr 6, 2017 at 1:06 PM, Abhishek Girish <agir...@apache.org
>> >
>> > > > wrote:
>> > > >
>> > > > > Could be related to DRILL-5395
>> > > > > <https://issues.apache.org/jira/browse/DRILL-5395>. Once
>> committed,
>> > > the
>> > > > > fix
>> > > > > should be available in Apache master.
>> > > > >
>> > > > > On Thu, Apr 6, 2017 at 10:56 AM, John Omernik <j...@omernik.com>
>> > > wrote:
>> > > > >
>> > > > > > Hello all, I am using Drill 1.8 and MapR 5.2. I just finished a
>> > large
>> > > > > load
>> > > > > > of data into a mapr table. I was able to confirm that the table
>> > > returns
>> > > > > > data from the c api for hbase, so no issue there, however, when
>> I
>> > > > select
>> > > > > > from the table in Drill, either from the table directly, or
>> from a
>> > > > view I
>> > > > > > created, then I get the NPE as listed below. Any advice on how
>> to
>> > > > > > troubleshoot further would be appreciated!
>> > > > > >
>> > > > > >
>> > > > > > 0: jdbc:drill:zk:zeta2.brewingintel.com:5181,> select * from
>> > > maprpcaps
>> > > > > > limit 1;
>> > > > > >
>> > > > > > Error: SYSTEM ERROR: NullPointerException
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > [Error Id: 6abaeadb-6e1b-4dce-9d1f-54b99a40becb on
>> > > > > > zeta4.brewingintel.com:20005]
>> > > > > >
>> > > > > >
>> > > > > >   (org.apache.drill.exec.work.foreman.ForemanException)
>> Unexpected
>> > > > > > exception during fragment initialization: null
>> > > > > >
>> > > > > > org.apache.drill.exec.work.foreman.Foreman.run():281
>> > > > > >
>> > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker():1142
>> > > > > >
>> > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run():617
>> > > > > >
>> > > > > > java.lang.Thread.run():745
>> > > > > >
>> > >

Re: NPE When Selecting from MapR-DB Table

2017-04-06 Thread John Omernik

Oh nice, 1.10 from MapR has the fix? Great... Looking forward to that!
Thanks!

On Thu, Apr 6, 2017 at 1:59 PM, Abhishek Girish <agir...@apache.org> wrote:

> I'm guessing the next release of Apache Drill could be a few months away.
> MapR Drill 1.10.0 release (which does contain the fix for DRILL-5395)
> should be out shortly, within the next week or so.
>
> On Thu, Apr 6, 2017 at 11:50 AM, John Omernik <j...@omernik.com> wrote:
>
> > Nope no other issues. I was waiting on 1.10 to be available from MapR, do
> > we know the release date for 1.11?
> >
> > On Thu, Apr 6, 2017 at 1:31 PM, Abhishek Girish <agir...@apache.org>
> > wrote:
> >
> > > Are there any other issues apart from the one being discussed? Builds
> > from
> > > Apache should work with MapR-DB tables (when built with mapr profile).
> > Let
> > > us know if you are having any trouble.
> > >
> > > The fix for DRILL-5395 should be available this week, afaik. You could
> > also
> > > build off Padma's branch if you need it urgently.
> > >
> > > On Thu, Apr 6, 2017 at 11:25 AM, John Omernik <j...@omernik.com>
> wrote:
> > >
> > > > Is there any work around except wait? This is unfortunate...  I
> wonder
> > > if I
> > > > could beg a MapR build off the MapR team if I offered them
> > beer/cookies.
> > > (
> > > > I have been unable to get MapR Tables to work with Builds off the
> > Apache
> > > > main line)
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Apr 6, 2017 at 1:06 PM, Abhishek Girish <agir...@apache.org>
> > > > wrote:
> > > >
> > > > > Could be related to DRILL-5395
> > > > > <https://issues.apache.org/jira/browse/DRILL-5395>. Once
> committed,
> > > the
> > > > > fix
> > > > > should be available in Apache master.
> > > > >
> > > > > On Thu, Apr 6, 2017 at 10:56 AM, John Omernik <j...@omernik.com>
> > > wrote:
> > > > >
> > > > > > Hello all, I am using Drill 1.8 and MapR 5.2. I just finished a
> > large
> > > > > load
> > > > > > of data into a mapr table. I was able to confirm that the table
> > > returns
> > > > > > data from the c api for hbase, so no issue there, however, when I
> > > > select
> > > > > > from the table in Drill, either from the table directly, or from
> a
> > > > view I
> > > > > > created, then I get the NPE as listed below. Any advice on how to
> > > > > > troubleshoot further would be appreciated!
> > > > > >
> > > > > >
> > > > > > 0: jdbc:drill:zk:zeta2.brewingintel.com:5181,> select * from
> > > maprpcaps
> > > > > > limit 1;
> > > > > >
> > > > > > Error: SYSTEM ERROR: NullPointerException
> > > > > >
> > > > > >
> > > > > >
> > > > > > [Error Id: 6abaeadb-6e1b-4dce-9d1f-54b99a40becb on
> > > > > > zeta4.brewingintel.com:20005]
> > > > > >
> > > > > >
> > > > > >   (org.apache.drill.exec.work.foreman.ForemanException)
> Unexpected
> > > > > > exception during fragment initialization: null
> > > > > >
> > > > > > org.apache.drill.exec.work.foreman.Foreman.run():281
> > > > > >
> > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> > > > > >
> > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> > > > > >
> > > > > > java.lang.Thread.run():745
> > > > > >
> > > > > >   Caused By (java.lang.NullPointerException) null
> > > > > >
> > > > > >
> > > > > > org.apache.drill.exec.store.mapr.db.MapRDBGroupScan.
> > > > > applyAssignments():205
> > > > > >
> > > > > >
> > > > > > org.apache.drill.exec.planner.fragment.Wrapper$
> > > > > > AssignEndpointsToScanAndStore.visitGroupScan():116
> > > > > >
> > > > > >
> > > > > > org.apache.drill.exec.planner.fragment.Wrapper$
> > > > > > AssignEndpointsToScanAndStore.visitGroupScan():103
&g

Re: NPE When Selecting from MapR-DB Table

2017-04-06 Thread John Omernik

Nope no other issues. I was waiting on 1.10 to be available from MapR, do
we know the release date for 1.11?

On Thu, Apr 6, 2017 at 1:31 PM, Abhishek Girish <agir...@apache.org> wrote:

> Are there any other issues apart from the one being discussed? Builds from
> Apache should work with MapR-DB tables (when built with mapr profile). Let
> us know if you are having any trouble.
>
> The fix for DRILL-5395 should be available this week, afaik. You could also
> build off Padma's branch if you need it urgently.
>
> On Thu, Apr 6, 2017 at 11:25 AM, John Omernik <j...@omernik.com> wrote:
>
> > Is there any work around except wait? This is unfortunate...  I wonder
> if I
> > could beg a MapR build off the MapR team if I offered them beer/cookies.
> (
> > I have been unable to get MapR Tables to work with Builds off the Apache
> > main line)
> >
> >
> >
> >
> >
> > On Thu, Apr 6, 2017 at 1:06 PM, Abhishek Girish <agir...@apache.org>
> > wrote:
> >
> > > Could be related to DRILL-5395
> > > <https://issues.apache.org/jira/browse/DRILL-5395>. Once committed,
> the
> > > fix
> > > should be available in Apache master.
> > >
> > > On Thu, Apr 6, 2017 at 10:56 AM, John Omernik <j...@omernik.com>
> wrote:
> > >
> > > > Hello all, I am using Drill 1.8 and MapR 5.2. I just finished a large
> > > load
> > > > of data into a mapr table. I was able to confirm that the table
> returns
> > > > data from the c api for hbase, so no issue there, however, when I
> > select
> > > > from the table in Drill, either from the table directly, or from a
> > view I
> > > > created, then I get the NPE as listed below. Any advice on how to
> > > > troubleshoot further would be appreciated!
> > > >
> > > >
> > > > 0: jdbc:drill:zk:zeta2.brewingintel.com:5181,> select * from
> maprpcaps
> > > > limit 1;
> > > >
> > > > Error: SYSTEM ERROR: NullPointerException
> > > >
> > > >
> > > >
> > > > [Error Id: 6abaeadb-6e1b-4dce-9d1f-54b99a40becb on
> > > > zeta4.brewingintel.com:20005]
> > > >
> > > >
> > > >   (org.apache.drill.exec.work.foreman.ForemanException) Unexpected
> > > > exception during fragment initialization: null
> > > >
> > > > org.apache.drill.exec.work.foreman.Foreman.run():281
> > > >
> > > > java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> > > >
> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> > > >
> > > > java.lang.Thread.run():745
> > > >
> > > >   Caused By (java.lang.NullPointerException) null
> > > >
> > > >
> > > > org.apache.drill.exec.store.mapr.db.MapRDBGroupScan.
> > > applyAssignments():205
> > > >
> > > >
> > > > org.apache.drill.exec.planner.fragment.Wrapper$
> > > > AssignEndpointsToScanAndStore.visitGroupScan():116
> > > >
> > > >
> > > > org.apache.drill.exec.planner.fragment.Wrapper$
> > > > AssignEndpointsToScanAndStore.visitGroupScan():103
> > > >
> > > > org.apache.drill.exec.physical.base.
> AbstractGroupScan.accept():63
> > > >
> > > >
> > > > org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.
> > > > visitChildren():138
> > > >
> > > >
> > > > org.apache.drill.exec.planner.fragment.Wrapper$
> > > > AssignEndpointsToScanAndStore.visitOp():134
> > > >
> > > >
> > > > org.apache.drill.exec.planner.fragment.Wrapper$
> > > > AssignEndpointsToScanAndStore.visitOp():103
> > > >
> > > >
> > > > org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.
> > > > visitLimit():92
> > > >
> > > > org.apache.drill.exec.physical.config.Limit.accept():57
> > > >
> > > >
> > > > org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.
> > > > visitChildren():138
> > > >
> > > >
> > > > org.apache.drill.exec.planner.fragment.Wrapper$
> > > > AssignEndpointsToScanAndStore.visitOp():134
> > > >
> > > >
> > > > org.apache.drill.exec.planner.fragment.Wrapper$
> > > > AssignEndpointsToScanAndStore.visitOp():103
> > > >

Re: NPE When Selecting from MapR-DB Table

2017-04-06 Thread John Omernik

Is there any work around except wait? This is unfortunate...  I wonder if I
could beg a MapR build off the MapR team if I offered them beer/cookies. (
I have been unable to get MapR Tables to work with Builds off the Apache
main line)





On Thu, Apr 6, 2017 at 1:06 PM, Abhishek Girish <agir...@apache.org> wrote:

> Could be related to DRILL-5395
> <https://issues.apache.org/jira/browse/DRILL-5395>. Once committed, the
> fix
> should be available in Apache master.
>
> On Thu, Apr 6, 2017 at 10:56 AM, John Omernik <j...@omernik.com> wrote:
>
> > Hello all, I am using Drill 1.8 and MapR 5.2. I just finished a large
> load
> > of data into a mapr table. I was able to confirm that the table returns
> > data from the c api for hbase, so no issue there, however, when I select
> > from the table in Drill, either from the table directly, or from a view I
> > created, then I get the NPE as listed below. Any advice on how to
> > troubleshoot further would be appreciated!
> >
> >
> > 0: jdbc:drill:zk:zeta2.brewingintel.com:5181,> select * from maprpcaps
> > limit 1;
> >
> > Error: SYSTEM ERROR: NullPointerException
> >
> >
> >
> > [Error Id: 6abaeadb-6e1b-4dce-9d1f-54b99a40becb on
> > zeta4.brewingintel.com:20005]
> >
> >
> >   (org.apache.drill.exec.work.foreman.ForemanException) Unexpected
> > exception during fragment initialization: null
> >
> > org.apache.drill.exec.work.foreman.Foreman.run():281
> >
> > java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> >
> > java.lang.Thread.run():745
> >
> >   Caused By (java.lang.NullPointerException) null
> >
> >
> > org.apache.drill.exec.store.mapr.db.MapRDBGroupScan.
> applyAssignments():205
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitGroupScan():116
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitGroupScan():103
> >
> > org.apache.drill.exec.physical.base.AbstractGroupScan.accept():63
> >
> >
> > org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.
> > visitChildren():138
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitOp():134
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitOp():103
> >
> >
> > org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.
> > visitLimit():92
> >
> > org.apache.drill.exec.physical.config.Limit.accept():57
> >
> >
> > org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.
> > visitChildren():138
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitOp():134
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitOp():103
> >
> > org.apache.drill.exec.physical.config.SelectionVectorRemover.accept(
> > ):42
> >
> >
> > org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.
> > visitChildren():138
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitOp():134
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitExchange():108
> >
> >
> > org.apache.drill.exec.planner.fragment.Wrapper$
> > AssignEndpointsToScanAndStore.visitExchange():103
> >
> > org.apache.drill.exec.physical.base.AbstractExchange.accept():123
> >
> > org.apache.drill.exec.planner.fragment.Wrapper.assignEndpoints():148
> >
> >
> > org.apache.drill.exec.planner.fragment.SoftAffinityFragmentParalleliz
> > er.parallelizeFragment():86
> >
> >
> > org.apache.drill.exec.planner.fragment.SimpleParallelizer.
> > parallelizeFragment():251
> >
> >
> > org.apache.drill.exec.planner.fragment.SimpleParallelizer.
> > parallelizeFragment():243
> >
> >
> > org.apache.drill.exec.planner.fragment.SimpleParallelizer.
> > getFragmentsHelper():167
> >
> >
> > org.apache.drill.exec.planner.fragment.SimpleParallelizer.
> > getFragments():126
> >
> > org.apache.drill.exec.work.foreman.Foreman.getQueryWorkUnit():596
> >
> > org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan():426
> >
> > org.apache.drill.exec.work.foreman.Foreman.runSQL():1010
> >
> > org.apache.drill.exec.work.foreman.Foreman.run():264
> >
> > java.util.concurrent.ThreadPoolExecutor.runWorker():1142
> >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run():617
> >
> > java.lang.Thread.run():745 (state=,code=0)
> >
>

NPE When Selecting from MapR-DB Table

2017-04-06 Thread John Omernik

Hello all, I am using Drill 1.8 and MapR 5.2. I just finished a large load
of data into a mapr table. I was able to confirm that the table returns
data from the c api for hbase, so no issue there, however, when I select
from the table in Drill, either from the table directly, or from a view I
created, then I get the NPE as listed below. Any advice on how to
troubleshoot further would be appreciated!


0: jdbc:drill:zk:zeta2.brewingintel.com:5181,> select * from maprpcaps
limit 1;

Error: SYSTEM ERROR: NullPointerException



[Error Id: 6abaeadb-6e1b-4dce-9d1f-54b99a40becb on
zeta4.brewingintel.com:20005]


  (org.apache.drill.exec.work.foreman.ForemanException) Unexpected
exception during fragment initialization: null

org.apache.drill.exec.work.foreman.Foreman.run():281

java.util.concurrent.ThreadPoolExecutor.runWorker():1142

java.util.concurrent.ThreadPoolExecutor$Worker.run():617

java.lang.Thread.run():745

  Caused By (java.lang.NullPointerException) null


org.apache.drill.exec.store.mapr.db.MapRDBGroupScan.applyAssignments():205


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitGroupScan():116


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitGroupScan():103

org.apache.drill.exec.physical.base.AbstractGroupScan.accept():63


org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitChildren():138


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitOp():134


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitOp():103


org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitLimit():92

org.apache.drill.exec.physical.config.Limit.accept():57


org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitChildren():138


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitOp():134


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitOp():103

org.apache.drill.exec.physical.config.SelectionVectorRemover.accept():42


org.apache.drill.exec.physical.base.AbstractPhysicalVisitor.visitChildren():138


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitOp():134


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitExchange():108


org.apache.drill.exec.planner.fragment.Wrapper$AssignEndpointsToScanAndStore.visitExchange():103

org.apache.drill.exec.physical.base.AbstractExchange.accept():123

org.apache.drill.exec.planner.fragment.Wrapper.assignEndpoints():148


org.apache.drill.exec.planner.fragment.SoftAffinityFragmentParallelizer.parallelizeFragment():86


org.apache.drill.exec.planner.fragment.SimpleParallelizer.parallelizeFragment():251


org.apache.drill.exec.planner.fragment.SimpleParallelizer.parallelizeFragment():243


org.apache.drill.exec.planner.fragment.SimpleParallelizer.getFragmentsHelper():167


org.apache.drill.exec.planner.fragment.SimpleParallelizer.getFragments():126

org.apache.drill.exec.work.foreman.Foreman.getQueryWorkUnit():596

org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan():426

org.apache.drill.exec.work.foreman.Foreman.runSQL():1010

org.apache.drill.exec.work.foreman.Foreman.run():264

java.util.concurrent.ThreadPoolExecutor.runWorker():1142

java.util.concurrent.ThreadPoolExecutor$Worker.run():617

java.lang.Thread.run():745 (state=,code=0)

Re: Rest API Authentication (1.5 Feature)

2017-04-04 Thread John Omernik

I'd like to open this discussion back up. I am working with a coworker who
pointed out that a REST service should be stateless (
https://en.wikipedia.org/wiki/Representational_state_transfer)  (relevant
quote: using a uniform and predefined set of stateless
<https://en.wikipedia.org/wiki/Stateless_protocol> operations.) That said,
I was explaining to him that I can see why Drill Rest was implemented like
this. It provides better functionality and allows for features more on par
with the JDBC and ODBC interfaces.  I also pointed out that, if you do an
auth call, you don't really have to mess with cookies, as long as you grab
the JSESSIONID returned in the cookies, you can just append that via
;jsessionid=$ID to every request and not have to format your cookies, seems
to work pretty well.

That said, I do think it makes sense to be able to provide a basic auth
header, and then a single request similar to what Venki suggested. It would
then be stateless (if you have to do a login request prior to your query
request, it's not really a stateless call)  and work as people would expect
a rest request to work.

Thoughts?

John

On Wed, Dec 23, 2015 at 12:26 PM, John Omernik <j...@omernik.com> wrote:

> Ah that makes sense to me (to keep the session open for calls that may
> utilize more than a single request or change settings that need to persist
> from one call to the next).  Ya, I think the basic auth could be nice for
> single calls (in addition to the form based auth) Should I open a JIRA?
>
> On Wed, Dec 23, 2015 at 11:48 AM, Venki Korukanti <
> venki.koruka...@gmail.com> wrote:
>
>> I think it makes sense to have basic auth along with the form auth.
>> Currently we maintain the session which has DrillClient so that all user
>> options are preserved in the login session. In case of basic auth (which
>> are usually point API calls), we can create the DrillClient and destroy
>> after the call.
>>
>> On Wed, Dec 23, 2015 at 5:44 AM, John Omernik <j...@omernik.com> wrote:
>>
>> > Hey all,
>> >
>> > I am not sure if I am jumping the gun on this, or if it's ok to ask a
>> > question based on the MapR 1.4 Developer preview.
>> >
>> > The feature that I am talking about is Authentication on the Rest API.
>> > Basically, in the 1.4 Dev preview, the way it's done is through forms
>> based
>> > authentication. If this is how it is going to land in 1.5, I'd like to
>> > discuss here, and perhaps open a JIRA for a small alteration.
>> >
>> > Basically, the way it's setup now, if I want to say use Python
>> Requests, I
>> > have to create a cookie jar, auth with one request, and then keep the
>> state
>> > in the request.
>> >
>> > What would be nice is using basic auth instead of forms auth. Then I can
>> > include the auth credentials with the request, allowing for single
>> request
>> > API calls.  This will also play out as we use the API more for
>> > administrative tasks (updating/creating storage plugins etc).
>> >
>> > I don't mind keeping the forms based auth if no auth is provided
>> (perhaps a
>> > redirect to the login form) to keep it looking nice for Web UI users,
>> but
>> > if basic auth is supplied to allow that to be used instead of the forms
>> > auth?
>> >
>> > Basically, if there is a reason for not doing this, I wanted to test the
>> > waters here, if no reason, I'll open a JIRA. I think it would simplify
>> > programatic access down the road, and make the Rest API easier to
>> program
>> > to.
>> >
>> > John
>> >
>>
>
>

Re: Drill Parquet Partitioning Method

2017-04-04 Thread John Omernik

When I post questions like that I take a very user centric mindset. For me,
abstracting what needs to be done here to ensure help users seamlessly work
with other tools make Drill a friendly addition to any Data Science team.
If admins have to do more, or train users to handle how Drill does stuff
differently than say Hive or Impala, it's not greeted as warmly.   Perhaps
we could have a option in Drill to limit how many levels is recurses in the
subdirectories. Even Hive folks say anymore than two gets really hairy...

On Mon, Apr 3, 2017 at 5:23 PM, Jesse Yates <jesse.k.ya...@gmail.com> wrote:

> I filed a JIRA for this a while back:
> https://issues.apache.org/jira/browse/DRILL-4615
>
> And even more recently finished a patch on our internal branch (pegged at
> 1.6 right now). Since 1.6, things have shifted a bit in the abstraction of
> managing directory partitions, so the patch wouldn't be directly
> applicable. Where I got caught up was in the directory depth issue - there
> is/was a bunch of expectation in the Easy reader, and fs partition rule,
> around directory names being tied to known depth in the directory tree. It
> makes it easy to do dir0, dir1, but hard for any other case :-/
>
> The workaround I used was having a now config for the workspace, a
> DirectoryStrategy, that uses at least a known directory name for each level
> of the hierarchy and a parsing strategy for the value. So you would have a
> directory list of, for your example, ['day'], or for a more complicated
> structure like:
>   /mytable
>  /day=1
>/hour=2
>
> a directory list like ['day', 'hour']. Not the best, but OK if you don't
> have changing directory structures for a workspace.
>
> Happy to post the patch if someone is interested, but it would take a bit
> of work (for which I don't have time right now) to finagle it into master.
>
> --Jesse
>
> On Mon, Apr 3, 2017 at 1:56 PM Jinfeng Ni <j...@apache.org> wrote:
>
> That's a good idea.
>
> Let me clarify one thing first. Drill has two kinds of partitions:
> auto partition, or directory-based partition.
>
> The first one is a result of using drill's CTAS partition by statement
> [1].  Both partition column name and column value are written and
> encoded in the output file. When querying such data, you can directly
> specify those partition columns in a WHERE/HAVING clause, and Drill
> will do partition pruning.
>
> The second one, directory-based partition, is to handle the cases
> where the input files are generated by other tools. It could be
> created by hive's or impala's partition by clause, or it could
> generated by some other ecosystem. For hive generated files, you are
> right the directory name has the pattern of "key=value", for files
> generated by other tools, we may see "value" only as the directory
> name. For the hive generated files, what we recommend today to user do
> is to created a view ( i.e. substring(dir0, 4) as dateCol), to extract
> the value from the directory name.
>
> Your suggestion makes sense to me. We probably may consider adding an
> option to the file system format plugin, to 'auto' parse the
> "key=value" pattern, and replace "dirN" with "key".   That option
> would work in a similar way as the options (skipFirstLine,
> extractHeader, etc) ) we added for text file [2]. I agree with you
> that it would make Drill more seamlessly work with other ecosystems.
>
> If we do that, I'm not sure if Drill could error out for "select *
> from mytable where day = '2017-04-01' ", if there is no "day" field in
> the directory names. The thing is day could come from either
> directory, or from data files. That probably need long discussion
> about how Drill should handle a possible non-exist column.
>
>
> 1. https://drill.apache.org/docs/partition-by-clause/
> 2. https://drill.apache.org/docs/text-files-csv-tsv-psv/
>
> On Mon, Apr 3, 2017 at 11:08 AM, John Omernik <j...@omernik.com> wrote:
> > So as a user of Drill now for a while, I have gotten used to the idea of
> > partitions just being values, instead of key=value like other things
> (hive,
> > impala, others).
> >
> > From a user/analyst perspective, the dir0, dir1, dirN methodology
> provides
> > quite a bit of flexibility, but to be intuitive, we have to know what
> that
> > field is... thus there has to be some transfer of knowledge on what that
> > value for directory names are.
> >
> > With other methods, the key is right there in the directory name.
> >
> > Now, I am really getting to the nitty gritty here, I know we could do
> > things like create a view to name the dir0 to be something.
> >
>

Drill Parquet Partitioning Method

2017-04-03 Thread John Omernik

So as a user of Drill now for a while, I have gotten used to the idea of
partitions just being values, instead of key=value like other things (hive,
impala, others).

>From a user/analyst perspective, the dir0, dir1, dirN methodology provides
quite a bit of flexibility, but to be intuitive, we have to know what that
field is... thus there has to be some transfer of knowledge on what that
value for directory names are.

With other methods, the key is right there in the directory name.

Now, I am really getting to the nitty gritty here, I know we could do
things like create a view to name the dir0 to be something.

For example:

Drill Method:
mytable
- 2017-04-01
- 2017-04-02
- 2017-04-03

Vs.

Hive method:
mytable
- day=2017-04-01
- day=2017-04-02
- day=2017-04-03


However, it takes extra admin effort, and hive, spark, etc all know the
key=value method.

Drill stands on its own here.  So, my thought is this, dir0 is nice, it
provides flexibility.  But why not have drill be able to infer key=value,
and when writing partitions (although I don't think Drill does this yet)
write using alias specified?

The more important part is the reading as the writing doesn't really work
yet. (We don't Insert into mytable Partition by day like we do in hive, if
we want to write a partition, we create table mytable/partition thus could
easily put the key value in there as needed)

So the reading.  A. This could not break anything existing.  Thus, dir0
must always work. B. Can we use a select option to enable/disable? (Would
we even need this?).

Basically, if there is a = in the partition name, split by =, make the
value to be the right side, alias be left side.

The hard parts:

The planner would have to be aware of this, so when a scan of the directory
occurs, the field name as an alias could be valid...

If I did "select * from mytable where day = '2017-04-01' but that field
didn't exist, it could error out, that said, we should know that the
directories have Key=Value format when we scan for files... it's not like
that is impossible (especially since we don't know what fields are in the
parquet files unless we do metadata).

This would also be something we should include in metadata... If we do
Key=value then boom, write to the metadata cache, and speed up planning!

So why do I think we need this?

It would sure make data created by other sources easier/quicker to read. We
wouldn't be string parsing directory names at query time, and it would just
be another avenue to make Drill a natural fit in the ecosystem...


I would be interested in community thoughts here, if there is interest I
will make a Jira


John

Re: Discussion: Comments in Drill Views

2017-03-06 Thread John Omernik

I can see both sides. But Ted is right, this won't hurt any thing from a
performance perspective, even if they put War and Peace in there 30 times,
that's 100mb of information to serve. People may choose to use formatting
languages like Markup or something. I do think we should have a limit so we
know what happens if someone tries to break that limit (from a security
perspective) but we could set that quite high, and then just test putting
data that exceeds that as a unit test.



On Fri, Mar 3, 2017 at 8:28 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> All of War and Peace is only 3MB.
>
> Let people document however they want. Don't over-optimize for problems
> that have never occurred.
>
>
>
> On Fri, Mar 3, 2017 at 3:19 PM, Kunal Khatua <kkha...@mapr.com> wrote:
>
> > It might be, incase someone begins to dump a massive design doc into the
> > comment field for a view's JSON.
> >
> >
> > I'm also not sure about how this information can be consumed. If it is
> > through CLI, either we rely on the SQLLine shell to trim the output, or
> not
> > worry at all. I'm assuming we'd also probably want something like a
> >
> > DESCRIBE VIEW ...
> >
> > to be enhanced to something like
> >
> > DESCRIBE VIEW WITH COMMENTARY ...
> >
> >
> > A 1KB field is quite generous IMHO. That's more than 7 tweets to describe
> > something ! [?]
> >
> >
> > Kunal Khatua
> >
> > 
> > From: Ted Dunning <ted.dunn...@gmail.com>
> > Sent: Friday, March 3, 2017 12:56:44 PM
> > To: user
> > Subject: Re: Discussion: Comments in Drill Views
> >
> > It it really necessary to put a technical limit in to prevent people from
> > OVER-documenting views?
> >
> >
> > What is the last time you saw code that had too many comments in it?
> >
> >
> >
> > On Thu, Mar 2, 2017 at 8:42 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > So I think on your worry that's an easily definable "abuse"
> condition...
> > > i.e. if we set a limit of say 1024 characters, that provides ample
> space
> > > for descriptions, but at 1kb per view, that's an allowable condition,
> > i.e.
> > > it would be hard to abuse it ... or am I missing something?
> > >
> > > On Wed, Mar 1, 2017 at 8:08 PM, Kunal Khatua <kkha...@mapr.com> wrote:
> > >
> > > > +1
> > > >
> > > >
> > > > I this this can be very useful. The only worry is of someone abusing
> > it,
> > > > so we probably should have a limit on the size of this? Not sure else
> > it
> > > > could be exposed and consumed.
> > > >
> > > >
> > > > Kunal Khatua
> > > >
> > > > Engineering
> > > >
> > > > [MapR]<http://www.mapr.com/>
> > > >
> > > > www.mapr.com<http://www.mapr.com/>
> > > >
> > > > 
> > > > From: John Omernik <j...@omernik.com>
> > > > Sent: Wednesday, March 1, 2017 9:55:27 AM
> > > > To: user
> > > > Subject: Re: Discussion: Comments in Drill Views
> > > >
> > > > Sorry, I let this idea drop (I didn't follow up and found when
> > searching
> > > > for something else...)  Any other thoughts on this idea?
> > > >
> > > > Should I open a JIRA if people think it would be handy?
> > > >
> > > > On Thu, Jun 23, 2016 at 4:02 PM, Ted Dunning <ted.dunn...@gmail.com>
> > > > wrote:
> > > >
> > > > > This is very interesting.  I love docstrings in Lisp and Python and
> > > > Javadoc
> > > > > in Java.
> > > > >
> > > > > Basically this is like that, but for SQL. Very helpful.
> > > > >
> > > > > On Thu, Jun 23, 2016 at 11:48 AM, John Omernik <j...@omernik.com>
> > > wrote:
> > > > >
> > > > > > I am looking for discussion here. A colleague was asking me how
> to
> > > add
> > > > > > comments to the metadata of a view.  (He's new to Drill, thus the
> > > idea
> > > > of
> > > > > > not having metadata for a table is one he's warming up to).
> > > > > >
> > > > > > That got me thinking... why couldn't we use Drill Views to store
> > > > > > table/field comments?  This could be a great way to help add
> > > c

Re: [Drill 1.9.0] : [CONNECTION ERROR] :- (user client) closed unexpectedly. Drillbit down?

2017-03-06 Thread John Omernik

Have you tried disabling hash joins or hash agg on the query or changing
the planning width? Here are some docs to check out:

https://drill.apache.org/docs/configuring-resources-for-a-shared-drillbit/

https://drill.apache.org/docs/guidelines-for-optimizing-aggregation/

https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/

Let us know if any of these have an effect on the queries...

Also, the three links I posted here are query based changes, so an ALTER
SESSION should address them. On the suggestion above with memory, that
WOULD have to be made on all Drill bits running, and would require a
restart of the Drillbit to take effect.



On Sat, Mar 4, 2017 at 1:01 PM, Anup Tiwari <anup.tiw...@games24x7.com>
wrote:

> Hi John,
>
> I have tried above config as well but still getting this issue.
> And please note that we were using similar configuration params for Drill
> 1.6 where this issue was not coming.
> Anything else which i can try?
>
> Regards,
> *Anup Tiwari*
>
> On Fri, Mar 3, 2017 at 11:01 PM, Abhishek Girish <agir...@apache.org>
> wrote:
>
> > +1 on John's suggestion.
> >
> > On Fri, Mar 3, 2017 at 6:24 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > So your node has 32G of ram yet you are allowing Drill to use 36G.  I
> > would
> > > change your settings to be 8GB of Heap, and 22GB of Direct Memory. See
> if
> > > this helps with your issues.  Also, are you using a distributed
> > filesystem?
> > > If so you may want to allow even more free ram...i.e. 8GB of Heap and
> > 20GB
> > > of Direct.
> > >
> > > On Fri, Mar 3, 2017 at 8:20 AM, Anup Tiwari <anup.tiw...@games24x7.com
> >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Please find our configuration details :-
> > > >
> > > > Number of Nodes : 4
> > > > RAM/Node : 32GB
> > > > Core/Node : 8
> > > > DRILL_MAX_DIRECT_MEMORY="20G"
> > > > DRILL_HEAP="16G"
> > > >
> > > > And all other variables are set to default.
> > > >
> > > > Since we have tried some of the settings suggested above but still
> > facing
> > > > this issue more frequently, kindly suggest us what is best
> > configuration
> > > > for our environment.
> > > >
> > > > Regards,
> > > > *Anup Tiwari*
> > > >
> > > > On Thu, Mar 2, 2017 at 1:26 AM, John Omernik <j...@omernik.com>
> wrote:
> > > >
> > > > > Another thing to consider is ensure you have a Spill Location
> setup,
> > > and
> > > > > then disable hashagg/hashjoin for the query...
> > > > >
> > > > > On Wed, Mar 1, 2017 at 1:25 PM, Abhishek Girish <
> agir...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hey Anup,
> > > > > >
> > > > > > This is indeed an issue, and I can understand that having an
> > unstable
> > > > > > environment is not something anyone wants. DRILL-4708 is still
> > > > > unresolved -
> > > > > > hopefully someone will get to it soon. I've bumped up the
> priority.
> > > > > >
> > > > > > Unfortunately we do not publish any sizing guidelines, so you'd
> > have
> > > to
> > > > > > experiment to settle on the right load for your cluster. Please
> > > > decrease
> > > > > > the concurrency (number of queries running in parallel). And try
> > > > bumping
> > > > > up
> > > > > > Drill DIRECT memory. Also, please set the system options
> > recommended
> > > by
> > > > > > Sudheesh. While this may not solve the issue, it may help reduce
> > it's
> > > > > > occurrence.
> > > > > >
> > > > > > Can you also update the JIRA with your configurations, type of
> > > queries
> > > > > and
> > > > > > the relevant logs?
> > > > > >
> > > > > > -Abhishek
> > > > > >
> > > > > > On Wed, Mar 1, 2017 at 10:17 AM, Anup Tiwari <
> > > > anup.tiw...@games24x7.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Can someone look into it? As we are now gett

Re: [Drill 1.9.0] : [CONNECTION ERROR] :- (user client) closed unexpectedly. Drillbit down?

2017-03-03 Thread John Omernik

So your node has 32G of ram yet you are allowing Drill to use 36G.  I would
change your settings to be 8GB of Heap, and 22GB of Direct Memory. See if
this helps with your issues.  Also, are you using a distributed filesystem?
If so you may want to allow even more free ram...i.e. 8GB of Heap and 20GB
of Direct.

On Fri, Mar 3, 2017 at 8:20 AM, Anup Tiwari <anup.tiw...@games24x7.com>
wrote:

> Hi,
>
> Please find our configuration details :-
>
> Number of Nodes : 4
> RAM/Node : 32GB
> Core/Node : 8
> DRILL_MAX_DIRECT_MEMORY="20G"
> DRILL_HEAP="16G"
>
> And all other variables are set to default.
>
> Since we have tried some of the settings suggested above but still facing
> this issue more frequently, kindly suggest us what is best configuration
> for our environment.
>
> Regards,
> *Anup Tiwari*
>
> On Thu, Mar 2, 2017 at 1:26 AM, John Omernik <j...@omernik.com> wrote:
>
> > Another thing to consider is ensure you have a Spill Location setup, and
> > then disable hashagg/hashjoin for the query...
> >
> > On Wed, Mar 1, 2017 at 1:25 PM, Abhishek Girish <agir...@apache.org>
> > wrote:
> >
> > > Hey Anup,
> > >
> > > This is indeed an issue, and I can understand that having an unstable
> > > environment is not something anyone wants. DRILL-4708 is still
> > unresolved -
> > > hopefully someone will get to it soon. I've bumped up the priority.
> > >
> > > Unfortunately we do not publish any sizing guidelines, so you'd have to
> > > experiment to settle on the right load for your cluster. Please
> decrease
> > > the concurrency (number of queries running in parallel). And try
> bumping
> > up
> > > Drill DIRECT memory. Also, please set the system options recommended by
> > > Sudheesh. While this may not solve the issue, it may help reduce it's
> > > occurrence.
> > >
> > > Can you also update the JIRA with your configurations, type of queries
> > and
> > > the relevant logs?
> > >
> > > -Abhishek
> > >
> > > On Wed, Mar 1, 2017 at 10:17 AM, Anup Tiwari <
> anup.tiw...@games24x7.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Can someone look into it? As we are now getting this more frequently
> in
> > > > Adhoc queries as well.
> > > > And for automation jobs, we are moving to Hive as in drill we are
> > getting
> > > > this more frequently.
> > > >
> > > > Regards,
> > > > *Anup Tiwari*
> > > >
> > > > On Sat, Dec 31, 2016 at 12:11 PM, Anup Tiwari <
> > anup.tiw...@games24x7.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are getting this issue bit more frequently. can someone please
> > look
> > > > > into it and tell us that why it is happening since as mention in
> > > earlier
> > > > > mail when this query gets executed no other query is running at
> that
> > > > time.
> > > > >
> > > > > Thanks in advance.
> > > > >
> > > > > Regards,
> > > > > *Anup Tiwari*
> > > > >
> > > > > On Sat, Dec 24, 2016 at 10:20 AM, Anup Tiwari <
> > > anup.tiw...@games24x7.com
> > > > >
> > > > > wrote:
> > > > >
> > > > >> Hi Sudheesh,
> > > > >>
> > > > >> Please find below ans :-
> > > > >>
> > > > >> 1. Total 4,(3 Datanodes, 1 namenode)
> > > > >> 2. Only one query, as this query is part of daily dump and runs in
> > > early
> > > > >> morning.
> > > > >>
> > > > >> And as @chun mentioned , it seems similar to DRILL-4708 , so any
> > > update
> > > > >> on progress of this ticket?
> > > > >>
> > > > >>
> > > > >> On 22-Dec-2016 12:13 AM, "Sudheesh Katkam" <skat...@maprtech.com>
> > > > wrote:
> > > > >>
> > > > >> Two more questions..
> > > > >>
> > > > >> (1) How many nodes in your cluster?
> > > > >> (2) How many queries are running when the failure is seen?
> > > > >>
> > > > >> If you have multiple large queries running at the same time, the
> > load
> > > on
> > >

Re: Minimise query plan time for dfs plugin for local file system on tsv file

2017-03-03 Thread John Omernik

Can you help me understand what "local to the cluster" means in the context
of a 5 node cluster? In the plan, the files are all file:// Are the files
replicated to each node? is it a common shared filesystem?  Do all 5 nodes
have equal access to the 10 files? I wonder if using a local FS in a
distributed cluster is having some effect on the planning...

On Fri, Mar 3, 2017 at 6:08 AM, PROJJWAL SAHA  wrote:

> I did not change the default values used by drill.
> Are you talking of changing planner.memory_limit
> and planner.memory.max_query_memory_per_node ?
> If there are any other debug work that I can do, pls suggest
>
> Regards
>
> On Fri, Mar 3, 2017 at 5:14 PM, Nitin Pawar 
> wrote:
>
> > how much memory have you set for planner ?
> >
> > On Fri, Mar 3, 2017 at 5:06 PM, PROJJWAL SAHA 
> wrote:
> >
> > > Hello all,
> > >
> > > I am quering select * from dfs.xxx where yyy (filter condition)
> > >
> > > I am using dfs storage plugin that comes out of the box from drill on a
> > > 1GB file, local to the drill cluster.
> > > The 1GB file is split into 10 files of 100 MB each.
> > > As expected I see 11 minor and 2 major fagments.
> > > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
> > >
> > > One observation is that the query plan time is more than 30 seconds. I
> > ran
> > > the explain plan query to validate this.
> > > The query execution time is 2 secs.
> > > total time taken is 32secs
> > >
> > > I wanted to understand how can i minimise the query plan time.
> > Suggestions
> > > ?
> > > Is the time taken described above expected ?
> > > Attached is result from explain plan query
> > >
> > > Regards,
> > > Projjwal
> > >
> > >
> >
> >
> > --
> > Nitin Pawar
> >
>

Re: Discussion: Comments in Drill Views

2017-03-02 Thread John Omernik

So I think on your worry that's an easily definable "abuse" condition...
i.e. if we set a limit of say 1024 characters, that provides ample space
for descriptions, but at 1kb per view, that's an allowable condition, i.e.
it would be hard to abuse it ... or am I missing something?

On Wed, Mar 1, 2017 at 8:08 PM, Kunal Khatua <kkha...@mapr.com> wrote:

> +1
>
>
> I this this can be very useful. The only worry is of someone abusing it,
> so we probably should have a limit on the size of this? Not sure else it
> could be exposed and consumed.
>
>
> Kunal Khatua
>
> Engineering
>
> [MapR]<http://www.mapr.com/>
>
> www.mapr.com<http://www.mapr.com/>
>
> 
> From: John Omernik <j...@omernik.com>
> Sent: Wednesday, March 1, 2017 9:55:27 AM
> To: user
> Subject: Re: Discussion: Comments in Drill Views
>
> Sorry, I let this idea drop (I didn't follow up and found when searching
> for something else...)  Any other thoughts on this idea?
>
> Should I open a JIRA if people think it would be handy?
>
> On Thu, Jun 23, 2016 at 4:02 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > This is very interesting.  I love docstrings in Lisp and Python and
> Javadoc
> > in Java.
> >
> > Basically this is like that, but for SQL. Very helpful.
> >
> > On Thu, Jun 23, 2016 at 11:48 AM, John Omernik <j...@omernik.com> wrote:
> >
> > > I am looking for discussion here. A colleague was asking me how to add
> > > comments to the metadata of a view.  (He's new to Drill, thus the idea
> of
> > > not having metadata for a table is one he's warming up to).
> > >
> > > That got me thinking... why couldn't we use Drill Views to store
> > > table/field comments?  This could be a great way to help add contextual
> > > information for users. Here's some current observations when I issue a
> > > describe view_myview
> > >
> > >
> > > 1. I get three columns ,COLUMN_NAME, DATA_TYPE, and IS_NULLABLE
> > > 2. Even thought the underlying parquet table has types, the view does
> not
> > > pass the types for the underlying parquet files through.  (The type is
> > ANY)
> > > 3. The data for the view is all just a json file that could be easily
> > > extended.
> > >
> > >
> > > So, a few things would be a nice to have
> > >
> > > 1. Table comments.  when I issue a describe table, if the view has a
> > > "Description" field, then having that print out as a description for
> the
> > > whole view would be nice.  This is harder, I think because it's not
> just
> > > extending the view information.
> > >
> > > 2. Column comments:  A text field that could be added to the view, and
> > just
> > > print out another column with description.  This would be very helpful.
> > > While Drill being schemaless is awesome, the ability to add information
> > to
> > > known data, is huge.
> > >
> > > 3. Ability to to use the types from the Parquet files (without manually
> > > specifying each type).  If we could provide an option to View creation
> to
> > > attempt to infer type, that would be handy. I realize that folks are
> > using
> > > the LIMIT 0 to get metadata, but describe could be done well too.
> > >
> > > 4. Ability, using ANSI Sql to update the view column descriptions and
> the
> > > description for the view itself.
> > >
> > > 5. I believe Avro has the ability to add this information to the files,
> > so
> > > if the data exists outside of views (such as in AVRO files) should we
> > > present it to the user in describe table events as well?
> > >
> > > Curious if folks think this would be valuable, how much work an
> addition
> > > like this would be to Drill, and other thoughts in general.
> > >
> > >
> > > John
> > >
> >
>

Re: [Drill 1.9.0] : [CONNECTION ERROR] :- (user client) closed unexpectedly. Drillbit down?

2017-03-01 Thread John Omernik

Another thing to consider is ensure you have a Spill Location setup, and
then disable hashagg/hashjoin for the query...

On Wed, Mar 1, 2017 at 1:25 PM, Abhishek Girish  wrote:

> Hey Anup,
>
> This is indeed an issue, and I can understand that having an unstable
> environment is not something anyone wants. DRILL-4708 is still unresolved -
> hopefully someone will get to it soon. I've bumped up the priority.
>
> Unfortunately we do not publish any sizing guidelines, so you'd have to
> experiment to settle on the right load for your cluster. Please decrease
> the concurrency (number of queries running in parallel). And try bumping up
> Drill DIRECT memory. Also, please set the system options recommended by
> Sudheesh. While this may not solve the issue, it may help reduce it's
> occurrence.
>
> Can you also update the JIRA with your configurations, type of queries and
> the relevant logs?
>
> -Abhishek
>
> On Wed, Mar 1, 2017 at 10:17 AM, Anup Tiwari 
> wrote:
>
> > Hi,
> >
> > Can someone look into it? As we are now getting this more frequently in
> > Adhoc queries as well.
> > And for automation jobs, we are moving to Hive as in drill we are getting
> > this more frequently.
> >
> > Regards,
> > *Anup Tiwari*
> >
> > On Sat, Dec 31, 2016 at 12:11 PM, Anup Tiwari  >
> > wrote:
> >
> > > Hi,
> > >
> > > We are getting this issue bit more frequently. can someone please look
> > > into it and tell us that why it is happening since as mention in
> earlier
> > > mail when this query gets executed no other query is running at that
> > time.
> > >
> > > Thanks in advance.
> > >
> > > Regards,
> > > *Anup Tiwari*
> > >
> > > On Sat, Dec 24, 2016 at 10:20 AM, Anup Tiwari <
> anup.tiw...@games24x7.com
> > >
> > > wrote:
> > >
> > >> Hi Sudheesh,
> > >>
> > >> Please find below ans :-
> > >>
> > >> 1. Total 4,(3 Datanodes, 1 namenode)
> > >> 2. Only one query, as this query is part of daily dump and runs in
> early
> > >> morning.
> > >>
> > >> And as @chun mentioned , it seems similar to DRILL-4708 , so any
> update
> > >> on progress of this ticket?
> > >>
> > >>
> > >> On 22-Dec-2016 12:13 AM, "Sudheesh Katkam" 
> > wrote:
> > >>
> > >> Two more questions..
> > >>
> > >> (1) How many nodes in your cluster?
> > >> (2) How many queries are running when the failure is seen?
> > >>
> > >> If you have multiple large queries running at the same time, the load
> on
> > >> the system could cause those failures (which are heartbeat related).
> > >>
> > >> The two options I suggested decrease the parallelism of stages in a
> > >> query, this implies lesser load but slower execution.
> > >>
> > >> System level option affect all queries, and session level affect
> queries
> > >> on a specific connection. Not sure what is preferred in your
> > environment.
> > >>
> > >> Also, you may be interested in metrics. More info here:
> > >>
> > >> http://drill.apache.org/docs/monitoring-metrics/ <
> > >> http://drill.apache.org/docs/monitoring-metrics/>
> > >>
> > >> Thank you,
> > >> Sudheesh
> > >>
> > >> > On Dec 21, 2016, at 4:31 AM, Anup Tiwari  >
> > >> wrote:
> > >> >
> > >> > @sudheesh, yes drill bit is running on datanodeN/10.*.*.5:31010).
> > >> >
> > >> > Can you tell me how this will impact to query and do i have to set
> > this
> > >> at
> > >> > session level OR system level?
> > >> >
> > >> >
> > >> >
> > >> > Regards,
> > >> > *Anup Tiwari*
> > >> >
> > >> > On Tue, Dec 20, 2016 at 11:59 PM, Chun Chang 
> > >> wrote:
> > >> >
> > >> >> I am pretty sure this is the same as DRILL-4708.
> > >> >>
> > >> >> On Tue, Dec 20, 2016 at 10:27 AM, Sudheesh Katkam <
> > >> skat...@maprtech.com>
> > >> >> wrote:
> > >> >>
> > >> >>> Is the drillbit service (running on datanodeN/10.*.*.5:31010)
> > actually
> > >> >>> down when the error is seen?
> > >> >>>
> > >> >>> If not, try lowering parallelism using these two session options,
> > >> before
> > >> >>> running the queries:
> > >> >>>
> > >> >>> planner.width.max_per_node (decrease this)
> > >> >>> planner.slice_target (increase this)
> > >> >>>
> > >> >>> Thank you,
> > >> >>> Sudheesh
> > >> >>>
> > >>  On Dec 20, 2016, at 12:28 AM, Anup Tiwari <
> > anup.tiw...@games24x7.com
> > >> >
> > >> >>> wrote:
> > >> 
> > >>  Hi Team,
> > >> 
> > >>  We are running some drill automation script on a daily basis and
> we
> > >> >> often
> > >>  see that some query gets failed frequently by giving below error
> ,
> > >> >> Also i
> > >>  came across DRILL-4708  > >> >> jira/browse/DRILL-4708
> > >> 
> > >>  which seems similar, Can anyone give me update on that OR
> > workaround
> > >> to
> > >>  avoid such issue ?
> > >> 
> > >>  *Stack Trace :-*
> > >> 
> > >>  Error: CONNECTION ERROR: Connection /10.*.*.1:41613 <-->
> > >>  datanodeN/10.*.*.5:31010 (user

Re: Drill Views and Typed Columns

2017-03-01 Thread John Omernik

Ok, as you can see I am reviving old threads :)
I am back to working with Carvel (now called Superset) with Charles, and
we've moved to a python interface based on the rest API (much cleaner
easier to work with, and working better out of the box!)

That said, the issue of typed description of tables is still a challenge.

For view that are typed, great, this is easy. The standard "DESCRIBE" table
(view) produces the results in a format that makes sense to the dbapi and
to caravel. I am having great success on tables that I create a view for.

However, I "feel" like DESCRIBE table "should" be able to handle some other
situations that given a few tunables would be better than the limit 0
method.

So to demonstrate, the restapi providers no data on DESCRIBE for a parquet
folder or a LIMIT 0 (See be low) the Limit 0 makes it so it returns
nothing... (see my results below)

Thus, while ODBC may provide more with Limit 0, instead of hacking Rest to
use the weird limit 0 like ODBC, why don't we fix DESCRIBE with tunables?

I.e.

DESCRIBE parquet_folder... read the first parquet file in there, get the
metadata, return the columns and types if other files change schema...
well, the query will break too?  Perhaps a read of cached meta data?
Perhaps a tunable that would say read X files ( -1 being all) to return the
schema of the folder.

Also, DESCRIBE csv folder: Well, if columns/headers are setup, then just
come back with column[0] to columns[1], if header reading for decribe is
turned on, then have describe read the first line and return the data...
Types? ANY... we can default there, just knowing the column names wold be a
huge win.  Same for other formats...

Potentially we could allow each storage plugin to overide DESCRIBE table
that returns the basics drill need to return... name, type (ANY or ...)
 and then other fields people who know data better than I do could
handle...

The reason this is frustrating in looking at superset (and I would imagine
other tools) is that we are now providing multiple behaviors dependent on
the access method to drill... I.e. if your initial response to my post is
"John is silly, just tell him to use LIMIT 0 with a ODBC" then explain to
me how JDBC works? How Rest works? How it differs?  Why not have the
interface be the same, i.e. DESCRIBE table which is commonly know, and
provide overidable data based on the storage plugin... perhaps some JDBC
connections or Mongo may provide more columns, that's fine!  What ever is
querying drill may be able to handle those, but what is the basic
information required?  By putting it into a limit 0 we have to rely on
metadata which isn't consistent or exposed via all Access methods in
Drill so lets' be consistent and produce an interface for describe!

I feel like I am rallying troops for some sort of revolution, but in
reality, just trying to sort out my thoughts after living in SQLAlchemy for
too long!

Also, if you are interested in the work with superset, I am mid doc update.
I used to use pyodbc and the odbc interface, but Charles pushed me to look
at the rest interface, and using some of his code, and code from drillpy
and pydrill, I think this is the best way going forward (and I've gotten
the most things in superset to work this way too!)

https://github.com/JohnOmernik/sqlalchemy-drill

DESCRIBE parquet_folder
200
{
  "columns" : [ ],
  "rows" : [ { } ]
}

select * from parquet_folder limit 0
200
{
  "columns" : [ ],
  "rows" : [ { } ]
}

select * from parquet_folder limit 1
200
{
"columns": ["all", "my","columns"],
"rows": [{"all":1, "my":2, "columns":"fun!"}]
}

On Mon, May 16, 2016 at 7:25 PM, Andries Engelbrecht <
aengelbre...@maprtech.com> wrote:

> John,
>
> Using a simple tool like squirrel and running a select * limit 0 against
> parquet it is able to retrieve the column names, and if you look at the
> returned metadata it identified the data type. Obviously schema change will
> be a challenge to deal with with limit 0 queries. This is where views can
> be used to force data types for the end tool, skip rows, substitute values,
> skip added columns, etc, which makes it much better for most tools to work
> with the data.
>
> A function like typeof does require a row to be returned to identify the
> data type.
>
>
> Ideally it will be great to do initial schema discovery with a tool,
> associate the metadata with the data sources and then make it available for
> wider use. A typeof histogram will be brilliant on data structures with
> schema changes, that way it will be possible to see how many records are
> affected of a certain type/change/etc. A endless list of possibilities. It
> should be feasible to utilize Drill as the execution engine with a smart
> tool on top of it to p

Re: Discussion: Comments in Drill Views

2017-03-01 Thread John Omernik

Sorry, I let this idea drop (I didn't follow up and found when searching
for something else...)  Any other thoughts on this idea?

Should I open a JIRA if people think it would be handy?

On Thu, Jun 23, 2016 at 4:02 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> This is very interesting.  I love docstrings in Lisp and Python and Javadoc
> in Java.
>
> Basically this is like that, but for SQL. Very helpful.
>
> On Thu, Jun 23, 2016 at 11:48 AM, John Omernik <j...@omernik.com> wrote:
>
> > I am looking for discussion here. A colleague was asking me how to add
> > comments to the metadata of a view.  (He's new to Drill, thus the idea of
> > not having metadata for a table is one he's warming up to).
> >
> > That got me thinking... why couldn't we use Drill Views to store
> > table/field comments?  This could be a great way to help add contextual
> > information for users. Here's some current observations when I issue a
> > describe view_myview
> >
> >
> > 1. I get three columns ,COLUMN_NAME, DATA_TYPE, and IS_NULLABLE
> > 2. Even thought the underlying parquet table has types, the view does not
> > pass the types for the underlying parquet files through.  (The type is
> ANY)
> > 3. The data for the view is all just a json file that could be easily
> > extended.
> >
> >
> > So, a few things would be a nice to have
> >
> > 1. Table comments.  when I issue a describe table, if the view has a
> > "Description" field, then having that print out as a description for the
> > whole view would be nice.  This is harder, I think because it's not just
> > extending the view information.
> >
> > 2. Column comments:  A text field that could be added to the view, and
> just
> > print out another column with description.  This would be very helpful.
> > While Drill being schemaless is awesome, the ability to add information
> to
> > known data, is huge.
> >
> > 3. Ability to to use the types from the Parquet files (without manually
> > specifying each type).  If we could provide an option to View creation to
> > attempt to infer type, that would be handy. I realize that folks are
> using
> > the LIMIT 0 to get metadata, but describe could be done well too.
> >
> > 4. Ability, using ANSI Sql to update the view column descriptions and the
> > description for the view itself.
> >
> > 5. I believe Avro has the ability to add this information to the files,
> so
> > if the data exists outside of views (such as in AVRO files) should we
> > present it to the user in describe table events as well?
> >
> > Curious if folks think this would be valuable, how much work an addition
> > like this would be to Drill, and other thoughts in general.
> >
> >
> > John
> >
>

Re: debugging bad input

2017-03-01 Thread John Omernik

The first thing I would try is turning on verbose errors.

The setting for that is exec.errors.verbose

I use select * from sys.options quite a bit when determining how to
approach problems.

To alter your session to use verbose errors type

ALTER SESSION set `exec.errors.verbose` = true;

Then you may get more data back on your error.

On Wed, Mar 1, 2017 at 10:34 AM, Wesley Chow  wrote:

> Is there any guidance on finding a needle-in-a-haystack input error in a
> ton of data? For example, I've got one row in a csv file amongst thousands
> containing tens of millions of rows that has as its first column a string
> instead of a number like it should be. Is there some way to get Drill to
> tell me which file that row is in? Something like the dirX columns would
> work, since I can select for the row.
>
> Note, these are CSV files hosted in S3.
>
> Thanks,
> Wes
>

Re: Dealing with bad data when trying to do date computations

2017-03-01 Thread John Omernik

So what would need to be done to get this process kick started?  I see a
few components here:

1. Develop the table in sys (sys.functions) that stores the information
about the function.  For this I propose this for discussions

name - The name of the function
description - The Description of the function
docgroup- This should be the groupings under "SQL Functions" in the
documentation. See * below for proposal
tags - A way to tag functions for easy searching like select  from
sys.functions where tags like '%string%'
arguments - The Ordered list of arguments and their types
return - What is returned (and it's type)
Examples: Usage examples that can display nicely in both select * from
sys.functions as well as provide enough information for use in HTML Docs.

* docgroup.

Currently, in the documentation, the nesting is as below (I didn't expand
all the stuff on SQL Window and Nested Data). I want to outline my proposal
for to hand handle this... So I propose moving "SQL Window Functions",
"Nested Data Functions" and "Query Directory Functions" all under SQL
Functions and adding another group here "Common Functions".  This will now
be the "Functions root".  The Docgroup field in sys.function will have
levels and grouping separated by ":".  Thus,  firstlevel:second:level  or
firstlevel:secondlevel:thirdlevel  This will allow us to keep the grouping
in the sys.functions table, and make it so that on every release, the
documentation could be updated with a query.  Note the level "SQL
Functions" would not be represented in the docgroup.

Here are some examples of functions

LOWER()
Current Doc Placement: SQL Functions -> String Manipulation
Proposed docgroup: 'Common Functions:String Manipulation'
How it would appear on the Doc page: SQL Functions -> Common Functions ->
String Manipulation

COUNT()
Current Doc Placement: SQL Functions -> SQL Window Functions -> Aggregate
Window Functions
Proposed docgroup: 'SQL Window Functions:Aggregate Window Functions'
How it would appear on the Doc pages: SQL Functions -> SQL Window Functions
-> Aggregate Window Functions

Basically it would be used to add the raw HTML to the final page in the
docgroup (Aggregate Window Functions)

Current Doc Layout:

SQL Reference
SQL Reference Introduction
-> Data Types
Lexical Structure
Operators
-> SQL Functions
->-> About SQL Function Examples
->-> Math and Trig
->->Data Type Conversion
->->Date/Time Functions and Arithmetic
->->String Manipulation
->->Aggregate and Aggregate Statistical
->-> Functions for handling Nulls
-> SQL Window Functions
->-> ...
->-> ...
-> Nested Data Functions
->->...
->->...
-> Query Directory Functions

2. Once we get to a certain point in development on sys.functions, we need
a call to arms.  We need to come up with an initial "fill" of
sys.functions.  For that we'll need to take current data fill it in, as
well as getting a list of all the ninja functions that have been added to
drill and not documented... not sure how get those with out an intense
code/jira review.

3.  Come up with new function proposal guidelines.  If you do a pull
request with a function, what will need to be included for your pull
request to be approved? We should not allow functions to be added to Drill
without a basic doc addition.

4.  Update procedures?

This is complicated, but done well, it could really put the knowledge and
analyst needs right in the system itself!

John

On Tue, Feb 28, 2017 at 12:20 PM, John Omernik <j...@omernik.com> wrote:

> You could also generate documentation updates via query at each release.
> This would be a great feature, move the information close to the analysts
> hands, I love how that would work.  (I think I remember some talk about
> extending sys.options to be self documenting as well )
>
>
>
> On Tue, Feb 28, 2017 at 12:11 PM, Jinfeng Ni <j...@apache.org> wrote:
>
>> Regarding the list of functions (build-in or UDF), someone once
>> suggested that we make the functions self-documented by adding a
>> sys.functions table.
>>
>> select * from sys.functions where name like '%SPLIT%';
>>
>> return function_name, parameter_list, description etc.
>>
>> This way, use could simply query sys.functions using Drill.
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Feb 28, 2017 at 9:02 AM, Jinfeng Ni <j...@apache.org> wrote:
>> >> 4.  I think as part of developer review and pull requests that add
>> >> functions/functionality should require a pull request to also provide a
>> >> documentation update. This helps to ensure that the docs keep up to
>> date,
>> >> as well as keeping users appraised of what is happening... i.e. it

Re: Odd Issue with to_date function

2017-02-28 Thread John Omernik

I am hitting this error again.. I really don't want to restart my cluster
with a UTC...

Let's put ourselves in to shoes of an analyst using a Drill Cluster all
they are going to feel is frustration that they can't run a query like
this. It doesn't provide a great error message, and even if they find this
thread, they may not have the ability to make this change (ranging from
they don't have the permissions, to admins just say no because of the other
affects)

We need a way to handle this at the session level.

a

On Fri, Apr 1, 2016 at 6:15 PM, John Omernik <j...@omernik.com> wrote:

> Yes, that worked, I restarted my drill cluster using -Duser.timezone=UTC
> on startup.  Per the other thread that I found after the fact (I was
> traveling) this works, but I wonder if we can't have a less intrusive fix.
> I.e. restarting the drill cluster to fix this (by adding the system
> property) is a pretty far reaching fix, and could have other downstream
> affects.  I see how this can come up, I just wonder if there could be some
> sort of option (using alter session), which per the other thread didn't
> appear to be the case, that we could use.  Changing the Drill bit time zone
> affects all users, alter session would be more localized (and if we had
> this, we should update the error message to possibly suggest the DST thing,
> with the proper command.
>
> Could we as a long term fix try to add a synonym for the time? I think
> that's the proper way to handle things as the time truly doesn't exist, so
> instead of just throwing an error add an hour and call it good. :)
>
>
>
> On Fri, Apr 1, 2016 at 3:32 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
>> On Fri, Apr 1, 2016 at 6:54 AM, John Omernik <j...@omernik.com> wrote:
>>
>> > SYSTEM ERROR: IllegalInstantException: Cannot parse "2003-04-06
>> 02:34:20":
>> > Illegal instant due to time zone offset transition (America/Chicago)
>> > Fragment 3:9
>> >
>>
>>
>> Can you set your timezone to +0600 to avoid DST transitions? Will that
>> break things?
>>
>
>

Spill Location, permissions and Authentication

2017-02-28 Thread John Omernik

I am using 1.8 as of this time, if this is fixed in a newer version, please
let me know.

I am running drill as a master user. (mapr) and have enabled
authentication. When I authenticate to drill, and try to run a query,
specifically one with hash joins turned off (thus using spill to disk), I
am getting error that my user (that I authenticated to sqlline with) does
not have access to my drill spill location. (They are MapR "local" volumes
that I have setup myself).

So, I was going to go adjust permissions, but I started thinking more...
Why isn't the spill location, when authentication is running, or when it's
not, always reading/writing as the drill bit user?  This doesn't make a lot
of sense that users would get to read this information, why not remove the
impersonation on requests to read and write?

Now, I am trying to think this through, so there may be tons of reasons,
but I would love to understand this process better.

Thanks!

John



Caused by: org.apache.hadoop.security.AccessControlException: User
zetasvcprod(user id 200) does not have access to
/var/local/zeta8/local/drillspill/drillprod/274a41a8-ae92-6b0f-8dab-3b16841c0e35_majorfragment1_minorfragment11_operator10/0

at com.mapr.fs.MapRClientImpl.create(MapRClientImpl.java:197)
~[maprfs-5.2.0-mapr.jar:5.2.0-mapr]

at com.mapr.fs.MapRFileSystem.create(MapRFileSystem.java:849)
~[maprfs-5.2.0-mapr.jar:5.2.0-mapr]

at com.mapr.fs.MapRFileSystem.create(MapRFileSystem.java:891)
~[maprfs-5.2.0-mapr.jar:5.2.0-mapr]

Re: Dealing with bad data when trying to do date computations

2017-02-28 Thread John Omernik

You could also generate documentation updates via query at each release.
This would be a great feature, move the information close to the analysts
hands, I love how that would work.  (I think I remember some talk about
extending sys.options to be self documenting as well )



On Tue, Feb 28, 2017 at 12:11 PM, Jinfeng Ni  wrote:

> Regarding the list of functions (build-in or UDF), someone once
> suggested that we make the functions self-documented by adding a
> sys.functions table.
>
> select * from sys.functions where name like '%SPLIT%';
>
> return function_name, parameter_list, description etc.
>
> This way, use could simply query sys.functions using Drill.
>
>
>
>
>
>
>
> On Tue, Feb 28, 2017 at 9:02 AM, Jinfeng Ni  wrote:
> >> 4.  I think as part of developer review and pull requests that add
> >> functions/functionality should require a pull request to also provide a
> >> documentation update. This helps to ensure that the docs keep up to
> date,
> >> as well as keeping users appraised of what is happening... i.e. it's a
> good
> >> "feeling" to see a great tool like Drill "improving" with new
> >> functionality.
> >>
> >> Please, folks, we need to do some one time clean up (go back through
> pull
> >> requests to ensure all functions are documented up to now) and then then
> >> get processes in place to ensure ongoing updates.
> >>
> >
> > That's a good suggestion. We should try our best to keep the code and
> > doc in sync.
> >
> > +1
>

Re: Dealing with bad data when trying to do date computations

2017-02-28 Thread John Omernik

Thanks Charles, that worked even on my 1.8.

Drill folks: We need to do some documentation updates.   We've added
functions (like REGEXP_MATCHES, and it's in 1.8, so I am not sure where it
was added) and other functions like SPLIT and yet no mention in
https://drill.apache.org/docs/string-manipulation/

So, yes, this is "meh" work compared to programming all the cool things in
Drill.  But there are a number of reasons that this needs to be done
besides common practices.

1.  Users, and more importantly POTENTIAL users get frustrated when trying
to use drill for the first time. Coming from other Big Data systems like
Hive, not having Regex, split, and other functions is frustrating. But what
is more frustrating is to find that they actually exist, and are just not
documented.  Nothing will turn people off faster.

2.  Without the knowledge of these functions, people try "hacky" work
arounds like what I did, killing performance, and setting Drill in a bad
light.

3.  It provides an over all feeling of lack of effort by the community.  I
am know that resources are not unlimited, and these things need to be
addressed by "someone" but issues like this are really important for
getting more people into the community who may be able to help contribute!

4.  I think as part of developer review and pull requests that add
functions/functionality should require a pull request to also provide a
documentation update. This helps to ensure that the docs keep up to date,
as well as keeping users appraised of what is happening... i.e. it's a good
"feeling" to see a great tool like Drill "improving" with new
functionality.

Please, folks, we need to do some one time clean up (go back through pull
requests to ensure all functions are documented up to now) and then then
get processes in place to ensure ongoing updates.

Thanks

John Omernik

On Tue, Feb 28, 2017 at 10:15 AM, Charles Givre <cgi...@gmail.com> wrote:

> Hi John,
> I believe that Drill 1.9 includes a REGEXP_MATCHES( ,  )
> function which does what you'd expect it to.  I'm not sure when this was
> introduced, so it maybe in earlier versions of Drill.
> Best,
> -- C
>
> On Tue, Feb 28, 2017 at 11:03 AM, John Omernik <j...@omernik.com> wrote:
>
> > I have a data set that has birthdays in -MM-DD format.
> >
> > Most of this data is great. I am trying to compute the age using
> >
> > EXTRACT(year from age(dob))
> >
> >
> > But some of my data is crapola... let's call it alternative data...
> >
> >
> > When I try to run the Extract function, I get
> >
> > Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
> > must be in the range [1,12]
> >
> > Fragment 5:17
> >
> > [Error Id: 62f90784-c9f4-4362-9710-a37464fc801a on drillnode:20005]
> >
> >
> > I've tried an ugly where clause, and this works:
> >
> > where
> >
> > (dob LIKE '%-01-%' or dob LIKE '%-02-%' or dob LIKE '%-03-%' or dob LIKE
> > '%-04-%' or dob LIKE '%-05-%' or dob LIKE '%-06-%' or dob LIKE '%-07-%'
> or
> > dob LIKE '%-08-%' or dob LIKE '%-09-%' or
> >
> > dob LIKE '%-1-%' or dob LIKE '%-2-%' or dob LIKE '%-3-%' or dob LIKE
> > '%-4-%' or dob LIKE '%-5-%' or dob LIKE '%-6-%' or dob LIKE '%-7-%' or
> dob
> > LIKE '%-8-%' or dob LIKE '%-9-%' or
> >
> > dob LIKE '%-10-%' or dob LIKE '%-11-%' or dob LIKE '%-12-%')
> >
> >
> > But WOW is that ugly. I could add the jar for regex contains, and make it
> > much easier (do we have a regex search function built into drill? I think
> > we should at this point...)
> >
> >
> > Is there another way to say try the extra function, and catch a failure,
> >  and ignore on failure? What if we had a cast function that returned NULL
> > on failure so we could use it in the where clause?  Any other more
> elegant
> > ways to handle this?
> >
> >
> > Thanks!
> >
> >
> > John
> >
>

Dealing with bad data when trying to do date computations

2017-02-28 Thread John Omernik

I have a data set that has birthdays in -MM-DD format.

Most of this data is great. I am trying to compute the age using

EXTRACT(year from age(dob))


But some of my data is crapola... let's call it alternative data...


When I try to run the Extract function, I get

Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear
must be in the range [1,12]

Fragment 5:17

[Error Id: 62f90784-c9f4-4362-9710-a37464fc801a on drillnode:20005]


I've tried an ugly where clause, and this works:

where

(dob LIKE '%-01-%' or dob LIKE '%-02-%' or dob LIKE '%-03-%' or dob LIKE
'%-04-%' or dob LIKE '%-05-%' or dob LIKE '%-06-%' or dob LIKE '%-07-%' or
dob LIKE '%-08-%' or dob LIKE '%-09-%' or

dob LIKE '%-1-%' or dob LIKE '%-2-%' or dob LIKE '%-3-%' or dob LIKE
'%-4-%' or dob LIKE '%-5-%' or dob LIKE '%-6-%' or dob LIKE '%-7-%' or dob
LIKE '%-8-%' or dob LIKE '%-9-%' or

dob LIKE '%-10-%' or dob LIKE '%-11-%' or dob LIKE '%-12-%')


But WOW is that ugly. I could add the jar for regex contains, and make it
much easier (do we have a regex search function built into drill? I think
we should at this point...)


Is there another way to say try the extra function, and catch a failure,
 and ignore on failure? What if we had a cast function that returned NULL
on failure so we could use it in the where clause?  Any other more elegant
ways to handle this?


Thanks!


John

Re: Drill and Elasticsearch

2017-02-23 Thread John Omernik

Nice, I will check that out Uwe.  (pyarrow)

Ted, that was my first approach, but getting the right way to get the json
file(s) out of 3+GB of parquet was challenging I tried Drills rest API
push to ES Rest API, but that was not performant at all...

I did stumble across this: https://github.com/moshe/elasticsearch_loader

Which was interesting, it would skip drill, read the parquet files directly
and stream them (in definable batches) into Elastic search. I ran into some
issues (that I posted and the author is aware of).

First, you have to specify all the files, so if you have a directory of
Parquet, it's a bit clumsy. Some fancy BASHing works around that

Second, if you specify a large number of files, due to the way a status bar
is implemented, it forces a HUGE preread of the files. The author has
promised the ability to disable this in the future. (when I say HUGE I mean
YUGE, the BIGGEST preread).  Basically to calculate the status bar, on a
3GB table of Parquet read from MapR FS, it required 36GB of ram. (Not
reasonable). Basically I used by hacky BASHing to run his tool on every
file one at a time to avoid that. (I  want to see if I can distribute on
mesos too :)

Third... I did see lots of elastic search load during the import.  I was
running three nodes using MapR FS via NFS/Fuse as the storage location (a
volume for each node).  And between Elasticsearch, mfs, and the fuse
client, there was a lot of CPU usage to do the load... I wish I was a FS
expert to figure out how to tune things on that front... but alas, just a
Bash/Python happy scripter.

I am excited to see the pyarrow stuff!

John

On Thu, Feb 23, 2017 at 8:29 AM, Uwe L. Korn <uw...@xhochy.com> wrote:

> On Thu, Feb 23, 2017, at 03:19 PM, Ted Dunning wrote:
> > On Tue, Feb 21, 2017 at 10:32 PM, John Omernik <j...@omernik.com> wrote:
> >
> > > I guess, I am just looking for ideas, how would YOU get data from
> Parquet
> > > files into Elastic Search? I have Drill and Spark at the ready, but
> want to
> > > be able to handle it as efficiently as possible.  Ideally, if we had a
> well
> > > written ES plugin, I could write a query that inserted into an index
> and
> > > streamed stuff in... but barring that, what other methods have people
> used?
> > >
> >
> > My traditional method has been to use Python's version of the ES batch
> > load
> > API. This runs ES pretty hard, but you would need more to saturate a
> > really
> > large ES cluster. Often I export a JSON file using whatever tool (Drill
> > would work) and then use the python on that file. Avoids questions of
> > Python reading obscure stuff. I think that Python is now able to read and
> > write Parquet, but that is pretty new stuff, so I would stay old school
> > there.
>
> If you want to try it, see
> https://pyarrow.readthedocs.io/en/latest/parquet.html
>
> You can use `conda install pyarrow` to get it, probably next monday it
> will also be pip-installable. It's based on Apache Arrow and Apache
> Parquet C++, we're happy about any feedback!
>

Drill and Elasticsearch

2017-02-21 Thread John Omernik

I know there are conversations about an Elasticsearch plugin, however, I
had a recent need to take some data that was accessible in Drill (stored as
a Parquet table) and move it into an Elasticsearch index.  There are about
1 million rows in the source data.

I was learning the technologies here, and so I thought I could use the Rest
API in Drill and then push the data into Elasticsearch.

I found that in using drill, I couldn't run a single query and then batch
them into Elastic Search. To many returned, timeouts etc. So I used LIMIT
1000 and offset.  This allowed the ES side of things to work well, but It
required quite a few of the same query in Drill wasting time and resources.

I guess, I am just looking for ideas, how would YOU get data from Parquet
files into Elastic Search? I have Drill and Spark at the ready, but want to
be able to handle it as efficiently as possible.  Ideally, if we had a well
written ES plugin, I could write a query that inserted into an index and
streamed stuff in... but barring that, what other methods have people used?

Thanks

John

Re: ideal drill node size

2017-02-06 Thread John Omernik

I think you would be wasting quite a bit of your server if you split it up
into multiple vms. Instead, I am thinking a larger drill bit size wise
(ensure you are upping your ram as high as you can) would be best.  Note I
am not an expert on this stuff, I would like an experts take as well. Here
is a link on configuring Drill memory:
https://drill.apache.org/docs/configuring-drill-memory/

Another thing with such a heavy weight server is you will likely need to
adjust defaults in memory to take advantage of more of the memory. (Drill
folks correct me if I am wrong). Settings like
planner.memory.max_query_memory_per_node

 Will
need need to be setup to take advantage of more of your memory.  It will be
very interesting to see where the bottleneck in a setup like yours is...
please share results!

On Sat, Feb 4, 2017 at 11:37 AM, Christian Hotz-Behofsits <
chris.hotz.behofs...@gmail.com> wrote:

> Hi,
>
> I have a big server (512gb RAM, 32 cores, storage is connected by FC) and
> want to use it for drill. Should I split it into several VM and build a
> cluster or should I use it as a single node? I know that splitting would
> introduce a overhead (guest-OS). But a cluster might provide better
> parellization of the tasks.
>
>
>- Are there any best practices in terms of node size (memory, CPU)?
>- Does drill favor a lot of small nodes or few (one) big node?
>
>
> cheers
>

Re: IndexR, a new storage plugin for Drill

2017-01-03 Thread John Omernik

This looks very interesting! Can't wait to see some how-to's to get the the
server nodes setup, and kafka pipelines setup.  I'd be very interested in
trying this once it's setup.

Thanks!



On Tue, Jan 3, 2017 at 2:35 AM, WeiWan  wrote:

> IndexR is a distributed, columnar storage system based on HDFS, which
> focus on fast analyse, both for massive static(historical) data and rapidly
> ingesting realtime data. IndexR is designed for OLAP.
>
> Fast analyze on large dataset
> Realtime ingestion with zero delay for query
> Deep integration with Hadoop ecosystem
> Hardware efficiency
> Highly avaliable, scalable, manageable and simple
> Adapted with popular query engines like Apache Drill, Apache Hive, etc.
>
> And now it is open source.
>
> Project: https://github.com/shunfei/indexr  indexr>
> Wiki: https://github.com/shunfei/indexr/wiki  indexr/wiki>
>
> IndexR is original developed by Sunteng Tech. This project started a year
> ago and now has been deployed to several productions in our company. The
> whole cluster consumes over 30 billions events each day in realtime from
> Kafka. The largest table contains over 10 billions rows (after rollup) and
> rapidly increasing. Most of the statistic/analyze queries’ latency is less
> than 3 seconds in real world production environment.
>
> Currently it is mainly used as Drill and Hive storage plugin. It should be
> quite easy to master.
>
> We hope IndexR be a favor to you and make it better.
>
> Regards
> Flow Wei
>
>
>
>

Re: Batch load of unstructured data in Drill

2016-12-08 Thread John Omernik

Sure... I believe you could CTAS from your json directory into a tmp
parquet directory and then move the resultant files into the final parquet
directory

i.e.

Drill Query: Create table `.mytempparq` as select * from `.mytempjson`
Filesystem command: mv ./mytempparq/* ./myfinalparq

It would be great of Drill could do this for us :)



On Thu, Dec 8, 2016 at 11:13 AM, Alexander Reshetov <
alexander.v.reshe...@gmail.com> wrote:

> By the way, is it possible to append data to parquet data source?
> I'm looking for possibility to update (append to) existing data new
> rows so every query execution will have new data rows.
>
> Surely it's possible with plain JSON, but I want more efficient binary
> format which will give quicker reads (and executions of queries).
>
> On Wed, Dec 7, 2016 at 4:08 PM, Alexander Reshetov
>  wrote:
> > Hello,
> >
> > I want to load batches of unstructured data in Drill. Mostly JSON data.
> >
> > Is there any batch API or other options to do so?
> >
> >
> > Thanks.
>

NPE on Select with Options on CSV File

2016-12-08 Thread John Omernik

Hey all, I am trying to do a select with options on a CSV file.  select
columns[0], columns[1] already works for this data.  Ideally, I am trying
to do a

select * from table(dfs.root.`path/to/data.csc'(type => 'text',
extractHeader => true, fieldDelimiter => ',') limit 10 and have it work and
read the column names... however, when I do that I get this odd NPE... (The
NPE happens for mytable.*.csv or mytable/myfile.csv)

This is Drill 1.8.  Thanks for any help!

John


> select * from table(`mytable/*.csv`(type => 'text', extractHeader =>
true, fieldDelimiter => ','))  limit 10;

Error: SYSTEM ERROR: NullPointerException


Fragment 1:0


[Error Id: dc3f675d-6cca-4ede-8364-8b4bcd0a7c42 on mynode:20045]


  (com.fasterxml.jackson.databind.JsonMappingException) Instantiation of
[simple type, class org.apache.drill.exec.store.dfs.easy.EasySubScan] value
failed (java.lang.NullPointerException): null

 at [Source: {

  "pop" : "single-sender",

  "@id" : 0,

  "receiver-major-fragment" : 0,

  "receiver-minor-fragment" : 0,

  "child" : {

"pop" : "selection-vector-remover",

"@id" : 1,

"child" : {

  "pop" : "limit",

  "@id" : 2,

  "child" : {

"pop" : "fs-sub-scan",

"@id" : 3,

"userName" : "jomernik",

"files" : [ {

  "start" : 268435456,

  "length" : 117805309,

  "path" : "maprfs:/data/datatest/mytable/load1.csv"

} ],

"storage" : {

  "type" : "file",

  "enabled" : true,

  "connection" : "maprfs:///",

  "config" : null,

  "workspaces" : {

"root" : {

  "location" : "/",

  "writable" : false,

  "defaultInputFormat" : null

},

"datatest" : {

  "location" : "/data/datatest",

  "writable" : true,

  "defaultInputFormat" : null

}

  },

  "formats" : {

"psv" : {

  "type" : "text",

  "extensions" : [ "tbl" ],

  "delimiter" : "|"

},

"csv" : {

  "type" : "text",

  "extensions" : [ "csv" ],

  "delimiter" : ","

},

"tsv" : {

  "type" : "text",

  "extensions" : [ "tsv" ],

  "delimiter" : "\t"

},

"parquet" : {

  "type" : "parquet"

},

"json" : {

  "type" : "json",

  "extensions" : [ "json" ]

},

"maprdb" : {

  "type" : "maprdb"

}

  }

},

"format" : {

  "type" : "named",

  "name" : "text"

},

"columns" : [ "`*`" ],

"selectionRoot" : "maprfs:/data/datatest/mytable",

"initialAllocation" : 100,

"maxAllocation" : 100,

"cost" : 0.0

  },

  "first" : 0,

  "last" : 10,

  "initialAllocation" : 100,

  "maxAllocation" : 100,

  "cost" : 10.0

},

"initialAllocation" : 100,

"maxAllocation" : 100,

"cost" : 10.0

  },

  "destination" : "CiVnaXN6ZXRhYWdlbnQwMi5lbGFiLm
JhbmtvZmFtZXJpY2EuY29tEM2cARjOnAEgz5wB",

  "initialAllocation" : 100,

  "maxAllocation" : 100,

  "cost" : 10.0

}; line: 75, column: 7] (through reference chain: org.apache.drill.exec.
physical.config.SingleSender["child"]->org.apache.drill.
exec.physical.config.SelectionVectorRemover["child"]->org.apache.drill.exec.
physical.config.Limit["child"])

com.fasterxml.jackson.databind.JsonMappingException.from():223

com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.
wrapAsJsonMappingException():445

com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.
rewrapCtorProblem():464

com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.
createFromObjectWith():258

com.fasterxml.jackson.databind.deser.impl.PropertyBasedCreator.build():
135

com.fasterxml.jackson.databind.deser.BeanDeserializer._
deserializeUsingPropertyBased():444

com.fasterxml.jackson.databind.deser.BeanDeserializerBase.
deserializeFromObjectUsingNonDefault():1123

com.fasterxml.jackson.databind.deser.BeanDeserializer.
deserializeFromObject():298

com.fasterxml.jackson.databind.deser.BeanDeserializerBase.
deserializeWithObjectId():1094

com.fasterxml.jackson.databind.deser.BeanDeserializer._
deserializeOther():166

com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize():135

com.fasterxml.jackson.databind.jsontype.impl.
AsPropertyTypeDeserializer._deserializeTypedForId():120

com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.
deserializeTypedFromObject():91

com.fasterxml.jackson.databind.deser.AbstractDeserializer.
deserializeWithType():142

SQL Line Formating

2016-12-08 Thread John Omernik

Hey all,

I have a puzzler (I think).

I have a directory with JSON, it's great, it queries well, it's well
formatted.

I created a view on that directory. Added some columns (like a timestamp
version of the EPOCH time field)

When I run a query in SQL Line of the view, I get a well formatted table:

select * from mytable limit 10;

+--+-+--+-+
| col1|  col2  | col3  |
   col 4|

Even though the data appears in different sizes, it takes the results and
pads them if it can and makes it look good.

However, when I run select col1, min(col2), count(distinct col3) as
num_col3 from mytable group by col1 order by num_col3 DESC limit 100;

Then all the columns are squished with no padding.

I guess, why does SQL Line output the data from the select * from mytable
limit 100 by sending the result set through something that determine
min/max columns by not for select col1, agg(col2) from mytable group by
col1 order by agg(col2) desc limit 100?

 Also interesting: If I remove the order by, the formatting is much
better... why is this the case?

Thanks

Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

2016-12-01 Thread John Omernik

@Abhishek,

Do you think the issue is related to any storage plugin that is enabled and
not available as it applies to all queries?  I guess if it's an issue where
all queries are slow because the foreman is waiting to initialize ALL
storage plugins, regardless of their applicability to the queried data,
then that is a more general issue (that should still be resolved, does the
foreman need to initialize all plugins before querying specific data?)
 However, I am still concerned that the query on the CTAS parquet data is
specifically slower because of it's source.  @Rahul could you test a
different Parquet table, NOT loaded from the SQL server to see if the
enabling or disabling the JDBC storage plugin (with the server unavailable)
has any impact?  Basically, I want to ensure that data that is created as a
Parquet table via CTAS is 100% free of any links to the source data. This
is EXTREMELY important.

John



On Thu, Dec 1, 2016 at 12:46 AM, Abhishek Girish <abhishek.gir...@gmail.com>
wrote:

> Thanks for the update, Rahul!
>
> On Wed, Nov 30, 2016 at 9:45 PM Rahul Raj <rahul@option3consulting.com
> >
> wrote:
>
> > Abhishek,
> >
> > Your observation is correct, we just verified that:
> >
> >1. The queries run as expected(faster) with Jdbc plugin disabled.
> >2. Queries run as expected when the plugin's datasource is running.
> >3. With the datasource down, queries run very slow waiting for the
> >connection to fail
> >
> > Rahul
> >
> > On Thu, Dec 1, 2016 at 10:07 AM, Abhishek Girish <
> > abhishek.gir...@gmail.com>
> > wrote:
> >
> > > @John,
> > >
> > > I agree that this should work. While I am not certain, I don't think
> the
> > > issue is specific to a particular plugin, but the way in a query's
> > > lifecycle, the foreman attempts to initialize every enabled storage
> > plugin
> > > before proceeding to execute the query. So when a particular plugin
> isn't
> > > configured correctly or the underlying datasource is not up, this could
> > > drastically slow down the query execution time.
> > >
> > > I'll look up to see if we have a JIRA for this already - if not will
> file
> > > one.
> > >
> > > On Wed, Nov 30, 2016 at 8:12 AM, John Omernik <j...@omernik.com>
> wrote:
> > >
> > > > So just my opinion in reading this thread.  (sorry for swooping in an
> > > > opining)
> > > >
> > > > If a CTAS is done from any data source into Parquet files there
> > > should
> > > > be NO dependency on the original data source to query the resultant
> > > Parquet
> > > > files.   As a Drill user, as a Drill admin, this breaks the concept
> of
> > > > least surprise.  If I take data from one source, and create Parquet
> > files
> > > > in a distributed file system, it should just work.  If there are
> > "issues"
> > > > with JDBC plugins or the HBase/Hive plugins in a similar manner,
> these
> > > > needs to be hunted down by a large group of villages with pitchforks
> > and
> > > > torches.  I just can't see how this could be acceptable at any level.
> > The
> > > > whole idea of Parquet files is they are self describing, schema
> > included
> > > > files thus a read of a directory of Parquet files should have NO
> > > > dependancies on anything but the parquet files... even the Parquet
> > > > "additions" (such as the METADATA Cache) should be a fail open
> thing...
> > > if
> > > > it exists great, use it, speed things up, but if it doesn't read the
> > > > parquet files as normal (Which I believe is how it operates)
> > > >
> > > > John
> > > >
> > > > On Wed, Nov 30, 2016 at 12:12 AM, Abhishek Girish <
> > > > abhishek.gir...@gmail.com
> > > > > wrote:
> > > >
> > > > > Can you attempt to disable to jdbc plugin (configured with
> SQLServer)
> > > and
> > > > > try the query (on parquet) when SQL Server is offline?
> > > > >
> > > > > I've seen a similar issue previously when the HBase / Hive plugin
> was
> > > > > enabled but either the plugin configuration was wrong or the
> > underlying
> > > > > data source was down.
> > > > >
> > > > > On Fri, Nov 25, 2016 at 3:21 AM, Rahul Raj
> > > <rahul.raj@option3consulting.
> > > > > com>
> > > > > wrote:
>

Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

2016-11-30 Thread John Omernik

So just my opinion in reading this thread.  (sorry for swooping in an
opining)

If a CTAS is done from any data source into Parquet files there should
be NO dependency on the original data source to query the resultant Parquet
files.   As a Drill user, as a Drill admin, this breaks the concept of
least surprise.  If I take data from one source, and create Parquet files
in a distributed file system, it should just work.  If there are "issues"
with JDBC plugins or the HBase/Hive plugins in a similar manner, these
needs to be hunted down by a large group of villages with pitchforks and
torches.  I just can't see how this could be acceptable at any level. The
whole idea of Parquet files is they are self describing, schema included
files thus a read of a directory of Parquet files should have NO
dependancies on anything but the parquet files... even the Parquet
"additions" (such as the METADATA Cache) should be a fail open thing... if
it exists great, use it, speed things up, but if it doesn't read the
parquet files as normal (Which I believe is how it operates)

John

On Wed, Nov 30, 2016 at 12:12 AM, Abhishek Girish  wrote:

> Can you attempt to disable to jdbc plugin (configured with SQLServer) and
> try the query (on parquet) when SQL Server is offline?
>
> I've seen a similar issue previously when the HBase / Hive plugin was
> enabled but either the plugin configuration was wrong or the underlying
> data source was down.
>
> On Fri, Nov 25, 2016 at 3:21 AM, Rahul Raj  com>
> wrote:
>
> > I have created a parquet file using CTAS from a MS SQL Server. The query
> on
> > parquet is getting stuck in STARTING state for a long time before
> returning
> > the results.
> >
> > We could see that drill was trying to connect to the MS SQL server from
> > which the data was imported. The MSSQL server was down, drill threw an
> > exception "Failure while attempting to load JDBC schema", and then
> returned
> > the results. While SQL server is running, the query executes without
> > issues.
> >
> > Why is drill querying the DB metadata externally and not the imported
> > parquets?
> >
> > Rahul.
> >
> > --
> >  This email and any files transmitted with it are confidential and
> > intended solely for the use of the individual or entity to whom it is
> > addressed. If you are not the named addressee then you should not
> > disseminate, distribute or copy this e-mail. Please notify the sender
> > immediately and delete this e-mail from your system.
> >
>

1 2 3 4 5 >

1 - 100 of 406 matches

Mail list logo