RE: Drill performance tuning parquet

2017-07-27 Thread Dan Holmes
Let's pretend there is only this one query or a similar style aggregation 
perhaps with a WHERE clause.  I am trying to understand how to get more out of 
the drill instance I have.  That set of parquet files is ~900MB.  It was 6.25 
GB as PSV.  There are 13million records.

How can I tell if I am IO bound and I need more reader threads?  If there were 
more files would that be better?

I don't think I am CPU bound based on the stats that EC2 gave me.  I am using 
this example to both learn how to scale drill and also how to understand it.

We are considering using it for our text file processing as an exploratory 
tool, ETL (since it will convert to parquet) and because of its ability to join 
disparate datasources as a db layer for tools like tableau.

Other tools we have thought of are Athena.  It is crazy fast.  That same query 
against the text files runs is ~3 seconds.  (My 4 vCPU drill instance did that 
same query against s3 txt files in 180 seconds) But it does have drawbacks.  It 
only works on AWS.  My on-premise solutions would have to be designed 
differently.  

I don't need performance parity but to do this right I need to understand drill 
better.  That is the essence of this inquiry.

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Saurabh Mahapatra [mailto:saurabhmahapatr...@gmail.com] 
Sent: Thursday, July 27, 2017 6:52 PM
To: user@drill.apache.org
Subject: Re: Drill performance tuning parquet

Hi Dan,

Here are some thoughts from my end.

So this is just one query and you have the numbers. But how about a 
representative collection? Do you have the use cases? Now, I know from 
experience that if you can predict the pattern of the queries to about 60%, 
that would be great. The rest could be ad hoc and you could plan for it.

For that 60%, it would be good to share some numbers along these lines:

1. SQL query, response time measured, response time expected, size of the 
tables that are part of the query 2. Do you have any data skew?
3. What is the EC2 configuration you have: memory, CPU cores?

So the approach would be to tune it for the entire set (which means you will 
end up trading off the various parameters) and then scale out. (Scale out is 
not cheap).

Thanks,
Saurabh

On Thu, Jul 27, 2017 at 1:37 PM, Kunal Khatua  wrote:

> You haven't specified what kind of query are you running.
>
> The Async Parquet Reader tuning should be more than sufficient in your 
> usecase, since you seem to be only processing 3 files.
>
> The feature introduces a small fixed pool of threads that are 
> responsible for the actual fetching of bytes from the disk, without 
> blocking the fragments that already have some data available to work on.
>
> The "store.parquet.reader.pagereader.buffersize" might be of interest.
> The default for this is 4MB and can be tuned to match the parquet page 
> size (usually 1MB). This can reduce memory pressure and improve the 
> pipeline behavior.
>
> Apart from this, the primary factors affecting your query performance 
> is the number of cores (which is what you seem to be tuning) and memory.
> By design, the parallelization level is a function of the num-of-cores.
> From the look of things, it looks like that is helping. You can try 
> further tuning it with this:
> planner.width.max_per_node (default is 70% of num-of-cores)
>
> For memory,
> planner.memory.max_query_memory_per_node (default is 2GB)
>
>
> This is where you'll find more about this:
> https://drill.apache.org/docs/performance-tuning/
>
> ~ Kunal
>
> -Original Message-
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 1:06 PM
> To: user@drill.apache.org
> Subject: RE: Drill performance tuning parquet
>
> I did not partition the data when I created the parquet files (CTAS 
> without a PARITION BY)
>
> Here is the file list.
>
> Thank you.
>
>
> [dholmes@ip-10-20-49-40 sales_p]$ ll
> total 1021372
> -rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
> -rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
> -rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet
>
> Dan Holmes | Revenue Analytics, Inc.
> Direct: 770.859.1255
> www.revenueanalytics.com
>
> -Original Message-
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 3:59 PM
> To: user@drill.apache.org
> Subject: Drill performance tuning parquet
>
> I am performance testing a single drill instance with different vCPU 
> configurations in AWS.  I have a parquet files on an EFS volume and 
> use the same data for each EC2 instance.
>
> I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12
> respectively.  I have not changed any of the options.   This an out of the
> box 1.11 installation.
>
> What Drill tuning options should I experiment with?  I have read 
> 

Re: How much more information can error messages provide?

2017-07-27 Thread Paul Rogers
Hi All,

As it turns out, many places in the code already use a class called 
UserException. (Read that as “an exception to send to the user” rather than “an 
exception that the user caused.”) Each use of UserException includes a message 
and context. So, the good news is, many places in the code already provide a 
useful message.

Unfortunately, many places in the code do not provide such messages. Still we 
can certainly track those that are missing and add them incrementally. That is 
a pretty low cost, low risk change if we do them gradually.

The question then becomes, why isn’t this information passed along to the user 
as a nice, simple text message? That would be worth investigating.

Other SQL engines use standard SqlCode and SqlWarning codes. A bit old school, 
but allows tools to understand the meaning of the error without parsing a 
vendor-specific message. We might want to add that.

So, for priority, again from a code perspective:

1) Better display the messages we already have.
2) Write a stack trace to the log, but not to the user’s console (in Sqlline, 
etc.)
3) As part of fixes and code reviews, look for spots without such messages and 
add them.

Are there better, simpler ways we could solve this?

- Paul


> On Jul 27, 2017, at 3:55 PM, Saurabh Mahapatra  
> wrote:
> 
> I completely agree with John. I think we need to make error/warning
> messages more friendly moving forward with any new features that ship.
> 
> Please share the JIRA that you create. But a holistic approach scares me.
> How would we prioritize the ones that would impact most users? Any thoughts
> on that.
> 
> Saurabh
> 
> On Thu, Jul 27, 2017 at 6:43 AM, John Omernik  wrote:
> 
>> Hey Dan, I created the master issue for tracking improvements to error
>> messages here: https://issues.apache.org/jira/browse/DRILL-5692
>> 
>> If you want to create your individual error and link to it in your message,
>> we'll see how it works.  I like the idea of a wholistic effort to improve
>> error messages!
>> 
>> John
>> 
>> 
>> 
>> On Thu, Jul 27, 2017 at 8:25 AM, John Omernik  wrote:
>> 
>>> I want to bump this up. I've had a number of troubleshooting times where
>>> getting more concise error messages would really help me deal with my
>> data.
>>> It looks like Dan found verbose mode, but sometimes verbose isn't what we
>>> need, but concise. Hey Dan, maybe we could come up with a Jira that is is
>>> the "tracker" JIRA for error messages that are not helpful. Then each
>> time
>>> someone comes up with an error message that is frustrating or unhelpful,
>> we
>>> could create an individual JIRA for that message and link it to the
>> master
>>> JIRA. That would allow us to find, throughout the code any error messages
>>> that are returning with limited information.
>>> 
>>> My thought is this, in the Master JIRA (which I will be happy to create)
>> I
>>> will explain the purpose of the Master JIRA, and how best to describe
>> error
>>> messages that don't help the analyst (what information to include,
>> showing
>>> the error with both normal and verbose mode, if the problem is resolved,
>>> what resolved the error etc).
>>> 
>>> I think we should focus on errors, not issues. What I mean by that, is
>>> there is an error like "Null Pointer Exception" which is a horrible error
>>> in Java that is hard for a user to make heads or tails over, we shouldn't
>>> open an "bad error message" for each Null Pointer Exception issue that
>>> occurs. Instead, perhaps we could open one for the error "Null Pointer
>>> Exception" (regardless of what caused it) and provide details how we can
>>> better trap that and communicate it to the user... does that make sense?
>>> 
>>> I will start working on a master JIRA for this.
>>> 
>>> 
>>> 
>>> On Thu, Jul 27, 2017 at 8:19 AM, Dan Holmes <
>> dhol...@revenueanalytics.com>
>>> wrote:
>>> 
 With verbose mode on the following is the error message.  I guess this
 confirms it is a date conversion but there isn't any help as to which
>> file
 or which expression (in this statement there is only one TO_DATE
>> function
 so it is obvious).
 
 org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
 IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id:
 22129e51-acbd-41f2-84f5-721541d9843b on ubuntu:31010]
 (java.lang.IllegalArgumentException) Invalid format: " "
 org.joda.time.format.DateTimeFormatter.parseDateTime():899
 org.joda.time.DateMidnight.parse():152 org.apache.drill.exec.test.gen
 erated.ProjectorGen367.doEval():76 org.apache.drill.exec.test.gen
 erated.ProjectorGen367.projectRecords():67
>> org.apache.drill.exec.physical
 .impl.project.ProjectRecordBatch.doWork():199
 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
 org.apache.drill.exec.physical.impl.project.
>> ProjectRecordBatch.innerNext():135
 

Re: How much more information can error messages provide?

2017-07-27 Thread Saurabh Mahapatra
I completely agree with John. I think we need to make error/warning
messages more friendly moving forward with any new features that ship.

Please share the JIRA that you create. But a holistic approach scares me.
How would we prioritize the ones that would impact most users? Any thoughts
on that.

Saurabh

On Thu, Jul 27, 2017 at 6:43 AM, John Omernik  wrote:

> Hey Dan, I created the master issue for tracking improvements to error
> messages here: https://issues.apache.org/jira/browse/DRILL-5692
>
> If you want to create your individual error and link to it in your message,
> we'll see how it works.  I like the idea of a wholistic effort to improve
> error messages!
>
> John
>
>
>
> On Thu, Jul 27, 2017 at 8:25 AM, John Omernik  wrote:
>
> > I want to bump this up. I've had a number of troubleshooting times where
> > getting more concise error messages would really help me deal with my
> data.
> > It looks like Dan found verbose mode, but sometimes verbose isn't what we
> > need, but concise. Hey Dan, maybe we could come up with a Jira that is is
> > the "tracker" JIRA for error messages that are not helpful. Then each
> time
> > someone comes up with an error message that is frustrating or unhelpful,
> we
> > could create an individual JIRA for that message and link it to the
> master
> > JIRA. That would allow us to find, throughout the code any error messages
> > that are returning with limited information.
> >
> > My thought is this, in the Master JIRA (which I will be happy to create)
> I
> > will explain the purpose of the Master JIRA, and how best to describe
> error
> > messages that don't help the analyst (what information to include,
> showing
> > the error with both normal and verbose mode, if the problem is resolved,
> > what resolved the error etc).
> >
> > I think we should focus on errors, not issues. What I mean by that, is
> > there is an error like "Null Pointer Exception" which is a horrible error
> > in Java that is hard for a user to make heads or tails over, we shouldn't
> > open an "bad error message" for each Null Pointer Exception issue that
> > occurs. Instead, perhaps we could open one for the error "Null Pointer
> > Exception" (regardless of what caused it) and provide details how we can
> > better trap that and communicate it to the user... does that make sense?
> >
> > I will start working on a master JIRA for this.
> >
> >
> >
> > On Thu, Jul 27, 2017 at 8:19 AM, Dan Holmes <
> dhol...@revenueanalytics.com>
> > wrote:
> >
> >> With verbose mode on the following is the error message.  I guess this
> >> confirms it is a date conversion but there isn't any help as to which
> file
> >> or which expression (in this statement there is only one TO_DATE
> function
> >> so it is obvious).
> >>
> >> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> >> IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id:
> >> 22129e51-acbd-41f2-84f5-721541d9843b on ubuntu:31010]
> >> (java.lang.IllegalArgumentException) Invalid format: " "
> >> org.joda.time.format.DateTimeFormatter.parseDateTime():899
> >> org.joda.time.DateMidnight.parse():152 org.apache.drill.exec.test.gen
> >> erated.ProjectorGen367.doEval():76 org.apache.drill.exec.test.gen
> >> erated.ProjectorGen367.projectRecords():67
> org.apache.drill.exec.physical
> >> .impl.project.ProjectRecordBatch.doWork():199
> >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
> >> org.apache.drill.exec.physical.impl.project.
> ProjectRecordBatch.innerNext():135
> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> >> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> >> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> >> org.apache.drill.exec.physical.impl.project.
> ProjectRecordBatch.innerNext():135
> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> >> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> >> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> >> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
> >> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> >> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> >> org.apache.drill.exec.physical.impl.SingleSenderCreator$Sing
> >> leSenderRootExec.innerNext():92 org.apache.drill.exec.
> physical.impl.BaseRootExec.next():94
> >> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
> >> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
> >> java.security.AccessController.doPrivileged():-2
> >> javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.
> UserGroupInformation.doAs():1657
> >> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
> >> org.apache.drill.common.SelfCleaningRunnable.run():38
> >> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> >> 

Re: Drill performance tuning parquet

2017-07-27 Thread Saurabh Mahapatra
Hi Dan,

Here are some thoughts from my end.

So this is just one query and you have the numbers. But how about a
representative collection? Do you have the use cases? Now, I know from
experience that if you can predict the pattern of the queries to about 60%,
that would be great. The rest could be ad hoc and you could plan for it.

For that 60%, it would be good to share some numbers along these lines:

1. SQL query, response time measured, response time expected, size of the
tables that are part of the query
2. Do you have any data skew?
3. What is the EC2 configuration you have: memory, CPU cores?

So the approach would be to tune it for the entire set (which means you
will end up trading off the various parameters) and then scale out. (Scale
out is not cheap).

Thanks,
Saurabh

On Thu, Jul 27, 2017 at 1:37 PM, Kunal Khatua  wrote:

> You haven't specified what kind of query are you running.
>
> The Async Parquet Reader tuning should be more than sufficient in your
> usecase, since you seem to be only processing 3 files.
>
> The feature introduces a small fixed pool of threads that are responsible
> for the actual fetching of bytes from the disk, without blocking the
> fragments that already have some data available to work on.
>
> The "store.parquet.reader.pagereader.buffersize" might be of interest.
> The default for this is 4MB and can be tuned to match the parquet page size
> (usually 1MB). This can reduce memory pressure and improve the pipeline
> behavior.
>
> Apart from this, the primary factors affecting your query performance is
> the number of cores (which is what you seem to be tuning) and memory.
> By design, the parallelization level is a function of the num-of-cores.
> From the look of things, it looks like that is helping. You can try further
> tuning it with this:
> planner.width.max_per_node (default is 70% of num-of-cores)
>
> For memory,
> planner.memory.max_query_memory_per_node (default is 2GB)
>
>
> This is where you'll find more about this:
> https://drill.apache.org/docs/performance-tuning/
>
> ~ Kunal
>
> -Original Message-
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 1:06 PM
> To: user@drill.apache.org
> Subject: RE: Drill performance tuning parquet
>
> I did not partition the data when I created the parquet files (CTAS
> without a PARITION BY)
>
> Here is the file list.
>
> Thank you.
>
>
> [dholmes@ip-10-20-49-40 sales_p]$ ll
> total 1021372
> -rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
> -rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
> -rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet
>
> Dan Holmes | Revenue Analytics, Inc.
> Direct: 770.859.1255
> www.revenueanalytics.com
>
> -Original Message-
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 3:59 PM
> To: user@drill.apache.org
> Subject: Drill performance tuning parquet
>
> I am performance testing a single drill instance with different vCPU
> configurations in AWS.  I have a parquet files on an EFS volume and use the
> same data for each EC2 instance.
>
> I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12
> respectively.  I have not changed any of the options.   This an out of the
> box 1.11 installation.
>
> What Drill tuning options should I experiment with?  I have read
> https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so
> technical that I can't consume it but it reads like the default options are
> the best ones.
>
> The query looks like this:
> SELECT store_key, SUM(sales_dollars) sd
> FROM dfs.root.sales_p
> GROUP BY store_key
> ORDER BY sd DESC
> LIMIT 10
>
> Dan Holmes | Architect | Revenue Analytics, Inc.
>
>


RE: Drill performance tuning parquet

2017-07-27 Thread Kunal Khatua
You haven't specified what kind of query are you running.

The Async Parquet Reader tuning should be more than sufficient in your usecase, 
since you seem to be only processing 3 files.

The feature introduces a small fixed pool of threads that are responsible for 
the actual fetching of bytes from the disk, without blocking the fragments that 
already have some data available to work on. 

The "store.parquet.reader.pagereader.buffersize" might be of interest. The 
default for this is 4MB and can be tuned to match the parquet page size 
(usually 1MB). This can reduce memory pressure and improve the pipeline 
behavior. 

Apart from this, the primary factors affecting your query performance is the 
number of cores (which is what you seem to be tuning) and memory.  
By design, the parallelization level is a function of the num-of-cores. From 
the look of things, it looks like that is helping. You can try further tuning 
it with this:
planner.width.max_per_node (default is 70% of num-of-cores)

For memory, 
planner.memory.max_query_memory_per_node (default is 2GB)


This is where you'll find more about this:
https://drill.apache.org/docs/performance-tuning/

~ Kunal

-Original Message-
From: Dan Holmes [mailto:dhol...@revenueanalytics.com] 
Sent: Thursday, July 27, 2017 1:06 PM
To: user@drill.apache.org
Subject: RE: Drill performance tuning parquet 

I did not partition the data when I created the parquet files (CTAS without a 
PARITION BY)

Here is the file list.

Thank you.


[dholmes@ip-10-20-49-40 sales_p]$ ll
total 1021372
-rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
-rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
-rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Dan Holmes [mailto:dhol...@revenueanalytics.com] 
Sent: Thursday, July 27, 2017 3:59 PM
To: user@drill.apache.org
Subject: Drill performance tuning parquet 

I am performance testing a single drill instance with different vCPU 
configurations in AWS.  I have a parquet files on an EFS volume and use the 
same data for each EC2 instance.

I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12 
respectively.  I have not changed any of the options.   This an out of the box 
1.11 installation.

What Drill tuning options should I experiment with?  I have read 
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so 
technical that I can't consume it but it reads like the default options are the 
best ones.

The query looks like this:
SELECT store_key, SUM(sales_dollars) sd
FROM dfs.root.sales_p
GROUP BY store_key
ORDER BY sd DESC
LIMIT 10

Dan Holmes | Architect | Revenue Analytics, Inc.



RE: Drill performance tuning parquet

2017-07-27 Thread Dan Holmes
I did not partition the data when I created the parquet files (CTAS without a 
PARITION BY)

Here is the file list.

Thank you.


[dholmes@ip-10-20-49-40 sales_p]$ ll
total 1021372
-rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
-rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
-rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Dan Holmes [mailto:dhol...@revenueanalytics.com] 
Sent: Thursday, July 27, 2017 3:59 PM
To: user@drill.apache.org
Subject: Drill performance tuning parquet 

I am performance testing a single drill instance with different vCPU 
configurations in AWS.  I have a parquet files on an EFS volume and use the 
same data for each EC2 instance.

I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12 
respectively.  I have not changed any of the options.   This an out of the box 
1.11 installation.

What Drill tuning options should I experiment with?  I have read 
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so 
technical that I can't consume it but it reads like the default options are the 
best ones.

The query looks like this:
SELECT store_key, SUM(sales_dollars) sd
FROM dfs.root.sales_p
GROUP BY store_key
ORDER BY sd DESC
LIMIT 10

Dan Holmes | Architect | Revenue Analytics, Inc.



Drill performance tuning parquet

2017-07-27 Thread Dan Holmes
I am performance testing a single drill instance with different vCPU 
configurations in AWS.  I have a parquet files on an EFS volume and use the 
same data for each EC2 instance.

I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12 
respectively.  I have not changed any of the options.   This an out of the box 
1.11 installation.

What Drill tuning options should I experiment with?  I have read 
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so 
technical that I can't consume it but it reads like the default options are the 
best ones.

The query looks like this:
SELECT store_key, SUM(sales_dollars) sd
FROM dfs.root.sales_p
GROUP BY store_key
ORDER BY sd DESC
LIMIT 10

Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com
LinkedIn
 | 
Twitter



Re: predicate push down behavior

2017-07-27 Thread Jinfeng Ni
Your interpretation of the physical plan is correct. The "IS NOT NULL"
predicate is pushed down to Oracle, while aggregation / sort / Limit is
handled by Drill, although they could be pushed down as well.  The JDBC
storage plugin rules may need some enhancements.

On Thu, Jul 27, 2017 at 5:15 AM, Dan Holmes 
wrote:

> I have the following query against and Oracle table.
> SELECT `YEAR`, `MONTH`, SUM(PREP_IN) as rainfall
> FROM yoda.SEA_PLATFORM.WEATHER_CF
> WHERE prep_in IS NOT NULL
> GROUP BY `YEAR`, `MONTH`
> ORDER BY rainfall DESC
> LIMIT 10
>
> I am attempting to determine what part of the query is pushed into
> Oracle.  In the physical plan I see the following (widen the window to see
> this easier):
> 00-00Screen : rowType = RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0)
> MONTH, ANY rainfall): rowcount = 10.0, cumulative cost = {219.0 rows,
> 2789.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247714
> 00-01  Project(YEAR=[$0], MONTH=[$1], rainfall=[$2]) : rowType =
> RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount
> = 10.0, cumulative cost = {218.0 rows, 2788.589411415945 cpu, 0.0 io, 0.0
> network, 2376.0 memory}, id = 247713
> 00-02SelectionVectorRemover : rowType = RecordType(DECIMAL(0, 0)
> YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 10.0, cumulative cost
> = {218.0 rows, 2788.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory},
> id = 247712
> 00-03  Limit(fetch=[10]) : rowType = RecordType(DECIMAL(0, 0)
> YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 10.0, cumulative cost
> = {208.0 rows, 2778.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory},
> id = 247711
> 00-04SelectionVectorRemover : rowType = RecordType(DECIMAL(0,
> 0) YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 9.0, cumulative
> cost = {198.0 rows, 2738.589411415945 cpu, 0.0 io, 0.0 network, 2376.0
> memory}, id = 247710
> 00-05  TopN(limit=[10]) : rowType = RecordType(DECIMAL(0, 0)
> YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 9.0, cumulative cost =
> {189.0 rows, 2729.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id
> = 247709
> 00-06HashAgg(group=[{0, 1}], rainfall=[SUM($2)]) : rowType
> = RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, ANY rainfall):
> rowcount = 9.0, cumulative cost = {180.0 rows, 2610.0 cpu, 0.0 io, 0.0
> network, 2376.0 memory}, id = 247708
> 00-07  Project(YEAR=[$2], MONTH=[$3], PREP_IN=[$5]) :
> rowType = RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, DECIMAL(0, 0)
> PREP_IN): rowcount = 90.0, cumulative cost = {90.0 rows, 90.0 cpu, 0.0 io,
> 0.0 network, 0.0 memory}, id = 247707
> 00-08Jdbc(sql=[SELECT *
> FROM "SEA_PLATFORM"."WEATHER_CF"
> WHERE "PREP_IN" IS NOT NULL]) : rowType = RecordType(VARCHAR(10) LOCATION,
> TIMESTAMP(0) RECORDDATE, DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH,
> DECIMAL(0, 0) DAY, DECIMAL(0, 0) PREP_IN, DECIMAL(0, 0) TAVG_F, DECIMAL(0,
> 0) TMAX_F, DECIMAL(0, 0) TMIN_F): rowcount = 90.0, cumulative cost = {90.0
> rows, 90.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 247626
>
> I think based on 00-08, that the predicate "PREP_IN IS NOT NULL" is what
> is pushed to Oracle.  Am I reading that correctly?  The Aggregation,
> Grouping and Limit is all handled by Drill?
>
> This is drill 1.10
>
>
> Dan Holmes | Architect | Revenue Analytics, Inc.
> 300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
> Direct: 770.859.1255 Cell: 404.617.3444
> www.revenueanalytics.com com/owa/redir.aspx?SURL=RqmyOJRm3r383jV2nPQLyg9BvjWZqM
> X4-tL3BHj81WfaslMWau_SCGgAdAB0AHAAOgAvAC8AdwB3AHcAL
> gByAGUAdgBlAG4AdQBlAGEAbgBhAGwAeQB0AGkAYwBzAC4AYwBvAG0A=
> http%3a%2f%2fwww.revenueanalytics.com>
> LinkedIn SrcaeiXxVTCDhl49ibCO7CHhTsNynunc_8gSjHDaikXaslMWau_
> SCGgAdAB0AHAAcwA6AC8ALwB3AHcAdwAuAGwAaQBuAGsAZQBkAGkAbgAuAGM
> AbwBtAC8AYwBvAG0AcABhAG4AeQAvAHIAZQB2AGUAbgB1AGUALQBhAG4AYQB
> sAHkAdABpAGMAcwAtAGkAbgBjAC0A=https%3a%2f%2fwww.
> linkedin.com%2fcompany%2frevenue-analytics-inc-> | Twitter<
> https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=
> cdePsMV8TCGx8O_Rugbj-maE9C9DVT373vSJwbUc23faslMWau_
> SCGgAdAB0AHAAcwA6AC8ALwB0AHcAaQB0AHQAZQByAC4AYwBvAG0ALwBSAGU
> AdgBfAEEAbgBhAGwAeQB0AGkAYwBzAA..=https%3a%2f%2ftwitter.
> com%2fRev_Analytics>
>
>


Re: How much more information can error messages provide?

2017-07-27 Thread John Omernik
Hey Dan, I created the master issue for tracking improvements to error
messages here: https://issues.apache.org/jira/browse/DRILL-5692

If you want to create your individual error and link to it in your message,
we'll see how it works.  I like the idea of a wholistic effort to improve
error messages!

John



On Thu, Jul 27, 2017 at 8:25 AM, John Omernik  wrote:

> I want to bump this up. I've had a number of troubleshooting times where
> getting more concise error messages would really help me deal with my data.
> It looks like Dan found verbose mode, but sometimes verbose isn't what we
> need, but concise. Hey Dan, maybe we could come up with a Jira that is is
> the "tracker" JIRA for error messages that are not helpful. Then each time
> someone comes up with an error message that is frustrating or unhelpful, we
> could create an individual JIRA for that message and link it to the master
> JIRA. That would allow us to find, throughout the code any error messages
> that are returning with limited information.
>
> My thought is this, in the Master JIRA (which I will be happy to create) I
> will explain the purpose of the Master JIRA, and how best to describe error
> messages that don't help the analyst (what information to include, showing
> the error with both normal and verbose mode, if the problem is resolved,
> what resolved the error etc).
>
> I think we should focus on errors, not issues. What I mean by that, is
> there is an error like "Null Pointer Exception" which is a horrible error
> in Java that is hard for a user to make heads or tails over, we shouldn't
> open an "bad error message" for each Null Pointer Exception issue that
> occurs. Instead, perhaps we could open one for the error "Null Pointer
> Exception" (regardless of what caused it) and provide details how we can
> better trap that and communicate it to the user... does that make sense?
>
> I will start working on a master JIRA for this.
>
>
>
> On Thu, Jul 27, 2017 at 8:19 AM, Dan Holmes 
> wrote:
>
>> With verbose mode on the following is the error message.  I guess this
>> confirms it is a date conversion but there isn't any help as to which file
>> or which expression (in this statement there is only one TO_DATE function
>> so it is obvious).
>>
>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
>> IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id:
>> 22129e51-acbd-41f2-84f5-721541d9843b on ubuntu:31010]
>> (java.lang.IllegalArgumentException) Invalid format: " "
>> org.joda.time.format.DateTimeFormatter.parseDateTime():899
>> org.joda.time.DateMidnight.parse():152 org.apache.drill.exec.test.gen
>> erated.ProjectorGen367.doEval():76 org.apache.drill.exec.test.gen
>> erated.ProjectorGen367.projectRecords():67 org.apache.drill.exec.physical
>> .impl.project.ProjectRecordBatch.doWork():199
>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.record.AbstractRecordBatch.next():119
>> org.apache.drill.exec.record.AbstractRecordBatch.next():109
>> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
>> org.apache.drill.exec.record.AbstractRecordBatch.next():162
>> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>> org.apache.drill.exec.physical.impl.SingleSenderCreator$Sing
>> leSenderRootExec.innerNext():92 
>> org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232
>> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226
>> java.security.AccessController.doPrivileged():-2
>> javax.security.auth.Subject.doAs():422 
>> org.apache.hadoop.security.UserGroupInformation.doAs():1657
>> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
>> org.apache.drill.common.SelfCleaningRunnable.run():38
>> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
>> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
>> java.lang.Thread.run():748
>>
>> Dan Holmes | Revenue Analytics, Inc.
>> Direct: 770.859.1255
>> www.revenueanalytics.com
>>
>> -Original Message-
>> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
>> Sent: Thursday, July 27, 2017 9:11 AM
>> To: user@drill.apache.org
>> Subject: How much more information can error messages provide?
>>
>> I am querying a 6GB data set in a directory.  There are 34 files in the
>> directory.  Here is the statement.
>>
>> CREATE TABLE dfs.root.`twm/sales_p` AS

Re: How much more information can error messages provide?

2017-07-27 Thread John Omernik
I want to bump this up. I've had a number of troubleshooting times where
getting more concise error messages would really help me deal with my data.
It looks like Dan found verbose mode, but sometimes verbose isn't what we
need, but concise. Hey Dan, maybe we could come up with a Jira that is is
the "tracker" JIRA for error messages that are not helpful. Then each time
someone comes up with an error message that is frustrating or unhelpful, we
could create an individual JIRA for that message and link it to the master
JIRA. That would allow us to find, throughout the code any error messages
that are returning with limited information.

My thought is this, in the Master JIRA (which I will be happy to create) I
will explain the purpose of the Master JIRA, and how best to describe error
messages that don't help the analyst (what information to include, showing
the error with both normal and verbose mode, if the problem is resolved,
what resolved the error etc).

I think we should focus on errors, not issues. What I mean by that, is
there is an error like "Null Pointer Exception" which is a horrible error
in Java that is hard for a user to make heads or tails over, we shouldn't
open an "bad error message" for each Null Pointer Exception issue that
occurs. Instead, perhaps we could open one for the error "Null Pointer
Exception" (regardless of what caused it) and provide details how we can
better trap that and communicate it to the user... does that make sense?

I will start working on a master JIRA for this.



On Thu, Jul 27, 2017 at 8:19 AM, Dan Holmes 
wrote:

> With verbose mode on the following is the error message.  I guess this
> confirms it is a date conversion but there isn't any help as to which file
> or which expression (in this statement there is only one TO_DATE function
> so it is obvious).
>
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id:
> 22129e51-acbd-41f2-84f5-721541d9843b on ubuntu:31010] 
> (java.lang.IllegalArgumentException)
> Invalid format: " " org.joda.time.format.DateTimeFormatter.parseDateTime():899
> org.joda.time.DateMidnight.parse():152 org.apache.drill.exec.test.
> generated.ProjectorGen367.doEval():76 org.apache.drill.exec.test.
> generated.ProjectorGen367.projectRecords():67 org.apache.drill.exec.
> physical.impl.project.ProjectRecordBatch.doWork():199
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.record.AbstractRecordBatch.next():119
> org.apache.drill.exec.record.AbstractRecordBatch.next():109
> org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
> org.apache.drill.exec.record.AbstractRecordBatch.next():162
> org.apache.drill.exec.physical.impl.BaseRootExec.next():104
> org.apache.drill.exec.physical.impl.SingleSenderCreator$
> SingleSenderRootExec.innerNext():92 org.apache.drill.exec.
> physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.
> fragment.FragmentExecutor$1.run():232 org.apache.drill.exec.work.
> fragment.FragmentExecutor$1.run():226 java.security.
> AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422
> org.apache.hadoop.security.UserGroupInformation.doAs():1657
> org.apache.drill.exec.work.fragment.FragmentExecutor.run():226
> org.apache.drill.common.SelfCleaningRunnable.run():38
> java.util.concurrent.ThreadPoolExecutor.runWorker():1149
> java.util.concurrent.ThreadPoolExecutor$Worker.run():624
> java.lang.Thread.run():748
>
> Dan Holmes | Revenue Analytics, Inc.
> Direct: 770.859.1255
> www.revenueanalytics.com
>
> -Original Message-
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 9:11 AM
> To: user@drill.apache.org
> Subject: How much more information can error messages provide?
>
> I am querying a 6GB data set in a directory.  There are 34 files in the
> directory.  Here is the statement.
>
> CREATE TABLE dfs.root.`twm/sales_p` AS
> SELECT to_date(columns[0], 'MM/dd/yy') transaction_date , CAST(columns[1]
> as INT) store_key , CAST(columns[2] AS BIGINT) item_code , columns[3]
> position_name , CAST(CASE WHEN columns[4] = 'NULL' THEN 0 ELSE columns[4]
> END as INT) quantity_per_sku , CAST(columns[5] as INT) sales_quantity,
> CAST(columns[6] as float) sales_dollars, columns[7] channel FROM
> dfs.root.`sales`
>
> I get the following error.  I was able to determine that columns[2] needed
> to be a BIGINT based on an 

RE: How much more information can error messages provide?

2017-07-27 Thread Dan Holmes
With verbose mode on the following is the error message.  I guess this confirms 
it is a date conversion but there isn't any help as to which file or which 
expression (in this statement there is only one TO_DATE function so it is 
obvious).

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id: 
22129e51-acbd-41f2-84f5-721541d9843b on ubuntu:31010] 
(java.lang.IllegalArgumentException) Invalid format: " " 
org.joda.time.format.DateTimeFormatter.parseDateTime():899 
org.joda.time.DateMidnight.parse():152 
org.apache.drill.exec.test.generated.ProjectorGen367.doEval():76 
org.apache.drill.exec.test.generated.ProjectorGen367.projectRecords():67 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():199 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 
org.apache.drill.exec.record.AbstractRecordBatch.next():162 
org.apache.drill.exec.record.AbstractRecordBatch.next():119 
org.apache.drill.exec.record.AbstractRecordBatch.next():109 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 
org.apache.drill.exec.record.AbstractRecordBatch.next():162 
org.apache.drill.exec.record.AbstractRecordBatch.next():119 
org.apache.drill.exec.record.AbstractRecordBatch.next():109 
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91 
org.apache.drill.exec.record.AbstractRecordBatch.next():162 
org.apache.drill.exec.physical.impl.BaseRootExec.next():104 
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 
java.security.AccessController.doPrivileged():-2 
javax.security.auth.Subject.doAs():422 
org.apache.hadoop.security.UserGroupInformation.doAs():1657 
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 
org.apache.drill.common.SelfCleaningRunnable.run():38 
java.util.concurrent.ThreadPoolExecutor.runWorker():1149 
java.util.concurrent.ThreadPoolExecutor$Worker.run():624 
java.lang.Thread.run():748

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Dan Holmes [mailto:dhol...@revenueanalytics.com] 
Sent: Thursday, July 27, 2017 9:11 AM
To: user@drill.apache.org
Subject: How much more information can error messages provide?

I am querying a 6GB data set in a directory.  There are 34 files in the 
directory.  Here is the statement.

CREATE TABLE dfs.root.`twm/sales_p` AS
SELECT to_date(columns[0], 'MM/dd/yy') transaction_date , CAST(columns[1] as 
INT) store_key , CAST(columns[2] AS BIGINT) item_code , columns[3] 
position_name , CAST(CASE WHEN columns[4] = 'NULL' THEN 0 ELSE columns[4] END 
as INT) quantity_per_sku , CAST(columns[5] as INT) sales_quantity, 
CAST(columns[6] as float) sales_dollars, columns[7] channel FROM 
dfs.root.`sales`

I get the following error.  I was able to determine that columns[2] needed to 
be a BIGINT based on an error and that columns[4] sometimes had the word "NULL" 
in it.  In the case of NULL I had to run a statement across every column I was 
converting to INT (col1 = NULL or col2 = NULL ...) to figure out which one it 
was.  Now I have a similar problem with the TO_DATE() function.

Can the error message include the filename and record number?  Or at least the 
value of what it found and which expression failed?  Those kinds of clues would 
go a long way towards making this type of detective work significantly faster.  
The larger the dataset the faster it would be.

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id: 
3557e893-e848-4503-bf6a-73110fbd6207 on ubuntu:31010]

v1.10

Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com
LinkedIn
 | 

How much more information can error messages provide?

2017-07-27 Thread Dan Holmes
I am querying a 6GB data set in a directory.  There are 34 files in the 
directory.  Here is the statement.

CREATE TABLE dfs.root.`twm/sales_p` AS
SELECT to_date(columns[0], 'MM/dd/yy') transaction_date
, CAST(columns[1] as INT) store_key
, CAST(columns[2] AS BIGINT) item_code
, columns[3] position_name
, CAST(CASE WHEN columns[4] = 'NULL' THEN 0 ELSE columns[4] END as INT) 
quantity_per_sku
, CAST(columns[5] as INT) sales_quantity, CAST(columns[6] as float) 
sales_dollars, columns[7] channel
FROM dfs.root.`sales`

I get the following error.  I was able to determine that columns[2] needed to 
be a BIGINT based on an error and that columns[4] sometimes had the word "NULL" 
in it.  In the case of NULL I had to run a statement across every column I was 
converting to INT (col1 = NULL or col2 = NULL ...) to figure out which one it 
was.  Now I have a similar problem with the TO_DATE() function.

Can the error message include the filename and record number?  Or at least the 
value of what it found and which expression failed?  Those kinds of clues would 
go a long way towards making this type of detective work significantly faster.  
The larger the dataset the faster it would be.

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id: 
3557e893-e848-4503-bf6a-73110fbd6207 on ubuntu:31010]

v1.10

Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com
LinkedIn
 | 
Twitter



RE: append data to already existing table saved in parquet format

2017-07-27 Thread Dan Holmes
I have done something similar.  Each directory represents a year and the 
content of each directory is a parquet file.  I SELECT at the directory level 
("FROM dfs.otp").  As I added directories DRILL just figured it out.  I didn't 
have to do anything special.  When I created the parquet files I used the 
PARTITION BY `YEAR`.  I can now use dir0 = 'year' or `year` = 2008 to get 
pruning.

dan@ubuntu:~/otp$ ls
1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008
dan@ubuntu:~/otp$ ls 2000
0_0_1.parquet

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Divya Gehlot [mailto:divya.htco...@gmail.com] 
Sent: Thursday, July 27, 2017 1:56 AM
To: user@drill.apache.org
Subject: Re: append data to already existing table saved in parquet format

Hi Paul,
Let my try your approach of CTAS and save to  partition directory structure .

Thanks for the suggestion.

Thanks,
Divya

On 27 July 2017 at 11:57, Paul Rogers  wrote:

> Hi All,
>
> Saurabh, you are right. But, since Parquet does not allow appending to 
> existing files, we have to do the logical equivalent which is to 
> create a new Parquet file. For it to be part of the same “table” it 
> must be part of an existing partition structure as Divya described.
>
> The trick here is to choose a proper time grain. Too small and you end 
> up with a very large number of files, and performance will suffer. 
> (Once a second, for example, is too frequent.) Too slow and people 
> don’t get near-real-time results. But, with the hour grain, Divya is 
> using, then the number of files will not be too large per directory, 
> and each file might be of a reasonable size.
>
> Using Kafka to batch data would be a fine idea.
>
> Of course, this is still not as good as Saurabh's former project, 
> Druid, which builds aggregated cubes on the fly and has a lambda 
> architecture to allow querying both immediate and historical data. 
> Still, Divya’s design can work fine for some use cases when latency is 
> not an issue and data volume is reasonable.
>
> It would help if Drill had INSERT INTO support. But, I wonder, can it 
> be made to work with Drill today? Perhaps the query can simply include 
> the proper target directory in the CTAS statement. That is, data for 
> 2017-07-25
> 02:00 would go into “2017/07/26/2000.parquet”, say. That is, 
> do-it-yourself partitioning. I hope Drill won’t care how the Parquet 
> files got into the directories, only that the directories have the 
> expected structure. (Is this accurate? Haven’t tried it myself…)
>
> With single-threaded, hourly updates, there is no worry about the name 
> collisions and other tricky issues that INSERT INTO will have to solve.
>
> Divya, have you tried this solution?
>
> Thanks,
>
> - Paul
>
> > On Jul 26, 2017, at 7:32 PM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
> >
> > But append only means you are adding event record to a table(forget 
> > the
> layout for a while). That means you have to write to the end of a 
> table. If the writes are too many, you have to batch them and then 
> convert them into a column format.
> >
> > This to me sounds like a Kafka workflow where you keeping ingesting
> event data, then batch process it ( or stream process it). Writing or 
> appending to a columnar store when you data is in a row like format 
> does not sound efficient at all. I have not seen such a design in 
> systems that actually work. I know there are query engines that try to 
> do that but the use is limited. You cannot scale.
> >
> > I always think of Parquet or a columnar data store as the repository 
> > of
> historical data that came from the OLTP world. You do not want to 
> touch it once you created it. You want to have a strategy where you 
> batch the recent data, create the historical data and move on.
> >
> > My 2 cents.
> >
> > Saurabh
> >
> > On Jul 26, 2017, at 6:58 PM, Divya Gehlot 
> wrote:
> >
> > Yes Paul I am looking for the insert into partition feature .
> > In this way we just have to create the file for that particular 
> > partition when new data comes in or any updation if its required .
> > Else every time when data comes in have run the view and recreate 
> > the parquet files for whole data set which is very time consuming 
> > specially when your data is being visualized in some real time dashboard .
> >
> > Thanks,
> > Divya
> >
> >> On 27 July 2017 at 08:40, Paul Rogers  wrote:
> >>
> >> Hi Divya,
> >>
> >> Seems that you are asking for an “INSERT INTO” feature 
> >> (DRILL-3534). The idea would be to create new Parquet files into an 
> >> existing partition structure. That feature has not yet been 
> >> started. So, the workarounds provided might help you for now.
> >>
> >> - Paul
> >>
> >>> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
> >> saurabhmahapatr...@gmail.com> wrote:
> >>>
> >>> Does Drill provide that 

predicate push down behavior

2017-07-27 Thread Dan Holmes
I have the following query against and Oracle table.
SELECT `YEAR`, `MONTH`, SUM(PREP_IN) as rainfall
FROM yoda.SEA_PLATFORM.WEATHER_CF
WHERE prep_in IS NOT NULL
GROUP BY `YEAR`, `MONTH`
ORDER BY rainfall DESC
LIMIT 10

I am attempting to determine what part of the query is pushed into Oracle.  In 
the physical plan I see the following (widen the window to see this easier):
00-00Screen : rowType = RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, 
ANY rainfall): rowcount = 10.0, cumulative cost = {219.0 rows, 
2789.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247714
00-01  Project(YEAR=[$0], MONTH=[$1], rainfall=[$2]) : rowType = 
RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 
10.0, cumulative cost = {218.0 rows, 2788.589411415945 cpu, 0.0 io, 0.0 
network, 2376.0 memory}, id = 247713
00-02SelectionVectorRemover : rowType = RecordType(DECIMAL(0, 0) YEAR, 
DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 10.0, cumulative cost = {218.0 
rows, 2788.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247712
00-03  Limit(fetch=[10]) : rowType = RecordType(DECIMAL(0, 0) YEAR, 
DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 10.0, cumulative cost = {208.0 
rows, 2778.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247711
00-04SelectionVectorRemover : rowType = RecordType(DECIMAL(0, 0) 
YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 9.0, cumulative cost = 
{198.0 rows, 2738.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 
247710
00-05  TopN(limit=[10]) : rowType = RecordType(DECIMAL(0, 0) YEAR, 
DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 9.0, cumulative cost = {189.0 
rows, 2729.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247709
00-06HashAgg(group=[{0, 1}], rainfall=[SUM($2)]) : rowType = 
RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 
9.0, cumulative cost = {180.0 rows, 2610.0 cpu, 0.0 io, 0.0 network, 2376.0 
memory}, id = 247708
00-07  Project(YEAR=[$2], MONTH=[$3], PREP_IN=[$5]) : rowType = 
RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, DECIMAL(0, 0) PREP_IN): 
rowcount = 90.0, cumulative cost = {90.0 rows, 90.0 cpu, 0.0 io, 0.0 network, 
0.0 memory}, id = 247707
00-08Jdbc(sql=[SELECT *
FROM "SEA_PLATFORM"."WEATHER_CF"
WHERE "PREP_IN" IS NOT NULL]) : rowType = RecordType(VARCHAR(10) LOCATION, 
TIMESTAMP(0) RECORDDATE, DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, DECIMAL(0, 0) 
DAY, DECIMAL(0, 0) PREP_IN, DECIMAL(0, 0) TAVG_F, DECIMAL(0, 0) TMAX_F, 
DECIMAL(0, 0) TMIN_F): rowcount = 90.0, cumulative cost = {90.0 rows, 90.0 cpu, 
0.0 io, 0.0 network, 0.0 memory}, id = 247626

I think based on 00-08, that the predicate "PREP_IN IS NOT NULL" is what is 
pushed to Oracle.  Am I reading that correctly?  The Aggregation, Grouping and 
Limit is all handled by Drill?

This is drill 1.10


Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com
LinkedIn
 | 
Twitter



Re: regex replace in string

2017-07-27 Thread Khurram Faraaz
Drill docs mention that REGEXP_REPLACE function uses 
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html


Link to REGEXP_REPLACE doc - 
https://drill.apache.org/docs/string-manipulation/#regexp_replace



From: Paul Rogers 
Sent: Thursday, July 27, 2017 11:38:42 AM
To: user@drill.apache.org
Subject: Re: regex replace in string

DRILL-4645: "Regex_replace() function is broken”?

I still wonder, the characters you are looking for are those used to format a 
“columns” column on output. Can you show a couple of lines of your CSV file?

Or, take a look at SELECT * FROM … LIMIT 10 to see if the data is being split 
into columns, or is somehow using the “columns” column…

That aside, the docs don’t state the regex syntax used in the REGEX_REPLACE 
function. Is it Java regex? If so, then you’d have to format it according to 
the Java syntax ([1]). So:

‘\[|,|\.|]’

Note that “[“ and “.” are part of the regex syntax, so must be escaped.

- Paul

[1] https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

> On Jul 26, 2017, at 10:49 PM, Divya Gehlot  wrote:
>
> Hi,
> I have already set the plugin configuration to extractheader :true .
> and I followed the below link
> https://drill.apache.org/docs/lesson-2-run-queries-with-ansi-sql/
>
> SELECT REGEXP_REPLACE(CAST(`Column1` AS VARCHAR(100)), '[,".]', '') AS
> `Col1` FROM
> dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`
>
> Just extracting column which has special charcaters including the delimeter
> as one of the special character  gives me empty result set .
>
> Am I missing something ?
>
> Appreciate the help.
>
> Thanks,
> Divya
>



Re: regex replace in string

2017-07-27 Thread Paul Rogers
DRILL-4645: "Regex_replace() function is broken”?

I still wonder, the characters you are looking for are those used to format a 
“columns” column on output. Can you show a couple of lines of your CSV file?

Or, take a look at SELECT * FROM … LIMIT 10 to see if the data is being split 
into columns, or is somehow using the “columns” column…

That aside, the docs don’t state the regex syntax used in the REGEX_REPLACE 
function. Is it Java regex? If so, then you’d have to format it according to 
the Java syntax ([1]). So:

‘\[|,|\.|]’

Note that “[“ and “.” are part of the regex syntax, so must be escaped.

- Paul

[1] https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

> On Jul 26, 2017, at 10:49 PM, Divya Gehlot  wrote:
> 
> Hi,
> I have already set the plugin configuration to extractheader :true .
> and I followed the below link
> https://drill.apache.org/docs/lesson-2-run-queries-with-ansi-sql/
> 
> SELECT REGEXP_REPLACE(CAST(`Column1` AS VARCHAR(100)), '[,".]', '') AS
> `Col1` FROM
> dfs.`installedsoftwares/ApacheDrill/apache-drill-1.10.0.tar/apache-drill-1.10.0/sample-data/sample_data.csv`
> 
> Just extracting column which has special charcaters including the delimeter
> as one of the special character  gives me empty result set .
> 
> Am I missing something ?
> 
> Appreciate the help.
> 
> Thanks,
> Divya
>