RE: Drill performance tuning parquet

2017-07-31 Thread Dan Holmes
When is it right to add nodes vs adding CPUs?  Since my installation is on AWS, 
adding CPUs is relatively easy.  When does it make sense to add nodes instead 
of CPUs?

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com] 
Sent: Friday, July 28, 2017 12:51 PM
To: user@drill.apache.org
Subject: RE: Drill performance tuning parquet

I also forgot to mention... within the drill-override.conf, is a parameter 
you'd need to set to constrain the async parquet reader's scan pool size. If 
you have just 4 cores, the pool's 4 threads will compete with your other 
fragments for CPU. Of course, all of this depends on what the metrics in query 
profile reveal. 

-Original Message-
From: Jinfeng Ni [mailto:j...@apache.org]
Sent: Friday, July 28, 2017 7:41 AM
To: user <user@drill.apache.org>
Subject: Re: Drill performance tuning parquet

The number you posted seems to show that the query elapse time is highly 
impacted by the number of scan minor fragments (scan parallelization degree).

In Drill, scan parallelization degree is capped at minimum of # of parquet row 
groups, or 70% of cpu cores. In your original configuration, since you only 
have 3 file each with one row group, Drill will have only up to 3 scan minor 
fragments ( you can confirm that by looking at the query profile).
With decreased blocksize, you have more parquet files, and hence higher scan 
parallelization degree, and better performance. In the case of 4 cores, the 
scan parallelization degree is capped at 4*70% = 2, which probably explains why 
reducing blocksize does not help.

The 900MB total parquet  file size is relatively small. If you want to turn 
Drill for such small dataset, you probably need smaller parquet file size.
In the case of 4 cores, you may consider bump up the following parameter.

`planner.width.max_per_node`





On Fri, Jul 28, 2017 at 7:03 AM, Dan Holmes <dhol...@revenueanalytics.com>
wrote:

> Thank you for the tips.  I have used 4 different block sizes.   It appears
> to scale linearly and anything less than the 512 blocksize was of 
> similar performance.  I rounded the numbers to whole seconds.  The 
> data is local to the EC2 instance; I did not put the data on EFS.  I 
> used the same data files.  After I created it the first time I put the 
> data on s3 and copied it to the others.
>
> If there are other configurations that someone is interested in, I 
> would be willing to try them out.  I have something to gain in that too.
>
> Here's the data for the interested.
>
> vCPU x Blocksize
> 64  128 256 512
> m3.xlarge - 16  6   6   5   12
> c3.2xlarge - 8  11  11  11      20
> c4.4xlarge - 4  20  20  20  20
>
>
> Dan Holmes | Revenue Analytics, Inc.
> Direct: 770.859.1255
> www.revenueanalytics.com
>
> -Original Message-
> From: Kunal Khatua [mailto:kkha...@mapr.com]
> Sent: Friday, July 28, 2017 2:38 AM
> To: user@drill.apache.org
> Subject: RE: Drill performance tuning parquet
>
> Look at the query profile's (in the UI) "operator  profiles - overview"
> section. The % Query Time is a good indicator of which operator 
> consumes the most CPU. Changing the planner.width.max_per_node 
> actually will affect this (positively or negatively, depending on the load).
>
> Within the same Operator Profile, also look at Average and Max Wait times.
> See if the numbers are unusually high.
>
> Scrolling further down, (since you are working with parquet) the 
> Parquet RowGroup Scan operator can be expanded to show the minor 
> fragment (worker/leaf fragments) metrics. Since you have 3 files, you 
> probably will see only 3 entries... since each fragment will scan 1 
> row group in the parquet file. (I'm making the assumption that u have 
> only 1 rowgroup per file). Just at the end of that table, you'll see 
> "OperatorMetrics". This gives you the time (in nanosec) and other 
> metrics it takes for these fragments to be handed data by the pool of Async 
> Parquet Reader threads.
>
> Most likely, changing the number of Parquet files being produced using 
> CTAS to a larger value (i.e. with smaller file sizes) might actually 
> help leverage the surplus CPU capacity you have. For that, you'll need 
> to tweak (and experiment) with parquet block sizes of something less 
> than 512MB. You could try 128MB (or even 64MB). All of this depends on 
> the nature of the dat.. so it's a bit of experimenting that's needed here on.
>
> Beyond that, see the memory (in "Fragment Profiles - Overview") , and 
> whether you need to bump up that setting.
>
> It is also possible that since you're running on AWS, the compute and 
> storage layers and not as 

RE: Drill performance tuning parquet

2017-07-28 Thread Dan Holmes
Thank you for the tips.  I have used 4 different block sizes.   It appears to 
scale linearly and anything less than the 512 blocksize was of similar 
performance.  I rounded the numbers to whole seconds.  The data is local to the 
EC2 instance; I did not put the data on EFS.  I used the same data files.  
After I created it the first time I put the data on s3 and copied it to the 
others.

If there are other configurations that someone is interested in, I would be 
willing to try them out.  I have something to gain in that too.

Here's the data for the interested.

vCPU x Blocksize
64  128 256 512
m3.xlarge - 16  6   6   5   12
c3.2xlarge - 8  11  11  11  20
c4.4xlarge - 4  20  20  20  20


Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com] 
Sent: Friday, July 28, 2017 2:38 AM
To: user@drill.apache.org
Subject: RE: Drill performance tuning parquet

Look at the query profile's (in the UI) "operator  profiles - overview" 
section. The % Query Time is a good indicator of which operator consumes the 
most CPU. Changing the planner.width.max_per_node actually will affect this 
(positively or negatively, depending on the load).

Within the same Operator Profile, also look at Average and Max Wait times. See 
if the numbers are unusually high. 

Scrolling further down, (since you are working with parquet) the Parquet 
RowGroup Scan operator can be expanded to show the minor fragment (worker/leaf 
fragments) metrics. Since you have 3 files, you probably will see only 3 
entries... since each fragment will scan 1 row group in the parquet file. (I'm 
making the assumption that u have only 1 rowgroup per file). Just at the end of 
that table, you'll see "OperatorMetrics". This gives you the time (in nanosec) 
and other metrics it takes for these fragments to be handed data by the pool of 
Async Parquet Reader threads. 

Most likely, changing the number of Parquet files being produced using CTAS to 
a larger value (i.e. with smaller file sizes) might actually help leverage the 
surplus CPU capacity you have. For that, you'll need to tweak (and experiment) 
with parquet block sizes of something less than 512MB. You could try 128MB (or 
even 64MB). All of this depends on the nature of the dat.. so it's a bit of 
experimenting that's needed here on. 

Beyond that, see the memory (in "Fragment Profiles - Overview") , and whether 
you need to bump up that setting. 

It is also possible that since you're running on AWS, the compute and storage 
layers and not as tightly coupled as Athena is with their own S3, which would 
make sense since they need an incentive for users to try Athena on their AWS 
infrastructure. :)

Happy Drilling! 

-----Original Message-
From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
Sent: Thursday, July 27, 2017 6:23 PM
To: user@drill.apache.org
Subject: RE: Drill performance tuning parquet

Let's pretend there is only this one query or a similar style aggregation 
perhaps with a WHERE clause.  I am trying to understand how to get more out of 
the drill instance I have.  That set of parquet files is ~900MB.  It was 6.25 
GB as PSV.  There are 13million records.

How can I tell if I am IO bound and I need more reader threads?  If there were 
more files would that be better?

I don't think I am CPU bound based on the stats that EC2 gave me.  I am using 
this example to both learn how to scale drill and also how to understand it.

We are considering using it for our text file processing as an exploratory 
tool, ETL (since it will convert to parquet) and because of its ability to join 
disparate datasources as a db layer for tools like tableau.

Other tools we have thought of are Athena.  It is crazy fast.  That same query 
against the text files runs is ~3 seconds.  (My 4 vCPU drill instance did that 
same query against s3 txt files in 180 seconds) But it does have drawbacks.  It 
only works on AWS.  My on-premise solutions would have to be designed 
differently.  

I don't need performance parity but to do this right I need to understand drill 
better.  That is the essence of this inquiry.

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Saurabh Mahapatra [mailto:saurabhmahapatr...@gmail.com]
Sent: Thursday, July 27, 2017 6:52 PM
To: user@drill.apache.org
Subject: Re: Drill performance tuning parquet

Hi Dan,

Here are some thoughts from my end.

So this is just one query and you have the numbers. But how about a 
representative collection? Do you have the use cases? Now, I know from 
experience that if you can predict the pattern of the queries to about 60%, 
that would be great. The rest could be ad hoc and you could plan for it.

For that 60%, it would be good to share some numbers along these lines:

1. SQL query, respons

RE: Drill performance tuning parquet

2017-07-27 Thread Dan Holmes
Let's pretend there is only this one query or a similar style aggregation 
perhaps with a WHERE clause.  I am trying to understand how to get more out of 
the drill instance I have.  That set of parquet files is ~900MB.  It was 6.25 
GB as PSV.  There are 13million records.

How can I tell if I am IO bound and I need more reader threads?  If there were 
more files would that be better?

I don't think I am CPU bound based on the stats that EC2 gave me.  I am using 
this example to both learn how to scale drill and also how to understand it.

We are considering using it for our text file processing as an exploratory 
tool, ETL (since it will convert to parquet) and because of its ability to join 
disparate datasources as a db layer for tools like tableau.

Other tools we have thought of are Athena.  It is crazy fast.  That same query 
against the text files runs is ~3 seconds.  (My 4 vCPU drill instance did that 
same query against s3 txt files in 180 seconds) But it does have drawbacks.  It 
only works on AWS.  My on-premise solutions would have to be designed 
differently.  

I don't need performance parity but to do this right I need to understand drill 
better.  That is the essence of this inquiry.

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Saurabh Mahapatra [mailto:saurabhmahapatr...@gmail.com] 
Sent: Thursday, July 27, 2017 6:52 PM
To: user@drill.apache.org
Subject: Re: Drill performance tuning parquet

Hi Dan,

Here are some thoughts from my end.

So this is just one query and you have the numbers. But how about a 
representative collection? Do you have the use cases? Now, I know from 
experience that if you can predict the pattern of the queries to about 60%, 
that would be great. The rest could be ad hoc and you could plan for it.

For that 60%, it would be good to share some numbers along these lines:

1. SQL query, response time measured, response time expected, size of the 
tables that are part of the query 2. Do you have any data skew?
3. What is the EC2 configuration you have: memory, CPU cores?

So the approach would be to tune it for the entire set (which means you will 
end up trading off the various parameters) and then scale out. (Scale out is 
not cheap).

Thanks,
Saurabh

On Thu, Jul 27, 2017 at 1:37 PM, Kunal Khatua <kkha...@mapr.com> wrote:

> You haven't specified what kind of query are you running.
>
> The Async Parquet Reader tuning should be more than sufficient in your 
> usecase, since you seem to be only processing 3 files.
>
> The feature introduces a small fixed pool of threads that are 
> responsible for the actual fetching of bytes from the disk, without 
> blocking the fragments that already have some data available to work on.
>
> The "store.parquet.reader.pagereader.buffersize" might be of interest.
> The default for this is 4MB and can be tuned to match the parquet page 
> size (usually 1MB). This can reduce memory pressure and improve the 
> pipeline behavior.
>
> Apart from this, the primary factors affecting your query performance 
> is the number of cores (which is what you seem to be tuning) and memory.
> By design, the parallelization level is a function of the num-of-cores.
> From the look of things, it looks like that is helping. You can try 
> further tuning it with this:
> planner.width.max_per_node (default is 70% of num-of-cores)
>
> For memory,
> planner.memory.max_query_memory_per_node (default is 2GB)
>
>
> This is where you'll find more about this:
> https://drill.apache.org/docs/performance-tuning/
>
> ~ Kunal
>
> -Original Message-
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 1:06 PM
> To: user@drill.apache.org
> Subject: RE: Drill performance tuning parquet
>
> I did not partition the data when I created the parquet files (CTAS 
> without a PARITION BY)
>
> Here is the file list.
>
> Thank you.
>
>
> [dholmes@ip-10-20-49-40 sales_p]$ ll
> total 1021372
> -rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
> -rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
> -rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet
>
> Dan Holmes | Revenue Analytics, Inc.
> Direct: 770.859.1255
> www.revenueanalytics.com
>
> -Original Message-
> From: Dan Holmes [mailto:dhol...@revenueanalytics.com]
> Sent: Thursday, July 27, 2017 3:59 PM
> To: user@drill.apache.org
> Subject: Drill performance tuning parquet
>
> I am performance testing a single drill instance with different vCPU 
> configurations in AWS.  I have a parquet files on an EFS volume and 
> use the same data for each EC2 instance.
>
> I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12
> respectively.  

RE: Drill performance tuning parquet

2017-07-27 Thread Dan Holmes
I did not partition the data when I created the parquet files (CTAS without a 
PARITION BY)

Here is the file list.

Thank you.


[dholmes@ip-10-20-49-40 sales_p]$ ll
total 1021372
-rw-rw-r-- 1 dholmes dholmes 393443418 Jul 27 19:05 1_0_0.parquet
-rw-rw-r-- 1 dholmes dholmes 321665234 Jul 27 19:06 1_1_0.parquet
-rw-rw-r-- 1 dholmes dholmes 330758061 Jul 27 19:06 1_2_0.parquet

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Dan Holmes [mailto:dhol...@revenueanalytics.com] 
Sent: Thursday, July 27, 2017 3:59 PM
To: user@drill.apache.org
Subject: Drill performance tuning parquet 

I am performance testing a single drill instance with different vCPU 
configurations in AWS.  I have a parquet files on an EFS volume and use the 
same data for each EC2 instance.

I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12 
respectively.  I have not changed any of the options.   This an out of the box 
1.11 installation.

What Drill tuning options should I experiment with?  I have read 
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so 
technical that I can't consume it but it reads like the default options are the 
best ones.

The query looks like this:
SELECT store_key, SUM(sales_dollars) sd
FROM dfs.root.sales_p
GROUP BY store_key
ORDER BY sd DESC
LIMIT 10

Dan Holmes | Architect | Revenue Analytics, Inc.



Drill performance tuning parquet

2017-07-27 Thread Dan Holmes
I am performance testing a single drill instance with different vCPU 
configurations in AWS.  I have a parquet files on an EFS volume and use the 
same data for each EC2 instance.

I have used 4vCPUs, 8 and 16.  Drill performance is ~25 second, 15 and 12 
respectively.  I have not changed any of the options.   This an out of the box 
1.11 installation.

What Drill tuning options should I experiment with?  I have read 
https://drill.apache.org/docs/asynchronous-parquet-reader/ but it is so 
technical that I can't consume it but it reads like the default options are the 
best ones.

The query looks like this:
SELECT store_key, SUM(sales_dollars) sd
FROM dfs.root.sales_p
GROUP BY store_key
ORDER BY sd DESC
LIMIT 10

Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=RqmyOJRm3r383jV2nPQLyg9BvjWZqMX4-tL3BHj81WfaslMWau_SCGgAdAB0AHAAOgAvAC8AdwB3AHcALgByAGUAdgBlAG4AdQBlAGEAbgBhAGwAeQB0AGkAYwBzAC4AYwBvAG0A=http%3a%2f%2fwww.revenueanalytics.com>
LinkedIn<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=SrcaeiXxVTCDhl49ibCO7CHhTsNynunc_8gSjHDaikXaslMWau_SCGgAdAB0AHAAcwA6AC8ALwB3AHcAdwAuAGwAaQBuAGsAZQBkAGkAbgAuAGMAbwBtAC8AYwBvAG0AcABhAG4AeQAvAHIAZQB2AGUAbgB1AGUALQBhAG4AYQBsAHkAdABpAGMAcwAtAGkAbgBjAC0A=https%3a%2f%2fwww.linkedin.com%2fcompany%2frevenue-analytics-inc->
 | 
Twitter<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=cdePsMV8TCGx8O_Rugbj-maE9C9DVT373vSJwbUc23faslMWau_SCGgAdAB0AHAAcwA6AC8ALwB0AHcAaQB0AHQAZQByAC4AYwBvAG0ALwBSAGUAdgBfAEEAbgBhAGwAeQB0AGkAYwBzAA..=https%3a%2f%2ftwitter.com%2fRev_Analytics>



RE: How much more information can error messages provide?

2017-07-27 Thread Dan Holmes
With verbose mode on the following is the error message.  I guess this confirms 
it is a date conversion but there isn't any help as to which file or which 
expression (in this statement there is only one TO_DATE function so it is 
obvious).

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id: 
22129e51-acbd-41f2-84f5-721541d9843b on ubuntu:31010] 
(java.lang.IllegalArgumentException) Invalid format: " " 
org.joda.time.format.DateTimeFormatter.parseDateTime():899 
org.joda.time.DateMidnight.parse():152 
org.apache.drill.exec.test.generated.ProjectorGen367.doEval():76 
org.apache.drill.exec.test.generated.ProjectorGen367.projectRecords():67 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():199 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 
org.apache.drill.exec.record.AbstractRecordBatch.next():162 
org.apache.drill.exec.record.AbstractRecordBatch.next():119 
org.apache.drill.exec.record.AbstractRecordBatch.next():109 
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 
org.apache.drill.exec.record.AbstractRecordBatch.next():162 
org.apache.drill.exec.record.AbstractRecordBatch.next():119 
org.apache.drill.exec.record.AbstractRecordBatch.next():109 
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91 
org.apache.drill.exec.record.AbstractRecordBatch.next():162 
org.apache.drill.exec.physical.impl.BaseRootExec.next():104 
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():232 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():226 
java.security.AccessController.doPrivileged():-2 
javax.security.auth.Subject.doAs():422 
org.apache.hadoop.security.UserGroupInformation.doAs():1657 
org.apache.drill.exec.work.fragment.FragmentExecutor.run():226 
org.apache.drill.common.SelfCleaningRunnable.run():38 
java.util.concurrent.ThreadPoolExecutor.runWorker():1149 
java.util.concurrent.ThreadPoolExecutor$Worker.run():624 
java.lang.Thread.run():748

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-----
From: Dan Holmes [mailto:dhol...@revenueanalytics.com] 
Sent: Thursday, July 27, 2017 9:11 AM
To: user@drill.apache.org
Subject: How much more information can error messages provide?

I am querying a 6GB data set in a directory.  There are 34 files in the 
directory.  Here is the statement.

CREATE TABLE dfs.root.`twm/sales_p` AS
SELECT to_date(columns[0], 'MM/dd/yy') transaction_date , CAST(columns[1] as 
INT) store_key , CAST(columns[2] AS BIGINT) item_code , columns[3] 
position_name , CAST(CASE WHEN columns[4] = 'NULL' THEN 0 ELSE columns[4] END 
as INT) quantity_per_sku , CAST(columns[5] as INT) sales_quantity, 
CAST(columns[6] as float) sales_dollars, columns[7] channel FROM 
dfs.root.`sales`

I get the following error.  I was able to determine that columns[2] needed to 
be a BIGINT based on an error and that columns[4] sometimes had the word "NULL" 
in it.  In the case of NULL I had to run a statement across every column I was 
converting to INT (col1 = NULL or col2 = NULL ...) to figure out which one it 
was.  Now I have a similar problem with the TO_DATE() function.

Can the error message include the filename and record number?  Or at least the 
value of what it found and which expression failed?  Those kinds of clues would 
go a long way towards making this type of detective work significantly faster.  
The larger the dataset the faster it would be.

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id: 
3557e893-e848-4503-bf6a-73110fbd6207 on ubuntu:31010]

v1.10

Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=RqmyOJRm3r383jV2nPQLyg9BvjWZqMX4-tL3BHj81WfaslMWau_SCGgAdAB0AHAAOgAvAC8AdwB3AHcALgByAGUAdgBlAG4AdQBlAGEAbgBhAGwAeQB0AGkAYwBzAC4AYwBvAG0A=http%3a%2f%2fwww.revenueanalytics.com>
LinkedIn<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=SrcaeiXxVTCDhl49ibCO7CHhTsNynunc_8gSjHDaikXaslMWau_SCGgAdAB0AHAAcwA6AC8ALwB3AHcAdwAuAGwAaQBuAGsAZQBkAGkAbgAuAGMAbwBtAC8AYwBvAG0AcABhAG4AeQAvAHIAZQB2AGUAbgB1AGUALQBhAG4AYQBsAHkAdABpAGMAcwAtAGkAbgBjAC0A=https%3a%2f%2fwww.linkedin.com%2fcompany%2frevenue-analytics-inc->
 | 
Twitter<https://webmail.revenu

How much more information can error messages provide?

2017-07-27 Thread Dan Holmes
I am querying a 6GB data set in a directory.  There are 34 files in the 
directory.  Here is the statement.

CREATE TABLE dfs.root.`twm/sales_p` AS
SELECT to_date(columns[0], 'MM/dd/yy') transaction_date
, CAST(columns[1] as INT) store_key
, CAST(columns[2] AS BIGINT) item_code
, columns[3] position_name
, CAST(CASE WHEN columns[4] = 'NULL' THEN 0 ELSE columns[4] END as INT) 
quantity_per_sku
, CAST(columns[5] as INT) sales_quantity, CAST(columns[6] as float) 
sales_dollars, columns[7] channel
FROM dfs.root.`sales`

I get the following error.  I was able to determine that columns[2] needed to 
be a BIGINT based on an error and that columns[4] sometimes had the word "NULL" 
in it.  In the case of NULL I had to run a statement across every column I was 
converting to INT (col1 = NULL or col2 = NULL ...) to figure out which one it 
was.  Now I have a similar problem with the TO_DATE() function.

Can the error message include the filename and record number?  Or at least the 
value of what it found and which expression failed?  Those kinds of clues would 
go a long way towards making this type of detective work significantly faster.  
The larger the dataset the faster it would be.

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
IllegalArgumentException: Invalid format: " " Fragment 1:0 [Error Id: 
3557e893-e848-4503-bf6a-73110fbd6207 on ubuntu:31010]

v1.10

Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=RqmyOJRm3r383jV2nPQLyg9BvjWZqMX4-tL3BHj81WfaslMWau_SCGgAdAB0AHAAOgAvAC8AdwB3AHcALgByAGUAdgBlAG4AdQBlAGEAbgBhAGwAeQB0AGkAYwBzAC4AYwBvAG0A=http%3a%2f%2fwww.revenueanalytics.com>
LinkedIn<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=SrcaeiXxVTCDhl49ibCO7CHhTsNynunc_8gSjHDaikXaslMWau_SCGgAdAB0AHAAcwA6AC8ALwB3AHcAdwAuAGwAaQBuAGsAZQBkAGkAbgAuAGMAbwBtAC8AYwBvAG0AcABhAG4AeQAvAHIAZQB2AGUAbgB1AGUALQBhAG4AYQBsAHkAdABpAGMAcwAtAGkAbgBjAC0A=https%3a%2f%2fwww.linkedin.com%2fcompany%2frevenue-analytics-inc->
 | 
Twitter<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=cdePsMV8TCGx8O_Rugbj-maE9C9DVT373vSJwbUc23faslMWau_SCGgAdAB0AHAAcwA6AC8ALwB0AHcAaQB0AHQAZQByAC4AYwBvAG0ALwBSAGUAdgBfAEEAbgBhAGwAeQB0AGkAYwBzAA..=https%3a%2f%2ftwitter.com%2fRev_Analytics>



RE: append data to already existing table saved in parquet format

2017-07-27 Thread Dan Holmes
I have done something similar.  Each directory represents a year and the 
content of each directory is a parquet file.  I SELECT at the directory level 
("FROM dfs.otp").  As I added directories DRILL just figured it out.  I didn't 
have to do anything special.  When I created the parquet files I used the 
PARTITION BY `YEAR`.  I can now use dir0 = 'year' or `year` = 2008 to get 
pruning.

dan@ubuntu:~/otp$ ls
1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008
dan@ubuntu:~/otp$ ls 2000
0_0_1.parquet

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Divya Gehlot [mailto:divya.htco...@gmail.com] 
Sent: Thursday, July 27, 2017 1:56 AM
To: user@drill.apache.org
Subject: Re: append data to already existing table saved in parquet format

Hi Paul,
Let my try your approach of CTAS and save to  partition directory structure .

Thanks for the suggestion.

Thanks,
Divya

On 27 July 2017 at 11:57, Paul Rogers <prog...@mapr.com> wrote:

> Hi All,
>
> Saurabh, you are right. But, since Parquet does not allow appending to 
> existing files, we have to do the logical equivalent which is to 
> create a new Parquet file. For it to be part of the same “table” it 
> must be part of an existing partition structure as Divya described.
>
> The trick here is to choose a proper time grain. Too small and you end 
> up with a very large number of files, and performance will suffer. 
> (Once a second, for example, is too frequent.) Too slow and people 
> don’t get near-real-time results. But, with the hour grain, Divya is 
> using, then the number of files will not be too large per directory, 
> and each file might be of a reasonable size.
>
> Using Kafka to batch data would be a fine idea.
>
> Of course, this is still not as good as Saurabh's former project, 
> Druid, which builds aggregated cubes on the fly and has a lambda 
> architecture to allow querying both immediate and historical data. 
> Still, Divya’s design can work fine for some use cases when latency is 
> not an issue and data volume is reasonable.
>
> It would help if Drill had INSERT INTO support. But, I wonder, can it 
> be made to work with Drill today? Perhaps the query can simply include 
> the proper target directory in the CTAS statement. That is, data for 
> 2017-07-25
> 02:00 would go into “2017/07/26/2000.parquet”, say. That is, 
> do-it-yourself partitioning. I hope Drill won’t care how the Parquet 
> files got into the directories, only that the directories have the 
> expected structure. (Is this accurate? Haven’t tried it myself…)
>
> With single-threaded, hourly updates, there is no worry about the name 
> collisions and other tricky issues that INSERT INTO will have to solve.
>
> Divya, have you tried this solution?
>
> Thanks,
>
> - Paul
>
> > On Jul 26, 2017, at 7:32 PM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
> >
> > But append only means you are adding event record to a table(forget 
> > the
> layout for a while). That means you have to write to the end of a 
> table. If the writes are too many, you have to batch them and then 
> convert them into a column format.
> >
> > This to me sounds like a Kafka workflow where you keeping ingesting
> event data, then batch process it ( or stream process it). Writing or 
> appending to a columnar store when you data is in a row like format 
> does not sound efficient at all. I have not seen such a design in 
> systems that actually work. I know there are query engines that try to 
> do that but the use is limited. You cannot scale.
> >
> > I always think of Parquet or a columnar data store as the repository 
> > of
> historical data that came from the OLTP world. You do not want to 
> touch it once you created it. You want to have a strategy where you 
> batch the recent data, create the historical data and move on.
> >
> > My 2 cents.
> >
> > Saurabh
> >
> > On Jul 26, 2017, at 6:58 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
> >
> > Yes Paul I am looking for the insert into partition feature .
> > In this way we just have to create the file for that particular 
> > partition when new data comes in or any updation if its required .
> > Else every time when data comes in have run the view and recreate 
> > the parquet files for whole data set which is very time consuming 
> > specially when your data is being visualized in some real time dashboard .
> >
> > Thanks,
> > Divya
> >
> >> On 27 July 2017 at 08:40, Paul Rogers <prog...@mapr.com> wrote:
> >>
> >> Hi Divya,
> >>
> >> Seems that you are

predicate push down behavior

2017-07-27 Thread Dan Holmes
I have the following query against and Oracle table.
SELECT `YEAR`, `MONTH`, SUM(PREP_IN) as rainfall
FROM yoda.SEA_PLATFORM.WEATHER_CF
WHERE prep_in IS NOT NULL
GROUP BY `YEAR`, `MONTH`
ORDER BY rainfall DESC
LIMIT 10

I am attempting to determine what part of the query is pushed into Oracle.  In 
the physical plan I see the following (widen the window to see this easier):
00-00Screen : rowType = RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, 
ANY rainfall): rowcount = 10.0, cumulative cost = {219.0 rows, 
2789.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247714
00-01  Project(YEAR=[$0], MONTH=[$1], rainfall=[$2]) : rowType = 
RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 
10.0, cumulative cost = {218.0 rows, 2788.589411415945 cpu, 0.0 io, 0.0 
network, 2376.0 memory}, id = 247713
00-02SelectionVectorRemover : rowType = RecordType(DECIMAL(0, 0) YEAR, 
DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 10.0, cumulative cost = {218.0 
rows, 2788.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247712
00-03  Limit(fetch=[10]) : rowType = RecordType(DECIMAL(0, 0) YEAR, 
DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 10.0, cumulative cost = {208.0 
rows, 2778.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247711
00-04SelectionVectorRemover : rowType = RecordType(DECIMAL(0, 0) 
YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 9.0, cumulative cost = 
{198.0 rows, 2738.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 
247710
00-05  TopN(limit=[10]) : rowType = RecordType(DECIMAL(0, 0) YEAR, 
DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 9.0, cumulative cost = {189.0 
rows, 2729.589411415945 cpu, 0.0 io, 0.0 network, 2376.0 memory}, id = 247709
00-06HashAgg(group=[{0, 1}], rainfall=[SUM($2)]) : rowType = 
RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, ANY rainfall): rowcount = 
9.0, cumulative cost = {180.0 rows, 2610.0 cpu, 0.0 io, 0.0 network, 2376.0 
memory}, id = 247708
00-07  Project(YEAR=[$2], MONTH=[$3], PREP_IN=[$5]) : rowType = 
RecordType(DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, DECIMAL(0, 0) PREP_IN): 
rowcount = 90.0, cumulative cost = {90.0 rows, 90.0 cpu, 0.0 io, 0.0 network, 
0.0 memory}, id = 247707
00-08Jdbc(sql=[SELECT *
FROM "SEA_PLATFORM"."WEATHER_CF"
WHERE "PREP_IN" IS NOT NULL]) : rowType = RecordType(VARCHAR(10) LOCATION, 
TIMESTAMP(0) RECORDDATE, DECIMAL(0, 0) YEAR, DECIMAL(0, 0) MONTH, DECIMAL(0, 0) 
DAY, DECIMAL(0, 0) PREP_IN, DECIMAL(0, 0) TAVG_F, DECIMAL(0, 0) TMAX_F, 
DECIMAL(0, 0) TMIN_F): rowcount = 90.0, cumulative cost = {90.0 rows, 90.0 cpu, 
0.0 io, 0.0 network, 0.0 memory}, id = 247626

I think based on 00-08, that the predicate "PREP_IN IS NOT NULL" is what is 
pushed to Oracle.  Am I reading that correctly?  The Aggregation, Grouping and 
Limit is all handled by Drill?

This is drill 1.10


Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=RqmyOJRm3r383jV2nPQLyg9BvjWZqMX4-tL3BHj81WfaslMWau_SCGgAdAB0AHAAOgAvAC8AdwB3AHcALgByAGUAdgBlAG4AdQBlAGEAbgBhAGwAeQB0AGkAYwBzAC4AYwBvAG0A=http%3a%2f%2fwww.revenueanalytics.com>
LinkedIn<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=SrcaeiXxVTCDhl49ibCO7CHhTsNynunc_8gSjHDaikXaslMWau_SCGgAdAB0AHAAcwA6AC8ALwB3AHcAdwAuAGwAaQBuAGsAZQBkAGkAbgAuAGMAbwBtAC8AYwBvAG0AcABhAG4AeQAvAHIAZQB2AGUAbgB1AGUALQBhAG4AYQBsAHkAdABpAGMAcwAtAGkAbgBjAC0A=https%3a%2f%2fwww.linkedin.com%2fcompany%2frevenue-analytics-inc->
 | 
Twitter<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=cdePsMV8TCGx8O_Rugbj-maE9C9DVT373vSJwbUc23faslMWau_SCGgAdAB0AHAAcwA6AC8ALwB0AHcAaQB0AHQAZQByAC4AYwBvAG0ALwBSAGUAdgBfAEEAbgBhAGwAeQB0AGkAYwBzAA..=https%3a%2f%2ftwitter.com%2fRev_Analytics>



RE: Add oracle as a storage fails: Please retry: error (unable to create/ update storage)

2017-07-21 Thread Dan Holmes
Turns out after you put the jar in the directory you have to restart drill.

Works now.

Dan Holmes | Revenue Analytics, Inc.
Direct: 770.859.1255
www.revenueanalytics.com 

-Original Message-
From: Dan Holmes [mailto:dhol...@revenueanalytics.com] 
Sent: Friday, July 21, 2017 3:51 PM
To: user@drill.apache.org
Subject: Add oracle as a storage fails: Please retry: error (unable to create/ 
update storage)

I have followed the instructions here:  
https://drill.apache.org/docs/rdbms-storage-plugin/#Example-Oracle-Configuration

I get the error in the subject.  I don't see anything in the logs about it.

{
  type: "jdbc",
  enabled: true,
  driver: "oracle.jdbc.OracleDriver",
  url:"jdbc:oracle:thin:user/pwd@yoda:1523/ORCL"
}

I have the thin driver installed.
dan@ubuntu:~/apache-drill-1.10.0/jars/3rdparty$ ll o* -rwxrwxr-x 1 dan dan 
3698857 Jul 21 15:25 ojdbc7.jar*

I can telnet to the server.
dan@ubuntu:~/apache-drill-1.10.0/jars/3rdparty$ telnet yoda 1523 Trying 
10.10.10.10...
Connected to yoda.RADOM.local.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

I don't know how to troubleshoot this further.

Thank you any help.

Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=RqmyOJRm3r383jV2nPQLyg9BvjWZqMX4-tL3BHj81WfaslMWau_SCGgAdAB0AHAAOgAvAC8AdwB3AHcALgByAGUAdgBlAG4AdQBlAGEAbgBhAGwAeQB0AGkAYwBzAC4AYwBvAG0A=http%3a%2f%2fwww.revenueanalytics.com>
LinkedIn<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=SrcaeiXxVTCDhl49ibCO7CHhTsNynunc_8gSjHDaikXaslMWau_SCGgAdAB0AHAAcwA6AC8ALwB3AHcAdwAuAGwAaQBuAGsAZQBkAGkAbgAuAGMAbwBtAC8AYwBvAG0AcABhAG4AeQAvAHIAZQB2AGUAbgB1AGUALQBhAG4AYQBsAHkAdABpAGMAcwAtAGkAbgBjAC0A=https%3a%2f%2fwww.linkedin.com%2fcompany%2frevenue-analytics-inc->
 | 
Twitter<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=cdePsMV8TCGx8O_Rugbj-maE9C9DVT373vSJwbUc23faslMWau_SCGgAdAB0AHAAcwA6AC8ALwB0AHcAaQB0AHQAZQByAC4AYwBvAG0ALwBSAGUAdgBfAEEAbgBhAGwAeQB0AGkAYwBzAA..=https%3a%2f%2ftwitter.com%2fRev_Analytics>



Add oracle as a storage fails: Please retry: error (unable to create/ update storage)

2017-07-21 Thread Dan Holmes
I have followed the instructions here:  
https://drill.apache.org/docs/rdbms-storage-plugin/#Example-Oracle-Configuration

I get the error in the subject.  I don't see anything in the logs about it.

{
  type: "jdbc",
  enabled: true,
  driver: "oracle.jdbc.OracleDriver",
  url:"jdbc:oracle:thin:user/pwd@yoda:1523/ORCL"
}

I have the thin driver installed.
dan@ubuntu:~/apache-drill-1.10.0/jars/3rdparty$ ll o*
-rwxrwxr-x 1 dan dan 3698857 Jul 21 15:25 ojdbc7.jar*

I can telnet to the server.
dan@ubuntu:~/apache-drill-1.10.0/jars/3rdparty$ telnet yoda 1523
Trying 10.10.10.10...
Connected to yoda.RADOM.local.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

I don't know how to troubleshoot this further.

Thank you any help.

Dan Holmes | Architect | Revenue Analytics, Inc.
300 Galleria Parkway, Suite 1900 | Atlanta, Georgia 30339
Direct: 770.859.1255 Cell: 404.617.3444
www.revenueanalytics.com<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=RqmyOJRm3r383jV2nPQLyg9BvjWZqMX4-tL3BHj81WfaslMWau_SCGgAdAB0AHAAOgAvAC8AdwB3AHcALgByAGUAdgBlAG4AdQBlAGEAbgBhAGwAeQB0AGkAYwBzAC4AYwBvAG0A=http%3a%2f%2fwww.revenueanalytics.com>
LinkedIn<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=SrcaeiXxVTCDhl49ibCO7CHhTsNynunc_8gSjHDaikXaslMWau_SCGgAdAB0AHAAcwA6AC8ALwB3AHcAdwAuAGwAaQBuAGsAZQBkAGkAbgAuAGMAbwBtAC8AYwBvAG0AcABhAG4AeQAvAHIAZQB2AGUAbgB1AGUALQBhAG4AYQBsAHkAdABpAGMAcwAtAGkAbgBjAC0A=https%3a%2f%2fwww.linkedin.com%2fcompany%2frevenue-analytics-inc->
 | 
Twitter<https://webmail.revenueanalytics.com/owa/redir.aspx?SURL=cdePsMV8TCGx8O_Rugbj-maE9C9DVT373vSJwbUc23faslMWau_SCGgAdAB0AHAAcwA6AC8ALwB0AHcAaQB0AHQAZQByAC4AYwBvAG0ALwBSAGUAdgBfAEEAbgBhAGwAeQB0AGkAYwBzAA..=https%3a%2f%2ftwitter.com%2fRev_Analytics>



Index out of bounds for SELECT * from 'directory'

2017-07-13 Thread Dan Holmes
I am getting the following error with this query.

SELECT COUNT(*)
FROM dfs.`/home/dan/twm/sales`

version 1.10.0

all the files are .txt.  Here is the relevant part of the profile for dfs
"txt": {
  "type": "text",
  "extensions": [
"txt"
  ],
  "extractHeader": true,
  "delimiter": "|"

how do i diagnose this?

thank you
dan
Query Failed: An Error
Occurredorg.apache.drill.common.exceptions.UserRemoteException:
SYSTEM ERROR: IndexOutOfBoundsException: index: 32384, length: 4 (expected:
range(0, 16384)) Fragment 1:0 [Error Id:
7cacf366-21dc-4528-9f4c-eda3c2e28a8b on ubuntu:31010]