Re: Merging files

2016-06-23 Thread Jinfeng Ni
This hash_distribute option should only matter when you have CTAS "partition by". If you do not do partition in CTAS, there should be no impact at all (in theory). Essentially, this option is to re-distribute the data according to the partition key, before Drill writes to target tables. See

Re: Is this normal view behavior?

2016-06-23 Thread Ted Dunning
On Thu, Jun 23, 2016 at 12:33 PM, John Omernik wrote: > Am I over thinking minutia again? :) > > I htink that this counts as basics, not minutiae.

Re: Is this normal view behavior?

2016-06-23 Thread Jinfeng Ni
Tried on a commit on 1.7.0-SNAPSHOT. Looks like I could not re-produce the problem. Which version are u using? create view dfs.tmp.myview as select dir0 as p_day, l_partkey, l_orderkey, l_suppkey from dfs.tmp.t2; +---+-+ | ok |

Re: Drill taking way too long to plan query

2016-06-23 Thread Tanmay Solanki
yes tables/stats/iad/mmdd is the structure and each of those directories is a day so when I run a query on tables/stats/iad/201604* I am running it on the whole month of april 2016. By refreshing metadata for tables/stats/iad, it is trying to do it for all of the data which goes back until

Re: Discussion: Comments in Drill Views

2016-06-23 Thread Ted Dunning
This is very interesting. I love docstrings in Lisp and Python and Javadoc in Java. Basically this is like that, but for SQL. Very helpful. On Thu, Jun 23, 2016 at 11:48 AM, John Omernik wrote: > I am looking for discussion here. A colleague was asking me how to add >

Question about querying array fields in parquet files

2016-06-23 Thread David Kincaid
I'm very new to Drill and just learning how everything works. I had a question about a query when one of the fields in an array (or list) of values. To simplify, I have a Parquet file of records where each record has just two fields. "name" is a string value and "lastName" is an array of strings.

Re: Drill taking way too long to plan query

2016-06-23 Thread Tanmay Solanki
Yeah, so I tried to cache the metadata for 1 day earlier and saw that that greatly improved the performance. Then I tried doing this for a full 1 month of data but unfortunately that was not allowed. I had to cache run "refresh table metadata" on the full iad folder since it does it by

Re: Drill taking way too long to plan query

2016-06-23 Thread Neeraja Rentachintala
You might want to enable metadata caching and see if it helps. https://drill.apache.org/docs/optimizing-parquet-metadata-reading/ On Thu, Jun 23, 2016 at 1:36 PM, Tanmay Solanki wrote: > Below is the plan. The amount of files is ~213000 files of parquet data. > >

Re: Drill taking way too long to plan query

2016-06-23 Thread Tanmay Solanki
Below is the plan. The amount of files is ~213000 files of parquet data. 0: jdbc:drill:> explain plan for select count(*) from s3.`tables/stats/iad/201604*/`; +--+--+ |   text | json | +--+--+ | 00-00    Screen 00-01  Project(EXPR$0=[$0]) 00-02   

Re: Is this normal view behavior?

2016-06-23 Thread Neeraja Rentachintala
This is a bug. On Thu, Jun 23, 2016 at 1:32 PM, rahul challapalli < challapallira...@gmail.com> wrote: > This looks like a bug. If you renamed the dir0 column as p_day, then you > should see that in sqlline as well. And I have never seen > "_DEFAULT_COL_TO_READ_" > before. Can you file a jira? >

Re: Is this normal view behavior?

2016-06-23 Thread rahul challapalli
This looks like a bug. If you renamed the dir0 column as p_day, then you should see that in sqlline as well. And I have never seen "_DEFAULT_COL_TO_READ_" before. Can you file a jira? - Rahul On Thu, Jun 23, 2016 at 12:33 PM, John Omernik wrote: > I have a table that is a

Is this normal view behavior?

2016-06-23 Thread John Omernik
I have a table that is a directory of parquet files, each row had say 3 columns, and the table is split into subdirectories that allow me to use dir0 partitioning. so if I select * from `table` I get col1, col2, col3, and dir0 as my fields returned. So if I create a view CREATE VIEW

Re: Merging files

2016-06-23 Thread John Omernik
It's basically a two level grouping that has a LEFT JOIN so select a.field1, a.field2, sum(b.somefield) as new_thing from table1 a LEFT JOIN table2 b on a.id = b.id where a.field1 = '2015-05-05' group by a.field1, b.field2 It's not a very complicated query, but it doesn't like the

Discussion: Comments in Drill Views

2016-06-23 Thread John Omernik
I am looking for discussion here. A colleague was asking me how to add comments to the metadata of a view. (He's new to Drill, thus the idea of not having metadata for a table is one he's warming up to). That got me thinking... why couldn't we use Drill Views to store table/field comments? This

Re: Merging files

2016-06-23 Thread Jinfeng Ni
I looked at the code. 1. Drill did log this long CanNotPlan msg in error level. 2) It was replaced with a much shorter version msg only when CanNotPlan was caused by cartesian join. I guess your query probably did not have cartesian join, and CanNotPlan was caused by other reasons. Either way, I

Drill taking way too long to plan query

2016-06-23 Thread Tanmay Solanki
I am trying to run a query on Apache drill to simply count the number of rows in a table stored in parquet format in S3. I am running this on a 20 node r3.8xlarge EC2 instance cluster and I have my direct memory set to 80GB, heap memory set to 32GB and set the planner.memory.max_memory_per_node

Re: Merging files

2016-06-23 Thread Jinfeng Ni
This "CannotPlanException" definitely is a bug in query planner. I thought we had put code to show that extremely long error msg "only" in debug mode. Looks like it's not that case. Could you please open a JIRA and post your query, if possible? thx. On Thu, Jun 23, 2016 at 10:45 AM, John Omernik

Re: Merging files

2016-06-23 Thread Ted Dunning
On Thu, Jun 23, 2016 at 10:41 AM, John Omernik wrote: > First of all, I feel like that as a "perhaps knows just enough to be > dangerous" power/intermediate user, it barely registered that I should, for > optimal performance, do something about 400 files. I saw them and almost

Re: Merging files

2016-06-23 Thread John Omernik
Jinfeng - I wrote my item prior to reading yours. Just an FYI, when I ran with that settting, I got a "CannotPlanException" (with an error that is easily the longest "non-verbose"( heck this beats all the verbose errors I've had) I've ever seen. I'd post it here, but I am not unsure if my Google

Re: Merging files

2016-06-23 Thread John Omernik
This worked perfectly. Thanks Jason. (It also made my small 1.5m per day table into 600K per day... so double win) So, I like this approach and will use it. It makes sense how it works, but obviously this is something I had to come to the Drill User group for, and you, being an expert on Drill

Re: Merging files

2016-06-23 Thread Jinfeng Ni
Do you partition by day in your CTAS? If that's the case, CTAS will produce at least one parquet file for each value of "day". If you have 100 days, then you will end up at least 100 files. However, in case the query is executed in distributed mode, there could be more than one file per value.

Re: Merging files

2016-06-23 Thread Jason Altekruse
Apply a sort in your CTAS, this will force the data down to a single stream before writing. Jason Altekruse Software Engineer at Dremio Apache Drill Committer On Thu, Jun 23, 2016 at 10:23 AM, John Omernik wrote: > When have a small query writing smaller data (like aggregate

Merging files

2016-06-23 Thread John Omernik
When have a small query writing smaller data (like aggregate tables for faster aggregates for Dashboards etc). It appears to write a ton of small files. Not sure why, maybe its just how the join worked out etc. I have a "day" that is 1.5M in total size, but 400 files total. This seems excessive.

Re: Issue with join query having same storage plugin name and same datatype

2016-06-23 Thread Magnus Pierre
Please look at the plan generated by Drill: explain plan with implementation for SELECT… (An idea is to test the generated SQL in the plan in the DB to see what it does) One idea is to try to put one of the tables in a sub select table where you actively rename the column. WITH s0 as (

Issue with join query having same storage plugin name and same datatype

2016-06-23 Thread SanjiV SwaraJ
I want to join two table using same storage plugin. But One Of the Column showing null value.I am using drill in window 8 system with latest version of drill(i.e;1.6) . I am using this query:- SELECT T2.ID AS T_ID,T1.ID AS T1_ID ,T1.ProviderID AS ProviderID,

Re: Pivot in Apache Drill

2016-06-23 Thread Sanjiv Kumar
I have tables : Sl No Name Status Time Taken 1 Description In Progress 2 2 StockNumber In Progress 3 3 SpecSheet Completed 3 I want to display all the row of NAME column(i.e:- Description. StockNumber, SpecSheet) as Column name in SQL SERVER using Drill. EXAMPLE:- Description

Re: DRILL-4199: Add Support for HBase 1.X - planning to merge

2016-06-23 Thread qiang li
Hi all, Just to report my test progress. As the SQL always throw schema changes exception. I reformatted my SQL, now it will not throw exception. I changed SQL *from * select CONVERT_FROM(BYTE_SUBSTR(`ref0`.row_key,-8,8),'BIGINT_BE') as uid, convert_from(`ref0`.`v`.`v`,'UTF8') as v from