Re: [HANGOUT] Topics for 10/31/2017

2017-10-30 Thread Timothy Farkas
I'll speak about unit testing: - Common mistakes that were made in unit tests - The soon to be merged temp directory test watcher classes - Making our Travis build run smoke tests with code coverage - Other misc improvements to be made Thanks, Tim From:

Re: Drill performance question

2017-10-30 Thread Andries Engelbrecht
As Ted touched on - CSV requires a lot of text parsing and data type conversion, while parquet doesn’t. This typically saves a ton of CPU when you query, even if you can’t leverage partitioning or columnar advantages. --Andries On 10/30/17, 10:57 AM, "Saurabh Mahapatra"

Re: Drill performance question

2017-10-30 Thread Saurabh Mahapatra
As Kunal points out, Parquet and partitioning should help. The concept of partitioning is not new to warehousing type of queries. To get performance, you aggregate and partition by the granularity of the query you are seeking. For example, by day, by week, by month or even quarters. The goal is to

RE: Drill performance question

2017-10-30 Thread Kunal Khatua
I second Ted's suggestion! Since we haven't seen what your profile's operator overview, we can't say for sure why the performance isn't good. On the top of my head ,these are most likely things happening that make your performance so bad: 1. All the CSV files are being read and rows rejected

Re: looking for dotnet drivers to access the drill

2017-10-30 Thread Saurabh Mahapatra
Hi Ram, I am curious if you are trying to access Drill from a .NET application. Is this a C# or Java app? Not sure about your specifics but you most probably can use the ODBC driver. See this: https://msdn.microsoft.com/en-us/library/ms228366(v=vs.90).aspx I do not know enough about what you

looking for dotnet drivers to access the drill

2017-10-30 Thread Duraisamy, Ram
Regards Ram Stanley Black & Decker, Inc. www.stanleyblackanddecker.com 701 East Joppa Rd., Towson, MD 21286 Business : 410-716-3167 Ext 23167 Mobile: 410-790-0677

Re: Drill performance question

2017-10-30 Thread Charles Givre
The data itself contains 6 or so columns: date, user_id, city, state, lat, long. I’m looking to aggregate by week, by day of week etc. So the general pattern would look something like: SELECT EXTRACT( day FROM `date` ) AS _`day`, COUNT( DISTINCT id ) as distinct_id, COUNT( id ) as

Re: Drill performance question

2017-10-30 Thread Ted Dunning
Also, on a practical note, Parquet will likely crush CSV on performance. Columnar. Compressed. Binary. All that. On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra < saurabhmahapatr...@gmail.com> wrote: > Hi Charles, > > Can you share some query patterns on this data? More specifically, the >

Re: Drill performance question

2017-10-30 Thread Saurabh Mahapatra
Hi Charles, Can you share some query patterns on this data? More specifically, the number of columns you retrieving out of the total, the filter on the time dimension itself (ranges and granularities) How much is ad hoc and how much is not. Best, Saurabh On Mon, Oct 30, 2017 at 9:27 AM,

Drill performance question

2017-10-30 Thread Charles Givre
Hello all, I have a dataset consisting of about 16 GB of CSV files. I am looking to do some time series analysis of this data, and created a view but when I started doing aggregate queries using components of the date, the performance was disappointing. Would it be better to do a CTAS and

[HANGOUT] Topics for 10/31/2017

2017-10-30 Thread Gautam Parai
Hi, We will have a Drill hangout tomorrow (Tuesday Oct 31) at 10 AM Pacific Time. Please suggest topics by replying to this thread or bring them up during the hangout. Hangout link: https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc Thanks, Gautam