RE: Drill performance question

Kunal Khatua Mon, 30 Oct 2017 10:46:36 -0700

I second Ted's suggestion!

Since we haven't seen what your profile's operator overview, we can't say for 
sure why the performance isn't good.

On the top of my head ,these are most likely things happening that make your 
performance so bad:

1. All the CSV files are being read and rows rejected because there is no way 
for Drill to understand the which segments of data have the relevant time 
ranges that you might be looking at. 
2. Your CSV data has many columns, but you only care about a few... CSV readers 
will need to process the irrelevant ones too. 
3. There is a cost to reading and casting/converting the data into a date/time 
format. 

So, as Ted suggested, writing as a parquet file will give you the most bang for 
the buck. 
Partitioning on, say, a date helps.. but you also don't want it too granular.

Last but not the least, if you are doing a query of the form..
select X,Y,Z where time between <startTime> and <endTime>
you will benefit immensely from the data being sorted with that time field. 

Hope that helps. 

~ Kunal

-----Original Message-----
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Monday, October 30, 2017 9:34 AM
To: user <user@drill.apache.org>
Subject: Re: Drill performance question

Also, on a practical note, Parquet will likely crush CSV on performance.
Columnar. Compressed. Binary.  All that.

On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra < 
saurabhmahapatr...@gmail.com> wrote:

> Hi Charles,
>
> Can you share some query patterns on this data? More specifically, the 
> number of columns you retrieving out of the total, the filter on the 
> time dimension itself (ranges and granularities)
>
> How much is ad hoc and how much is not.
>
> Best,
> Saurabh
>
> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre <cgi...@gmail.com> wrote:
>
> > Hello all,
> > I have a dataset consisting of about 16 GB of CSV files.  I am 
> > looking to do some time series analysis of this data, and created a 
> > view but when I started doing aggregate queries using components of 
> > the date, the performance was disappointing.  Would it be better to 
> > do a CTAS and partition by components of the date?  If so, would 
> > parquet be the best format?
> > Would anyone have other suggestions of things I could do to improve 
> > performance?
> > Thanks,
> > — C
>

RE: Drill performance question

Reply via email to