Re: UNORDERED_RECEIVER taking 70% of query time

2017-06-01 Thread Abhishek Girish
Attachment hasn't come through. Can you upload the query profile to some cloud storage and share a link to it? Also, please share details on how large your dataset is, number of Drillbits, memory and other configurations. On Thu, Jun 1, 2017 at 10:18 PM, wrote: >

UNORDERED_RECEIVER taking 70% of query time

2017-06-01 Thread jasbir.sing
Hi, I am running a simple query which performs JOIN operation between two parquet files and it takes around 3-4 secs and I noticed that 70% of the time is used by UNORDERED_RECEIVER. Sample query is - select sum(sales),week from

Re: Pushing down Joins, Aggregates and filters, and data distribution questions

2017-06-01 Thread Paul Rogers
Hi Muhammad, > I have a couple of questions: > > 1. If I have multiple *SubScan*s to be executed, will each *SubScan* be > handled by a single *Scan* operator ? So whenever I have *n* *SubScan*s, > I'll have *n* Scan operators distributed among Drill's cluster ? As Rahul explained,

Re: Parquet on S3 - timeouts

2017-06-01 Thread Abhishek Girish
Cool, thanks for confirming. _ From: Raz Baluchi Sent: Thursday, June 1, 2017 2:14 PM Subject: Re: Parquet on S3 -

Re: Parquet on S3 - timeouts

2017-06-01 Thread Raz Baluchi
setting fs.s3a.connection.maximum 100 does fix the problem. No more timeouts and very quick response. No need to 'prime' the query... On Thu, Jun 1, 2017 at 4:08 PM, Abhishek Girish wrote: > Can you take a look at [1] and let us know if that helps resolve

Re: Parquet on S3 - timeouts

2017-06-01 Thread Raz Baluchi
I noticed that if I precede the query with a select count(*) with the same filters, I no longer experience timeouts. By 'priming' the query in this way, the second query is also faster. This seems to be an acceptable workaround as it it seems to allow me to essentially include all partitions in

Re: Pushing down Joins, Aggregates and filters, and data distribution questions

2017-06-01 Thread rahul challapalli
I would first recommend you spend some time reading the execution flow inside drill [1]. Try to understand specifically what major/minor fragments are and that different major fragments can have different levels of parallelism. Let us take a simple query which runs on a 2 node cluster select *

Re: Parquet on S3 - timeouts

2017-06-01 Thread Abhishek Girish
Can you take a look at [1] and let us know if that helps resolve your issue? [1] https://drill.apache.org/docs/s3-storage-plugin/#quering-parquet-format-files-on-s3 On Thu, Jun 1, 2017 at 12:55 PM, Raz Baluchi wrote: > Now that I have Drill working with parquet files on

Parquet on S3 - timeouts

2017-06-01 Thread Raz Baluchi
Now that I have Drill working with parquet files on dfs, the next step was to move the parquet files to S3. I get pretty good performance - I can query for events by date range within 10 seconds. ( out of a total of ~ 800M events across 25 years) However, there seems to be some threshold beyond

Re: Partitioning for parquet

2017-06-01 Thread Andries Engelbrecht
Sorting the data by the partition column in the CTAS is a good plan normally, not only does it sort the output by the most likely filter column but also limits the number of parquet files being written to a single stream per partition. Drill can write data per fragment by partition, unless you

Re: Partitioning for parquet

2017-06-01 Thread Raz Baluchi
I guess there is such a thing as over partitioning... The query on the table partitioned by date spends most of the elapsed time on the 'planning' phase, with the execution being roughly equal to the one on the table partitioned by year and month. Based on these results, I've added a third table

Pushing down Joins, Aggregates and filters, and data distribution questions

2017-06-01 Thread Muhammad Gelbana
First of all, I was very happy to at last attend the hangouts meeting, I've been trying to do so for quite sometime. I know I confused most of you during the meeting but that's because my requirements aren't crystal clear at the moment and I'm still learning what Drill can do. Hopefully I learn

Re: Parquet filter pushdown and string fields that use dictionary encoding

2017-06-01 Thread Stefán Baxter
Hi Jinfeng, Netflix already has this working in Presto with current Parquet version so the fundamentals are all there. I wish we had resources to do this our selves as this is massively important to us and I would think that the performance gain is so substantial that this would be of high value