Attachment hasn't come through. Can you upload the query profile to some
cloud storage and share a link to it?
Also, please share details on how large your dataset is, number of
Drillbits, memory and other configurations.
On Thu, Jun 1, 2017 at 10:18 PM, wrote:
>
Hi,
I am running a simple query which performs JOIN operation between two parquet
files and it takes around 3-4 secs and I noticed that 70% of the time is used
by UNORDERED_RECEIVER.
Sample query is -
select sum(sales),week from
Hi Muhammad,
> I have a couple of questions:
>
> 1. If I have multiple *SubScan*s to be executed, will each *SubScan* be
> handled by a single *Scan* operator ? So whenever I have *n* *SubScan*s,
> I'll have *n* Scan operators distributed among Drill's cluster ?
As Rahul explained,
Cool, thanks for confirming.
_
From: Raz Baluchi
Sent: Thursday, June 1, 2017 2:14 PM
Subject: Re: Parquet on S3 -
setting
fs.s3a.connection.maximum
100
does fix the problem. No more timeouts and very quick response. No need to
'prime' the query...
On Thu, Jun 1, 2017 at 4:08 PM, Abhishek Girish wrote:
> Can you take a look at [1] and let us know if that helps resolve
I noticed that if I precede the query with a select count(*) with the same
filters, I no longer experience timeouts. By 'priming' the query in this
way, the second query is also faster. This seems to be an acceptable
workaround as it it seems to allow me to essentially include all partitions
in
I would first recommend you spend some time reading the execution flow
inside drill [1]. Try to understand specifically what major/minor fragments
are and that different major fragments can have different levels of
parallelism.
Let us take a simple query which runs on a 2 node cluster
select *
Can you take a look at [1] and let us know if that helps resolve your issue?
[1]
https://drill.apache.org/docs/s3-storage-plugin/#quering-parquet-format-files-on-s3
On Thu, Jun 1, 2017 at 12:55 PM, Raz Baluchi wrote:
> Now that I have Drill working with parquet files on
Now that I have Drill working with parquet files on dfs, the next step was
to move the parquet files to S3.
I get pretty good performance - I can query for events by date range
within 10 seconds. ( out of a total of ~ 800M events across 25 years)
However, there seems to be some threshold beyond
Sorting the data by the partition column in the CTAS is a good plan normally,
not only does it sort the output by the most likely filter column but also
limits the number of parquet files being written to a single stream per
partition. Drill can write data per fragment by partition, unless you
I guess there is such a thing as over partitioning...
The query on the table partitioned by date spends most of the elapsed time
on the 'planning' phase, with the execution being roughly equal to the one
on the table partitioned by year and month.
Based on these results, I've added a third table
First of all, I was very happy to at last attend the hangouts meeting, I've
been trying to do so for quite sometime.
I know I confused most of you during the meeting but that's because my
requirements aren't crystal clear at the moment and I'm still learning what
Drill can do. Hopefully I learn
Hi Jinfeng,
Netflix already has this working in Presto with current Parquet version so
the fundamentals are all there.
I wish we had resources to do this our selves as this is massively
important to us and I would think that the performance gain is so
substantial that this would be of high value
13 matches
Mail list logo