Re: Minimise query plan time for dfs plugin for local file system on tsv file

Kunal Khatua Tue, 07 Mar 2017 10:58:47 -0800

Looking at the 1st two lines of the log shows that the bulk of time was lost 
before the query even went into the real planning stage of the query:



2017-03-07 06:27:28,074 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman] INFO  
o.a.drill.exec.work.foreman.Foreman - Query text for query id 
274166de-f543-3fa7-ef9e-8e9e87d5d6a0: select columns[0] from 
dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv` where columns[0] 
='41' and columns[3] ='568'
2017-03-07 06:28:00,775 [274166de-f543-3fa7-ef9e-8e9e87d5d6a0:foreman] INFO  
o.a.d.exec.store.dfs.FileSelection - FileSelection.getStatuses() took 0 ms, 
numFiles: 1


More than 30 secs is unaccounted for. Can you turn on the root logger to be at 
the debug level and retry the explain plan?


Kunal Khatua


________________________________
From: rahul challapalli <challapallira...@gmail.com>
Sent: Tuesday, March 7, 2017 5:24:43 AM
To: user
Subject: Re: Minimise query plan time for dfs plugin for local file system on 
tsv file

I did not get a chance to review the log file.

However the next thing I would try is to make your cluster a single node
cluster first and then run the same explain plan query separately on each
individual file.



On Mar 7, 2017 5:09 AM, "PROJJWAL SAHA" <proj.s...@gmail.com> wrote:

> Hi Rahul,
>
> thanks for your suggestions. However, I am still not able to see any
> reduction in query planning time
> by explicit column names, removing extract headers and using columns[index]
>
> As I said, I ran explain plan and its taking 30+ secs for me.
> My data is 1 GB tsv split into 20 files of 5 MB each.
> Each 5MB file has close to 50k records
> Its a 5 node cluster, and width per node is 4
> Therefore, total number of minor fragments for one major fragment is 20
> I have copied the source directory in all the drillbit nodes
>
> can you tell me a reasonable time estimate which I can expect drill to
> return result for query for the above described scenario.
> Query is - select columns[0] from 
> dfs.root.`/scratch/localdisk/drill/testdata/Cust_1G_20_tsv`
> where columns[0] ='41' and columns[3] ='568'
>
> attached is the json profile
> and the drillbit.log
>
> I also have the tracing enabled.
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
> org.apache.drill.exec.work.foreman.Foreman
> however i see the duration of various steps in the order of ms in the logs.
> i am not sure where planning time of the order of 30 secs is consumed.
>
> Please help
>
> Regards,
> Projjwal
>
>
>
>
>
>
>
> On Mon, Mar 6, 2017 at 11:23 PM, rahul challapalli <
> challapallira...@gmail.com> wrote:
>
>> You can try the below things. For each of the below check the planning
>> time
>> individually
>>
>> 1. Run explain plan for a simple "select * from `
>> /scratch/localdisk/drill/testdata/Cust_1G_tsv`"
>> 2. Replace the '*' in your query with explicit column names
>> 3. Remove the extract header from your storage plugin configuration and
>> from your data files? Rewrite your query to use, columns[0_based_index]
>> instead of explicit column names
>>
>> Also how many columns do you have in your text files and what is the size
>> of each file? Like gautam suggested, it would be good to take a look at
>> drillbit.log file (from the foreman node where planning occurred) and the
>> query profile as well.
>>
>> - Rahul
>>
>> On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai <gpa...@mapr.com> wrote:
>>
>> > Can you please provide the drillbit.log file?
>> >
>> >
>> > Gautam
>> >
>> > ________________________________
>> > From: PROJJWAL SAHA <proj.s...@gmail.com>
>> > Sent: Monday, March 6, 2017 1:45:38 AM
>> > To: user@drill.apache.org
>> > Subject: Fwd: Minimise query plan time for dfs plugin for local file
>> > system on tsv file
>> >
>> > all, please help me in giving suggestions on what areas i can look into
>> > why the query planning time is taking so long for files which are local
>> to
>> > the drill machines. I have the same directory structure copied on all
>> the 5
>> > nodes of the cluster. I am accessing the source files using out of the
>> box
>> > dfs storage plugin.
>> >
>> > Query planning time is approx 30 secs
>> > Query execution time is apprx 1.5 secs
>> >
>> > Regards,
>> > Projjwal
>> >
>> > ---------- Forwarded message ----------
>> > From: PROJJWAL SAHA <proj.s...@gmail.com<mailto:proj.s...@gmail.com>>
>> > Date: Fri, Mar 3, 2017 at 5:06 PM
>> > Subject: Minimise query plan time for dfs plugin for local file system
>> on
>> > tsv file
>> > To: user@drill.apache.org<mailto:user@drill.apache.org>
>> >
>> >
>> > Hello all,
>> >
>> > I am quering select * from dfs.xxx where yyy (filter condition)
>> >
>> > I am using dfs storage plugin that comes out of the box from drill on a
>> > 1GB file, local to the drill cluster.
>> > The 1GB file is split into 10 files of 100 MB each.
>> > As expected I see 11 minor and 2 major fagments.
>> > The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>> >
>> > One observation is that the query plan time is more than 30 seconds. I
>> ran
>> > the explain plan query to validate this.
>> > The query execution time is 2 secs.
>> > total time taken is 32secs
>> >
>> > I wanted to understand how can i minimise the query plan time.
>> Suggestions
>> > ?
>> > Is the time taken described above expected ?
>> > Attached is result from explain plan query
>> >
>> > Regards,
>> > Projjwal
>> >
>> >
>> >
>>
>
>

Re: Minimise query plan time for dfs plugin for local file system on tsv file

Reply via email to