Configuring Drill Memory Usage under Windows

2017-03-06 Thread David F. Severski
Greetings!

I'm a new user of Drill 1.9.0 under Windows 10 w/Java 1.8.0_121 (x64). I am
trying to configure drill-embedded to have more direct memory available to
it than the default 7GB I see when starting on my 32GB equipped
workstation. Uncommenting the DRILL_HEAP and DRILL_MAX_DIRECT_MEMORY
settings from `conf/drill-env.sh` and setting them to 16G has no effect
(value of direct_max via "select * from sys.memory;" is unchanged [7Gig]
after a restart).

General web searches and specific searches on Stack Overflow haven't turned
up any similar issues. What is the correct way to increase memory available
to drill when launching under Windows?

David


Re: Minimise query plan time for dfs plugin for local file system on tsv file

2017-03-06 Thread rahul challapalli
You can try the below things. For each of the below check the planning time
individually

1. Run explain plan for a simple "select * from `
/scratch/localdisk/drill/testdata/Cust_1G_tsv`"
2. Replace the '*' in your query with explicit column names
3. Remove the extract header from your storage plugin configuration and
from your data files? Rewrite your query to use, columns[0_based_index]
instead of explicit column names

Also how many columns do you have in your text files and what is the size
of each file? Like gautam suggested, it would be good to take a look at
drillbit.log file (from the foreman node where planning occurred) and the
query profile as well.

- Rahul

On Mon, Mar 6, 2017 at 9:30 AM, Gautam Parai  wrote:

> Can you please provide the drillbit.log file?
>
>
> Gautam
>
> 
> From: PROJJWAL SAHA 
> Sent: Monday, March 6, 2017 1:45:38 AM
> To: user@drill.apache.org
> Subject: Fwd: Minimise query plan time for dfs plugin for local file
> system on tsv file
>
> all, please help me in giving suggestions on what areas i can look into
> why the query planning time is taking so long for files which are local to
> the drill machines. I have the same directory structure copied on all the 5
> nodes of the cluster. I am accessing the source files using out of the box
> dfs storage plugin.
>
> Query planning time is approx 30 secs
> Query execution time is apprx 1.5 secs
>
> Regards,
> Projjwal
>
> -- Forwarded message --
> From: PROJJWAL SAHA >
> Date: Fri, Mar 3, 2017 at 5:06 PM
> Subject: Minimise query plan time for dfs plugin for local file system on
> tsv file
> To: user@drill.apache.org
>
>
> Hello all,
>
> I am quering select * from dfs.xxx where yyy (filter condition)
>
> I am using dfs storage plugin that comes out of the box from drill on a
> 1GB file, local to the drill cluster.
> The 1GB file is split into 10 files of 100 MB each.
> As expected I see 11 minor and 2 major fagments.
> The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.
>
> One observation is that the query plan time is more than 30 seconds. I ran
> the explain plan query to validate this.
> The query execution time is 2 secs.
> total time taken is 32secs
>
> I wanted to understand how can i minimise the query plan time. Suggestions
> ?
> Is the time taken described above expected ?
> Attached is result from explain plan query
>
> Regards,
> Projjwal
>
>
>


Re: Explain Plan for Parquet data is taking a lot of timre

2017-03-06 Thread rahul challapalli
For explanation regarding why we are rebuilding the metadata cache, take a
look at Padma's previous email. Most likely, there is a data change in the
folder. If not we should refresh the metadata cache and its a bug.

Drill currently does not do incremental metadata refreshes. Now lets say
you have a table "transactions" (with 100 partitions) and you added a new
partition to the "transaction" folder. Drill refreshes the metadata cache
for all the partitions (even though we only added 1 partition and nothing
has changed in the remaining partitions) automatically for the subsequent
query. This might explain why its taking a long time.

- Rahul

On Mon, Mar 6, 2017 at 9:24 AM, Chetan Kothari 
wrote:

> Hi All
>
>
>
> Any inputs on this?
>
>
>
> Why creating metadata files recursively should took 1457445 ms when
> refresh metadata on this path is already done?
>
>
>
> Regards
>
> Chetan
>
>
>
>   - -Original Message-
> From: Jeena Vinod
> Sent: Sunday, March 5, 2017 10:44 PM
> To: user@drill.apache.org
> Subject: RE: Explain Plan for Parquet data is taking a lot of timre
>
>
>
> Re- attaching the log file as zip file.
>
>
>
> Regards
>
> Jeena
>
>
>
> -Original Message-
>
> From: Jeena Vinod
>
> Sent: Sunday, March 05, 2017 9:23 PM
>
> To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org
>
> Subject: RE: Explain Plan for Parquet data is taking a lot of timre
>
>
>
> Hi Kunal,
>
>
>
> Thanks for the response.
>
> Attaching the log with DEBUG enabled for the mentioned loggers. I had to
> trim the log for the query, since this mailer allows max 1MB.
>
>
>
> From the log files, the below step seems to be taking the most time. Since
> refresh metadata on this path is already done, I am unsure what this means.
>
>   -Creating metadata files recursively took 1457445 ms
>
>
>
> Also I have 4 core nodes and the planner.width.max_per_node value is
> currently 3.
>
> I tried with values 6 and 8, but did not see significant improvement in
> response time. How do we get the optimal value for this property on a
> cluster?
>
>
>
> Regards
>
> Jeena
>
>
>
> -Original Message-
>
> From: Kunal Khatua [mailto:kkha...@mapr.com]
>
> Sent: Thursday, March 02, 2017 7:25 AM
>
> To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org
>
> Subject: Re: Explain Plan for Parquet data is taking a lot of timre
>
>
>
> Hi Jeena
>
>
>
>
>
> The JSON profile does not reveal much about why the planning time took so
> long, but only give you information from the physical plan and when the
> planning approximately completed (7+min for 2node; 15+min for 5node).
>
>
>
> Drillbit logs, however, will give you more information. For this, you'll
> need to look in the log for with information like
>
>
>
>
>
> 2017-02-23 14:00:54,609 [27513143-8718-7a47-a2d4-06850755568a:foreman]
> DEBUG o.a.d.e.p.s.h.DefaultSqlHandler - VOLCANO:Physical Planning
> (49588ms):
>
>
>
> You might need to enable your logback.xml to pop out this information (by
> enabling DEBUG level logging for these classes).
>
> These are the recommended loggers you can enable DEBUG for:
>
> org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
>
> org.apache.drill.exec.work.foreman.Foreman
>
>
>
>
>
> You can share the drillbit log file with these as a run thereafter.
>
>
>
>
>
> Also, Your profile JSONs indicate that you have a fairly slow underlying
> filesystem.
>
>
>
>
>
> I'm seeing an average of
>
>
>
> 3m45s to read 945K rows (2 node setup)
>
>
>
> and
>
>
>
> 2m25s to read 1.5M rows (5node setup)
>
>
>
>
>
> Your 2node setup shows 6 fragments processing 29 batches; while 5node
> setup shows 15 fragments processing 46 batches. For the number of rows, the
> amount of time spent is very high, which makes me believe that your
> filesystem (Oracle Cloud Storage service) is itself quite slow.
>
>
>
> For speeding up execution, you can try changing the
> planner.width.max_per_node to a higher value (like the number of cores on
> the node). This should increase parallelization and the utilization of all
> the cores by Drill.
>
>
>
>
>
> Kunal Khatua
>
>
>
> Engineering
>
>
>
> [MapR]
>
>
>
> HYPERLINK "http://www.mapr.com%3chttp:/www.mapr.com/"www.mapr.com ://www.mapr.com/>
>
>
>
> 
>
> From: Jeena Vinod mailto:jeena.vi...@oracle.com;
> jeena.vi...@oracle.com>
>
> Sent: Tuesday, February 28, 2017 12:24:52 PM
>
> To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org
>
> Subject: RE: Explain Plan for Parquet data is taking a lot of timre
>
>
>
> Kindly let know if there are any pointers on how to improve response time
> for parquet data here.
>
>
>
> Regards
>
> Jeena
>
>
>
> -Original Message-
>
> From: Jeena Vinod
>
> Sent: Tuesday, February 28, 2017 4:25 AM
>
> To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org
>
> Subject: RE: Explain Plan for Parquet data is taking a lot of timre
>
>
>
> Hi,
>
>
>
> I 

Re: Minimise query plan time for dfs plugin for local file system on tsv file

2017-03-06 Thread Gautam Parai
Can you please provide the drillbit.log file?


Gautam


From: PROJJWAL SAHA 
Sent: Monday, March 6, 2017 1:45:38 AM
To: user@drill.apache.org
Subject: Fwd: Minimise query plan time for dfs plugin for local file system on 
tsv file

all, please help me in giving suggestions on what areas i can look into why the 
query planning time is taking so long for files which are local to the drill 
machines. I have the same directory structure copied on all the 5 nodes of the 
cluster. I am accessing the source files using out of the box dfs storage 
plugin.

Query planning time is approx 30 secs
Query execution time is apprx 1.5 secs

Regards,
Projjwal

-- Forwarded message --
From: PROJJWAL SAHA >
Date: Fri, Mar 3, 2017 at 5:06 PM
Subject: Minimise query plan time for dfs plugin for local file system on tsv 
file
To: user@drill.apache.org


Hello all,

I am quering select * from dfs.xxx where yyy (filter condition)

I am using dfs storage plugin that comes out of the box from drill on a 1GB 
file, local to the drill cluster.
The 1GB file is split into 10 files of 100 MB each.
As expected I see 11 minor and 2 major fagments.
The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.

One observation is that the query plan time is more than 30 seconds. I ran the 
explain plan query to validate this.
The query execution time is 2 secs.
total time taken is 32secs

I wanted to understand how can i minimise the query plan time. Suggestions ?
Is the time taken described above expected ?
Attached is result from explain plan query

Regards,
Projjwal




RE: Explain Plan for Parquet data is taking a lot of timre

2017-03-06 Thread Chetan Kothari
Hi All

 

Any inputs on this?

 

Why creating metadata files recursively should took 1457445 ms when refresh 
metadata on this path is already done?

 

Regards

Chetan

 

  - -Original Message-
From: Jeena Vinod 
Sent: Sunday, March 5, 2017 10:44 PM
To: user@drill.apache.org
Subject: RE: Explain Plan for Parquet data is taking a lot of timre

 

Re- attaching the log file as zip file.

 

Regards

Jeena

 

-Original Message-

From: Jeena Vinod

Sent: Sunday, March 05, 2017 9:23 PM

To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

Subject: RE: Explain Plan for Parquet data is taking a lot of timre

 

Hi Kunal, 

 

Thanks for the response. 

Attaching the log with DEBUG enabled for the mentioned loggers. I had to trim 
the log for the query, since this mailer allows max 1MB.

 

>From the log files, the below step seems to be taking the most time. Since 
>refresh metadata on this path is already done, I am unsure what this means.

  -Creating metadata files recursively took 1457445 ms

 

Also I have 4 core nodes and the planner.width.max_per_node value is currently 
3.

I tried with values 6 and 8, but did not see significant improvement in 
response time. How do we get the optimal value for this property on a cluster?

 

Regards

Jeena

 

-Original Message-

From: Kunal Khatua [mailto:kkha...@mapr.com]

Sent: Thursday, March 02, 2017 7:25 AM

To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

Subject: Re: Explain Plan for Parquet data is taking a lot of timre

 

Hi Jeena

 

 

The JSON profile does not reveal much about why the planning time took so long, 
but only give you information from the physical plan and when the planning 
approximately completed (7+min for 2node; 15+min for 5node).

 

Drillbit logs, however, will give you more information. For this, you'll need 
to look in the log for with information like

 

 

2017-02-23 14:00:54,609 [27513143-8718-7a47-a2d4-06850755568a:foreman] DEBUG 
o.a.d.e.p.s.h.DefaultSqlHandler - VOLCANO:Physical Planning (49588ms):

 

You might need to enable your logback.xml to pop out this information (by 
enabling DEBUG level logging for these classes).

These are the recommended loggers you can enable DEBUG for:

org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler

org.apache.drill.exec.work.foreman.Foreman

 

 

You can share the drillbit log file with these as a run thereafter.

 

 

Also, Your profile JSONs indicate that you have a fairly slow underlying 
filesystem.

 

 

I'm seeing an average of

 

3m45s to read 945K rows (2 node setup)

 

and

 

2m25s to read 1.5M rows (5node setup)

 

 

Your 2node setup shows 6 fragments processing 29 batches; while 5node setup 
shows 15 fragments processing 46 batches. For the number of rows, the amount of 
time spent is very high, which makes me believe that your filesystem (Oracle 
Cloud Storage service) is itself quite slow.

 

For speeding up execution, you can try changing the planner.width.max_per_node 
to a higher value (like the number of cores on the node). This should increase 
parallelization and the utilization of all the cores by Drill.

 

 

Kunal Khatua

 

Engineering

 

[MapR]

 

HYPERLINK 
"http://www.mapr.com%3chttp:/www.mapr.com/"www.mapr.com

 



From: Jeena Vinod mailto:jeena.vi...@oracle.com"jeena.vi...@oracle.com>

Sent: Tuesday, February 28, 2017 12:24:52 PM

To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

Subject: RE: Explain Plan for Parquet data is taking a lot of timre

 

Kindly let know if there are any pointers on how to improve response time for 
parquet data here.

 

Regards

Jeena

 

-Original Message-

From: Jeena Vinod

Sent: Tuesday, February 28, 2017 4:25 AM

To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

Subject: RE: Explain Plan for Parquet data is taking a lot of timre

 

Hi,

 

I have 2 Drill 1.9 installations. One is a 5 node 32GB cluster and other is a 2 
node 16GB cluster. And I am running the same query in both the places.

select * from `testdata` where  limit 100; testdata is 1GB 
uncompressed parquet data.

 

The query response time is found as below:

2 node cluster - 13min

5 node cluster - 19min

 

I was expecting 5 node cluster to be faster, but the results say otherwise.

In the query profile, as expected, 5 node cluster has more minor fragments, but 
still the scan time is higher. Attached the json profile for both.

Is this in anyway related to the max batches/max records for row group scan?

 

Any suggestions on how we can get better response time in the 5 node cluster is 
appreciated.

 

Regards

Jeena

 

-Original Message-

From: Jeena Vinod

Sent: Sunday, February 26, 2017 2:22 AM

To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

Subject: RE: Explain Plan for Parquet data is taking a lot of timre

 


Re: Metadata Caching

2017-03-06 Thread rahul challapalli
There is no need to refresh the metadata for every query. You only need to
generate the metadata cache once for each folder. Now if your data gets
updated, then any subsequent query you submit will automatically refresh
the metadata cache. Again you need not run the "refresh table metadata
" command  explicitly. Refer to [1] and ignore the reference
to "session" on that page.

[1] https://drill.apache.org/docs/optimizing-parquet-metadata-reading/

- Rahul



On Mon, Mar 6, 2017 at 7:49 AM, Chetan Kothari 
wrote:

> Hi All
>
>
>
> As I understand,  we can trigger generation of the Parquet Metadata Cache
> File by using REFRESH TABLE METADATA .
>
> It seems we need to run this command on a directory, nested or flat, once
> during the session.
>
>
>
> Why we need to run for every session? That implies if I use REST API to
> fire query, I have to generate meta-data cache file as part of every REST
> API call.
>
> This seems to be issue as I have seen that generation of meta-data cache
> file takes some significant time.
>
>
>
> Can't we define/configure  cache expiry time so that we can keep meta-data
> in cache for longer duration?
>
>
>
> Any inputs on this will be helpful.
>
>
>
> Regards
>
> Chetan
>
>
>


Metadata Caching

2017-03-06 Thread Chetan Kothari
Hi All

 

As I understand,  we can trigger generation of the Parquet Metadata Cache File 
by using REFRESH TABLE METADATA .

It seems we need to run this command on a directory, nested or flat, once 
during the session. 

 

Why we need to run for every session? That implies if I use REST API to fire 
query, I have to generate meta-data cache file as part of every REST API call.

This seems to be issue as I have seen that generation of meta-data cache file 
takes some significant time.

 

Can't we define/configure  cache expiry time so that we can keep meta-data in 
cache for longer duration?

 

Any inputs on this will be helpful.

 

Regards

Chetan

 


Re: Discussion: Comments in Drill Views

2017-03-06 Thread John Omernik
I can see both sides. But Ted is right, this won't hurt any thing from a
performance perspective, even if they put War and Peace in there 30 times,
that's 100mb of information to serve. People may choose to use formatting
languages like Markup or something. I do think we should have a limit so we
know what happens if someone tries to break that limit (from a security
perspective) but we could set that quite high, and then just test putting
data that exceeds that as a unit test.



On Fri, Mar 3, 2017 at 8:28 PM, Ted Dunning  wrote:

> All of War and Peace is only 3MB.
>
> Let people document however they want. Don't over-optimize for problems
> that have never occurred.
>
>
>
> On Fri, Mar 3, 2017 at 3:19 PM, Kunal Khatua  wrote:
>
> > It might be, incase someone begins to dump a massive design doc into the
> > comment field for a view's JSON.
> >
> >
> > I'm also not sure about how this information can be consumed. If it is
> > through CLI, either we rely on the SQLLine shell to trim the output, or
> not
> > worry at all. I'm assuming we'd also probably want something like a
> >
> > DESCRIBE VIEW ...
> >
> > to be enhanced to something like
> >
> > DESCRIBE VIEW WITH COMMENTARY ...
> >
> >
> > A 1KB field is quite generous IMHO. That's more than 7 tweets to describe
> > something ! [?]
> >
> >
> > Kunal Khatua
> >
> > 
> > From: Ted Dunning 
> > Sent: Friday, March 3, 2017 12:56:44 PM
> > To: user
> > Subject: Re: Discussion: Comments in Drill Views
> >
> > It it really necessary to put a technical limit in to prevent people from
> > OVER-documenting views?
> >
> >
> > What is the last time you saw code that had too many comments in it?
> >
> >
> >
> > On Thu, Mar 2, 2017 at 8:42 AM, John Omernik  wrote:
> >
> > > So I think on your worry that's an easily definable "abuse"
> condition...
> > > i.e. if we set a limit of say 1024 characters, that provides ample
> space
> > > for descriptions, but at 1kb per view, that's an allowable condition,
> > i.e.
> > > it would be hard to abuse it ... or am I missing something?
> > >
> > > On Wed, Mar 1, 2017 at 8:08 PM, Kunal Khatua  wrote:
> > >
> > > > +1
> > > >
> > > >
> > > > I this this can be very useful. The only worry is of someone abusing
> > it,
> > > > so we probably should have a limit on the size of this? Not sure else
> > it
> > > > could be exposed and consumed.
> > > >
> > > >
> > > > Kunal Khatua
> > > >
> > > > Engineering
> > > >
> > > > [MapR]
> > > >
> > > > www.mapr.com
> > > >
> > > > 
> > > > From: John Omernik 
> > > > Sent: Wednesday, March 1, 2017 9:55:27 AM
> > > > To: user
> > > > Subject: Re: Discussion: Comments in Drill Views
> > > >
> > > > Sorry, I let this idea drop (I didn't follow up and found when
> > searching
> > > > for something else...)  Any other thoughts on this idea?
> > > >
> > > > Should I open a JIRA if people think it would be handy?
> > > >
> > > > On Thu, Jun 23, 2016 at 4:02 PM, Ted Dunning 
> > > > wrote:
> > > >
> > > > > This is very interesting.  I love docstrings in Lisp and Python and
> > > > Javadoc
> > > > > in Java.
> > > > >
> > > > > Basically this is like that, but for SQL. Very helpful.
> > > > >
> > > > > On Thu, Jun 23, 2016 at 11:48 AM, John Omernik 
> > > wrote:
> > > > >
> > > > > > I am looking for discussion here. A colleague was asking me how
> to
> > > add
> > > > > > comments to the metadata of a view.  (He's new to Drill, thus the
> > > idea
> > > > of
> > > > > > not having metadata for a table is one he's warming up to).
> > > > > >
> > > > > > That got me thinking... why couldn't we use Drill Views to store
> > > > > > table/field comments?  This could be a great way to help add
> > > contextual
> > > > > > information for users. Here's some current observations when I
> > issue
> > > a
> > > > > > describe view_myview
> > > > > >
> > > > > >
> > > > > > 1. I get three columns ,COLUMN_NAME, DATA_TYPE, and IS_NULLABLE
> > > > > > 2. Even thought the underlying parquet table has types, the view
> > does
> > > > not
> > > > > > pass the types for the underlying parquet files through.  (The
> type
> > > is
> > > > > ANY)
> > > > > > 3. The data for the view is all just a json file that could be
> > easily
> > > > > > extended.
> > > > > >
> > > > > >
> > > > > > So, a few things would be a nice to have
> > > > > >
> > > > > > 1. Table comments.  when I issue a describe table, if the view
> has
> > a
> > > > > > "Description" field, then having that print out as a description
> > for
> > > > the
> > > > > > whole view would be nice.  This is harder, I think because it's
> not
> > > > just
> > > > > > extending the view information.
> > > > > >
> > > > > > 2. Column comments:  A text field that could be added to the
> view,
> > > 

Re: [Drill 1.9.0] : [CONNECTION ERROR] :- (user client) closed unexpectedly. Drillbit down?

2017-03-06 Thread John Omernik
Have you tried disabling hash joins or hash agg on the query or changing
the planning width? Here are some docs to check out:

https://drill.apache.org/docs/configuring-resources-for-a-shared-drillbit/

https://drill.apache.org/docs/guidelines-for-optimizing-aggregation/

https://drill.apache.org/docs/sort-based-and-hash-based-memory-constrained-operators/

Let us know if any of these have an effect on the queries...

Also, the three links I posted here are query based changes, so an ALTER
SESSION should address them. On the suggestion above with memory, that
WOULD have to be made on all Drill bits running, and would require a
restart of the Drillbit to take effect.



On Sat, Mar 4, 2017 at 1:01 PM, Anup Tiwari 
wrote:

> Hi John,
>
> I have tried above config as well but still getting this issue.
> And please note that we were using similar configuration params for Drill
> 1.6 where this issue was not coming.
> Anything else which i can try?
>
> Regards,
> *Anup Tiwari*
>
> On Fri, Mar 3, 2017 at 11:01 PM, Abhishek Girish 
> wrote:
>
> > +1 on John's suggestion.
> >
> > On Fri, Mar 3, 2017 at 6:24 AM, John Omernik  wrote:
> >
> > > So your node has 32G of ram yet you are allowing Drill to use 36G.  I
> > would
> > > change your settings to be 8GB of Heap, and 22GB of Direct Memory. See
> if
> > > this helps with your issues.  Also, are you using a distributed
> > filesystem?
> > > If so you may want to allow even more free ram...i.e. 8GB of Heap and
> > 20GB
> > > of Direct.
> > >
> > > On Fri, Mar 3, 2017 at 8:20 AM, Anup Tiwari  >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Please find our configuration details :-
> > > >
> > > > Number of Nodes : 4
> > > > RAM/Node : 32GB
> > > > Core/Node : 8
> > > > DRILL_MAX_DIRECT_MEMORY="20G"
> > > > DRILL_HEAP="16G"
> > > >
> > > > And all other variables are set to default.
> > > >
> > > > Since we have tried some of the settings suggested above but still
> > facing
> > > > this issue more frequently, kindly suggest us what is best
> > configuration
> > > > for our environment.
> > > >
> > > > Regards,
> > > > *Anup Tiwari*
> > > >
> > > > On Thu, Mar 2, 2017 at 1:26 AM, John Omernik 
> wrote:
> > > >
> > > > > Another thing to consider is ensure you have a Spill Location
> setup,
> > > and
> > > > > then disable hashagg/hashjoin for the query...
> > > > >
> > > > > On Wed, Mar 1, 2017 at 1:25 PM, Abhishek Girish <
> agir...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Hey Anup,
> > > > > >
> > > > > > This is indeed an issue, and I can understand that having an
> > unstable
> > > > > > environment is not something anyone wants. DRILL-4708 is still
> > > > > unresolved -
> > > > > > hopefully someone will get to it soon. I've bumped up the
> priority.
> > > > > >
> > > > > > Unfortunately we do not publish any sizing guidelines, so you'd
> > have
> > > to
> > > > > > experiment to settle on the right load for your cluster. Please
> > > > decrease
> > > > > > the concurrency (number of queries running in parallel). And try
> > > > bumping
> > > > > up
> > > > > > Drill DIRECT memory. Also, please set the system options
> > recommended
> > > by
> > > > > > Sudheesh. While this may not solve the issue, it may help reduce
> > it's
> > > > > > occurrence.
> > > > > >
> > > > > > Can you also update the JIRA with your configurations, type of
> > > queries
> > > > > and
> > > > > > the relevant logs?
> > > > > >
> > > > > > -Abhishek
> > > > > >
> > > > > > On Wed, Mar 1, 2017 at 10:17 AM, Anup Tiwari <
> > > > anup.tiw...@games24x7.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > Can someone look into it? As we are now getting this more
> > > frequently
> > > > in
> > > > > > > Adhoc queries as well.
> > > > > > > And for automation jobs, we are moving to Hive as in drill we
> are
> > > > > getting
> > > > > > > this more frequently.
> > > > > > >
> > > > > > > Regards,
> > > > > > > *Anup Tiwari*
> > > > > > >
> > > > > > > On Sat, Dec 31, 2016 at 12:11 PM, Anup Tiwari <
> > > > > anup.tiw...@games24x7.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We are getting this issue bit more frequently. can someone
> > please
> > > > > look
> > > > > > > > into it and tell us that why it is happening since as mention
> > in
> > > > > > earlier
> > > > > > > > mail when this query gets executed no other query is running
> at
> > > > that
> > > > > > > time.
> > > > > > > >
> > > > > > > > Thanks in advance.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > *Anup Tiwari*
> > > > > > > >
> > > > > > > > On Sat, Dec 24, 2016 at 10:20 AM, Anup Tiwari <
> > > > > > anup.tiw...@games24x7.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi Sudheesh,
> > > > > > > >>
> > > > > > > >> Please find below ans :-
> > > > > > > >>
> > > > > > > >> 

Fwd: Minimise query plan time for dfs plugin for local file system on tsv file

2017-03-06 Thread PROJJWAL SAHA
all, please help me in giving suggestions on what areas i can look into why
the query planning time is taking so long for files which are local to the
drill machines. I have the same directory structure copied on all the 5
nodes of the cluster. I am accessing the source files using out of the box
dfs storage plugin.

Query planning time is approx 30 secs
Query execution time is apprx 1.5 secs

Regards,
Projjwal

-- Forwarded message --
From: PROJJWAL SAHA 
Date: Fri, Mar 3, 2017 at 5:06 PM
Subject: Minimise query plan time for dfs plugin for local file system on
tsv file
To: user@drill.apache.org


Hello all,

I am quering select * from dfs.xxx where yyy (filter condition)

I am using dfs storage plugin that comes out of the box from drill on a 1GB
file, local to the drill cluster.
The 1GB file is split into 10 files of 100 MB each.
As expected I see 11 minor and 2 major fagments.
The drill cluster is 5 nodes cluster with 4 cores, 32 GB  each.

One observation is that the query plan time is more than 30 seconds. I ran
the explain plan query to validate this.
The query execution time is 2 secs.
total time taken is 32secs

I wanted to understand how can i minimise the query plan time. Suggestions ?
Is the time taken described above expected ?
Attached is result from explain plan query

Regards,
Projjwal
+--+--+
| text | json |
+--+--+
| 00-00Screen
00-01  Project(*=[$0])
00-02UnionExchange
01-01  Project(T2¦¦*=[$0])
01-02SelectionVectorRemover
01-03  Filter(condition=[AND(=($1, '41'), =($2, '568'))])
01-04Project(T2¦¦*=[$0], ORDER_ID=[$1], CUSTOMER_ID=[$2])
01-05  Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/scratch/localdisk/drill/testdata/Cust_1G_tsv, numFiles=10, 
columns=[`*`], files=[file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/4.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/5.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/10.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/2.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/3.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/1.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/7.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/6.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/8.tsv, 
file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/9.tsv]]])
 | {
  "head" : {
"version" : 1,
"generator" : {
  "type" : "ExplainHandler",
  "info" : ""
},
"type" : "APACHE_DRILL_PHYSICAL",
"options" : [ ],
"queue" : 0,
"resultMode" : "EXEC"
  },
  "graph" : [ {
"pop" : "fs-scan",
"@id" : 65541,
"userName" : "optitest",
"files" : [ "file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/4.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/5.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/10.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/2.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/3.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/1.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/7.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/6.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/8.tsv", 
"file:/scratch/localdisk/drill/testdata/Cust_1G_tsv/9.tsv" ],
"storage" : {
  "type" : "file",
  "enabled" : true,
  "connection" : "file:///",
  "config" : null,
  "workspaces" : {
"root" : {
  "location" : "/",
  "writable" : true,
  "defaultInputFormat" : null
},
"tpch9m" : {
  "location" : "/user/hive/warehouse/tpch9m.db",
  "writable" : true,
  "defaultInputFormat" : null
},
"taxi1m" : {
  "location" : "/user/hive/warehouse/taxi.db/taxi_enriched_sukhdeep_1m",
  "writable" : true,
  "defaultInputFormat" : null
},
"tmp" : {
  "location" : "/tmp",
  "writable" : true,
  "defaultInputFormat" : null
}
  },
  "formats" : {
"psv" : {
  "type" : "text",
  "extensions" : [ "tbl" ],
  "delimiter" : "|"
},
"csv" : {
  "type" : "text",
  "extensions" : [ "csv" ],
  "delimiter" : ","
},
"tsv" : {
  "type" : "text",
  "extensions" : [ "tsv" ],
  "extractHeader" : true,
  "delimiter" : "\t"
},
"parquet" : {
  "type" : "parquet"
},
"json" : {
  "type" : "json",
  "extensions" : [ "json" ]
},
"avro" : {
  "type" : "avro"
},
"sequencefile" : {
  "type" : "sequencefile",
  "extensions" : [ "seq" ]
},
"csvh" : {
  "type" : "text",
  "extensions" : [ "csvh" ],
  "extractHeader" : true,