Re: Query Optimization

2017-08-16 Thread Divya Gehlot
Hi,
Another observation is
My query had where conditions based on the partition values

Total number of parquet files in directory  - 102290
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files


This is how the refresh metadata works I mean it scans each and every files
and get the results ?

I dont  have access to logs now .

Thanks,
Divya

On 17 August 2017 at 13:48, Divya Gehlot  wrote:

> Hi,
> Another observation is
> My query had where conditions based on the partition values
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:03, Padma Penumarthy  wrote:
>
>> Does your query have partition filter ?
>> Execution time is increased most likely because partition pruning is not
>> happening.
>> Did you get a chance to look at the logs ?  That might give some clues.
>>
>> Thanks,
>> Padma
>>
>>
>> > On Aug 16, 2017, at 9:32 PM, Divya Gehlot 
>> wrote:
>> >
>> > Hi,
>> > Even I am surprised .
>> > I am running Drill version 1.10  on MapR enterprise version.
>> > *Query *- Selecting all the columns on partitioned parquet table
>> >
>> > I observed few things from Query statistics :
>> >
>> > Value
>> >
>> > Before Refresh Metadata
>> >
>> > After Refresh Metadata
>> >
>> > Fragments
>> >
>> > 1
>> >
>> > 13
>> >
>> > DURATION
>> >
>> > 01 min 0.233 sec
>> >
>> > 18 min 0.744 sec
>> >
>> > PLANNING
>> >
>> > 59.818 sec
>> >
>> > 33.087 sec
>> >
>> > QUEUED
>> >
>> > Not Available
>> >
>> > Not Available
>> >
>> > EXECUTION
>> >
>> > 0.415 sec
>> >
>> > 17 min 27.657 sec
>> >
>> > The planning time is being reduced by approx 60% but the execution time
>> > increased  drastically.
>> > I would like to understand why the exceution time increases after the
>> > metadata refresh .
>> >
>> >
>> > Appreciate the help.
>> >
>> > Thanks,
>> > divya
>> >
>> >
>> > On 17 August 2017 at 11:54, Padma Penumarthy 
>> wrote:
>> >
>> >> Refresh table metadata should  help reduce query planning time.
>> >> It is odd that it went up after you did refresh table metadata.
>> >> Did you check the logs to see what is happening ? You might have to
>> >> turn on some debugs if needed.
>> >> BTW, what version of Drill are you running ?
>> >>
>> >> Thanks,
>> >> Padma
>> >>
>> >>
>> >>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot 
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>> I have data in parquet file format .
>> >>> when I run the query the data and see the execution plan I could see
>> >>> following
>> >>> statistics
>> >>>
>>  TOTAL FRAGMENTS: 1
>> > DURATION: 01 min 0.233 sec
>> > PLANNING: 59.818 sec
>> > QUEUED: Not Available
>> > EXECUTION: 0.415 sec
>> 
>> 
>> >>>
>> >>> As its a paquet file format I tried enabling refresh meta data
>> >>> and run below command
>> >>> REFRESH TABLE METADATA  ;
>> >>> then run the same query again on the same table same data (no changes
>> in
>> >>> data)  and could find the statistics as show below :
>> >>>
>> >>> TOTAL FRAGMENTS: 13
>> > DURATION: 14 min 14.604 sec
>> > PLANNING: 33.087 sec
>> > QUEUED: Not Available
>> > EXECUTION: Not Available
>> 
>> 
>> >>> The query is still running .
>> >>>
>> >>> Can somebody help me  understand why the query taking so long once I
>> >> issue
>> >>> the refresh metadata command.
>> >>>
>> >>> Aprreciate the help !
>> >>>
>> >>> Thanks,
>> >>> Divya
>> >>
>> >>
>>
>>
>


Re: Query Optimization

2017-08-16 Thread Divya Gehlot
Hi,
Another observation is
My query had where conditions based on the partition values
Before Metadata refresh - Its reading only 4 files
After metadata refresh - its reading 102290 files

Thanks,
Divya

On 17 August 2017 at 13:03, Padma Penumarthy  wrote:

> Does your query have partition filter ?
> Execution time is increased most likely because partition pruning is not
> happening.
> Did you get a chance to look at the logs ?  That might give some clues.
>
> Thanks,
> Padma
>
>
> > On Aug 16, 2017, at 9:32 PM, Divya Gehlot 
> wrote:
> >
> > Hi,
> > Even I am surprised .
> > I am running Drill version 1.10  on MapR enterprise version.
> > *Query *- Selecting all the columns on partitioned parquet table
> >
> > I observed few things from Query statistics :
> >
> > Value
> >
> > Before Refresh Metadata
> >
> > After Refresh Metadata
> >
> > Fragments
> >
> > 1
> >
> > 13
> >
> > DURATION
> >
> > 01 min 0.233 sec
> >
> > 18 min 0.744 sec
> >
> > PLANNING
> >
> > 59.818 sec
> >
> > 33.087 sec
> >
> > QUEUED
> >
> > Not Available
> >
> > Not Available
> >
> > EXECUTION
> >
> > 0.415 sec
> >
> > 17 min 27.657 sec
> >
> > The planning time is being reduced by approx 60% but the execution time
> > increased  drastically.
> > I would like to understand why the exceution time increases after the
> > metadata refresh .
> >
> >
> > Appreciate the help.
> >
> > Thanks,
> > divya
> >
> >
> > On 17 August 2017 at 11:54, Padma Penumarthy 
> wrote:
> >
> >> Refresh table metadata should  help reduce query planning time.
> >> It is odd that it went up after you did refresh table metadata.
> >> Did you check the logs to see what is happening ? You might have to
> >> turn on some debugs if needed.
> >> BTW, what version of Drill are you running ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot 
> >> wrote:
> >>>
> >>> Hi,
> >>> I have data in parquet file format .
> >>> when I run the query the data and see the execution plan I could see
> >>> following
> >>> statistics
> >>>
>  TOTAL FRAGMENTS: 1
> > DURATION: 01 min 0.233 sec
> > PLANNING: 59.818 sec
> > QUEUED: Not Available
> > EXECUTION: 0.415 sec
> 
> 
> >>>
> >>> As its a paquet file format I tried enabling refresh meta data
> >>> and run below command
> >>> REFRESH TABLE METADATA  ;
> >>> then run the same query again on the same table same data (no changes
> in
> >>> data)  and could find the statistics as show below :
> >>>
> >>> TOTAL FRAGMENTS: 13
> > DURATION: 14 min 14.604 sec
> > PLANNING: 33.087 sec
> > QUEUED: Not Available
> > EXECUTION: Not Available
> 
> 
> >>> The query is still running .
> >>>
> >>> Can somebody help me  understand why the query taking so long once I
> >> issue
> >>> the refresh metadata command.
> >>>
> >>> Aprreciate the help !
> >>>
> >>> Thanks,
> >>> Divya
> >>
> >>
>
>


Re: Query Optimization

2017-08-16 Thread Padma Penumarthy
Does your query have partition filter ? 
Execution time is increased most likely because partition pruning is not 
happening.
Did you get a chance to look at the logs ?  That might give some clues.

Thanks,
Padma


> On Aug 16, 2017, at 9:32 PM, Divya Gehlot  wrote:
> 
> Hi,
> Even I am surprised .
> I am running Drill version 1.10  on MapR enterprise version.
> *Query *- Selecting all the columns on partitioned parquet table
> 
> I observed few things from Query statistics :
> 
> Value
> 
> Before Refresh Metadata
> 
> After Refresh Metadata
> 
> Fragments
> 
> 1
> 
> 13
> 
> DURATION
> 
> 01 min 0.233 sec
> 
> 18 min 0.744 sec
> 
> PLANNING
> 
> 59.818 sec
> 
> 33.087 sec
> 
> QUEUED
> 
> Not Available
> 
> Not Available
> 
> EXECUTION
> 
> 0.415 sec
> 
> 17 min 27.657 sec
> 
> The planning time is being reduced by approx 60% but the execution time
> increased  drastically.
> I would like to understand why the exceution time increases after the
> metadata refresh .
> 
> 
> Appreciate the help.
> 
> Thanks,
> divya
> 
> 
> On 17 August 2017 at 11:54, Padma Penumarthy  wrote:
> 
>> Refresh table metadata should  help reduce query planning time.
>> It is odd that it went up after you did refresh table metadata.
>> Did you check the logs to see what is happening ? You might have to
>> turn on some debugs if needed.
>> BTW, what version of Drill are you running ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>>> On Aug 16, 2017, at 8:15 PM, Divya Gehlot 
>> wrote:
>>> 
>>> Hi,
>>> I have data in parquet file format .
>>> when I run the query the data and see the execution plan I could see
>>> following
>>> statistics
>>> 
 TOTAL FRAGMENTS: 1
> DURATION: 01 min 0.233 sec
> PLANNING: 59.818 sec
> QUEUED: Not Available
> EXECUTION: 0.415 sec
 
 
>>> 
>>> As its a paquet file format I tried enabling refresh meta data
>>> and run below command
>>> REFRESH TABLE METADATA  ;
>>> then run the same query again on the same table same data (no changes in
>>> data)  and could find the statistics as show below :
>>> 
>>> TOTAL FRAGMENTS: 13
> DURATION: 14 min 14.604 sec
> PLANNING: 33.087 sec
> QUEUED: Not Available
> EXECUTION: Not Available
 
 
>>> The query is still running .
>>> 
>>> Can somebody help me  understand why the query taking so long once I
>> issue
>>> the refresh metadata command.
>>> 
>>> Aprreciate the help !
>>> 
>>> Thanks,
>>> Divya
>> 
>> 



Re: Query Optimization

2017-08-16 Thread Divya Gehlot
Hi,
Even I am surprised .
I am running Drill version 1.10  on MapR enterprise version.
*Query *- Selecting all the columns on partitioned parquet table

I observed few things from Query statistics :

Value

Before Refresh Metadata

After Refresh Metadata

Fragments

1

13

DURATION

01 min 0.233 sec

 18 min 0.744 sec

PLANNING

59.818 sec

33.087 sec

QUEUED

Not Available

Not Available

EXECUTION

0.415 sec

17 min 27.657 sec

The planning time is being reduced by approx 60% but the execution time
increased  drastically.
I would like to understand why the exceution time increases after the
metadata refresh .


Appreciate the help.

Thanks,
divya


On 17 August 2017 at 11:54, Padma Penumarthy  wrote:

> Refresh table metadata should  help reduce query planning time.
> It is odd that it went up after you did refresh table metadata.
> Did you check the logs to see what is happening ? You might have to
> turn on some debugs if needed.
> BTW, what version of Drill are you running ?
>
> Thanks,
> Padma
>
>
> > On Aug 16, 2017, at 8:15 PM, Divya Gehlot 
> wrote:
> >
> > Hi,
> > I have data in parquet file format .
> > when I run the query the data and see the execution plan I could see
> > following
> > statistics
> >
> >> TOTAL FRAGMENTS: 1
> >>> DURATION: 01 min 0.233 sec
> >>> PLANNING: 59.818 sec
> >>> QUEUED: Not Available
> >>> EXECUTION: 0.415 sec
> >>
> >>
> >
> > As its a paquet file format I tried enabling refresh meta data
> > and run below command
> > REFRESH TABLE METADATA  ;
> > then run the same query again on the same table same data (no changes in
> > data)  and could find the statistics as show below :
> >
> > TOTAL FRAGMENTS: 13
> >>> DURATION: 14 min 14.604 sec
> >>> PLANNING: 33.087 sec
> >>> QUEUED: Not Available
> >>> EXECUTION: Not Available
> >>
> >>
> > The query is still running .
> >
> > Can somebody help me  understand why the query taking so long once I
> issue
> > the refresh metadata command.
> >
> > Aprreciate the help !
> >
> > Thanks,
> > Divya
>
>


Re: Query Optimization

2017-08-16 Thread Padma Penumarthy
Refresh table metadata should  help reduce query planning time.
It is odd that it went up after you did refresh table metadata.
Did you check the logs to see what is happening ? You might have to
turn on some debugs if needed.
BTW, what version of Drill are you running ?

Thanks,
Padma


> On Aug 16, 2017, at 8:15 PM, Divya Gehlot  wrote:
> 
> Hi,
> I have data in parquet file format .
> when I run the query the data and see the execution plan I could see
> following
> statistics
> 
>> TOTAL FRAGMENTS: 1
>>> DURATION: 01 min 0.233 sec
>>> PLANNING: 59.818 sec
>>> QUEUED: Not Available
>>> EXECUTION: 0.415 sec
>> 
>> 
> 
> As its a paquet file format I tried enabling refresh meta data
> and run below command
> REFRESH TABLE METADATA  ;
> then run the same query again on the same table same data (no changes in
> data)  and could find the statistics as show below :
> 
> TOTAL FRAGMENTS: 13
>>> DURATION: 14 min 14.604 sec
>>> PLANNING: 33.087 sec
>>> QUEUED: Not Available
>>> EXECUTION: Not Available
>> 
>> 
> The query is still running .
> 
> Can somebody help me  understand why the query taking so long once I issue
> the refresh metadata command.
> 
> Aprreciate the help !
> 
> Thanks,
> Divya



Query Optimization

2017-08-16 Thread Divya Gehlot
Hi,
I have data in parquet file format .
when I run the query the data and see the execution plan I could see
following
statistics

> TOTAL FRAGMENTS: 1
>> DURATION: 01 min 0.233 sec
>> PLANNING: 59.818 sec
>> QUEUED: Not Available
>> EXECUTION: 0.415 sec
>
>

As its a paquet file format I tried enabling refresh meta data
and run below command
REFRESH TABLE METADATA  ;
then run the same query again on the same table same data (no changes in
data)  and could find the statistics as show below :

TOTAL FRAGMENTS: 13
>> DURATION: 14 min 14.604 sec
>> PLANNING: 33.087 sec
>> QUEUED: Not Available
>> EXECUTION: Not Available
>
>
The query is still running .

Can somebody help me  understand why the query taking so long once I issue
the refresh metadata command.

Aprreciate the help !

Thanks,
Divya


RE: drill error connecting to Hbase

2017-08-16 Thread Shai Shapira
We re-install drill, with newer version ( 1.11) and played a bit in with the 
configuration using the Web access and made it work.

Thanks a lot for your help!!

Thanks,
Shai

-Original Message-
From: Dor Ben Dov 
Sent: Sunday, August 06, 2017 2:09 PM
To: user@drill.apache.org
Subject: RE: drill error connecting to Hbase

Hi Kunal,

I am assisting Shai with the drill, I followed you instructions but once I am 
running maven with the profile of cloudera aka 'cdh' I am receiving this
[dor@dor-fedora64 drill]$ mvn -U -DskipTests clean install -Pcdh [INFO] 
Scanning for projects...
Downloading: 
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/apache/14/apache-14.pom
Downloading: http://conjars.org/repo/org/apache/apache/14/apache-14.pom
Downloading: http://repository.mapr.com/maven/org/apache/apache/14/apache-14.pom
Downloading: http://repo.dremio.com/release/org/apache/apache/14/apache-14.pom
Downloading: 
http://repository.mapr.com/nexus/content/repositories/drill/org/apache/apache/14/apache-14.pom
Downloading: 
https://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
[ERROR] [ERROR] Some problems were encountered while processing the POMs:
[FATAL] Non-resolvable parent POM for org.apache.drill:drill-root:1.11.0: Could 
not transfer artifact org.apache:apache:pom:14 from/to cloudera 
(https://repository.cloudera.com/artifactory/cloudera-repos/): 
repository.cloudera.com: Name or service not known and 'parent.relativePath' 
points at wrong local POM @ line 15, column 11  @ [ERROR] The build could not 
read 1 project -> [Help 1]
[ERROR]   
[ERROR]   The project org.apache.drill:drill-root:1.11.0 
(/home/dor/Downloads/drill/pom.xml) has 1 error
[ERROR] Non-resolvable parent POM for org.apache.drill:drill-root:1.11.0: 
Could not transfer artifact org.apache:apache:pom:14 from/to cloudera 
(https://repository.cloudera.com/artifactory/cloudera-repos/): 
repository.cloudera.com: Name or service not known and 'parent.relativePath' 
points at wrong local POM @ line 15, column 11: Unknown host 
repository.cloudera.com: Name or service not known -> [Help 2]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
[ERROR] [Help 2] 
http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
[dor@dor-fedora64 drill]$


** I am using fedora 26 **

Regards,
Dor

-Original Message-
From: Kunal Khatua [mailto:kkha...@mapr.com]
Sent: יום ה 03 אוגוסט 2017 20:53
To: user@drill.apache.org
Subject: RE: drill error connecting to Hbase

The failure appears to be coming from this:

Caused by: java.lang.IllegalAccessError: Class 
org/apache/hadoop/hbase/zookeeper/MetaTableLocator illegally accessing "package 
private" member of class com/google/common/base/Stopwatch


Scrolling up a bit, I noticed that during the startup, there is an error here:
2017-08-03 11:04:58,957 [main] WARN  o.a.drill.exec.util.GuavaPatcher - Unable 
to patch Guava classes.
javassist.CannotCompileException: by java.lang.LinkageError: 
com.google.common.base.Stopwatch

When you build your Drill package, you can specify a profile. 

https://github.com/apache/drill/blob/master/pom.xml#L953

You can choose the platform you need Drill for and build with that. This tells 
Maven to apply specific versions of some dependencies that will work.

e.g.
mvn -U -DskipTests clean install -P

If there is an issue, let us know the specs of the platform you are building 
against. It is possible that there might have been upgrades to the dependencies 
within these platforms that are not being reflected in the pon.xml.



-Original Message-
From: Shai Shapira [mailto:shai.shap...@amdocs.com]
Sent: Thursday, August 03, 2017 1:13 AM
To: user@drill.apache.org; Kunal Khatua 
Subject: RE: drill error connecting to Hbase

Attached the relevant part from the sqlline.log, Hope it helps


Thanks,
Shai


-Original Message-
From: Shai Shapira
Sent: Thursday, August 03, 2017 11:04 AM
To: kkha...@mapr.com
Cc: user@drill.apache.org
Subject: RE: drill error connecting to Hbase

Hi,

My versions are:
Hbase - 1.2.0   
Hive - 1.1.0 

I'll send the complete stack trace.

Is Drill is so version sensitive?
Can I build a solution for production based on Drill? Or should I stick to what 
is coming with the Cloudera/Hortonworks distribution?

Thanks,
Shai


hbase shell
17/08/03 10:54:05 INFO Configuration.deprecation: hadoop.native.lib is 
deprecated. Instead, use io.native.lib.available HBase Shell; enter 
'help' for list of supported commands.
Type "exit" to leave the HBase Shell Version 1.2.0-cdh5.8.2, rUnknown, 
Sun Sep 11 11:52:54 PDT 2016


hive shell

Logging initialized using