Could you be running into https://issues.apache.org/jira/browse/DRILL-3846 ?

- Rahul

On Thu, Aug 17, 2017 at 9:13 PM, Padma Penumarthy <ppenumar...@mapr.com>
wrote:

> It is supposed to work like you expected. May be you are running into a
> bug.
> Why is it reading all files after metadata refresh ? That is difficult to
> answer without
> looking at the logs and query profile. If you look at the query profile,
> you can may
> be check what usedMetadataFile flag says for scan.
> Also, I am thinking if you created so many files, your metadata
> cache file could be big. May be you can manually sanity
> check if it looks ok (look for .drill.parquet.metadata file in the root
> directory) and not
> corrupted ?
>
> Thanks,
> Padma
>
>
> On Aug 17, 2017, at 8:10 PM, Khurram Faraaz <kfar...@mapr.com<mailto:kfara
> a...@mapr.com>> wrote:
>
> Please share your SQL query and the query plan.
>
> To get the query plan, execute EXPLAIN PLAN FOR <your-SQL-query>;
>
>
> Thanks,
>
> Khurram
>
> ________________________________
> From: Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com
> >>
> Sent: Friday, August 18, 2017 7:15:18 AM
> To: user@drill.apache.org<mailto:user@drill.apache.org>
> Subject: Re: Query Optimization
>
> Hi ,
> Yes its the same query its just the ran the metadata refresh command .
> My understanding is metadata refresh command saves reading the metadata.
> How about column values ... Why is it reading all the files after metedata
> refresh ?
> Partition helps to retrieve data faster .
> Like in hive how it happens when you mention the partition column in where
> condition
> it just goes and read and improves the query performace .
> In my query also I where conidtion has  partioning column it should go and
> read those partitioned files right ?
> Why is it taking more time ?
> Does the Drill works in different way compare to hive ?
>
>
> Thanks,
> Divya
>
> On 18 August 2017 at 07:37, Padma Penumarthy <ppenumar...@mapr.com<mailto:
> ppenumar...@mapr.com>> wrote:
>
> It might read all those files if some new data gets added after running
> refresh metadata cache.
> If everything is same before and after metadata refresh i.e. no
> new data added and query is exactly the same, then it should not do that.
> Also, check if you can partition in  a way that will not create so many
> files in the
> first place.
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 10:54 PM, Divya Gehlot <divya.htco...@gmail.com<
> mailto:divya.htco...@gmail.com>>
> wrote:
>
> Hi,
> Another observation is
> My query had where conditions based on the partition values
>
> Total number of parquet files in directory  - 102290
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
>
> This is how the refresh metadata works I mean it scans each and every
> files
> and get the results ?
>
> I dont  have access to logs now .
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:48, Divya Gehlot <divya.htco...@gmail.com<mailto:
> divya.htco...@gmail.com>>
> wrote:
>
> Hi,
> Another observation is
> My query had where conditions based on the partition values
> Before Metadata refresh - Its reading only 4 files
> After metadata refresh - its reading 102290 files
>
> Thanks,
> Divya
>
> On 17 August 2017 at 13:03, Padma Penumarthy <ppenumar...@mapr.com<mailto:
> ppenumar...@mapr.com>>
> wrote:
>
> Does your query have partition filter ?
> Execution time is increased most likely because partition pruning is
> not
> happening.
> Did you get a chance to look at the logs ?  That might give some clues.
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 9:32 PM, Divya Gehlot <divya.htco...@gmail.com<mailto:
> divya.htco...@gmail.com>>
> wrote:
>
> Hi,
> Even I am surprised .
> I am running Drill version 1.10  on MapR enterprise version.
> *Query *- Selecting all the columns on partitioned parquet table
>
> I observed few things from Query statistics :
>
> Value
>
> Before Refresh Metadata
>
> After Refresh Metadata
>
> Fragments
>
> 1
>
> 13
>
> DURATION
>
> 01 min 0.233 sec
>
> 18 min 0.744 sec
>
> PLANNING
>
> 59.818 sec
>
> 33.087 sec
>
> QUEUED
>
> Not Available
>
> Not Available
>
> EXECUTION
>
> 0.415 sec
>
> 17 min 27.657 sec
>
> The planning time is being reduced by approx 60% but the execution
> time
> increased  drastically.
> I would like to understand why the exceution time increases after the
> metadata refresh .
>
>
> Appreciate the help.
>
> Thanks,
> divya
>
>
> On 17 August 2017 at 11:54, Padma Penumarthy <ppenumar...@mapr.com<mailto:
> ppenumar...@mapr.com>>
> wrote:
>
> Refresh table metadata should  help reduce query planning time.
> It is odd that it went up after you did refresh table metadata.
> Did you check the logs to see what is happening ? You might have to
> turn on some debugs if needed.
> BTW, what version of Drill are you running ?
>
> Thanks,
> Padma
>
>
> On Aug 16, 2017, at 8:15 PM, Divya Gehlot <divya.htco...@gmail.com<mailto:
> divya.htco...@gmail.com>>
> wrote:
>
> Hi,
> I have data in parquet file format .
> when I run the query the data and see the execution plan I could see
> following
> statistics
>
> TOTAL FRAGMENTS: 1
> DURATION: 01 min 0.233 sec
> PLANNING: 59.818 sec
> QUEUED: Not Available
> EXECUTION: 0.415 sec
>
>
>
> As its a paquet file format I tried enabling refresh meta data
> and run below command
> REFRESH TABLE METADATA <path to table> ;
> then run the same query again on the same table same data (no
> changes
> in
> data)  and could find the statistics as show below :
>
> TOTAL FRAGMENTS: 13
> DURATION: 14 min 14.604 sec
> PLANNING: 33.087 sec
> QUEUED: Not Available
> EXECUTION: Not Available
>
>
> The query is still running .
>
> Can somebody help me  understand why the query taking so long once I
> issue
> the refresh metadata command.
>
> Aprreciate the help !
>
> Thanks,
> Divya
>
>
>
>
>
>
>
>
>

Reply via email to