subject:"Clustering and Large\-scale analysis of Hive Queries"

Re: Clustering and Large-scale analysis of Hive Queries

2018-08-03 Thread Gopal Vijayaraghavan



> I am interested in working on a project that takes a large number of Hive 
> queries (as well as their meta data like amount of resources used etc) and 
> find out common sub queries and expensive query groups etc.

This was roughly the central research topic of one of the Hive CBO devs, except 
was implemented for PIG (not Hive).

https://hal.inria.fr/hal-01353891
+
https://github.com/jcamachor/pigreuse

I think there's a lot of interest in this topic for ETL workloads and the goal 
is to pick this up as ETL becomes the target problem.

There's a recent SIGMOID paper which talks about the same sort of reuse.

https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf

If you are interested in looking into this using existing infra in Hive, I 
recommend looking at Zoltan's recent work which tracks query plans + runtime 
statistics from the RUNTIME_STATS table in the metastore.

You can debug through what this does by doing

"explain reoptimization  ;"

Cheers,
Gopal

Re: Clustering and Large-scale analysis of Hive Queries

2018-07-26 Thread Thai Bui

I don’t see any project especially tuned for Hive doing what you described.
I have encountered this problem recently as the number of users and queries
grew exponentially in my company and I’m interested.

We are currently collecting Hive internal metrics in order to do certain
analysis (don’t know what yet) in order to suggest better settings and/or
better querying pattern for our users. Mostly involving really large
queries that cause OOM error.

Hive also has an existing optimizer called cost-based optimizer (CBO) that
can perform query rewrite (mostly joins) to speed up queries based on
table/column statistics.

Another feature that could be beneficial is to identify common pattern of
existing queries to suggest a materialized view to build (also a new
feature of Hive 3.0). I think the Hive team is planning on this supporting
feature on the road map as well.

On Wed, Jul 25, 2018 at 3:27 PM Johannes Alberti 
wrote:

> Did you guys already look at Dr Elephant?
>
>
> https://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark
>
> Not sure if there is anything you might find useful, but I would be
> interested in hearing about good and bad about Dr Elephant w/ Hive.
>
> Sent from my iPhone
>
> On Jul 25, 2018, at 12:13 PM, Zheng Shao  wrote:
>
> Hi,
>
> I am interested in working on a project that takes a large number of Hive
> queries (as well as their meta data like amount of resources used etc) and
> find out common sub queries and expensive query groups etc.
>
> Are there any existing work in this domain?  Happy to collaborate as well
> if there are shared I interests.
>
> Zheng
>
> --
Thai

Re: Clustering and Large-scale analysis of Hive Queries

2018-07-25 Thread Johannes Alberti

Did you guys already look at Dr Elephant?

https://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

Not sure if there is anything you might find useful, but I would be interested 
in hearing about good and bad about Dr Elephant w/ Hive.

Sent from my iPhone

> On Jul 25, 2018, at 12:13 PM, Zheng Shao  wrote:
> 
> Hi,
> 
> I am interested in working on a project that takes a large number of Hive 
> queries (as well as their meta data like amount of resources used etc) and 
> find out common sub queries and expensive query groups etc.
> 
> Are there any existing work in this domain?  Happy to collaborate as well if 
> there are shared I interests.
> 
> Zheng
>

Clustering and Large-scale analysis of Hive Queries

2018-07-25 Thread Zheng Shao

Hi,

I am interested in working on a project that takes a large number of Hive
queries (as well as their meta data like amount of resources used etc) and
find out common sub queries and expensive query groups etc.

Are there any existing work in this domain?  Happy to collaborate as well
if there are shared I interests.

Zheng

Re: Clustering and Large-scale analysis of Hive Queries

Re: Clustering and Large-scale analysis of Hive Queries

Re: Clustering and Large-scale analysis of Hive Queries

Clustering and Large-scale analysis of Hive Queries

4 matches

Site Navigation

Mail list logo

Footer information