Re: [GENERAL] Predicting query runtime

2016-09-14 Thread Oleg Ivanov

Hi Vinicius,

I recommend you to read this 
(http://www.doc.ic.ac.uk/~nb605/IO%20performance%20modeling%20research/Learning-based%20Query%20Performance%20-%202011.pdf) 
paper. Authors make a nice classification of different query performance 
prediction methods and propose their own solution for this problem.


You can also read (http://www.vldb.org/pvldb/vol9/p204-leis.pdf) to be 
warned about possible pitfalls in PostgreSQL query optimizer. In my 
opinion, the most unpleasant one is that you often cannot rely on 
cardinality estimations made by PostgreSQL for path nodes. Typically, 
the more complicated query is, the less reliable cardinality estimations 
become. The good news is that cost model allows to predict query 
execution time precisely enough with good cardinality estimations.


In paper (http://pages.cs.wisc.edu/~wentaowu/papers/prediction-full.pdf) 
there is no machine learning. Nevertheless, you may find it interesting. 
It contains good description of PostgreSQL cost model and a method for 
automatic costs calibration (similar to proposed by Jeff in this thread).
The issue with the calibrating is follows: the multipliers for each 
*_cost factor are not provided or even directly computed in PostgreSQL 
for the majority of path nodes. The typical way of computations is not, 
for example, total_cost = 10 * seq_page_cost + 25 * random_page_cost, 
but total_cost = 10 * (seq_page_cost + 2 * random_page_cost) + 10 * 
(random_page_cost / 2). Mathematically these formulas are equivalent, 
but practically you will spend more time and write more code to extract 
the multipliers in the second case.
In the above paper authors decided to calibrate costs using only those 
nodes, for which the computations are not very complicated and, 
therefore, the multipliers can be extracted relatively easy. Anyway, 
cost models are available in src/backend/optimizer/path/costsize.c, and 
you have to get inside it somehow to obtain extra information.


As for me, the paper 
(http://2014.eswc-conferences.org/sites/default/files/eswc2014pd_submission_30.pdf) 
is interesting mostly by their introduction of graph editing distance as 
a distance on the space of paths. It is interesting because some machine 
learning methods do not require feature representations of objects, but 
only a valid distance function on each pair of them.


The paper (http://www.vldb.org/pvldb/vol6/p925-wu.pdf) is about 
predicting query execution time for concurrent workloads and also 
contains machine learning.


I hope listed papers will be useful for your master's thesis.

The post related to (https://pgconf.ru/en/2016/89977) is available here 
(http://tigvarts.livejournal.com/691.html). Please note, that this post 
was published in February 2016, so the information in this post is 
partially outdated. Some main principles were changed during my work, 
some issues for further research are closed now, while some other issues 
appeared. I believe I will have a paper on my current results completed 
in the early October.


--
Oleg Ivanov
Postgres Professional: http://www.postgrespro.com 


The Russian Postgres Company

On 09/12/2016 05:03 PM, Vinicius Segalin wrote:

Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be 
extremely precise). I've been reading some papers about it, and people 
are using machine learning to do so. For the feature vector, they use 
what the DBMS's query planner provide, such as operators and their 
cost. The thing is that I haven't found any work using PostgreSQL, so 
I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning 
and PostgreSQL to predict query runtime, or maybe some other method to 
perform this.


Thank you.

Best regards,

Vinicius Segalin




Re: [GENERAL] Predicting query runtime

2016-09-13 Thread Vinicius Segalin
2016-09-13 17:12 GMT-03:00 Oleg Bartunov :
>
>
> FYI, please check https://pgconf.ru/2016/89977
>
>
 Interesting! Was this presentation filmed? Or would you have a post or
something else with more details?


Re: [GENERAL] Predicting query runtime

2016-09-13 Thread Oleg Bartunov
On Tue, Sep 13, 2016 at 2:54 PM, Istvan Soos  wrote:
> On Tue, Sep 13, 2016 at 2:06 AM, Vinicius Segalin  
> wrote:
>> 2016-09-12 18:22 GMT-03:00 Istvan Soos :
>>> At Heap we have non-trivial complexity in our analytical queries, and
>>> some of them can take a long time to complete. We did analyze features
>>> like the query planner's output, our query properties (type,
>>> parameters, complexity) and tried to automatically identify factors
>>> that contribute the most into the total query time. It turns out that
>>> you don't need to use machine learning for the basics, but at this
>>> point we were not aiming for predictions yet.
>>
>> And how did you do that? Manually analyzing some queries?
>
> In this case, it was automatic analysis and feature discovery. We were
> generating features out of our query parameters, out of the SQL
> string, and also out of the explain analyze output. For each of these
> features, we have examined the P(query is slow | feature is present),
> and measured its statistical properties (precision, recall,
> correlations...).
>
> With these we have built a decision tree-based partitioning, where our
> feature-predicates divided the queries into subsets. Such a tree could
> be used for predictions, or if we would like to be fancy, we could use
> the feature vectors to train a neural network.

FYI, please check https://pgconf.ru/2016/89977

>
> Hope this helps for now,
>   Istvan
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Predicting query runtime

2016-09-13 Thread Istvan Soos
On Tue, Sep 13, 2016 at 2:06 AM, Vinicius Segalin  wrote:
> 2016-09-12 18:22 GMT-03:00 Istvan Soos :
>> At Heap we have non-trivial complexity in our analytical queries, and
>> some of them can take a long time to complete. We did analyze features
>> like the query planner's output, our query properties (type,
>> parameters, complexity) and tried to automatically identify factors
>> that contribute the most into the total query time. It turns out that
>> you don't need to use machine learning for the basics, but at this
>> point we were not aiming for predictions yet.
>
> And how did you do that? Manually analyzing some queries?

In this case, it was automatic analysis and feature discovery. We were
generating features out of our query parameters, out of the SQL
string, and also out of the explain analyze output. For each of these
features, we have examined the P(query is slow | feature is present),
and measured its statistical properties (precision, recall,
correlations...).

With these we have built a decision tree-based partitioning, where our
feature-predicates divided the queries into subsets. Such a tree could
be used for predictions, or if we would like to be fancy, we could use
the feature vectors to train a neural network.

Hope this helps for now,
  Istvan


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Predicting query runtime

2016-09-13 Thread Hu, Patricia
I’ve been looking for this on postgres too.  Does Postgres have something 
similar to Oracle’s v$session_longops? It gives info on total unit of work, 
units done so far, last update time, and time remaining etc, and I found it 
valuable in providing an estimate to how long a certain query would keep 
running and whether or not to kill it if applicable. This should be relatively 
easy to implement in postgres too if it is not available yet?


Thanks,
Patricia

From: Oleksandr Shulgin [mailto:oleksandr.shul...@zalando.de]
Sent: Monday, September 12, 2016 11:08 AM
To: Vinicius Segalin
Cc: pgsql general
Subject: Re: Predicting query runtime

On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin 
> wrote:
Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be extremely 
precise). I've been reading some papers about it, and people are using machine 
learning to do so. For the feature vector, they use what the DBMS's query 
planner provide, such as operators and their cost. The thing is that I haven't 
found any work using PostgreSQL, so I'm struggling to adapt it.
My question is if anyone is aware of a work that uses machine learning and 
PostgreSQL to predict query runtime, or maybe some other method to perform this.

Hi,

I'm not aware of machine-learning techniques to achieve that (and I don't 
actually believe it's feasible), but there you might find this extension 
particularly useful: 
https://www.postgresql.org/docs/9.5/static/pgstatstatements.html[postgresql.org]

Can you share some links to the papers you are referring to (assuming these are 
publicly available)?

Regards,
--
Alex


Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


Re: [GENERAL] Predicting query runtime

2016-09-12 Thread Vinicius Segalin
2016-09-12 18:22 GMT-03:00 Istvan Soos :

> Hi Vinicius,
>
> At Heap we have non-trivial complexity in our analytical queries, and
> some of them can take a long time to complete. We did analyze features
> like the query planner's output, our query properties (type,
> parameters, complexity) and tried to automatically identify factors
> that contribute the most into the total query time. It turns out that
> you don't need to use machine learning for the basics, but at this
> point we were not aiming for predictions yet.
>

And how did you do that? Manually analyzing some queries?
But I think, as you said, it wouldn't apply for predictions, but instead
for making long queries run faster, right?


>
> As a spoiler: queries take long time because they do a lot of IO.
> Features like reachback depth and duration (e.g. what period is the
> analytical query about) can contribute a lot to the amount of IO,
> thus, the query time. I have a blog post in my queue about our
> analysis, would gladly bump its priority if there is interest in such
> details.
>

If it's not too much work, I would like to have some details on your
process. It looks it's not exactly what I'm trying to do, but would
certainly help me with my work.


> I'm also curious: if you had a great way to predict the time/cost of
> the queries, how would you use it?


I'm working on something for my master degree (it's the idea, and I really
hope I can make it possible) where I'll help the user choosing the
resources for the database that will give him the best performance (or at
least the performance he thinks it's good enough). So the idea would be to
train each machine (with different resources) and then be able to predict
for an specific query what the performance would be.

Thank you all for the answers so far. I hope we can clear my mind about
this issue.


Re: [GENERAL] Predicting query runtime

2016-09-12 Thread Vinicius Segalin
2016-09-12 17:01 GMT-03:00 Jeff Janes :

> On Mon, Sep 12, 2016 at 7:03 AM, Vinicius Segalin 
> wrote:
>
>> Hi everyone,
>>
>> I'm trying to find a way to predict query runtime (I don't need to be
>> extremely precise). I've been reading some papers about it, and people are
>> using machine learning to do so. For the feature vector, they use what the
>> DBMS's query planner provide, such as operators and their cost. The thing
>> is that I haven't found any work using PostgreSQL, so I'm struggling to
>> adapt it.
>> My question is if anyone is aware of a work that uses machine learning
>> and PostgreSQL to predict query runtime, or maybe some other method to
>> perform this.
>>
>
> I don't know about machine learning, but if there were some way to get the
> planner to tell you predicted cost in terms of a breakdown of how many
> multiples of each *_cost factor (rather than only a grand total which is
> what it does now), then it would be fairly easy to combine that with wall
> times from log_duration and do a simple linear regression.
>
> I suspect the result would be that seq_page_cost and random_page_cost
> would have huge uncertainties on them.  And since pretty much every query
> has non-zero predicted values for at least one of those, the huge
> uncertainties would then pollute all the rest of the fitted values as
> well.  Perhaps that is where the machine learning would come in?
>
> Another issue is the predicted costs are only meant to choose between
> different plans, not to predict overall wall time. Some parts of the
> planner only have one way to do something, and so doesn't bother to compute
> a cost for that as there is no choice to be made.  This would leave glaring
> holes in the estimates (particularly for updates)
>
> But to get that data out would require quite a bit of tedious altering of
> the planner code, and then you would have to find people willing to run
> that altered code on real world databases with a high level of logging to
> gather the data.  (I suspect that gathering data from only toy databases
> would not be very useful).
>
> Cheers,
>
> Jeff
>

Modifying the planner is way too complex for me at this time, so I really
can't go into that kind of solution, but I can try to use as much as the
planner gives me today, make the best out of it and hope it's enough to
give me some satisfactory results.


Re: [GENERAL] Predicting query runtime

2016-09-12 Thread Vinicius Segalin
2016-09-12 15:16 GMT-03:00 Merlin Moncure :

> On Mon, Sep 12, 2016 at 9:03 AM, Vinicius Segalin 
> wrote:
> > Hi everyone,
> >
> > I'm trying to find a way to predict query runtime (I don't need to be
> > extremely precise). I've been reading some papers about it, and people
> are
> > using machine learning to do so. For the feature vector, they use what
> the
> > DBMS's query planner provide, such as operators and their cost. The
> thing is
> > that I haven't found any work using PostgreSQL, so I'm struggling to
> adapt
> > it.
> > My question is if anyone is aware of a work that uses machine learning
> and
> > PostgreSQL to predict query runtime, or maybe some other method to
> perform
> > this.
>
> Well, postgres estimates the query runtime in the form of an expected
> 'cost', where the cost is an arbitrary measure based on time
> complexity of query plan.   It shouldn't be too difficult to correlate
> estimated cost to runtime cost.


That's what I though too. At least it makes sense, I guess. But sometimes
logic doesn't work, so I think only giving it a try will say.


> A statistical analysis of that
> correlation would be incredibly useful work although generating sample
> datasets would be a major challenge.
>
> merlin
>

Indeed. I'm using TPC-B along with pgbench to have some data to test (while
I don't have real data), but I'm having a hard time creating queries that
give me (very) different performance results so I can train my ML algorithm.


Re: [GENERAL] Predicting query runtime

2016-09-12 Thread Istvan Soos
Hi Vinicius,

At Heap we have non-trivial complexity in our analytical queries, and
some of them can take a long time to complete. We did analyze features
like the query planner's output, our query properties (type,
parameters, complexity) and tried to automatically identify factors
that contribute the most into the total query time. It turns out that
you don't need to use machine learning for the basics, but at this
point we were not aiming for predictions yet.

As a spoiler: queries take long time because they do a lot of IO.
Features like reachback depth and duration (e.g. what period is the
analytical query about) can contribute a lot to the amount of IO,
thus, the query time. I have a blog post in my queue about our
analysis, would gladly bump its priority if there is interest in such
details.

I'm also curious: if you had a great way to predict the time/cost of
the queries, how would you use it?

Best regards,
  Istvan

On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin  wrote:
> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing is
> that I haven't found any work using PostgreSQL, so I'm struggling to adapt
> it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.
>
> Thank you.
>
> Best regards,
>
> Vinicius Segalin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Predicting query runtime

2016-09-12 Thread Jeff Janes
On Mon, Sep 12, 2016 at 7:03 AM, Vinicius Segalin 
wrote:

> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing
> is that I haven't found any work using PostgreSQL, so I'm struggling to
> adapt it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.
>

I don't know about machine learning, but if there were some way to get the
planner to tell you predicted cost in terms of a breakdown of how many
multiples of each *_cost factor (rather than only a grand total which is
what it does now), then it would be fairly easy to combine that with wall
times from log_duration and do a simple linear regression.

I suspect the result would be that seq_page_cost and random_page_cost would
have huge uncertainties on them.  And since pretty much every query has
non-zero predicted values for at least one of those, the huge uncertainties
would then pollute all the rest of the fitted values as well.  Perhaps that
is where the machine learning would come in?

Another issue is the predicted costs are only meant to choose between
different plans, not to predict overall wall time. Some parts of the
planner only have one way to do something, and so doesn't bother to compute
a cost for that as there is no choice to be made.  This would leave glaring
holes in the estimates (particularly for updates)

But to get that data out would require quite a bit of tedious altering of
the planner code, and then you would have to find people willing to run
that altered code on real world databases with a high level of logging to
gather the data.  (I suspect that gathering data from only toy databases
would not be very useful).

Cheers,

Jeff


Re: [GENERAL] Predicting query runtime

2016-09-12 Thread Merlin Moncure
On Mon, Sep 12, 2016 at 9:03 AM, Vinicius Segalin  wrote:
> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing is
> that I haven't found any work using PostgreSQL, so I'm struggling to adapt
> it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.

Well, postgres estimates the query runtime in the form of an expected
'cost', where the cost is an arbitrary measure based on time
complexity of query plan.   It shouldn't be too difficult to correlate
estimated cost to runtime cost.  A statistical analysis of that
correlation would be incredibly useful work although generating sample
datasets would be a major challenge.

merlin


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Predicting query runtime

2016-09-12 Thread Vinicius Segalin
2016-09-12 12:08 GMT-03:00 Oleksandr Shulgin :

> On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin 
> wrote:
>
>> Hi everyone,
>>
>> I'm trying to find a way to predict query runtime (I don't need to be
>> extremely precise). I've been reading some papers about it, and people are
>> using machine learning to do so. For the feature vector, they use what the
>> DBMS's query planner provide, such as operators and their cost. The thing
>> is that I haven't found any work using PostgreSQL, so I'm struggling to
>> adapt it.
>> My question is if anyone is aware of a work that uses machine learning
>> and PostgreSQL to predict query runtime, or maybe some other method to
>> perform this.
>>
>
> Hi,
>
> I'm not aware of machine-learning techniques to achieve that (and I don't
> actually believe it's feasible), but there you might find this extension
> particularly useful: https://www.postgresql.org/docs/9.5/static/
> pgstatstatements.html
>

I'll read about it and see how it can help me. Thank you


> Can you share some links to the papers you are referring to (assuming
> these are publicly available)?
>

I don't have them all now, but these are some of them:

http://dl.acm.org/citation.cfm?id=1845166
https://people.eecs.berkeley.edu/~jordan/papers/ganapathi-etal-icde09.pdf


Re: [GENERAL] Predicting query runtime

2016-09-12 Thread Oleksandr Shulgin
On Mon, Sep 12, 2016 at 4:03 PM, Vinicius Segalin 
wrote:

> Hi everyone,
>
> I'm trying to find a way to predict query runtime (I don't need to be
> extremely precise). I've been reading some papers about it, and people are
> using machine learning to do so. For the feature vector, they use what the
> DBMS's query planner provide, such as operators and their cost. The thing
> is that I haven't found any work using PostgreSQL, so I'm struggling to
> adapt it.
> My question is if anyone is aware of a work that uses machine learning and
> PostgreSQL to predict query runtime, or maybe some other method to perform
> this.
>

Hi,

I'm not aware of machine-learning techniques to achieve that (and I don't
actually believe it's feasible), but there you might find this extension
particularly useful:
https://www.postgresql.org/docs/9.5/static/pgstatstatements.html

Can you share some links to the papers you are referring to (assuming these
are publicly available)?

Regards,
--
Alex


[GENERAL] Predicting query runtime

2016-09-12 Thread Vinicius Segalin
Hi everyone,

I'm trying to find a way to predict query runtime (I don't need to be
extremely precise). I've been reading some papers about it, and people are
using machine learning to do so. For the feature vector, they use what the
DBMS's query planner provide, such as operators and their cost. The thing
is that I haven't found any work using PostgreSQL, so I'm struggling to
adapt it.
My question is if anyone is aware of a work that uses machine learning and
PostgreSQL to predict query runtime, or maybe some other method to perform
this.

Thank you.

Best regards,

Vinicius Segalin