[HACKERS] proposal: window function - change_number

2014-09-21 Thread Pavel Stehule
Hi
I tried to solve following task:

I have a table

start, reason, km
=
 2014-01-01 08:00:00, private, 10
 2014-01-01 09:00:00, commerc, 20
 2014-01-01 10:00:00, commerc, 20
 2014-01-01 11:00:00, private, 8

and I would reduce these rows to

 2014-01-01 08:00:00, private, 10
 2014-01-01 09:00:00, commerc, 20 + 20 = 40
 2014-01-01 11:00:00, private, 8

It is relative hard to it now with SQL only. But we can simplify this task
with window function that returns number of change in some column. Then
this task can be solved by

select min(start), min(reason), sum(km)
  from (select start, reason, km, change_number(reason) over (order by
start))
  group by change_number;

Do you think, so it has sense?

Regards

Pavel


Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread David Rowley
On Sun, Sep 21, 2014 at 9:27 PM, Pavel Stehule pavel.steh...@gmail.com
wrote:

 Hi
 I tried to solve following task:

 I have a table

 start, reason, km
 =
  2014-01-01 08:00:00, private, 10
  2014-01-01 09:00:00, commerc, 20
  2014-01-01 10:00:00, commerc, 20
  2014-01-01 11:00:00, private, 8

 and I would reduce these rows to

  2014-01-01 08:00:00, private, 10
  2014-01-01 09:00:00, commerc, 20 + 20 = 40
  2014-01-01 11:00:00, private, 8

 It is relative hard to it now with SQL only. But we can simplify this task
 with window function that returns number of change in some column. Then
 this task can be solved by

 select min(start), min(reason), sum(km)
   from (select start, reason, km, change_number(reason) over (order by
 start))
   group by change_number;


I guess that might be quite useful, otherwise the only way that comes to
mind to do this would be something along the lines of:

select *,sum(case when reason  lastreason then 1 else 0 end) over (order
by start) as chg_num from (select *,lag(reason) over (order by start) vnext
from sometable) sometable;

This way might not be too bad as I think the outer window will have no need
to perform another sort, since the inner window clause has sorted it the
right way already. Though something like change_number() would make this a
bit more pretty. It's almost like rank(), but with a parameter.

Regards

David Rowley


Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread Mart Kelder
Hi Pavel (and others),

Pavel Stehule wrote:
 Hi
 I tried to solve following task:
 
 I have a table
 
 start, reason, km
 =
  2014-01-01 08:00:00, private, 10
  2014-01-01 09:00:00, commerc, 20
  2014-01-01 10:00:00, commerc, 20
  2014-01-01 11:00:00, private, 8
 
 and I would reduce these rows to
 
  2014-01-01 08:00:00, private, 10
  2014-01-01 09:00:00, commerc, 20 + 20 = 40
  2014-01-01 11:00:00, private, 8
 
 It is relative hard to it now with SQL only. But we can simplify this task
 with window function that returns number of change in some column. Then
 this task can be solved by
 
 select min(start), min(reason), sum(km)
   from (select start, reason, km, change_number(reason) over (order by
 start))
   group by change_number;

What about

select srk.reason, min(srk.start), sum(srk.km)
from start_reason_km srk
group by srk.reason, (select max(start) from start_reason_km other WHERE 
other.start  srk.start and other.reason != srk.reason);

In general, I think window function are very specific in how the queryplan 
must look like, leaving not much room for the optimizer. On the other hand, 
if there happends to be an efficient way to get the results of the table 
ordered by start, then the window function will very likely much faster 
then a join. I would be nice if the optimizer is able to add such stream 
order operations.

 Do you think, so it has sense?
 
 Regards
 
 Pavel

Regards,

Mart

PS: This is my first post to the mailing list. I am a software developer 
interest is performance making webapplications with a different database 
server during working hours.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread Pavel Stehule
2014-09-21 14:30 GMT+02:00 Mart Kelder m...@kelder31.nl:

 Hi Pavel (and others),

 Pavel Stehule wrote:
  Hi
  I tried to solve following task:
 
  I have a table
 
  start, reason, km
  =
   2014-01-01 08:00:00, private, 10
   2014-01-01 09:00:00, commerc, 20
   2014-01-01 10:00:00, commerc, 20
   2014-01-01 11:00:00, private, 8
 
  and I would reduce these rows to
 
   2014-01-01 08:00:00, private, 10
   2014-01-01 09:00:00, commerc, 20 + 20 = 40
   2014-01-01 11:00:00, private, 8
 
  It is relative hard to it now with SQL only. But we can simplify this
 task
  with window function that returns number of change in some column. Then
  this task can be solved by
 
  select min(start), min(reason), sum(km)
from (select start, reason, km, change_number(reason) over (order by
  start))
group by change_number;

 What about

 select srk.reason, min(srk.start), sum(srk.km)
 from start_reason_km srk
 group by srk.reason, (select max(start) from start_reason_km other WHERE
 other.start  srk.start and other.reason != srk.reason);


This query is Cartesian product, so for some large data it is significantly
slower then window function (required only sorts without joins)

My motivation was a) to implement described task without Cartesian product.
b) introduce some tool for this kind of problems. I seen more times a
request .. reduce a time series, and a window function change_number (or
maybe consistent_series_number) can be good candidate.



 In general, I think window function are very specific in how the queryplan
 must look like, leaving not much room for the optimizer. On the other hand,
 if there happends to be an efficient way to get the results of the table
 ordered by start, then the window function will very likely much faster
 then a join. I would be nice if the optimizer is able to add such stream
 order operations.


  Do you think, so it has sense?
 
  Regards
 
  Pavel

 Regards,

 Mart

 PS: This is my first post to the mailing list. I am a software developer
 interest is performance making webapplications with a different database
 server during working hours.



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread Mart Kelder
Hi Pavel (and others),

Op zondag 21 september 2014 15:35:46 schreef u:
 2014-09-21 14:30 GMT+02:00 Mart Kelder m...@kelder31.nl:
  Hi Pavel (and others),
  
  Pavel Stehule wrote:
   Hi
   I tried to solve following task:
   
   I have a table
   
   start, reason, km
   =
   
2014-01-01 08:00:00, private, 10
2014-01-01 09:00:00, commerc, 20
2014-01-01 10:00:00, commerc, 20
2014-01-01 11:00:00, private, 8
   
   and I would reduce these rows to
   
2014-01-01 08:00:00, private, 10
2014-01-01 09:00:00, commerc, 20 + 20 = 40
2014-01-01 11:00:00, private, 8
   
   It is relative hard to it now with SQL only. But we can simplify this
   task
  
   with window function that returns number of change in some column. Then
   this task can be solved by
   
   select min(start), min(reason), sum(km)
 from (select start, reason, km, change_number(reason) over (order by
   start))
   
 group by change_number;
  
  What about
  
  select srk.reason, min(srk.start), sum(srk.km)
  from start_reason_km srk
  group by srk.reason, (select max(start) from start_reason_km other WHERE
  other.start  srk.start and other.reason != srk.reason);
 
 This query is Cartesian product, so for some large data it is significantly
 slower then window function (required only sorts without joins)

I think we have the same queryplan in mind (with only one scan). As far as I 
know, SQL is a language where you define the result you want to get, and let 
the server find a way how to find the data. I think windowed function also say 
how the server needs to get the information.

The real challenge is of course of finding heuristics to remove the additional 
join. In this particular case, I can tell how to remove the inner join from 
the subquery:
* the where-clause of the self-join contains other.start  srk.start. From 
that we can conclude that if the table is (or can be) sorted on start, we 
have seen the data before the subquery is executed
* because we only need an aggregate, we need to store the intermediate max 
for each reason. And then add the result to the stream.

Recently, I had a simular problem with a table containing a timestamp, a state 
and a object where the state belongs to. A object remains in a state until 
there is a more recent tuple in the table. I needed basically to query all the 
previous state for each tuple, but preverably without the additional join.

 My motivation was a) to implement described task without Cartesian product.

Good reason (if you consider the queryplan and not the query).

 b) introduce some tool for this kind of problems. I seen more times a
 request .. reduce a time series, and a window function change_number (or
 maybe consistent_series_number) can be good candidate.

I also need to note that there is a lot of difference in complexity between the 
possible solutions of this problem. Where a new window function can probably 
be very easy implemented, the optimizer changes descripted above are complex 
and not easy to implement. 

I also want to note that I am not really against new window functions, I only 
want to point out that a more generic solution also might be possible.

Regards,

Mart


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread Pavel Stehule
2014-09-21 17:00 GMT+02:00 Mart Kelder m...@kelder31.nl:

 Hi Pavel (and others),

 Op zondag 21 september 2014 15:35:46 schreef u:
  2014-09-21 14:30 GMT+02:00 Mart Kelder m...@kelder31.nl:
   Hi Pavel (and others),
  
   Pavel Stehule wrote:
Hi
I tried to solve following task:
   
I have a table
   
start, reason, km
=
   
 2014-01-01 08:00:00, private, 10
 2014-01-01 09:00:00, commerc, 20
 2014-01-01 10:00:00, commerc, 20
 2014-01-01 11:00:00, private, 8
   
and I would reduce these rows to
   
 2014-01-01 08:00:00, private, 10
 2014-01-01 09:00:00, commerc, 20 + 20 = 40
 2014-01-01 11:00:00, private, 8
   
It is relative hard to it now with SQL only. But we can simplify this
task
   
with window function that returns number of change in some column.
 Then
this task can be solved by
   
select min(start), min(reason), sum(km)
  from (select start, reason, km, change_number(reason) over (order
 by
start))
   
  group by change_number;
  
   What about
  
   select srk.reason, min(srk.start), sum(srk.km)
   from start_reason_km srk
   group by srk.reason, (select max(start) from start_reason_km other
 WHERE
   other.start  srk.start and other.reason != srk.reason);
 
  This query is Cartesian product, so for some large data it is
 significantly
  slower then window function (required only sorts without joins)

 I think we have the same queryplan in mind (with only one scan). As far as
 I
 know, SQL is a language where you define the result you want to get, and
 let
 the server find a way how to find the data. I think windowed function also
 say
 how the server needs to get the information.


What I know It is not true now. Any subselect enforce individual scan of
source relation. Postgres has no any special support for selfjoin.




 The real challenge is of course of finding heuristics to remove the
 additional
 join. In this particular case, I can tell how to remove the inner join from
 the subquery:
 * the where-clause of the self-join contains other.start  srk.start. From
 that we can conclude that if the table is (or can be) sorted on start, we
 have seen the data before the subquery is executed
 * because we only need an aggregate, we need to store the intermediate
 max
 for each reason. And then add the result to the stream.

 Recently, I had a simular problem with a table containing a timestamp, a
 state
 and a object where the state belongs to. A object remains in a state until
 there is a more recent tuple in the table. I needed basically to query all
 the
 previous state for each tuple, but preverably without the additional join.

  My motivation was a) to implement described task without Cartesian
 product.

Good reason (if you consider the queryplan and not the query).


yes.

There is probably big space for improvements in more directions. For
example - we have application, where is often used pattern SELECT FROM A
JOIN (SELECT someagg() FROM A) .. ON

Sometimes these queries are slow due terrible low estimation. It is one
example of more

Pavel


  b) introduce some tool for this kind of problems. I seen more times a
  request .. reduce a time series, and a window function change_number
 (or
  maybe consistent_series_number) can be good candidate.

 I also need to note that there is a lot of difference in complexity
 between the
 possible solutions of this problem. Where a new window function can
 probably
 be very easy implemented, the optimizer changes descripted above are
 complex
 and not easy to implement.

 I also want to note that I am not really against new window functions, I
 only
 want to point out that a more generic solution also might be possible.

 Regards,

 Mart


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread Andrew Gierth
 Pavel == Pavel Stehule pavel.steh...@gmail.com writes:

 Pavel Hi
 Pavel I tried to solve following task:

 Pavel I have a table

 Pavel start, reason, km
 Pavel =
 Pavel  2014-01-01 08:00:00, private, 10
 Pavel  2014-01-01 09:00:00, commerc, 20
 Pavel  2014-01-01 10:00:00, commerc, 20
 Pavel  2014-01-01 11:00:00, private, 8

 Pavel and I would reduce these rows to

 Pavel  2014-01-01 08:00:00, private, 10
 Pavel  2014-01-01 09:00:00, commerc, 20 + 20 = 40
 Pavel  2014-01-01 11:00:00, private, 8

 Pavel It is relative hard to it now with SQL only.

Only relatively. My standard solution is something like this:

select start_time, reason, sum(km) as km
  from (select max(label_time) over (order by start) as start_time,
   reason, km
  from (select start, reason, km,
   case when reason
 is distinct from
 lag(reason) over (order by start)
then start
   end as label_time
  from yourtable
   ) s2
   ) s1
 group by start_time, reason
 order by start_time;

(Your change_number idea is essentially equivalent to doing
sum(case when x is distinct from lag(x) over w then 1 end) over w,
except that since window functions can't be nested, that expression
requires a subquery.)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread Rémi Cura
Hey, sorry I what I say is obvious for you .

If I understood your problem correctly, it is strictly equivalent to this
one :
http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-td5775363.html

there is a postgres trick to solve this problem :
what you want is essentially generate a unique group_id,
but one that depends of an order of row not defined in the group.

The solution
is to generate a row number by the order you want , then a row number by
the group ,
then a subtraction of the 2 row number gives you an unique id per group.

The cost is that you have to use 2 windows function., hence 2 scans I guess.

Cheers,
Rémi-C

2014-09-21 17:51 GMT+02:00 Andrew Gierth and...@tao11.riddles.org.uk:

  Pavel == Pavel Stehule pavel.steh...@gmail.com writes:

  Pavel Hi
  Pavel I tried to solve following task:

  Pavel I have a table

  Pavel start, reason, km
  Pavel =
  Pavel  2014-01-01 08:00:00, private, 10
  Pavel  2014-01-01 09:00:00, commerc, 20
  Pavel  2014-01-01 10:00:00, commerc, 20
  Pavel  2014-01-01 11:00:00, private, 8

  Pavel and I would reduce these rows to

  Pavel  2014-01-01 08:00:00, private, 10
  Pavel  2014-01-01 09:00:00, commerc, 20 + 20 = 40
  Pavel  2014-01-01 11:00:00, private, 8

  Pavel It is relative hard to it now with SQL only.

 Only relatively. My standard solution is something like this:

 select start_time, reason, sum(km) as km
   from (select max(label_time) over (order by start) as start_time,
reason, km
   from (select start, reason, km,
case when reason
  is distinct from
  lag(reason) over (order by start)
 then start
end as label_time
   from yourtable
) s2
) s1
  group by start_time, reason
  order by start_time;

 (Your change_number idea is essentially equivalent to doing
 sum(case when x is distinct from lag(x) over w then 1 end) over w,
 except that since window functions can't be nested, that expression
 requires a subquery.)

 --
 Andrew (irc:RhodiumToad)


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread Pavel Stehule
2014-09-21 17:51 GMT+02:00 Andrew Gierth and...@tao11.riddles.org.uk:

  Pavel == Pavel Stehule pavel.steh...@gmail.com writes:

  Pavel Hi
  Pavel I tried to solve following task:

  Pavel I have a table

  Pavel start, reason, km
  Pavel =
  Pavel  2014-01-01 08:00:00, private, 10
  Pavel  2014-01-01 09:00:00, commerc, 20
  Pavel  2014-01-01 10:00:00, commerc, 20
  Pavel  2014-01-01 11:00:00, private, 8

  Pavel and I would reduce these rows to

  Pavel  2014-01-01 08:00:00, private, 10
  Pavel  2014-01-01 09:00:00, commerc, 20 + 20 = 40
  Pavel  2014-01-01 11:00:00, private, 8

  Pavel It is relative hard to it now with SQL only.

 Only relatively. My standard solution is something like this:

 select start_time, reason, sum(km) as km
   from (select max(label_time) over (order by start) as start_time,
reason, km
   from (select start, reason, km,
case when reason
  is distinct from
  lag(reason) over (order by start)
 then start
end as label_time
   from yourtable
) s2
) s1
  group by start_time, reason
  order by start_time;

 (Your change_number idea is essentially equivalent to doing
 sum(case when x is distinct from lag(x) over w then 1 end) over w,
 except that since window functions can't be nested, that expression
 requires a subquery.)


yes, I found this solution in third iteration too.

so this proposal lost a main benefit

Regards

Pavel


 --
 Andrew (irc:RhodiumToad)



Re: [HACKERS] proposal: window function - change_number

2014-09-21 Thread Pavel Stehule
2014-09-21 18:08 GMT+02:00 Rémi Cura remi.c...@gmail.com:

 Hey, sorry I what I say is obvious for you .

 If I understood your problem correctly, it is strictly equivalent to this
 one :

 http://postgresql.1045698.n5.nabble.com/Count-of-records-in-a-row-td5775363.html

 there is a postgres trick to solve this problem :
 what you want is essentially generate a unique group_id,
 but one that depends of an order of row not defined in the group.

 The solution
 is to generate a row number by the order you want , then a row number by
 the group ,
 then a subtraction of the 2 row number gives you an unique id per group.

 The cost is that you have to use 2 windows function., hence 2 scans I
 guess.


yes, it is little bit similar - I found a pattern described by Andrew is
well too.

regards

Pavel


 Cheers,
 Rémi-C

 2014-09-21 17:51 GMT+02:00 Andrew Gierth and...@tao11.riddles.org.uk:

  Pavel == Pavel Stehule pavel.steh...@gmail.com writes:

  Pavel Hi
  Pavel I tried to solve following task:

  Pavel I have a table

  Pavel start, reason, km
  Pavel =
  Pavel  2014-01-01 08:00:00, private, 10
  Pavel  2014-01-01 09:00:00, commerc, 20
  Pavel  2014-01-01 10:00:00, commerc, 20
  Pavel  2014-01-01 11:00:00, private, 8

  Pavel and I would reduce these rows to

  Pavel  2014-01-01 08:00:00, private, 10
  Pavel  2014-01-01 09:00:00, commerc, 20 + 20 = 40
  Pavel  2014-01-01 11:00:00, private, 8

  Pavel It is relative hard to it now with SQL only.

 Only relatively. My standard solution is something like this:

 select start_time, reason, sum(km) as km
   from (select max(label_time) over (order by start) as start_time,
reason, km
   from (select start, reason, km,
case when reason
  is distinct from
  lag(reason) over (order by start)
 then start
end as label_time
   from yourtable
) s2
) s1
  group by start_time, reason
  order by start_time;

 (Your change_number idea is essentially equivalent to doing
 sum(case when x is distinct from lag(x) over w then 1 end) over w,
 except that since window functions can't be nested, that expression
 requires a subquery.)

 --
 Andrew (irc:RhodiumToad)


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers