Re: [HACKERS] pivot aggregation with a patched intarray

2014-06-06 Thread Marc Mamin
 From: Ali Akbar [mailto:the.ap...@gmail.com]
 Sent: Freitag, 6. Juni 2014 03:44
 Subject: Re: [HACKERS] pivot aggregation with a patched intarray
 
 2014-06-05 17:18 GMT+07:00 Marc Mamin m.ma...@intershop.de:
  I'm thinking about adding a final function to my aggregate that would
  replace zero values will nulls, hence transforming the intarray into
 a standard int[], possibly with nullbitmap and a lowerbound that can be
  1.
  This will probably degrade the performance considerably, but may
 reduce the size of the end result for spare data and not too small
 integers...
  Performances should greatly depend on the data distribution and order
 as they influence the number of palloc.
  My first tests shown as well better and poorer results.
 
  My target is not to get better performances at the first place, but
 to get a pivot structure in an early aggregation stage.
 
 Usually for pivot, i use crosstab function from tablefunc
 (http://www.postgresql.org/docs/9.4/static/tablefunc.html#AEN158550).
 If your patch doesn't perform better, it's more easier to just use
 crosstab. For storing it efficiently, the result can be transformed
 into array manually.


Hello,

crosstab is too restrictive for my use case: 
it supports only one row_name column and you have to know the number of 
returned columns (categories).
What I'm doing is to combine a count aggregate with a pivot. This is not the 
same as what crosstab offers.

 
 PS: as Michael Paquier said above, its better if you could send the
 patch in the .patch file format (see:
 https://wiki.postgresql.org/wiki/Working_with_GIT).

I'm first waiting for some positive feedback on the idea itself before to 
eventually submit this officially.

best regards, 

Marc Mamin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pivot aggregation with a patched intarray

2014-06-05 Thread Marc Mamin

 -Original Message-
 From: Ali Akbar [mailto:the.ap...@gmail.com]
 Sent: Donnerstag, 5. Juni 2014 01:12
 To: Marc Mamin
 Cc: Michael Paquier; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] pivot aggregation with a patched intarray
 
 2014-06-01 20:48 GMT+07:00 Marc Mamin m.ma...@intershop.de:
 
  On Sat, May 31, 2014 at 12:31 AM, Marc Mamin m.ma...@intershop.de
 wrote:
   I have patched intarray with 3 additional functions in order to
   count[distinct] event IDs into arrays, whereas the array position
   correspond to the integer values. (mimic column oriented storage)
  
  I didn't look at the feature itself, but here are some comments
 about
  the format of the patch:
  - Be careful the newlines on the file you posted use ¥r¥n, which is
  purely Windows stuff... This will generate unnecessary diffs with
 the
  source code
  I don't mean to suggests this directly as a patch, I'm first
  interested to see if there are some interest for such an aggregation
  type.
 
 From what i see, the icount_to_array is complementary to standard
 count() aggregates, but it produces array. If the values are not
 sparse, i think the performance and memory/storage benefit you
 mentioned will be true. But if the values are sparse, there will be
 many 0's, how it will perform?

I'm thinking about adding a final function to my aggregate that would replace 
zero values will nulls, 
hence transforming the intarray into a standard int[], possibly with nullbitmap 
and a lowerbound that can be  1.
This will probably degrade the performance considerably, but may reduce the 
size of the end result for spare data and not too small integers...


 
 I'm interested to benchmark it with some use cases, to confirm the
 performance benefits of it.


Performances should greatly depend on the data distribution and order as they 
influence the number of palloc.
My first tests shown as well better and poorer results.

My target is not to get better performances at the first place, but to get a 
pivot structure in an early aggregation stage.


regards,

Marc Mamin



intarray_mod.tar.gz
Description: intarray_mod.tar.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pivot aggregation with a patched intarray

2014-06-05 Thread Ali Akbar
2014-06-05 17:18 GMT+07:00 Marc Mamin m.ma...@intershop.de:
 I'm thinking about adding a final function to my aggregate that would replace 
 zero values will nulls,
 hence transforming the intarray into a standard int[], possibly with 
 nullbitmap and a lowerbound that can be  1.
 This will probably degrade the performance considerably, but may reduce the 
 size of the end result for spare data and not too small integers...
 Performances should greatly depend on the data distribution and order as they 
 influence the number of palloc.
 My first tests shown as well better and poorer results.

 My target is not to get better performances at the first place, but to get a 
 pivot structure in an early aggregation stage.

Usually for pivot, i use crosstab function from tablefunc
(http://www.postgresql.org/docs/9.4/static/tablefunc.html#AEN158550).
If your patch doesn't perform better, it's more easier to just use
crosstab. For storing it efficiently, the result can be transformed
into array manually.

PS: as Michael Paquier said above, its better if you could send the
patch in the .patch file format (see:
https://wiki.postgresql.org/wiki/Working_with_GIT).

-- 
Ali Akbar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pivot aggregation with a patched intarray

2014-06-04 Thread Ali Akbar
2014-06-01 20:48 GMT+07:00 Marc Mamin m.ma...@intershop.de:

 On Sat, May 31, 2014 at 12:31 AM, Marc Mamin m.ma...@intershop.de wrote:
  I have patched intarray with 3 additional functions in order to 
  count[distinct] event IDs
  into arrays, whereas the array position correspond to the integer values. 
  (mimic column oriented storage)
 
 I didn't look at the feature itself, but here are some comments about
 the format of the patch:
 - Be careful the newlines on the file you posted use ¥r¥n, which is
 purely Windows stuff... This will generate unnecessary diffs with the
 source code
 I don't mean to suggests this directly as a patch,
 I'm first interested to see if there are some interest
 for such an aggregation type.

From what i see, the icount_to_array is complementary to standard
count() aggregates, but it produces array. If the values are not
sparse, i think the performance and memory/storage benefit you
mentioned will be true. But if the values are sparse, there will be
many 0's, how it will perform?

I'm interested to benchmark it with some use cases, to confirm the
performance benefits of it.

--
Ali Akbar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pivot aggregation with a patched intarray

2014-06-01 Thread Michael Paquier
On Sat, May 31, 2014 at 12:31 AM, Marc Mamin m.ma...@intershop.de wrote:
 I have patched intarray with 3 additional functions in order to 
 count[distinct] event IDs
 into arrays, whereas the array position correspond to the integer values. 
 (mimic column oriented storage)

I didn't look at the feature itself, but here are some comments about
the format of the patch:
- Be careful the newlines on the file you posted use ¥r¥n, which is
purely Windows stuff... This will generate unnecessary diffs with the
source code
- Here are some guidelines about the code convention:
http://www.postgresql.org/docs/devel/static/source.html
- And here are a couple of rules to respect when submitting a patch:
https://wiki.postgresql.org/wiki/Submitting_a_Patch
Following those rules will make patch review as well as the life of
reviewers easier.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pivot aggregation with a patched intarray

2014-06-01 Thread Marc Mamin

On Sat, May 31, 2014 at 12:31 AM, Marc Mamin m.ma...@intershop.de wrote:
 I have patched intarray with 3 additional functions in order to 
 count[distinct] event IDs
 into arrays, whereas the array position correspond to the integer values. 
 (mimic column oriented storage)

I didn't look at the feature itself, but here are some comments about
the format of the patch:
- Be careful the newlines on the file you posted use ¥r¥n, which is
purely Windows stuff... This will generate unnecessary diffs with the
source code

oops, will fix

- Here are some guidelines about the code convention:
http://www.postgresql.org/docs/devel/static/source.html
- And here are a couple of rules to respect when submitting a patch:
https://wiki.postgresql.org/wiki/Submitting_a_Patch
Following those rules will make patch review as well as the life of
reviewers easier.

thanks for your comments
I don't mean to suggests this directly as a patch,
I'm first interested to see if there are some interest
for such an aggregation type.

regards,

Marc Mamin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pivot aggregation with a patched intarray

2014-05-30 Thread Marc Mamin
(reposted, this time with attachment. sorry)


Hello,


(sorry for this long post)

I have patched intarray with 3 additional functions in order to count[distinct] 
event IDs
into arrays, whereas the array position correspond to the integer values. 
(mimic column oriented storage)

I need to use an array for the event IDs as I don't know how many of them 
exist, and the list may increase slowly upon the time.

The first results are quite promising. 

Although this approach is only usable for a narrow set of use cases (*), 
I wonder if I should look at other places in the source code to achieve a 
better implementation
and if there are discussions or plan to enhance Postgres with some support for 
such kind of storage (vectors of int).


(*) The array sizes should ideally not exceed a few tens 
and the number of events should be unknown, otherwise using one column 
per event would be faster)


use case


c: category
s: session
e: event int4 (small range, mostly less than 20)

c   s   e
-   -   -
1   1   1
1   1   1
1   1   3
1   2   1
2   2   3
3   1   5

WITH SES AS (
  SELECT s, c, 
 icount_to_array(e) ecount,
 iarray_flag(e) edistinct
  FROM foo
  GROUP BY s, c)
SELECT  c, 
iarray_sum(ecount) ecount, 
iarray_sum(edistinct)edistinct
FROM SES GROUP BY c


cecount  edistinct
---- -
1[3,0,1] [2,0,1]
2[0,0,1] [0,0,1]
3[0,0,0,0,1] [0,0,0,0,1]

e.g.: the event 1 was met 3 times in the category 1, in 2 distinct sessions.


source code
===
The 3 aggregates use following functions:

isumv:  isumv([1,1,1],[0,0,1,3]) = [1,1,2,3])

iarray_addat :  iarray_addat([3,3,3],2) = [3,4,3]

iarray_setat :  iarray_setat([0,0],2) = [0,1] 
iarray_setat([0,1],2) = [0,1] 


I've added them at the end of _int_op.c from intarray (attached)

missing in code: checks for integer overflow and that the array lower bound are 
all 1.

These are my first C lines ever and I've never learned it.
Hence I'm open for critics ;-)
I've started with this great posting by Craig Ringer:
 
http://stackoverflow.com/questions/16992339/why-is-postgresql-array-access-so-much-faster-in-c-than-in-pl-pgsql?answertab=votes#tab-top
The rest is mostly copy and paste from other parts of intarray.
 

iarray_setat is just to set a bitmap position to 1, possibly enlarging it 
when required.
It is a trivial modification from iarray_addat
It should be possible to implement this more efficiently with bit[] or varbit, 
but this
lies beyond my C capbilities.

performances  advantage of icount_to_array
===
As this aggregate allows to reduce the cardinality of GROUP BYs
it often performs better than a vertical approach. 

With the horizontal storage, the result can be stored in smaller tables with 
much less rows,
hence bringing more advantages when it comes to indices.



e.g.:

select icount_to_array(user_id) from foo

{256655,0,0,1075,0,1,91154,36,0 (...)

Aggregate  (cost=36930.44..36930.45 rows=1 width=4) (actual 
time=333.378..333.378 rows=1 loops=1)
  -  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual 
time=0.010..131.117 rows=1860179 loops=1)
Total runtime: 333.420 ms


select user_id, count(*) from foo group by user_id order by 1

Sort  (cost=41582.75..41582.87 rows=48 width=4) (actual time=492.638..492.638 
rows=79 loops=1)
  Sort Key: user_id
  Sort Method: quicksort  Memory: 28kB
  -  HashAggregate  (cost=41580.93..41581.41 rows=48 width=4) (actual 
time=492.606..492.615 rows=79 loops=1)
-  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual 
time=0.010..128.800 rows=1860179 loops=1)
Total runtime: 492.699 ms

1;256656
4;1075
7;91157
8;36
17;1455583
(...)

It will swap later on
-
... which result in a significant difference in some cases.

create temp table ev AS SELECT * FROM (
generate_series(1,1000)s cross join
generate_series(1,500)c cross join
generate_series(1,15)e cross join
generate_series(1,5) repeat
)

explain analyze select s, c,  icount_to_array(e)  from ev group by s,c

HashAggregate  (cost=830575.54..830975.54 rows=4 width=12) (actual 
time=19188.384..19379.154 rows=50 loops=1)
  -  Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual 
time=0.046..4849.977 rows=3750 loops=1)
Total runtime: 19399.151 ms

explain analyze select s, c, e, count(*) from ev group by s,c,e

GroupAggregate  (cost=5589186.88..6073545.71 rows=3587844 width=12) (actual 
time=63290.168..89336.265 rows=750 loops=1)
  -  Sort  (cost=5589186.88..5678882.96 rows=35878431 width=12) (actual 
time=63290.156..83981.772 rows=3750 loops=1)
Sort Key: s, c, e
Sort Method: external merge  Disk: 806424kB
-  Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) 
(actual time=0.039..4957.481 rows=3750 loops=1)
Total runtime: 89680.844 ms