[PERFORM] Ordering on GIN Index

2016-12-13 Thread Rory
Hello mailing list!

We have a JSONB column in a table that has a key (string) => value (int)
structure. We want to be able to create a btree index that accepts an
arbitrary key to allowing sorting by key.

Table Example:

> ++--+
> | ID |  JSONB   |
> ++--+
> |  1 | {"key_1": 20, "key_2": 30, "key_52": -1} |
> |  2 | {"key_1": 10}|
> ++--+


Here is the kind of query we want to run:

> select id, (jsonb ->> 'key_1')::int as sort_key
> from my_table
> where (jsonb ? 'key_1' and (jsonb ->> 'key_1')::int > 0) and (jsonb ?
> 'key_2' and jsonb ->> (jsonb ->> 'key_2')::int > 50)

order by sort_key desc
> limit 100;


We know that we can create indexes for each individual key (create index
my_table_key_1_idx on my_table using btree((jsonb -> 'key_1')) or using a
partial index including the ? operator) but the issue is that there are
around 5000 potential keys, which means 5000 indexes.

We tried doing the relational thing, and splitting the JSONB table into
it's own separate table, which is great because we can use a simple btree
index, but unfortunately this forces us to use weird queries such as:

> select id, max(value) filter (where key = 'key_1') as sort_key
> from my_table_split
> where (
> (key = 'key_1' and value > 0) or
> (key = 'key_2' and value > 50)
> )
> group by id having count(*) = 2
>
order by sort_key desc

limit 100;


Such a query takes a disappointing long time to aggregate. This also has
the disadvantage that if we wanted to expand my_table we'd have to do an
inner join further decreasing performance.

I see that in 2013 there was a talk (
http://www.sai.msu.su/~megera/postgres/talks/Next%20generation%20of%20GIN.pdf)
about ordered GIN indexes which seems perfect for our case, but I can't see
any progress or updates on that.

Does anyone have any ideas on how to approach this in a for performant way
with the Postgres we have today?

Thank you,
Rory.


Re: [PERFORM] Querying with multicolumn index

2016-12-13 Thread Daniel Blanch Bataller
Hi all,If anyone still interested in the issue I think I have a very plausible explanation of Eric’s postgresql bad index choice that is: bloated updates_driver_id_time_idx index.Though it’s possible to fool postgresql planner, as I’ve shown in previous tests, this happens with a very concrete data distribution ~ 100 evenly distributed keys over 15M records and ~ 6M records under one single key, if you play a bit with figures it doesn’t happen anymore.Eric’s data distribution wasn’t that extreme, as far as I know he had ~ 100K vs 500K distributions … Well, I’ve been able to reproduce the problem with a close data distribution to Erik’s. I made it creating a ‘bloated index’. If optimal index is too big, postgres tries with another suboptimal index, in this case index ’time’. See this excerpt of my tests results:(..)-- populate table with 99 homogeneus distributed valuesINSERT INTO updates SELECT q, q % 99, q, q, to_timestamp(q), q % 99 FROM generate_series(1, 1500) q;INSERT 0 1500Time: 65686,547 ms-- populate table with 1 value with 500K rows, simmilar distribution you posted.INSERT INTO updates SELECT q + 1500, 100, q, q, to_timestamp(q), -- timestamp will start at 1 at end at 6M	100 FROM generate_series(1, 50) q;INSERT 0 50Time: 2463,073 ms-- add constraints and indexes(…)-- create 'bloated' driver_id, time index.CREATE INDEX ON updates (driver_id, "time") WITH (fillfactor = 10);CREATE INDEXTime: 41234,091 ms-- check index sizes, updates_driver_id_idx is huge.SELECT relname, relpages FROM pg_class WHERE relname LIKE 'updates%';           relname           | relpages -+-- updates                     |   129167 updates_driver_id_time_idx  |   576919   updates_id_seq              |        1 updates_pkey                |    42502 updates_time_idx            |    42502 updates_vehicle_id_time_idx |    59684(6 rows)Time: 16,810 ms-- check behavior with bloated indexANALYZE updates;ANALYZETime: 254,917 ms(..)Time: 4,635 msEXPLAIN (ANALYZE, BUFFERS) SELECT * FROM updates WHERE driver_id = 100 ORDER BY "time" DESC LIMIT 1;                                                                         QUERY PLAN                                                                          Limit  (cost=0.43..1.91 rows=1 width=36) (actual time=21486.015..21486.015 rows=1 loops=1)   Buffers: shared hit=39618 read=160454 written=592   ->  Index Scan Backward using updates_time_idx on updates  (cost=0.43..691283.45 rows=469134 width=36) (actual time=21486.014..21486.014 rows=1 loops=1)         Filter: (driver_id = 100)         Rows Removed by Filter: 1450         Buffers: shared hit=39618 read=160454 written=592 Planning time: 0.171 ms Execution time: 21486.068 ms(8 rows)Time: 21486,905 ms-- rebuild index with default fillfactorALTER INDEX updates_driver_id_time_idx SET (fillfactor = 90);ALTER INDEXTime: 0,682 msREINDEX INDEX updates_driver_id_time_idx;REINDEXTime: 23559,530 ms-- recheck index sizes, updates_driver_id_idx should look pretty simmilar to others.SELECT relname, relpages FROM pg_class WHERE relname LIKE 'updates%';           relname           | relpages -+-- updates                     |   129167 updates_driver_id_time_idx  |    59684 updates_id_seq              |        1 updates_pkey                |    42502 updates_time_idx            |    42502 updates_vehicle_id_time_idx |    59684(6 rows)Time: 0,452 ms-- check behavior with regular sized indexEXPLAIN (ANALYZE, BUFFERS) SELECT * FROM updates WHERE driver_id = 100 ORDER BY "time" DESC LIMIT 1;                                                                          QUERY PLAN                                                                          -- Limit  (cost=0.56..1.69 rows=1 width=36) (actual time=0.032..0.033 rows=1 loops=1)   Buffers: shared hit=2 read=3   ->  Index Scan Backward using updates_driver_id_time_idx on updates  (cost=0.56..529197.34 rows=469133 width=36) (actual time=0.032..0.032 rows=1 loops=1)         Index Cond: (driver_id = 100)         Buffers: shared hit=2 read=3 Planning time: 0.074 ms Execution time: 0.046 ms(7 rows)Time: 0,312 ms@EricHow to solve the problem:First of all check if this is the case, check indexes sizes, if you have this problem updates_driver_id_time_idx should be significantly bigger than others.SELECT relname, relpages FROM pg_class WHERE relname LIKE 'updates%’;Check index configuration to see if you have different fillfactor configuration\d+ updates_driver_id_time_idxIf you have setup a different fillfactor, turn it to normal, that is 90%. I don’t see why you should have a low fillfactor, your data doesn’t seem to 

Re: [PERFORM] Isolation of tx logs on VMware

2016-12-13 Thread Joshua D. Drake

On 12/13/2016 12:16 PM, ProPAAS DBA wrote:

Hi All;




I'm not a VMware expert, however I thought VMware would allow the
creation of multiple disk volumes and attach them via separate mount
points. Is this not true? If it is an option can someone point me to a
how to...


Yes it is possible to do this and then you will be able to use standard 
OS tools to determine the IO utilization.





Also, if we cannot do multiple VMDK volumes then what is everyone's
thoughts about relocating pg_xlog to an NFS mount?



I personally wouldn't do it but it would depend on the implementation.

JD



Thanks in advance








--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Isolation of tx logs on VMware

2016-12-13 Thread ProPAAS DBA

Hi All;


We have a client running on VMware, they have heavy write traffic and we 
want to isolate the IO for the tx logs (pg_xlog). However it seems the 
best plan based on feedback from the client is either


(a) simply leave the pg_xlog dir in the VMDK

or

(b) relocate pg_xlog to NAS/NFS


I'm not a VMware expert, however I thought VMware would allow the 
creation of multiple disk volumes and attach them via separate mount 
points. Is this not true? If it is an option can someone point me to a 
how to...



Also, if we cannot do multiple VMDK volumes then what is everyone's 
thoughts about relocating pg_xlog to an NFS mount?



Thanks in advance





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance