Re: [PERFORM] Querying with multicolumn index

2016-12-09 Thread Andreas Joseph Krogh
På fredag 09. desember 2016 kl. 18:00:16, skrev Eric Jiang <e...@doublemap.com 
<mailto:e...@doublemap.com>>:
Hi all,
 I have a query that I *think* should use a multicolumn index, but
 sometimes isn't, resulting in slow queries.

 We have a DB that records GPS coordinates for vehicles:

                                Table "public.updates"
    Column   |           Type           |                      Modifiers
 
+--+--
  id         | integer                  | not null default
 nextval('updates_id_seq'::regclass)
  driver_id  | integer                  | not null
  latitude   | double precision         | not null
  longitude  | double precision         | not null
  time       | timestamp with time zone | not null default now()
  vehicle_id | integer                  |
 Indexes:
     "updates_pkey" PRIMARY KEY, btree (id)
     "ix_updates_time" btree ("time")
     "updates_driver_id_time_idx" btree (driver_id, "time")
     "updates_vehicle_id_time_idx" btree (vehicle_id, "time")

 Table has about 15M records across 100 distinct driver_id.

 I want to get the last record for a specific driver:

 SELECT * FROM updates WHERE driver_id=123 ORDER BY "time" DESC LIMIT 1;

 For some values of driver_id, it does what I expect and uses
 updates_driver_id_time_idx to fetch the records in 2 ms or less. For
 other values of driver_id, it does an index scan backwards on
 ix_updates_time, taking upwards of 2 minutes.

 Good plan:

  Limit  (cost=0.11..1.38 rows=1 width=56) (actual time=2.710..2.710
 rows=1 loops=1)
    ->  Index Scan Backward using updates_driver_id_time_idx on updates
  (cost=0.11..139278.28 rows=110051 width=56) (actual time=2.709..2.709
 rows=1 loops=1)
          Index Cond: (driver_id = 17127)
  Total runtime: 2.732 ms
 (4 rows)

 Bad plan:

  Limit  (cost=0.09..0.69 rows=1 width=56) (actual
 time=216769.111..216769.112 rows=1 loops=1)
    ->  Index Scan Backward using ix_updates_time on updates
 (cost=0.09..272339.04 rows=448679 width=56) (actual
 time=216769.110..216769.110 rows=1 loops=1)
          Filter: (driver_id = 30132)
          Rows Removed by Filter: 5132087
  Total runtime: 216769.174 ms


 From cursory testing, the difference seems to be based on how many
 total rows there are for a particular driver. The above query uses
 updates_driver_id_time_idx for drivers with less than about 300K rows,
 but uses ix_updates_time for drivers with more than about 300K rows.

 Anything we can do to make it do the "right" thing? We are also
 considering denormalizing the data and keeping a "cache" of the same
 data in another table.

 pgsql version: 9.3.14 and 9.5.3, already tried vacuum analyze.

 Thanks,
 Eric
 
You should be having this index:
 
create index updates_driver_time_idx ON updates(driver_id, "time" DESC);
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Tuning one Recurcive CTE

2016-11-09 Thread Andreas Joseph Krogh
På onsdag 09. november 2016 kl. 15:30:20, skrev Henrik Ekenberg <
hen...@ekenberg.pw <mailto:hen...@ekenberg.pw>>:
Hi,

 I will need to anonymized before sending it.
 Do you know if there is any tuning documents related to CTE scans

You might want to read this:
http://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/

https://robots.thoughtbot.com/advanced-postgres-performance-tips#common-table-expressions-and-subqueries

https://www.postgresql.org/message-id/CAPo4y_XUJR1sijvTySy9W%2BShpORwzbhSdEzE9pgtc1%3DcTkvpkw%40mail.gmail.com
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Tuning one Recurcive CTE

2016-11-09 Thread Andreas Joseph Krogh
På onsdag 09. november 2016 kl. 14:05:55, skrev Henrik Ekenberg <
hen...@ekenberg.pw <mailto:hen...@ekenberg.pw>>:
Hi,

 I try to tune one Recursive CTE.

 Explain Plan can be found here
 https://explain.depesz.com/s/yLVd

 Anyone can give me direction to check?

 //H.

 
Rule number one; Always provide the query in question when asking for help 
tuning it.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Very poor performance with Nested Loop Anti Join

2016-08-02 Thread Andreas Joseph Krogh
På tirsdag 02. august 2016 kl. 01:15:05, skrev Tom Lane <t...@sss.pgh.pa.us 
<mailto:t...@sss.pgh.pa.us>>:
Andreas Joseph Krogh <andr...@visena.com> writes:
 > This query performs terribly slow (~26 minutes,��1561346.597ms):

 Seems like the key misestimation is on the inner antijoin:

 >                ->  Hash Anti Join  (cost=654.21..4008.72 rows=1 width=8) 
(actual time=9.016..40.672 rows=76174 loops=1)
 >                      Hash Cond: (il.invoice_id = creditnote.credit_against)
 >                      ->  Seq Scan on onp_crm_invoice_line il  
(cost=0.00..3062.01 rows=78001 width=8) (actual time=0.005..11.259 rows=78614 
loops=1)
 >                      ->  Hash  (cost=510.56..510.56 rows=11492 width=8) 
(actual time=8.940..8.940 rows=372 loops=1)
 >                            Buckets: 16384  Batches: 1  Memory Usage: 143kB
 >                            ->  Seq Scan on onp_crm_invoice creditnote  
(cost=0.00..510.56 rows=11492 width=8) (actual time=0.014..7.882 rows=11507 
loops=1)
 >                                  Filter: ((sent_date <= '2016-06-27'::date) 
AND ((status_key)::text = 'INVOICE_STATUS_INVOICED'::text))
 >                                  Rows Removed by Filter: 149

 If it realized that this produces 78k rows not 1, it'd likely do something
 smarter at the outer antijoin.

 I have no idea why that estimate's so far off though.  What PG version is
 this?  Stats all up to date on these two tables?
 
Sorry for not providing PG-version, this is on 9.5.3.
All stats are up to date, or should be a I've analyzed all manually.

 
Are the rows excluded
 by the filter condition on "creditnote" significantly different from the
 rest of that table?
 
This happens also without the filter-cond:
 
explain analyze SELECT  log.relation_id as company_id , sum(log.duration) AS 
durationFROM onp_crm_activity_log log  JOIN onp_crm_person logfor ON 
logfor.onp_user_id =log.logged_for AND logfor.is_resource = FALSE WHERE 1 = 1 
-- Filter out already invoiced AND NOT EXISTS( SELECT * FROM 
onp_crm_calendarentry_invoice_membership cemJOIN onp_crm_invoice_line il ON 
cem.invoice_line_id = il.idJOIN onp_crm_invoice inv ON il.invoice_id = 
inv.entity_idWHERE cem.calendar_entry_id = log.id AND NOT EXISTS( SELECT * FROM 
onp_crm_invoice creditnoteWHERE il.invoice_id = creditnote.credit_against ) ) 
GROUP BY log.relation_id ; 
 
QUERY PLAN 
-
HashAggregate (cost=12049.35..12067.11 rows=1421 width=12) (actual time
=1386683.646..1386683.858rows=720 loops=1) Group Key: log.relation_id -> Nested 
LoopAnti Join (cost=512.08..12039.32 rows=2006 width=12) (actual time
=395.017..1386576.756rows=93480 loops=1) Join Filter: (cem.calendar_entry_id = 
log.id) Rows Removed by Join Filter: 12185913244 -> Hash Join (cost
=86.56..9757.61rows=2006 width=20) (actual time=0.165..366.778 rows=181872 
loops=1) Hash Cond: (log.logged_for = logfor.onp_user_id) -> Seq Scan on 
onp_crm_activity_loglog (cost=0.00..8957.45 rows=184945 width=24) (actual time
=0.003..256.862rows=182606 loops=1) -> Hash (cost=39.46..39.46 rows=3768 width=8
) (actualtime=0.132..0.132 rows=36 loops=1) Buckets: 4096 Batches: 1 Memory 
Usage: 34kB -> Bitmap Heap Scan on onp_crm_person logfor (cost=3.69..39.46 rows=
3768width=8) (actual time=0.033..0.125 rows=36 loops=1) Recheck Cond: 
(onp_user_idIS NOT NULL) Filter: (NOT is_resource) Rows Removed by Filter: 5 
Heap Blocks: exact=10 -> Bitmap Index Scan on onp_crm_person_onp_id_idx (cost
=0.00..2.75rows=41 width=0) (actual time=0.017..0.017 rows=41 loops=1) -> 
Materialize (cost=425.53..2251.62 rows=1 width=4) (actual time=0.000..2.544 rows
=67003 loops=181872) -> Nested Loop (cost=425.53..2251.61 rows=1 width=4) 
(actualtime=3.283..320.057 rows=88511 loops=1) -> Nested Loop (cost
=425.24..2251.30rows=1 width=8) (actual time=3.241..154.783 rows=88511 loops=1) 
->Hash Anti Join (cost=424.95..2250.75 rows=1 width=8) (actual time
=3.110..30.097rows=76281 loops=1) Hash Cond: (il.invoice_id = 
creditnote.credit_against) ->Index Only Scan using 
origo_invoice_line_id_invoice_idxon onp_crm_invoice_line il (cost=0.29..1530.95 
rows=78707 width=8) (actual time=0.030..13.719 rows=78740 loops=1) Heap Fetches:
2967 -> Hash (cost=278.22..278.22 rows=11715 width=8) (actual time=3.003..3.003 
rows=376 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 143kB -> Index Only 
Scanusing origo_invoice_credit_against_idx on onp_crm_invoice creditnote (cost
=0.29..278.22rows=11715 width=8) (actual time=0.042..2.082 rows=11692 loops=1) 
Heap Fetches:1151 -> Index Only Scan using 
onp_crm_calendarentry_invoice_invoice_line_id_calendar_entr_keyon 
onp_crm_calendarentry_invoice_membership

Re: [PERFORM] Very poor performance with Nested Loop Anti Join

2016-08-01 Thread Andreas Joseph Krogh
På mandag 01. august 2016 kl. 15:33:04, skrev Andreas Joseph Krogh <
andr...@visena.com <mailto:andr...@visena.com>>:
I have this schema:
 
CREATE TABLE onp_crm_person( id serial PRIMARY KEY, onp_user_id bigint 
referencesonp_user(id) deferrable initially deferred, is_resource boolean not 
null default false, UNIQUE(onp_user_id) ); CREATE TABLE onp_crm_activity_log( id
bigserial PRIMARY KEY, relation_id integer REFERENCES 
onp_crm_relation(entity_id), logged_forint references 
onp_crm_person(onp_user_id), durationbigint ); CREATE TABLE onp_crm_invoice( 
entity_idbigint PRIMARY KEY REFERENCES onp_crm_entity(entity_id), status_key 
VARCHAR NOT NULL, credit_against bigint REFERENCES onp_crm_invoice(entity_id), 
sent_dateDATE, UNIQUE(credit_against) deferrable INITIALLY DEFERRED -- 
invoice_print_template_id is added after creation of 
origo_invoice_print_template); CREATE TABLE onp_crm_invoice_line ( id SERIAL 
PRIMARY KEY, invoice_id INTEGER NOT NULL REFERENCES onp_crm_invoice (entity_id) 
);CREATE TABLE onp_crm_calendarentry_invoice_membership( invoice_line_id 
INTEGER NOT NULL REFERENCESonp_crm_invoice_line(id) ON DELETE CASCADE, 
calendar_entry_idINTEGER NOT NULL REFERENCES onp_crm_activity_log(id), unique
(invoice_line_id, calendar_entry_id)DEFERRABLE INITIALLY DEFERRED ); 
 
This query performs terribly slow ( ~26 minutes, 1561346.597ms):
 
explain analyze SELECT  log.relation_id as company_id , sum(log.duration) AS 
durationFROM onp_crm_activity_log log  JOIN onp_crm_person logfor ON 
logfor.onp_user_id =log.logged_for AND logfor.is_resource = FALSE WHERE 1 = 1 
-- Filter out already invoiced before 2016-06-27 AND NOT EXISTS( SELECT * FROM 
onp_crm_calendarentry_invoice_membership cemJOIN onp_crm_invoice_line il ON 
cem.invoice_line_id = il.idJOIN onp_crm_invoice inv ON il.invoice_id = 
inv.entity_idWHERE cem.calendar_entry_id = log.id AND inv.status_key = 
'INVOICE_STATUS_INVOICED' AND inv.sent_date <= '2016-06-27' AND NOT EXISTS( 
SELECT* FROM onp_crm_invoice creditnote WHERE il.invoice_id = 
creditnote.credit_againstAND creditnote.status_key = 'INVOICE_STATUS_INVOICED' 
ANDcreditnote.sent_date <= '2016-06-27' ) ) GROUP BY  log.relation_id ; 
 
Explain output:
                                                                              
                                       QUERY PLAN 

  
 
-
 HashAggregate  (cost=13778.63..13796.39 rows=1421 width=12) (actual 
time=1561343.861..1561344.042 rows=724 loops=1)
   Group Key: log.relation_id
   ->  Nested Loop Anti Join  (cost=741.35..13768.63 rows=2000 width=12) 
(actual time=471.973..1561221.929 rows=96095 loops=1)
 Join Filter: (cem.calendar_entry_id = log.id)
 Rows Removed by Join Filter: 11895758618
 ->  Hash Join  (cost=86.56..9729.03 rows=2000 width=20) (actual 
time=0.170..668.911 rows=181644 loops=1)
   Hash Cond: (log.logged_for = logfor.onp_user_id)
   ->  Seq Scan on onp_crm_activity_log log  (cost=0.00..8930.98 
rows=184398 width=24) (actual time=0.007..538.893 rows=182378 loops=1)
   ->  Hash  (cost=39.46..39.46 rows=3768 width=8) (actual 
time=0.126..0.126 rows=36 loops=1)
 Buckets: 4096  Batches: 1  Memory Usage: 34kB
 ->  Bitmap Heap Scan on onp_crm_person logfor 
 (cost=3.69..39.46 rows=3768 width=8) (actual time=0.040..0.106 rows=36 loops=1)
   Recheck Cond: (onp_user_id IS NOT NULL)
   Filter: (NOT is_resource)
   Rows Removed by Filter: 5
   Heap Blocks: exact=10
   ->  Bitmap Index Scan on onp_crm_person_onp_id_idx 
 (cost=0.00..2.75 rows=41 width=0) (actual time=0.019..0.019 rows=41 loops=1)
 ->  Materialize  (cost=654.79..4009.60 rows=1 width=4) (actual 
time=0.000..2.829 rows=65490 loops=181644)
   ->  Nested Loop  (cost=654.79..4009.59 rows=1 width=4) (actual 
time=9.056..386.835 rows=85668 loops=1)
 ->  Nested Loop  (cost=654.50..4009.27 rows=1 width=8) 
(actual time=9.046..165.280 rows=88151 loops=1)
   ->  Hash Anti Join  (cost=654.21..4008.72 rows=1 
width=8) (actual time=9.016..40.672 rows=76174 loops=1)
 Hash Cond: (il.invoice_id = 
creditnote.credit_against)
 ->  Seq Scan on onp_crm_invoice_line il 
 (cost=0.00..3062.01 rows=78001 width=8) (actual time=0.005..11.259 rows=78614 
loops=1)
 ->  Hash  (cost=510.56..510.56 rows=1149

[PERFORM] Very poor performance with Nested Loop Anti Join

2016-08-01 Thread Andreas Joseph Krogh
 
onp_crm_invoice_line il  (cost=0.00..3062.01 rows=78001 width=8) (actual 
time=0.008..12.476 rows=78622 loops=1)
 ->  Hash  (cost=510.56..510.56 
rows=11492 width=8) (actual time=12.477..12.477 rows=372 loops=1)
   Buckets: 16384  Batches: 1 
 Memory Usage: 143kB
   ->  Seq Scan on 
onp_crm_invoice creditnote  (cost=0.00..510.56 rows=11492 width=8) (actual 
time=0.008..10.963 rows=11507 loops=1)
 Filter: ((sent_date 
<= '2016-06-27'::date) AND ((status_key)::text = 
'INVOICE_STATUS_INVOICED'::text))
 Rows Removed by 
Filter: 151
 Planning time: 3.510 ms
 Execution time: 338.349 ms
 (39 rows)

 So my question is is there something I can do to make PG favor a Hash Anti 
Join instead of a Nested Loop Anti Join (which I assume is the problem)?
Can the nested NOT EXISTS be re-written to be more performant?
 
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>




Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-21 Thread Andreas Joseph Krogh
På mandag 21. mars 2016 kl. 16:13:07, skrev Oleg Bartunov <obartu...@gmail.com 
<mailto:obartu...@gmail.com>>:
    On Mon, Mar 21, 2016 at 5:41 PM, Andreas Joseph Krogh <andr...@visena.com 
<mailto:andr...@visena.com>> wrote: På lørdag 19. mars 2016 kl. 03:44:55, skrev 
Jeff Janes <jeff.ja...@gmail.com <mailto:jeff.ja...@gmail.com>>:
On Wed, Mar 16, 2016 at 6:53 AM, Andreas Joseph Krogh <andr...@visena.com 
<mailto:andr...@visena.com>> wrote: På onsdag 16. mars 2016 kl. 14:37:27, skrev 
Tom Lane <t...@sss.pgh.pa.us <mailto:t...@sss.pgh.pa.us>>:
Andreas Joseph Krogh <andr...@visena.com <mailto:andr...@visena.com>> writes:
 > 1. Why isnt' folder_id part of the index-cond?

 Because a GIN index is useless for sorting.

 > 2. Is there a way to make it use the (same) index to sort by
 > received_timestamp?

 No.

 > 3. Using a GIN-index, is there a way to use the index at all for sorting?

 No.

 > 4. It doesn't seem like ts_rank uses the index for sorting either.

 Same reason.

 regards, tom lane
 
So it's basically impossible to use FTS/GIN with sorting on large datasets?
Are there any plans to improve this situation?
 
I don't see why it would not be possible to create a new execution node type 
that does an index scan to obtain order (or just to satisfy an equality or 
range expression), and takes a bitmap (as produced by the FTS/GIN) to apply as 
a filter.  But, I don't know of anyone planning on doing that.



 
Isn't this what Postgres Pro are planning? 
http://postgrespro.com/roadmap/mssearch 
<http://postgrespro.com/roadmap/mssearch>
 
"Unlike external special-purpose search engines, a full-text search engine 
built in a DBMS is capable of combining full-text and attributive search 
criteria in SQL query syntax. It is planned to improve the existing PostgreSQL 
full-text search engine byextending the functionality of Generalized Inverted 
Index (GIN) to make it capable of storing extra information required for 
ranging query results. This search acceleration will allow to go back from 
external full-text search engines, thus facilitating system administration and 
use, reducing technology risks, and improving information security."
 
This is different feature ! Actually, we already have prototype of what Jeff 
suggested, we called it bitmap filtering, but failed to find use case where it 
provides benefits. Teodor will comment this idea more detail.



 
The feature I'm missing is the ability to do FTS (or use GIN in general) and 
then sort on some other column (also indexed by the same GIN-index, using the 
btree-gin extention), often of type BIGINT or TIMESTAMP.
Are you planning to work on such a feature for GIN?
 
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-21 Thread Andreas Joseph Krogh
På lørdag 19. mars 2016 kl. 03:44:55, skrev Jeff Janes <jeff.ja...@gmail.com 
<mailto:jeff.ja...@gmail.com>>:
On Wed, Mar 16, 2016 at 6:53 AM, Andreas Joseph Krogh <andr...@visena.com 
<mailto:andr...@visena.com>> wrote: På onsdag 16. mars 2016 kl. 14:37:27, skrev 
Tom Lane <t...@sss.pgh.pa.us <mailto:t...@sss.pgh.pa.us>>:
Andreas Joseph Krogh <andr...@visena.com <mailto:andr...@visena.com>> writes:
 > 1. Why isnt' folder_id part of the index-cond?

 Because a GIN index is useless for sorting.

 > 2. Is there a way to make it use the (same) index to sort by
 > received_timestamp?

 No.

 > 3. Using a GIN-index, is there a way to use the index at all for sorting?

 No.

 > 4. It doesn't seem like ts_rank uses the index for sorting either.

 Same reason.

 regards, tom lane
 
So it's basically impossible to use FTS/GIN with sorting on large datasets?
Are there any plans to improve this situation?
 
I don't see why it would not be possible to create a new execution node type 
that does an index scan to obtain order (or just to satisfy an equality or 
range expression), and takes a bitmap (as produced by the FTS/GIN) to apply as 
a filter.  But, I don't know of anyone planning on doing that.



 
Isn't this what Postgres Pro are planning? 
http://postgrespro.com/roadmap/mssearch
 
"Unlike external special-purpose search engines, a full-text search engine 
built in a DBMS is capable of combining full-text and attributive search 
criteria in SQL query syntax. It is planned to improve the existing PostgreSQL 
full-text search engine byextending the functionality of Generalized Inverted 
Index (GIN) to make it capable of storing extra information required for 
ranging query results. This search acceleration will allow to go back from 
external full-text search engines, thus facilitating system administration and 
use, reducing technology risks, and improving information security."
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-19 Thread Andreas Joseph Krogh
På onsdag 16. mars 2016 kl. 14:37:27, skrev Tom Lane <t...@sss.pgh.pa.us 
<mailto:t...@sss.pgh.pa.us>>:
Andreas Joseph Krogh <andr...@visena.com> writes:
 > 1. Why isnt' folder_id part of the index-cond?

 Because a GIN index is useless for sorting.

 > 2. Is there a way to make it use the (same) index to sort by
 > received_timestamp?

 No.

 > 3. Using a GIN-index, is there a way to use the index at all for sorting?

 No.

 > 4. It doesn't seem like ts_rank uses the index for sorting either.

 Same reason.

 regards, tom lane
 
So it's basically impossible to use FTS/GIN with sorting on large datasets?
Are there any plans to improve this situation?
 
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-19 Thread Andreas Joseph Krogh
På onsdag 16. mars 2016 kl. 16:04:08, skrev Evgeniy Shishkin <
itparan...@gmail.com <mailto:itparan...@gmail.com>>:

 > On 16 Mar 2016, at 17:52, Evgeniy Shishkin <itparan...@gmail.com> wrote:
 >
 >
 >> On 16 Mar 2016, at 16:37, Tom Lane <t...@sss.pgh.pa.us> wrote:
 >>
 >> Andreas Joseph Krogh <andr...@visena.com> writes:
 >>> 1. Why isnt' folder_id part of the index-cond?
 >>
 >> Because a GIN index is useless for sorting.
 >
 > I don't see how gin inability to return sorted data relates to index 
condition.
 > In fact i tried to reproduce the example,
 > and if i change folder_id to int from bigint, then index condition with 
folder_id is used
 >
 >         Index Cond: ((fts_all @@ '''hi'''::tsquery) AND (folder_id = 1))
 >

 Looks like documentation 
http://www.postgresql.org/docs/9.5/static/btree-gin.html
 is lying about supporting int8 type
 
Hm, interesting!
 
@Tom: Any idea why BIGINT doesn't work, but INTEGER does?
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-19 Thread Andreas Joseph Krogh
På onsdag 16. mars 2016 kl. 16:07:56, skrev Evgeniy Shishkin <
itparan...@gmail.com <mailto:itparan...@gmail.com>>:

 > On 16 Mar 2016, at 18:04, Evgeniy Shishkin <itparan...@gmail.com> wrote:
 >
 >>
 >> On 16 Mar 2016, at 17:52, Evgeniy Shishkin <itparan...@gmail.com> wrote:
 >>
 >>
 >>> On 16 Mar 2016, at 16:37, Tom Lane <t...@sss.pgh.pa.us> wrote:
 >>>
 >>> Andreas Joseph Krogh <andr...@visena.com> writes:
 >>>> 1. Why isnt' folder_id part of the index-cond?
 >>>
 >>> Because a GIN index is useless for sorting.
 >>
 >> I don't see how gin inability to return sorted data relates to index 
condition.
 >> In fact i tried to reproduce the example,
 >> and if i change folder_id to int from bigint, then index condition with 
folder_id is used
 >>
 >>        Index Cond: ((fts_all @@ '''hi'''::tsquery) AND (folder_id = 1))
 >>
 >
 > Looks like documentation 
http://www.postgresql.org/docs/9.5/static/btree-gin.html
 > is lying about supporting int8 type
 >

 Uh, it works if i cast to bigint explicitly
       WHERE  del.fts_all @@ to_tsquery('simple', 'hi')
       AND del.folder_id = 1::bigint;
 results in
          Index Cond: ((folder_id = '1'::bigint) AND (fts_all @@ 
'''hi'''::tsquery))
 
Hm, this is quite cranky, but thanks for the heads-up!
 
Tho it looks like it works if prepared, without explicit cast:
 
prepare fish AS  SELECT del.id , del.received_timestamp FROM delivery del WHERE
1= 1 AND del.fts_all @@ to_tsquery('simple', $1) AND del.folder_id = $2 ORDER BY
del.received_timestampDESC LIMIT 101 OFFSET 0; explain analyze execute fish(
'hi:*', 1);  QUERY 
PLAN     
 

  Limit  (cost=9.13..9.13 rows=1 width=16) (actual time=0.047..0.048 rows=2 
loops=1)
    ->  Sort  (cost=9.13..9.13 rows=1 width=16) (actual time=0.045..0.045 
rows=2 loops=1)
  Sort Key: received_timestamp DESC
  Sort Method: quicksort  Memory: 25kB
  ->  Bitmap Heap Scan on delivery del  (cost=7.80..9.12 rows=1 
width=16) (actual time=0.034..0.034 rows=2 loops=1)
    Recheck Cond: ((fts_all @@ '''hi'':*'::tsquery) AND (folder_id 
= '1'::bigint))
    Heap Blocks: exact=1
    ->  Bitmap Index Scan on fts_idx  (cost=0.00..7.80 rows=1 
width=0) (actual time=0.023..0.023 rows=2 loops=1)
  Index Cond: ((fts_all @@ '''hi'':*'::tsquery) AND 
(folder_id = '1'::bigint))
  Execution time: 0.103 ms
 (10 rows)
 

 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-18 Thread Andreas Joseph Krogh
På onsdag 16. mars 2016 kl. 14:53:04, skrev Andreas Joseph Krogh <
andr...@visena.com <mailto:andr...@visena.com>>:
På onsdag 16. mars 2016 kl. 14:37:27, skrev Tom Lane <t...@sss.pgh.pa.us 
<mailto:t...@sss.pgh.pa.us>>:
Andreas Joseph Krogh <andr...@visena.com> writes:
 > 1. Why isnt' folder_id part of the index-cond?

 Because a GIN index is useless for sorting.

 > 2. Is there a way to make it use the (same) index to sort by
 > received_timestamp?

 No.

 > 3. Using a GIN-index, is there a way to use the index at all for sorting?

 No.

 > 4. It doesn't seem like ts_rank uses the index for sorting either.

 Same reason.

 regards, tom lane
 
So it's basically impossible to use FTS/GIN with sorting on large datasets?
Are there any plans to improve this situation?
 
Thanks.
 
This paper talks about ORDER BY optimizations for FTS (starting at slide 6 and 
7):
http://www.sai.msu.su/~megera/postgres/talks/Next%20generation%20of%20GIN.pdf
 
This indicates some work is being done in this area.
 
Oleg, if you're listening, do you guys have any exiting news regarding this?
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


Re: [PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-18 Thread Andreas Joseph Krogh
På torsdag 17. mars 2016 kl. 18:20:23, skrev Tom Lane <t...@sss.pgh.pa.us 
<mailto:t...@sss.pgh.pa.us>>:
Evgeniy Shishkin <itparan...@gmail.com> writes:
 > Uh, it works if i cast to bigint explicitly

 FWIW, the reason for that is that the int8_ops operator class that
 btree_gin creates doesn't contain any cross-type operators.  Probably
 wouldn't be that hard to fix if somebody wanted to put in the work.

 regards, tom lane
 
Thanks for info.
 
Can you explain why it works when using prepared statement without casting? 
Does the machinary then know the type so the "setParameter"-call uses the 
correct type?
 
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>


 


[PERFORM] Searching GIN-index (FTS) and sort by timestamp-column

2016-03-16 Thread Andreas Joseph Krogh
Hi all.
 
I'm doing full-text-search and want do display the results in the order the 
articles were received (timestamp). I have millions of articles where the 
newest are the most interesting, and a search may match many articles so doing 
the sort using some INDEX would be great.
 
Take the following example-schema:
 
create extension if not exists btree_gin; 
drop table if EXISTS delivery; create table delivery( id BIGSERIAL primary key
, fts_allTSVECTOR not null, folder_id BIGINT NOT NULL, received_timestamp 
TIMESTAMP not null, message varchar not null ); create index fts_idx ON delivery
using gin(fts_all, folder_id); CREATE OR REPLACE FUNCTION 
update_delivery_tsvector_tf()RETURNS TRIGGER AS $$ BEGIN  NEW.fts_all = 
to_tsvector('simple', NEW.message); return NEW; END; $$ LANGUAGE PLPGSQL; 
CREATE TRIGGERupdate_delivery_tsvector_t BEFORE INSERT OR UPDATE ON delivery 
FOR EACH ROW EXECUTE PROCEDUREupdate_delivery_tsvector_tf(); insert into 
delivery(folder_id, received_timestamp,message) values (1, '2015-01-01', 'Yes 
hit four') , (1, '2014-01-01', 'Hi man') , (2, '2013-01-01', 'Hi man') , (2, 
'2013-01-01', 'fish') ; analyze delivery; set ENABLE_SEQSCAN to off; explain 
analyze SELECTdel.id , del.received_timestamp FROM delivery del WHERE 1 = 1 AND 
del.fts_all @@ to_tsquery('simple', 'hi:*') AND del.folder_id = 1 ORDER BY 
del.received_timestampDESC LIMIT 101 OFFSET 0; 
 
I use btree_gin extention to make folder_id part of index.
 
I get the following plan (using 9.6 from master):
 QUERY 
PLAN     
 

  Limit  (cost=5.23..5.23 rows=1 width=16) (actual time=0.042..0.043 rows=2 
loops=1)
    ->  Sort  (cost=5.23..5.23 rows=1 width=16) (actual time=0.040..0.040 
rows=2 loops=1)
  Sort Key: received_timestamp DESC
  Sort Method: quicksort  Memory: 25kB
  ->  Bitmap Heap Scan on delivery del  (cost=3.90..5.22 rows=1 
width=16) (actual time=0.029..0.030 rows=2 loops=1)
    Recheck Cond: (fts_all @@ '''hi'':*'::tsquery)
    Filter: (folder_id = 1)
    Rows Removed by Filter: 1
    Heap Blocks: exact=1
    ->  Bitmap Index Scan on fts_idx  (cost=0.00..3.90 rows=1 
width=0) (actual time=0.018..0.018 rows=3 loops=1)
  Index Cond: (fts_all @@ '''hi'':*'::tsquery)
  Planning time: 0.207 ms
  Execution time: 0.085 ms
 (13 rows)
 
Here is the explain from a real-world query:
  EXPLAIN ANALYZE SELECT del.entity_id , del.received_timestamp FROM 
origo_email_delivery delWHERE 1 = 1 AND del.fts_all @@ to_tsquery('simple', 
'andre:*') AND del.folder_id = 44964 ORDER BY del.received_timestamp DESC LIMIT 
101OFFSET 0; 
    QUERY 
PLAN    
 
──
  Limit  (cost=92260.66..92260.91 rows=101 width=16) (actual 
time=347.891..347.907 rows=101 loops=1)
    ->  Sort  (cost=92260.66..92291.08 rows=12167 width=16) (actual 
time=347.888..347.899 rows=101 loops=1)
  Sort Key: received_timestamp DESC
  Sort Method: top-N heapsort  Memory: 29kB
  ->  Bitmap Heap Scan on origo_email_delivery del  
(cost=2480.95..91794.77 rows=12167 width=16) (actual time=152.568..346.132 
rows=18257 loops=1)
    Recheck Cond: (fts_all @@ '''andre'':*'::tsquery)
    Filter: (folder_id = 44964)
    Rows Removed by Filter: 264256
    Heap Blocks: exact=80871
    ->  Bitmap Index Scan on temp_fts_idx  (cost=0.00..2477.91 
rows=309588 width=0) (actual time=134.903..134.903 rows=282513 loops=1)
  Index Cond: (fts_all @@ '''andre'':*'::tsquery)
  Planning time: 0.530 ms
  Execution time: 347.967 ms
 (13 rows)
 

 
1. Why isnt' folder_id part of the index-cond?
2. Is there a way to make it use the (same) index to sort by 
received_timestamp?
3. Using a GIN-index, is there a way to use the index at all for sorting?
4. It doesn't seem like ts_rank uses the index for sorting either.
 
Thanks.

 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com <mailto:andr...@visena.com>
www.visena.com <https://www.visena.com>
 <https://www.visena.com>




Re: [PERFORM] problem with select *

2015-08-24 Thread Andreas Joseph Krogh
På mandag 24. august 2015 kl. 09:04:07, skrev bhuvan Mitra bhuvan...@gmail.com 
mailto:bhuvan...@gmail.com:
Hello,
  
 I have a table with 12 columns and 20 Million rows. While writing the table I 
do not find any problem but when reading that I have some issues faced. When I 
perform a 'select * from table limit 1400;' (selecting 14million rows), it 
is working fine. If the limit value is 1500, it is throwing the error as 
'out of memory'.
  
 If the query is 'select * from table' , The process is getting killed by 
displaying the message 'killed'.
  
 Kindly tell me where it is going wrong. I have 6MB cache, 1.6GHz CPU, linux 
14.04 OS, 8GB RAM.

 
In what application are you performing these queries?
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com mailto:andr...@visena.com
www.visena.com https://www.visena.com
 https://www.visena.com


 


Re: [PERFORM] Slow HashAggregate/cache access

2015-08-06 Thread Andreas Joseph Krogh
=1)
               Filter: (fr01codemp = 1::smallint)
               Buffers: shared hit=2
 Total runtime: 18.528 ms
(35 rows)

 
 
Tomorrow I will try to do the same with the other slow query, reporting here.



 
It will be interesting to see how Oracle and SQL-Server perform with the 
re-written query too.
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com mailto:andr...@visena.com
www.visena.com https://www.visena.com
 https://www.visena.com


 


Re: [PERFORM] Slow HashAggregate/cache access

2015-08-05 Thread Andreas Joseph Krogh
På onsdag 05. august 2015 kl. 20:25:25, skrev Maxim Boguk maxim.bo...@gmail.com
 mailto:maxim.bo...@gmail.com: [snip]   ​I think I know where issue is.
The PostgreSQL planner unable pass join conditions into subquery with 
aggregate functions (it's well known limitation).
[snip]




 
I'm curious; will 9.5 help here as it has WHERE clause pushdown in subqueries 
with window functions?

http://michael.otacoo.com/postgresql-2/postgres-9-5-feature-highlight-where-pushdown-with-window-function/

 
Are you able to try 9.5 and post the results?
 
Thanks.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com mailto:andr...@visena.com
www.visena.com https://www.visena.com
 https://www.visena.com


 



Re: [PERFORM] Slow HashAggregate/cache access

2015-08-05 Thread Andreas Joseph Krogh
På onsdag 05. august 2015 kl. 22:53:25, skrev Alexandre de Arruda Paes 
adald...@gmail.com mailto:adald...@gmail.com:
Hi Andreas,  
Same plan in 9.5, but the execution time was greater than 9.3 (maybe need some 
tunning):

 
Thanks for sharing.
Maybe some @hackers will chime in and comment.
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com mailto:andr...@visena.com
www.visena.com https://www.visena.com
 https://www.visena.com


 


Re: [PERFORM] New server: SSD/RAID recommendations?

2015-07-01 Thread Andreas Joseph Krogh
På torsdag 02. juli 2015 kl. 01:06:57, skrev Craig James cja...@emolecules.com 
mailto:cja...@emolecules.com:
We're buying a new server in the near future to replace an aging system. I'd 
appreciate advice on the best SSD devices and RAID controller cards available 
today. 
The database is about 750 GB. This is a warehouse server. We load supplier 
catalogs throughout a typical work week, then on the weekend (after Q/A), 
integrate the new supplier catalogs into our customer-visible store, which is 
then copied to a production server where customers see it. So the load is 
mostly data loading, and essentially no OLTP. Typically there are fewer than a 
dozen connections to Postgres.  
Linux 2.6.32
Postgres 9.3
Hardware:
  2 x INTEL WESTMERE 4C XEON 2.40GHZ
  12GB DDR3 ECC 1333MHz
  3WARE 9650SE-12ML with BBU
  12 x 1TB Hitachi 7200RPM SATA disks
RAID 1 (2 disks)
   Linux partition
   Swap partition
   pg_xlog partition
RAID 10 (8 disks)
   Postgres database partition
 
We get 5000-7000 TPS from pgbench on this system.
 
The new system will have at least as many CPUs, and probably a lot more memory 
(196 GB). The database hasn't reached 1TB yet, but we'd like room to grow, so 
we'd like a 2TB file system for Postgres. We'll start with the latest versions 
of Linux and Postgres.
 
Intel's products have always received good reports in this forum. Is that 
still the best recommendation? Or are there good alternatives that are price 
competitive?
 
What about a RAID controller? Are RAID controllers even available for 
PCI-Express SSD drives, or do we have to stick with SATA if we need a 
battery-backed RAID controller? Or is software RAID sufficient for SSD drives?
 
Are spinning disks still a good choice for the pg_xlog partition and OS? Is 
there any reason to get spinning disks at all, or is it better/simpler to just 
put everything on SSD drives?
 
Thanks in advance for your advice!




 
 
Depends on you SSD-drives, but today's enterprise-grade SSD disks can handle 
pg_xlog just fine. So I'd go full SSD, unless you have many BLOBs in 
pg_largeobject, then move that to a separate tablespace with 
archive-grade-disks (spinning disks).
 
-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andr...@visena.com mailto:andr...@visena.com
www.visena.com https://www.visena.com
 https://www.visena.com


 


[PERFORM] Performance of vacuumlo

2015-04-14 Thread Andreas Joseph Krogh
Hi all.   I have a pg_largeobject of ~300GB size and when I run vacuumlo -n 
dbname, I get:   Would remove 82172 large objects from database dbname.  
So I'm running without -n to do the actual work, but it seems to take 
forever. The disks are 8 SAS 10K HDD drives in RAID5.   Any hints on how long 
this is supposed to take?   Thanks.   -- Andreas Joseph Krogh CTO / Partner - 
Visena AS Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com 
www.visena.com https://www.visena.com  https://www.visena.com

[PERFORM] Poor performance when deleting from entity-attribute-value type master-table

2015-02-09 Thread Andreas Joseph Krogh
Hi all.   Using PG-9.4.0 I'm seeing this trying to delete from an 
entity-master table:   *# explain analyze delete from onp_crm_entity 
where entity_id IN (select tmp.delivery_id from temp_delete_delivery_id tmp);
     QUERY 
PLAN     
 
---
  Delete on onp_crm_entity  (cost=0.43..5673.40 rows=1770 width=12) (actual 
time=7.370..7.370 rows=0 loops=1)
    -  Nested Loop  (cost=0.43..5673.40 rows=1770 width=12) (actual 
time=0.050..1.374 rows=108 loops=1)
  -  Seq Scan on temp_delete_delivery_id tmp  (cost=0.00..27.70 
rows=1770 width=14) (actual time=0.014..0.080 rows=108 loops=1)
  -  Index Scan using onp_crm_entity_pkey on onp_crm_entity  
(cost=0.43..3.18 rows=1 width=14) (actual time=0.010..0.011 rows=1 loops=108)
    Index Cond: (entity_id = tmp.delivery_id)
  Planning time: 0.314 ms
  Trigger for constraint onp_crm_activity_entity_id_fkey: time=4.141 calls=108 
 Trigger for constraint ...  Trigger for constraint ...  Trigger for constraint 
...     I have lots of tables referencing onp_crm_entity(entity_id) so I expect 
the poor performance of deleting from it is caused by all the triggers firing 
to check FKI-constraints.   Are there any ways around this or do people simply 
avoid having FKs in schemas like this?   Thanks.   -- Andreas Joseph Krogh CTO 
/ Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com 
mailto:andr...@visena.com www.visena.com https://www.visena.com  
https://www.visena.com

Re: [PERFORM] Poor performance when deleting from entity-attribute-value type master-table

2015-02-09 Thread Andreas Joseph Krogh
På mandag 09. februar 2015 kl. 22:36:55, skrev Igor Neyman 
iney...@perceptron.com mailto:iney...@perceptron.com:  

 

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org]On Behalf Of Andreas Joseph Krogh
Sent: Monday, February 09, 2015 4:13 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Poor performance when deleting from entity-attribute-value 
type master-table

 

Hi all.

 

Using PG-9.4.0 I'm seeing this trying to delete from an entity-master table:

 

*# explain analyze delete from onp_crm_entity where entity_id IN (select 
tmp.delivery_id from temp_delete_delivery_id tmp);
     QUERY 
PLAN     
 
---
  Delete on onp_crm_entity  (cost=0.43..5673.40 rows=1770 width=12) (actual 
time=7.370..7.370 rows=0 loops=1)
    -  Nested Loop  (cost=0.43..5673.40 rows=1770 width=12) (actual 
time=0.050..1.374 rows=108 loops=1)
  -  Seq Scan on temp_delete_delivery_id tmp  (cost=0.00..27.70 
rows=1770 width=14) (actual time=0.014..0.080 rows=108 loops=1)
  -  Index Scan using onp_crm_entity_pkey on onp_crm_entity  
(cost=0.43..3.18 rows=1 width=14) (actual time=0.010..0.011 rows=1 loops=108)
    Index Cond: (entity_id = tmp.delivery_id)
  Planning time: 0.314 ms
  Trigger for constraint onp_crm_activity_entity_id_fkey: time=4.141 calls=108

 Trigger for constraint ...

 Trigger for constraint ...

 Trigger for constraint ...

 

 

I have lots of tables referencing onp_crm_entity(entity_id) so I expect the 
poor performance of deleting from it is caused by all the triggers firing to 
check FKI-constraints.

 

 

Andreas, do you have indexes on FK columns in child tables?

If not – there is your problem.

  Yes, they have indexes, but deleting 1M rows still results in calling the 
triggers 1M times * number of FKs...   -- Andreas Joseph Krogh CTO / Partner - 
Visena AS Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com 
www.visena.com https://www.visena.com  https://www.visena.com  

[PERFORM] Query performing very bad and sometimes good

2014-08-05 Thread Andreas Joseph Krogh
 time=0.002..0.002 rows=0 
loops=354300)
  Index Cond: ((owner_id = 
3::bigint) AND (message_id = del.message_id))
  -  Index Only Scan using 
origo_email_folder_message_delivery_id_folder_id_key on 
origo_email_folder_message fm  (cost=0.42..0.50 rows=1 width=16) (actual 
time=0.003..0.003 rows=0 loops=12870)
    Index Cond: ((delivery_id = 
del.entity_id) AND (folder_id = 44961::bigint))
    Heap Fetches: 75
    -  Index Scan using origo_email_message_pkey on 
origo_email_message em  (cost=0.42..2.50 rows=1 width=455) (actual 
time=0.007..0.007 rows=1 loops=75)
  Index Cond: (entity_id = del.message_id)
  -  Index Scan using origo_person_entity_id_idx on 
onp_crm_person pers  (cost=0.28..0.29 rows=1 width=35) (actual 
time=0.000..0.000 rows=0 loops=75)
    Index Cond: (em.from_entity_id = entity_id)
    -  Index Scan using onp_crm_relation_pkey on onp_crm_relation 
comp  (cost=0.27..0.35 rows=1 width=19) (actual time=0.000..0.000 rows=0 
loops=75)
  Index Cond: (entity_id = pers.relation_id)
  -  Index Scan using origo_email_address_pkey on origo_email_address 
ef  (cost=0.43..0.68 rows=1 width=50) (actual time=0.007..0.008 rows=1 loops=75)
    Index Cond: (em.from_id = entity_id)
  SubPlan 1
    -  Nested Loop  (cost=0.85..16.90 rows=1 width=24) (actual 
time=0.012..0.014 rows=1 loops=75)
  -  Index Scan using 
origo_email_address_owner_message_id_recipient_type_address_key on 
origo_email_address_owner o  (cost=0.42..8.45 rows=1 width=12) (actual 
time=0.008..0.008 rows=1 loops=75)
    Index Cond: ((message_id = em.entity_id) AND 
((recipient_type)::text = 'TO'::text))
  -  Index Scan using origo_email_address_pkey on 
origo_email_address a  (cost=0.43..8.45 rows=1 width=28) (actual 
time=0.002..0.003 rows=1 loops=93)
    Index Cond: (entity_id = o.address_id)
  SubPlan 2
    -  Nested Loop  (cost=0.85..16.90 rows=1 width=26) (actual 
time=0.007..0.008 rows=1 loops=75)
  -  Index Scan using 
origo_email_address_owner_message_id_recipient_type_address_key on 
origo_email_address_owner o_1  (cost=0.42..8.45 rows=1 width=12) (actual 
time=0.004..0.004 rows=1 loops=75)
    Index Cond: ((message_id = em.entity_id) AND 
((recipient_type)::text = 'TO'::text))
  -  Index Scan using origo_email_address_pkey on 
origo_email_address a_1  (cost=0.43..8.45 rows=1 width=30) (actual 
time=0.002..0.002 rows=1 loops=93)
    Index Cond: (entity_id = o_1.address_id)
  SubPlan 3
    -  Seq Scan on origo_favourite_owner person_fav  (cost=0.00..4.75 
rows=1 width=4) (actual time=0.027..0.027 rows=0 loops=75)
  Filter: ((favourite_for = 3::bigint) AND (favourite_item = 
pers.entity_id))
  Rows Removed by Filter: 183
  SubPlan 4
    -  Seq Scan on origo_favourite_owner company_fav  
(cost=0.00..4.75 rows=1 width=4) (actual time=0.023..0.023 rows=0 loops=75)
  Filter: ((favourite_for = 3::bigint) AND (favourite_item = 
comp.entity_id))
  Rows Removed by Filter: 183
  Total runtime: 3481.136 ms
 (42 rows)   Does anyone see anything obvious or have any hints what to 
investigate further?   Thanks.   -- Andreas Joseph Krogh CTO / Partner - Visena 
AS Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com 
www.visena.com https://www.visena.com  https://www.visena.com

Re: [PERFORM] OFFSET/LIMIT - Disparate Performance w/ Go application

2014-06-12 Thread Andreas Joseph Krogh
På torsdag 12. juni 2014 kl. 16:58:06, skrev Tom Lane t...@sss.pgh.pa.us 
mailto:t...@sss.pgh.pa.us: Matt Silverlock m...@eatsleeprepeat.net writes:
  Hi all. This might be tricky in so much as there���s a few moving parts 
(when isn���t there?), but I���ve tried to test the postgres side as much as 
possible.
  Trying to work out a potential database bottleneck with a HTTP application 
(written in Go):
  Pages that render HTML templates but don���t perform DB queries can hit 
~36k+ req/s
  Pages that perform a SELECT on a single row net about ~6.6k req/s: 
db.Get(l, SELECT * FROM listings WHERE id = $1 AND expiry_date  
current_date, l.Id)
  Pages that SELECT multiple rows with OFFSET and LIMIT conditions struggle 
to top 1.3k req/s

 You don't show us exactly what you're doing with OFFSET/LIMIT, but I'm
 going to guess that you're using it to paginate large query results.
 That's basically always going to suck: Postgres has no way to implement
 OFFSET except to generate and then throw away that number of initial rows.
 If you do the same query over again N times with different OFFSETs, it's
 going to cost you N times as much as the base query would.   Are there any 
plans to make PG implement OFFSET more efficiently, so it doesn't have to read 
and throw away?   I used SQL Server back in 2011 in a project and seem to 
remember they implemented offset pretty fast. Paging in a resultset of millions 
was much faster than in PG.   -- Andreas Jospeh Krogh CTO / Partner - Visena AS 
Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com 
www.visena.com https://www.visena.com  https://www.visena.com  

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-02 Thread Andreas Joseph Krogh
På fredag 02. mai 2014 kl. 02:17:58, skrev Craig James cja...@emolecules.com 
mailto:cja...@emolecules.com: On Thu, May 1, 2014 at 4:26 AM, Andreas Joseph 
Kroghandr...@visena.com mailto:andr...@visena.com wrote: I have a schema 
where I have lots of messages and some users who might have read some of them. 
When a message is read by a user I create an entry i a table message_property 
holding the property (is_read) for that user.  The schema is as follows: [...]  
 create table person(    id serial primary key,
     username varchar not null unique
 );   create table message(
     id serial primary key,
     subject varchar
 );   create table message_property(
     message_id integer not null references message(id),
     person_id integer not null references person(id),
     is_read boolean not null default false,
     unique(message_id, person_id)
 );   [...]  So, for person 1 there are 10 unread messages, out of a total 
1mill. 5 of those unread does not have an entry in message_property and 5 have 
an entry and is_read set to FALSE.   Here's a possible enhancement: add two 
columns, an indexed timestamp to the message table, and a timestamp of the 
oldest message this user has NOT read on the person table. If most users read 
messages in a timely fashion, this would (in most cases) narrow down the 
portion of the messages table to a tiny fraction of the total -- just those 
messages newer than the oldest message this user has not read.
   When you sign up a new user, you can set his timestamp to the time the 
account was created, since presumably messages before that time don't apply.
   Whether this will help depends a lot on actual use patterns, i.e. do users 
typically read all messages or do they leave a bunch of unread messages sitting 
around forever?   Thanks fort the suggestion. A user must be able to read 
arbitrary old messages, and messages don't expire.   -- Andreas Jospeh Krogh 
CTO / Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com 
mailto:andr...@visena.com www.visena.com https://www.visena.com  
https://www.visena.com  

[PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
Hi all,   I'm using PostgreSQL 9.3.2 on x86_64-unknown-linux-gnu 
 I have a schema where I have lots of messages and some users who might have 
read some of them. When a message is read by a user I create an entry i a table 
message_property holding the property (is_read) for that user.   The schema is 
as follows:   drop table if exists message_property;
 drop table if exists message;
 drop table if exists person;   create table person(
     id serial primary key,
     username varchar not null unique
 );   create table message(
     id serial primary key,
     subject varchar
 );   create table message_property(
     message_id integer not null references message(id),
     person_id integer not null references person(id),
     is_read boolean not null default false,
     unique(message_id, person_id)
 );   insert into person(username) values('user_' || generate_series(0, 999));
 insert into message(subject) values('Subject ' || random() || 
generate_series(0, 99));
 insert into message_property(message_id, person_id, is_read) select id, 1, 
true from message order by id limit 90;
 insert into message_property(message_id, person_id, is_read) select id, 1, 
false from message order by id limit 5 offset 90; analyze;   So, for person 
1 there are 10 unread messages, out of a total 1mill. 5 of those unread does 
not have an entry in message_property and 5 have an entry and is_read set to 
FALSE.   I have the following query to list all un-read messages for person 
with id=1:   SELECT
     m.id  AS message_id,
     prop.person_id,
     coalesce(prop.is_read, FALSE) AS is_read,
     m.subject
 FROM message m
     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND 
prop.person_id = 1
 WHERE 1 = 1
   AND NOT EXISTS(SELECT
  *
  FROM message_property pr
  WHERE pr.message_id = m.id AND pr.person_id = 
prop.person_id AND prop.is_read = TRUE)
     ; 
 The problem is that it's not quite efficient and performs badly, explain 
analyze shows: 

 
QUERY PLAN
 
-
  Merge Anti Join  (cost=1.27..148784.09 rows=5 width=40) (actual 
time=918.906..918.913 rows=10 loops=1)
    Merge Cond: (m.id = pr.message_id)
    Join Filter: (prop.is_read AND (pr.person_id = prop.person_id))
    Rows Removed by Join Filter: 5
    -  Merge Left Join  (cost=0.85..90300.76 rows=100 width=40) (actual 
time=0.040..530.748 rows=100 loops=1)
  Merge Cond: (m.id = prop.message_id)
  -  Index Scan using message_pkey on message m  (cost=0.42..34317.43 
rows=100 width=35) (actual time=0.014..115.829 rows=100 loops=1)
  -  Index Scan using message_property_message_id_person_id_key on 
message_property prop  (cost=0.42..40983.40 rows=95 width=9) (actual 
time=0.020..130.728 rows=95 loops=1)
    Index Cond: (person_id = 1)
    -  Index Only Scan using message_property_message_id_person_id_key on 
message_property pr  (cost=0.42..40983.40 rows=95 width=8) (actual 
time=0.024..140.349 rows=95 loops=1)
  Index Cond: (person_id = 1)
  Heap Fetches: 95
  Total runtime: 918.975 ms
 (13 rows)   
 Does anyone have suggestions on how to optimize the query or schema? It's 
important that any message not having an entry in message_property for a user 
is considered un-read. 
 Thanks!   -- Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 
56 963 andr...@visena.com mailto:andr...@visena.com www.visena.com 
https://www.visena.com  https://www.visena.com

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
På torsdag 01. mai 2014 kl. 20:35:07, skrev Jochem Berndsen joc...@functor.nl 
mailto:joc...@functor.nl: 
 Hi Andreas,

 [New to this list, forgive my ignorance.]
 [snip]
 I'm getting better performance with:

 SELECT
 m.id AS message_id,
 1 AS person_id,
 FALSE AS is_read,
 m.subject
 FROM message m
 WHERE 1 = 1
 AND NOT EXISTS(SELECT
      *
      FROM message_property pr
      WHERE pr.message_id = m.id AND pr.person_id = 1 AND pr.is_read);

 You then lose the distinction between message_property with is_read =
 FALSE, and nonexistent message_property for the message row.

 If that is essential, I'm getting a roughly 2x speedup on my non-tuned
 PostgreSQL with:
   SELECT
      m.id                          AS message_id,
      prop.person_id,
      coalesce(prop.is_read, FALSE) AS is_read,
      m.subject
 FROM message m
      LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND
 prop.person_id = 1
 WHERE not coalesce(prop.is_read, false);     Hi Jochem,   Thansk for looking 
at it. I'm still seing ~500ms being spent and I was hoping for a way to do this 
using index so one could achieve 1-10ms, but maybe that's impossible given the 
schema?   Is there a way to design an equivalent  schema to achieve 10ms 
execution-time?   -- Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 
909 56 963 andr...@visena.com mailto:andr...@visena.com www.visena.com 
https://www.visena.com  https://www.visena.com  

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel Stehule 
pavel.steh...@gmail.com mailto:pavel.steh...@gmail.com: Hello [snip]   I had 
a perfect success on similar use case with descent ordered partial index

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html 
http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html   I'm not 
getting good performance. Are you able to craft an example using my schema and 
partial index?   Thanks.   -- Andreas Jospeh Krogh CTO / Partner - Visena AS 
Mobile: +47 909 56 963 andr...@visena.com mailto:andr...@visena.com 
www.visena.com https://www.visena.com  https://www.visena.com  

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
På torsdag 01. mai 2014 kl. 21:53:32, skrev Pavel Stehule 
pavel.steh...@gmail.com mailto:pavel.steh...@gmail.com:     2014-05-01 21:39 
GMT+02:00 Andreas Joseph Kroghandr...@visena.com mailto:andr...@visena.com: 
På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel Stehule 
pavel.steh...@gmail.com mailto:pavel.steh...@gmail.com: Hello [snip]   I had 
a perfect success on similar use case with descent ordered partial index

http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html 
http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html   I'm not 
getting good performance. Are you able to craft an example using my schema and 
partial index?   maybe some like
   CREATE INDEX ON message_property (person_id, message_id) WHERE pr.is_read
   When I am thinking about your schema, it is designed well, but it is not 
index friendly, so for some fast access you should to hold a cache (table) of 
unread messages   Ah, that's what I was hoping to not having to do. In my 
system, messages arrive all the time and having to update a cache for all new 
messages for all users seems messy... Seems I could just as well create a 
message_property for all users when a new message arrives, so I can INNER JOIN 
it and get good performance. But that table will quickly grow *very* large...   
--Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 56 963 
andr...@visena.com mailto:andr...@visena.com www.visena.com 
https://www.visena.com  https://www.visena.com  

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
På torsdag 01. mai 2014 kl. 23:19:55, skrev David G Johnston 
david.g.johns...@gmail.com mailto:david.g.johns...@gmail.com: How does 
something like:

 WITH unreads AS (
 SELECT messageid FROM message
 EXCEPT
 SELECT messageid FROM message_property WHERE personid=1 AND has_read
 )
 SELECT ...
 FROM unreads
 JOIN messages USING (messageid)
 ;

 perform?   It actually performs worse.   The best query so far is:   SELECT
     m.id  AS message_id,
     prop.person_id,
     coalesce(prop.is_read, FALSE) AS is_read,
     m.subject
 FROM message m
     LEFT OUTER JOIN message_property prop ON prop.message_id = m.id AND
  prop.person_id = 1
 WHERE coalesce(prop.is_read, false) = false;   Giving the plan: 

  
QUERY PLAN
 
---
  Merge Left Join  (cost=4.20..90300.76 rows=50 width=40) (actual 
time=445.021..445.025 rows=10 loops=1)
    Merge Cond: (m.id = prop.message_id)
    Filter: (NOT COALESCE(prop.is_read, false))
    Rows Removed by Filter: 90
    -  Index Scan using message_pkey on message m  (cost=0.42..34317.43 
rows=100 width=35) (actual time=0.014..113.314 rows=100 loops=1)
    -  Index Scan using message_property_message_id_person_id_key on 
message_property prop  (cost=0.42..40983.40 rows=95 width=9) (actual 
time=0.018..115.019 rows=95 loops=1)
  Index Cond: (person_id = 1)
  Total runtime: 445.076 ms
 (8 rows)   -- Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 
56 963 andr...@visena.com mailto:andr...@visena.com www.visena.com 
https://www.visena.com  https://www.visena.com  

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
På torsdag 01. mai 2014 kl. 23:45:49, skrev Tomas Vondra t...@fuzzy.cz 
mailto:t...@fuzzy.cz: On 1.5.2014 23:19, Andreas Joseph Krogh wrote:
  På torsdag 01. mai 2014 kl. 23:02:13, skrev Pavel Stehule
  pavel.steh...@gmail.com mailto:pavel.steh...@gmail.com:
 
      
      
      2014-05-01 22:30 GMT+02:00 Andreas Joseph Krogh andr...@visena.com
      mailto:andr...@visena.com:
 
          På torsdag 01. mai 2014 kl. 21:53:32, skrev Pavel Stehule
          pavel.steh...@gmail.com mailto:pavel.steh...@gmail.com:
 
              
              
              2014-05-01 21:39 GMT+02:00 Andreas Joseph Krogh
              andr...@visena.com mailto:andr...@visena.com:
 
                  På torsdag 01. mai 2014 kl. 21:30:39, skrev Pavel
                  Stehule pavel.steh...@gmail.com
                  mailto:pavel.steh...@gmail.com:
 
                      Hello
                      [snip]
                      
                      I had a perfect success on similar use case with
                      descent ordered partial index
 
                     
 http://www.postgresql.org/docs/9.3/interactive/sql-createindex.html
 
                  
                  I'm not getting good performance. Are you able to craft
                  an example using my schema and partial index?
 
              
              maybe some like
              
              CREATE INDEX ON message_property (person_id, message_id)
              WHERE pr.is_read
              
              When I am thinking about your schema, it is designed well,
              but it is not index friendly, so for some fast access you
              should to hold a cache (table) of unread messages
 
          
          Ah, that's what I was hoping to not having to do. In my system,
          messages arrive all the time and having to update a cache for
          all new messages for all users seems messy... Seems I could just
          as well create a message_property for all users when a new
          message arrives, so I can INNER JOIN it and get good
          performance. But that table will quickly grow *very* large...
 
      
      What you need is a JOIN index, that is not possible in Postgres.
      
      I afraid so some ugly solutions is necessary (when you require
      extra fast access). You need a index (small index) and it require
      some existing set - you cannot do index on the difference two sets.
      
      I expect so some flag on the relation message - like it should
      not be not read can helps little bit - and can be used in partial
      index as conditions. Other possibility is some variant of
      partitioning - you can divide a messages and users to distinct sets
      and then you decrease a number of possible combinations.
 
  
  Just curious:
  Is such a JOIN index possible in other DBs, if so - which?
  Can other DBs do index on difference between two sets?
  Will PG ever have this, is it even possible?

 I'm not aware of such database, but maybe it's possible at least for
 some cases. But I'd expect that to significantly depend on the schema.
 And I'm not aware of any such effort in case of PostgreSQL, do don't
 hold your breath.

 IMHO the problem with your schema is that while each 'read' message has
 a matching row in message_property, 'undread' messages may or may not
 have a matching row. Is there a particular reason for that?     Yes. The 
point is that maintaining a message_property pair for all messages for all 
users in the system imposes quite a maintainance-headache. As the schema is now 
any new message is per definition un-read, and when a user reads it then it 
gets an entry with is_read=true in message_property. This table holds other 
properties too. This way I'm avoiding having to book-keep so much when a new 
message arrives and when a new user is created. A message in my system does not 
necessarily have only one recipient, it might have one, many or none, and might 
be visible to many.   If you could get rid of this, i.e. require that each pair 
(message,
 recipient) has a row in message_propery (irrespectedly whether the
 message was read or not), you can do this:

 CREATE INDEX message_unread_idx
     ON message_property(message_id, person_id) WHERE (NOT is_read)

 and then simply do the query like this:

 SELECT
     m.id,
     prop.person_id,
     prop.is_read,
     m.subject
 FROM messages m JOIN message_property p ON (m.id = p.message_id)
 WHERE (NOT is_read) AND person_id = :pid

 and I'd expect this to use the partial index, and being much faster.   I'm 
aware of the performance-gain using such a plain JOIN-query.   Thanks for your 
feedback.   -- Andreas Jospeh Krogh CTO / Partner - Visena AS Mobile: +47 909 
56 963 andr...@visena.com mailto:andr...@visena.com www.visena.com 
https://www.visena.com  https://www.visena.com  

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
På fredag 02. mai 2014 kl. 00:34:34, skrev Tomas Vondra t...@fuzzy.cz 
mailto:t...@fuzzy.cz: On 1.5.2014 23:58, Andreas Joseph Krogh wrote:
  På torsdag 01. mai 2014 kl. 23:45:49, skrev Tomas Vondra t...@fuzzy.cz
  mailto:t...@fuzzy.cz:
 
      On 1.5.2014 23:19, Andreas Joseph Krogh wrote:
       Just curious:
       Is such a JOIN index possible in other DBs, if so - which?
       Can other DBs do index on difference between two sets?
       Will PG ever have this, is it even possible?
 
      I'm not aware of such database, but maybe it's possible at least for
      some cases. But I'd expect that to significantly depend on the schema.
      And I'm not aware of any such effort in case of PostgreSQL, do don't
      hold your breath.
 
      IMHO the problem with your schema is that while each 'read' message has
      a matching row in message_property, 'undread' messages may or may not
      have a matching row. Is there a particular reason for that?
 
  
  
  Yes. The point is that maintaining a message_property pair for all
  messages for all users in the system imposes quite a
  maintainance-headache. As the schema is now any new message is per
  definition un-read, and when a user reads it then it gets an entry with
  is_read=true in message_property. This table holds other properties too.
  This way I'm avoiding having to book-keep so much when a new message
  arrives and when a new user is created. A message in my system does not
  necessarily have only one recipient, it might have one, many or none,
  and might be visible to many.

 So how do you determine who's the recipient of a message? Or is that the
 case that everyone can read everything (which is why you're displaying
 them the unread messages, right)?     A message might have a recipient and 
might be read by others.   I understand you're trying to solve this without 
storing a row for each
 possible message-person combination, but won't this eventually happen
 anyway (with is_read=true for all rows)?     I will end up with that only if 
all users read all messages, which is not nearly the case.   -- Andreas Jospeh 
Krogh CTO / Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com 
mailto:andr...@visena.com www.visena.com https://www.visena.com  
https://www.visena.com  

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
På fredag 02. mai 2014 kl. 00:55:25, skrev David G Johnston 
david.g.johns...@gmail.com mailto:david.g.johns...@gmail.com: Andreas Joseph 
Krogh-2 wrote
  I will end up with that only if
  all users read all messages, which is not nearly the case.

 These observations probably won't help but...

 You have what amounts to a mathematical spare matrix problem on your
 hands...

 Is there any way to expire messages so that dimension does not grow
 unbounded?     No, unfortunately...   Per-User caching does seem to be 
something that is going to be needed...

 Depending on how many users are being tracked would storing the reader_id
 in an indexed array improve matters?   SELECT ... FROM message WHERE NOT (1
 = ANY(reader_ids)) ; UPDATE message SET reader_ids = reader_ids || 1 WHERE
 messageid = ...  I'm not that familiar with how well indexes over arrays
 work or which kind is needed (i.e. gin/gist).     is_read is one of many 
properties being tracked for a message...   -- Andreas Jospeh Krogh CTO / 
Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com 
mailto:andr...@visena.com www.visena.com https://www.visena.com  
https://www.visena.com  

Re: [PERFORM] Optimize query for listing un-read messages

2014-05-01 Thread Andreas Joseph Krogh
På fredag 02. mai 2014 kl. 01:58:04, skrev David G Johnston 
david.g.johns...@gmail.com mailto:david.g.johns...@gmail.com: 
 Per-User caching does seem to be something that is going to be needed...

 Depending on how many users are being tracked would storing the reader_id
 in an indexed array improve matters?   SELECT ... FROM message WHERE NOT (1
 = ANY(reader_ids)) ; UPDATE message SET reader_ids = reader_ids || 1 WHERE
 messageid = ...  I'm not that familiar with how well indexes over arrays
 work or which kind is needed (i.e. gin/gist).     is_read is one of many 
properties being tracked for a message...     ​But you don't have to have all 
of them on the same table.  Once you've identified the messages in question 
performing a standard join onto a supplemental detail table should be fairly 
straight-forward.   Do these other properties have values when is_read is 
false or only when is_read is true?  Since you already allow for the 
possibility of a missing record (giving it the meaning of not read)​ these 
other properties cannot currently exist in that situation.     A message might 
hold a property (ie. is_important) when is_read is FALSE (it might be set back 
to is_read=FALSE after being read the first time).   -- Andreas Jospeh Krogh 
CTO / Partner - Visena AS Mobile: +47 909 56 963 andr...@visena.com 
mailto:andr...@visena.com www.visena.com https://www.visena.com  
https://www.visena.com