Re: [PERFORM] PostgreSQL seems to create inefficient plans in simple conditional joins

2016-02-08 Thread Hedayat Vatankhah

Hi again,

/*Hedayat Vatankhah*/ wrote on Sun, 31 Jan 2016 01:20:53 +0330:

Hi,

/*David Rowley*/ wrote on Sun, 31 Jan 2016 04:57:04 +1300:

On 31 January 2016 at 01:30, Hedayat Vatankhah
 wrote:

Personally, I expect both queries below to perform exactly the same:

SELECT
 t1.id, *
FROM
 t1
INNER JOIN
 t2 ON t1.id = t2.id
 where t1.id > -9223372036513411363;

And:

SELECT
 t1.id, *
FROM
 t1
INNER JOIN
 t2 ON t1.id = t2.id
 where t1.id > -9223372036513411363 and t2.id >
-9223372036513411363;

Unfortunately, they do not. PostgreSQL creates different plans for
these
queries, which results in very poor performance for the first one
compared
to the second (What I'm testing against is a DB with around 350 million
rows in t1, and slightly less in t2).

EXPLAIN output:
First query: http://explain.depesz.com/s/uauk
Second query: link: http://explain.depesz.com/s/uQd

Yes, unfortunately you've done about the only thing that you can do,
and that's just include both conditions in the query. Is there some
special reason why you can't just write the t2.id > ... condition in
the query too? or is the query generated dynamically by some software
that you have no control over?


I just found another issue with using a query like the second one (using 
LEFT JOINs instead of INNER JOINs): referencing id columns of joined 
tables explicitly disables PostgreSQL join removal optimization when you 
only select column(s) from t1! :(
I should forget about creating views on top of JOIN queries, and build 
appropriate JOIN queries with referenced table and appropriate 
conditions manually, so the whole data model should be exposed to the 
application.


If I'm not wrong, PostgreSQL should understand that ANY condition on t2 
doesn't change the LEFT JOIN output when t2 columns are not SELECTed.


Regards,
Hedayat



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Primary key index suddenly became very slow

2016-02-08 Thread Gustav Karlsson
Hi,

Question:

What may cause a primary key index to suddenly become very slow? Index scan for 
single row taking 2-3 seconds. A manual vacuum resolved the problem.


Background:

We have a simple table ‘KONTO’ with about 600k rows.


Column|Type |   Modifiers
--+-+---
 id   | bigint  | not null
...

Indexes:
"konto_pk" PRIMARY KEY, btree (id)
...


Over the weekend we experienced that lookups using the primary key index 
(‘konto_pk’) became very slow, in the region 2-3s for fetching a single record:

QUERY PLAN
Index Scan using konto_pk on konto  (cost=0.42..6.44 rows=1 width=164) (actual 
time=0.052..2094.549 rows=1 loops=1)
  Index Cond: (id = 2121172829)
Planning time: 0.376 ms
Execution time: 2094.585 ms


After a manual Vacuum the execution time is OK:

QUERY PLAN
Index Scan using konto_pk on konto  (cost=0.42..6.44 rows=1 width=164) (actual 
time=0.037..2.876 rows=1 loops=1)
  Index Cond: (id = 2121172829)
Planning time: 0.793 ms
Execution time: 2.971 ms


So things are working OK again, but we would like to know what may cause such a 
degradation of the index scan, to avoid this happening again? (We are using 
Postgresql version 9.4.4)



Regards,
Gustav


Re: [PERFORM] Primary key index suddenly became very slow

2016-02-08 Thread Gustav Karlsson
Additional information:

The problematic row has likely received many hot updates (100k+). Could this be 
a likely explanation for the high execution time?


Regards,
Gustav



On Feb 8, 2016, at 10:45 AM, Gustav Karlsson 
mailto:gustav.karls...@bekk.no>> wrote:

Hi,

Question:

What may cause a primary key index to suddenly become very slow? Index scan for 
single row taking 2-3 seconds. A manual vacuum resolved the problem.


Background:

We have a simple table ‘KONTO’ with about 600k rows.


Column|Type |   Modifiers
--+-+---
 id   | bigint  | not null
...

Indexes:
"konto_pk" PRIMARY KEY, btree (id)
...


Over the weekend we experienced that lookups using the primary key index 
(‘konto_pk’) became very slow, in the region 2-3s for fetching a single record:

QUERY PLAN
Index Scan using konto_pk on konto  (cost=0.42..6.44 rows=1 width=164) (actual 
time=0.052..2094.549 rows=1 loops=1)
  Index Cond: (id = 2121172829)
Planning time: 0.376 ms
Execution time: 2094.585 ms


After a manual Vacuum the execution time is OK:

QUERY PLAN
Index Scan using konto_pk on konto  (cost=0.42..6.44 rows=1 width=164) (actual 
time=0.037..2.876 rows=1 loops=1)
  Index Cond: (id = 2121172829)
Planning time: 0.793 ms
Execution time: 2.971 ms


So things are working OK again, but we would like to know what may cause such a 
degradation of the index scan, to avoid this happening again? (We are using 
Postgresql version 9.4.4)



Regards,
Gustav



Re: [PERFORM] gin performance issue.

2016-02-08 Thread Marc Mamin


> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Sent: Freitag, 5. Februar 2016 16:07
 

> > http://explain.depesz.com/s/wKv7
> > Postgres Version 9.3.10 (Linux)
> > 
> > Hello,
> > this is a large daily table that only get bulk inserts (200-400 /days) with 
> > no update.
> > After rebuilding the whole table, the Bitmap Index Scan on
> > r_20160204_ix_toprid falls under 1 second (from 800)
> >
> > Fastupdate is using the default, but autovacuum is disabled on that
> > table which contains 30 Mio rows.


> Pre-9.5, it's a pretty bad idea to disable autovacuum on a GIN index,
> because then the "pending list" only gets flushed when it exceeds
> work_mem.  (Obviously, using a large work_mem setting makes this
> worse.)
> 
>   regards, tom lane


Hello,
knowing what the problem is don't really help here:

- auto vacuum will not run as these are insert only tables
- according to this post, auto analyze would also do the job:
  http://postgresql.nabble.com/Performance-problem-with-gin-index-td5867870.html
  It seems that this information is missing in the doc
  
  but it sadly neither triggers in our case as we have manual analyzes called 
during the dataprocesssing just following the imports.
  Manual vacuum is just too expensive here.
  
  Hence disabling fast update seems to be our only option. 
  
  I hope this problem will help push up the 9.5 upgrade on our todo list :)
  
  Ideally, we would then like to flush the pending list inconditionally after 
the imports. 
  I guess we could achieve something approaching while modifying the analyze 
scale factor  and gin_pending_list_limit
  before/after the (bulk) imports, but having the possibility to flush it per 
SQL would be better. 
  Is this a reasonable feature wish?
  
  And a last question: how does the index update work with bulk (COPY) inserts:
  without pending list: is it like a per row trigger or will the index be cared 
of afterwards ?
  with small pending lists : is there a concurrency problem, or can both tasks 
cleanly work in parallel ?
  
  best regards,
  
  Marc mamin
  
  


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad COPY performance with NOTIFY in a trigger

2016-02-08 Thread Merlin Moncure
On Sat, Feb 6, 2016 at 6:03 AM, Filip Rembiałkowski
 wrote:
> Thanks for the feedback.
>
> This patch is my first and obvious approach.
>
> @Merlin, I'm not sure if I get your idea:
> - keep previous behaviour as obligatory? (which is: automatic
> de-duplicating of incoming messages by channel+payload),
> - instead of trivial search (sorting by browsing) use some kind of
> faster lookups?
>
> I'm not sure if this statement in async.c is carved in stone:
>
> * Duplicate notifications from the same transaction are sent out as one
> * notification only. This is done to save work when for example a trigger
> * on a 2 million row table fires a notification for each row that has been
> * changed. If the application needs to receive every single notification
> * that has been sent, it can easily add some unique string into the extra
> * payload parameter.
>
> 1) "work-saving" is disputable in some cases
>
> 2) an idea to "add some unique string" is OK logical-wise but it's not
> OK performance-wise.
>
> Current search code is a sequential search:
> https://github.com/filiprem/postgres/blob/master/src/backend/commands/async.c#L2139
>
> I'm not that smart to devise an algorithm for faster lookups -
> probably you guys can give some advice.
>
> Again, my rationale is... This feature can burn a lot of CPU for
> nothing. I was hoping to use NOTIFY/LISTEN as superfast notification
> mechanism. Superfast regardless on whether you insert 100, 10k or 1m
> rows.

Sure, I get it -- you want to have fast notification events -- this is
a good thing to want to have.  However, a GUC is probably not the best
way to do that in this particular case.  It's way to fringey and the
bar for behavior controlling GUC is incredibly high (short version:
most modern introductions were to manage security issues).  I'm far
from the last word on this thoug, but it's better to get this all
sorted out now.

Anyways, it should be possible to micro-optimize that path.  Perhaps
using a hash table?  I'm not sure.

Another possible way to work things out here is to expose your switch
in the syntax of the command itself, or perhaps via the pg_notify
function to avoid syntax issues.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Bitmap and-ing between btree and gin?

2016-02-08 Thread Jeff Janes
On Thu, Feb 4, 2016 at 9:19 AM, Jordi  wrote:

The custom here is to respond in line, not to top-post.  Thanks.

>
> So basically you're saying it's hard to do sorting in any way when a gin
> index is involved? Neither with a complete multi-column btree_gin index
> because it doesn't support sorting per definition, nor with a seperate gin
> and btree because there would be an extra post-sorting step involved over
> the FULL resultset (because of the LIMIT).

In principle there is no reason (that I can think of) that a normal
btree index range scan couldn't accept a bitmap as an optional input,
and then use that as a filter which would allow it to walk the index
in order while throwing out tuples that can't match the other
conditions.  You are not the first person who would benefit from such
a feature.  But it would certainly not be trivial to implement.  It is
not on anyone's to-do list as far as I know.

>From your earlier email:

> BUT: when I remove the ORDER BY statement, the query runs really fast. It 
> uses the 2 indexes seperately and bitmap-ands them together, resulting in a 
> fast executing query.

When you removed the ORDER BY, did you also remove the LIMIT?  If you
removed the ORDER BY and kept the LIMIT, that is pretty much a
meaningless comparison.  You are asking a much easier question at that
point.

> Then would you have any hint on how to implement pagination when doing full
> text search?
> Cause in theory, if I gave it a id>100 LIMIT 100, it might just as well
> return me results 150 to 250, instead of 100 to 200...

Can you use a method that maintains state (cursor with fetching, or
temporary storage) so that it doesn't have to recalculate the query
for each page?

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] gin performance issue.

2016-02-08 Thread Jeff Janes
On Mon, Feb 8, 2016 at 2:21 AM, Marc Mamin  wrote:
>
> - auto vacuum will not run as these are insert only tables
> - according to this post, auto analyze would also do the job:
>   
> http://postgresql.nabble.com/Performance-problem-with-gin-index-td5867870.html
>   It seems that this information is missing in the doc
>
>   but it sadly neither triggers in our case as we have manual analyzes called 
> during the dataprocesssing just following the imports.
>   Manual vacuum is just too expensive here.
>
>   Hence disabling fast update seems to be our only option.

Does disabling fast update cause problems?  I always start with
fastupdate disabled, and only turn on if it I have a demonstrable
problem with it being off.

I would think "off" is likely to be better for you.  You say each
distinct key only appears in 2.7 rows.  So you won't get much benefit
from aggregating together all the new rows for each key before
updating the index for that key, as there is very little to aggregate.

Also, you say the inserts come in bulk.  It is generally a good thing
to slow down bulk operations by making them clean up their own messes,
for the sake of everyone else.


>   I hope this problem will help push up the 9.5 upgrade on our todo list :)
>
>   Ideally, we would then like to flush the pending list inconditionally after 
> the imports.
>   I guess we could achieve something approaching while modifying the analyze 
> scale factor  and gin_pending_list_limit
>   before/after the (bulk) imports, but having the possibility to flush it per 
> SQL would be better.
>   Is this a reasonable feature wish?

That feature has already been committed for the 9.6 branch.

>   And a last question: how does the index update work with bulk (COPY) 
> inserts:
>   without pending list: is it like a per row trigger or will the index be 
> cared of afterwards ?

Done for each row.

>   with small pending lists : is there a concurrency problem, or can both 
> tasks cleanly work in parallel ?

I don't understand the question.  What are the two tasks you are
referring to? Do you have multiple COPY running at the same time in
different processes?

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad COPY performance with NOTIFY in a trigger

2016-02-08 Thread Merlin Moncure
On Mon, Feb 8, 2016 at 8:35 AM, Merlin Moncure  wrote:
> On Sat, Feb 6, 2016 at 6:03 AM, Filip Rembiałkowski
>  wrote:
>> Thanks for the feedback.
>>
>> This patch is my first and obvious approach.
>>
>> @Merlin, I'm not sure if I get your idea:
>> - keep previous behaviour as obligatory? (which is: automatic
>> de-duplicating of incoming messages by channel+payload),
>> - instead of trivial search (sorting by browsing) use some kind of
>> faster lookups?
>>
>> I'm not sure if this statement in async.c is carved in stone:
>>
>> * Duplicate notifications from the same transaction are sent out as one
>> * notification only. This is done to save work when for example a trigger
>> * on a 2 million row table fires a notification for each row that has been
>> * changed. If the application needs to receive every single notification
>> * that has been sent, it can easily add some unique string into the extra
>> * payload parameter.
>>
>> 1) "work-saving" is disputable in some cases
>>
>> 2) an idea to "add some unique string" is OK logical-wise but it's not
>> OK performance-wise.
>>
>> Current search code is a sequential search:
>> https://github.com/filiprem/postgres/blob/master/src/backend/commands/async.c#L2139
>>
>> I'm not that smart to devise an algorithm for faster lookups -
>> probably you guys can give some advice.
>>
>> Again, my rationale is... This feature can burn a lot of CPU for
>> nothing. I was hoping to use NOTIFY/LISTEN as superfast notification
>> mechanism. Superfast regardless on whether you insert 100, 10k or 1m
>> rows.
>
> Sure, I get it -- you want to have fast notification events -- this is
> a good thing to want to have.  However, a GUC is probably not the best
> way to do that in this particular case.  It's way to fringey and the
> bar for behavior controlling GUC is incredibly high (short version:
> most modern introductions were to manage security issues).  I'm far
> from the last word on this thoug, but it's better to get this all
> sorted out now.
>
> Anyways, it should be possible to micro-optimize that path.  Perhaps
> using a hash table?  I'm not sure.
>
> Another possible way to work things out here is to expose your switch
> in the syntax of the command itself, or perhaps via the pg_notify
> function to avoid syntax issues.

whoops, I just noticed this thread moved to -hackers -- so please respond there.

merlin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance