from:"Jeff Janes"

Re: Setting effective_io_concurrency in VM?

2017-11-27 Thread Jeff Janes

On Mon, Nov 27, 2017 at 10:40 AM, Scott Marlowe 
wrote:

>
> Generally VMs are never going to be as fast as running on bare metal
> etc. You can adjust it and test it with something simple like pgbench
> with various settings for -c (concurrency) and see where it peaks etc
> with the setting. This will at least get you into the ball park.
>

None of the built-in workloads for pgbench cares a whit about
effective_io_concurrency.  He would have to come up with some custom
transactions to exercise that feature.  (Or use the tool people use to run
the TPCH benchmark, rather than using pgbench's built in transactions)

I think the best overall advice would be to configure it the same as you
would if it were not a VM.  There may be cases where you diverge from that,
but I think each one would require extensive investigation and
experimentation, so can't be turned into a rule of thumb.

Cheers,

Jeff

Re: insert and query performance on big string table with pg_trgm

2017-11-24 Thread Jeff Janes

On Nov 21, 2017 00:05, "Matthew Hall"  wrote:

> Are all indexes present at the time you insert?  It will probably be much
faster to insert without the gin index (at least) and build it after the
load.

There is some flexibility on the initial load, but the updates in the
future will require the de-duplication capability. I'm willing to accept
that might be somewhat slower on the load process, to get the accurate
updates, provided we could try meeting the read-side goal I wrote about, or
at least figure out why it's impossible, so I can understand what I need to
fix to make it possible.

As long as you don't let anyone use the table between the initial load and
when the index build finishes, you don't have to compromise on
correctness.  But yeah, makes sense to worry about query speed first.

> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds
as could be expected for such a torture test query, but it's still WAY
faster than the first such query. If you change it out to a different
expression, it's longer again of course. There does seem to be a
low-to-medium correlation between the number of rows found and the query
completion time.

To make this quick, you will need to get most of the table and most of the
index cached into RAM.  A good way to do that is with pg_prewarm.  Of
course that only works if you have enough RAM in the first place.

What is the size of the table and the gin index?

Cheers,

Jeff

Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Jeff Janes

On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby <pry...@telsasoft.com> wrote:

> On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> > I think the non-extended stats code also has trouble with booleans.
> > pg_stats gives me a correlation  of 0.8 or higher for the flag column.
>
> It's not due to the boolean though; you see the same thing if you do:
> CREATE INDEX aaa_f ON aaa((flag::text));
> ANALYZE aaa;
> correlation | 0.81193
>
> or:
> ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
> correlation | 0.81193
>
> I think it's caused by having so few (2) values to correlate.
>
> most_common_vals   | {f,t}
> most_common_freqs  | {0.9014,0.0986}
> correlation| 0.822792
>
> It thinks there's somewhat-high correlation since it gets a list of x and y
> values (integer positions by logical and physical sort order) and 90% of
> the x
> list (logical value) are the same value ('t'), and the CTIDs are in order
> on
> the new index, so 90% of the values are 100% correlated.
>

But there is no index involved (except in the case of the functional
index).  The correlation of table columns to physical order of the table
doesn't depend on the existence of an index, or the physical order within
an index.

But I do see that ties within the logical order of the column values are
broken to agree with the physical order.  That is wrong, right?  Is there
any argument that this is desirable?

It looks like it could be fixed with a few extra double calcs per distinct
value.  Considering we already sorted the sample values using SQL-callable
collation dependent comparators, I doubt a few C-level double calcs is
going to be meaningful.

Cheers,

Jeff

Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Jeff Janes

On Dec 3, 2017 15:31, "Tom Lane" <t...@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.ja...@gmail.com> writes:
> On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby <pry...@telsasoft.com>
wrote:
>> It thinks there's somewhat-high correlation since it gets a list of x
>> and y values (integer positions by logical and physical sort order) and
>> 90% of the x list (logical value) are the same value ('t'), and the
>> CTIDs are in order on the new index, so 90% of the values are 100%
>> correlated.

> But there is no index involved (except in the case of the functional
> index).  The correlation of table columns to physical order of the table
> doesn't depend on the existence of an index, or the physical order within
> an index.

> But I do see that ties within the logical order of the column values are
> broken to agree with the physical order.  That is wrong, right?  Is there
> any argument that this is desirable?

Uh ... what do you propose doing instead?  We'd have to do something with
ties, and it's not so obvious this way is wrong.

Let them be tied.  If there are 10 distinct values, number the values 0 to
9, and all rows of a given distinct values get the same number for the
logical order axis.

Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
me.  Although if we switched btree to store duplicate values with tid as a
tie breaker, then maybe it wouldn't be as obviously wrong.

Cheers,

Jeff

Re: Bitmap scan is undercosted?

2017-12-02 Thread Jeff Janes

On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:

>
>
> seq_page_cost = 0.0
> random_page_cost = 0.0
> explain analyze select * from aaa where num = 2 and flag = true;
>
> Bitmap Heap Scan on aaa  (cost=753.00..2003.00 rows=10257 width=5) (actual
> time=82.212..82.212 rows=0 loops=1)
>   ->  Bitmap Index Scan on i1  (cost=0.00..750.43 rows=10 width=0)
> (actual time=17.401..17.401 rows=10 loops=1)
>
> Index Scan using i1 on aaa  (cost=0.44..1750.43 rows=10257 width=5)
> (actual time=49.766..49.766 rows=0 loops=1)
>
> The bitmap plan was reduced to use only one bitmap scan, and finally it
> costs more than the index plan.
>

Right, so there is a cpu costing problem (which could only be fixed by
hacking postgresql and recompiling it), but it is much smaller of a problem
than the IO cost not being accurate due to the high hit rate.  Fixing the
CPU costing problem is unlikely to make a difference to your real query.
If you set the page costs to zero, what happens to your real query?

> But I doubt that the settings seq_page_cost = random_page_cost = 0.0
> should actually be used.
>

Why not?  If your production server really has everything in memory during
normal operation, that is the correct course of action.  If you ever
restart the server, then you could have some unpleasant time getting it
back up to speed again, but pg_prewarm could help with that.

> Probably it should be instead something like 1.0/1.0 or 1.0/1.1, but other
> costs increased, to have more weight.
>

This doesn't make any  sense to me.  Halving the page costs is
mathematically the same as doubling all the other constants.  But the first
way of doing things says what you are doing, and the second way is an
obfuscation of what you are doing.

>
> # x4 tuple/operator costs - bitmap scan still a bit cheaper
> set seq_page_cost = 1.0;
> set random_page_cost = 1.0;
> set cpu_tuple_cost = 0.04;
> set cpu_index_tuple_cost = 0.02;
> set cpu_operator_cost = 0.01;
>

If you really want to target the plan with the BitmapAnd, you should
increase  cpu_index_tuple_cost and/or cpu_operator_cost but not increase
cpu_tuple_cost.  That is because the  unselective bitmap index scan does
not incur any cpu_tuple_cost, but does incur index_tuple and operator
costs.  Unfortunately all other index scans in the system will also be
skewed by such a change if you make the change system-wide.

Incidentally, the "actual rows" field of BitmapAnd is always zero.  That
field is not implemented for that node type.

Why do you have an index on flag in the first place?  What does the index
accomplish, other than enticing the planner into bad plans?  I don't know
how this translates back into your real query, but dropping that index
should be considered.  Or replace both indexes with one on (num,flag).

Or you can re-write the part of the WHERE clause in a way that it can't use
an index, something like:

and flag::text ='t'

Cheers,

Jeff

Re: Sort is generating rows

2018-05-31 Thread Jeff Janes

On Thu, May 31, 2018 at 7:22 AM, Nicolas Seinlet 
wrote:

> Hi,
>
> I have a query with a strange query plan.
>
> This query is roughly searching for sales, and convert them with a
> currency rate. As currency rate changes from time to time, table contains
> the currency, the company, the rate, the start date of availability of this
> rate and the end date of availability.
>
> The join is done using :
> left join currency_rate cr on (cr.currency_id = pp.currency_id and
>   cr.company_id = s.company_id and
>   cr.date_start <= coalesce(s.date_order, now()) and
>  (cr.date_end is null or cr.date_end > coalesce(s.date_order,
> now(
>
> The tricky part is the date range on the currency rate, which is not an
> equality.
>
> the query plan shows:
> ->  Sort  (cost=120.13..124.22 rows=1637 width=56) (actual
> time=14.300..72084.758 rows=308054684 loops=1)
>   Sort Key: cr.currency_id, cr.company_id
>   Sort Method: quicksort  Memory: 172kB
>   ->  CTE Scan on currency_rate cr
> (cost=0.00..32.74 rows=1637 width=56) (actual time=1.403..13.610 rows=1576
> loops=1)
>
> There's 2 challenging things :
> - planner estimates 1637 rows, and get 300 million lines
> - sorting is generating lines
>

These are both explained by the same thing.  The sort is feeding into a
merge join.  For every row in the other node which have the same value of
the scan keys, the entire section of this sort with those same keys gets
scanned again.  The repeated scanning gets counted in the actual row count,
but isn't counted in the expected row count, or the actual row count of the
thing feeding into the sort (the CTE)

>
>
For now, the more currency rates, the slowest the query. There's not that
> much currency rates (1k in this case), as you can only have one rate per
> day per currency.
>

If it is only per currency per day, then why is company_id present? In any
case, you might be better off listing the rates per day, rather than as a
range, and then doing an equality join.

Cheers,

Jeff

Re: Batch insert heavily affecting query performance.

2017-12-27 Thread Jeff Janes

On Sun, Dec 24, 2017 at 11:51 AM, Jean Baro  wrote:

> Hi there,
>
> We are testing a new application to try to find performance issues.
>
> AWS RDS m4.large 500GB storage (SSD)
>

Is that general purpose SSD, or provisioned IOPS SSD?  If provisioned, what
is the level of provisioning?

Cheers,

Jeff

Re: primary key hash index

2018-01-04 Thread Jeff Janes

On Tue, Jan 2, 2018 at 6:02 AM, Rick Otten  wrote:

> After reading this article about keys in relational databases, highlighted
> on hacker news this morning:
> https://begriffs.com/posts/2018-01-01-sql-keys-in-depth.html
>
> I keep pondering the performance chart, regarding uuid insert, shown
> towards the bottom of the article.  I believe he was doing that test with
> PostgreSQL.
>
> My understanding is that the performance is degrading because he has a
> btree primary key index.  Is it possible to try a hash index or some other
> index type for a uuid primary key that would mitigate the performance issue
> he is recording?
>

Hash indexes do not yet support primary keys, but you could always test it
with just an plain index, since you already know the keys are unique via
the way they are constructed.  But I wouldn't expect any real improvement.
Hash indexes still trigger FPW and still dirty massive numbers of pages in
a random fashion (even worse than btree does as far as randomness goes but
since the hash is more compact maybe more of the pages will be re-dirtied
and so save on FPW or separate writes).  I was surprised that turning off
FPW was so effective for him, that suggests that maybe his checkpoints are
too close together, which I guess means max_wal_size is too low.

Cheers,

Jeff

Re: Query is slow when run for first time; subsequent execution is fast

2018-01-10 Thread Jeff Janes

On Wed, Jan 10, 2018 at 3:59 AM, Nandakumar M  wrote:

>
> I am not using prepared statements. Postgres documentation and previous
> questions in the pgsql-performance mailing list mention that the query plan
> is cached only when prepared statements are used.
>
> https://www.postgresql.org/message-id/15600.1346885470%40sss.pgh.pa.us
>
> In the above thread Tom Lane mentions that the plan is never cached for
> raw queries. Yet, this is exactly what seems to be happening in my case. Am
> I missing something?
>

The query plan itself is not cached, but all the metadata about the (large
number) of tables used in the query is cached.  Apparently reading/parsing
that data is the slow step, not coming up with the actual plan.

> Please let me know how I can make sure the query execution for the first
time is fast too.

Don't keep closing and reopening connections.  Use a connection pooler
(pgbouncer, pgpool, whatever pooler is built into your
language/library/driver, etc.) if necessary to accomplish this.

Cheers,

Jeff

Re: Query is slow when run for first time; subsequent execution is fast

2018-01-16 Thread Jeff Janes

On Fri, Jan 12, 2018 at 12:03 AM, Nandakumar M  wrote:

> Hello Jeff,
>
> Thanks for the insights.
>
> >Don't keep closing and reopening connections.
>
> Even if I close a connection and open a new one and execute the same
> query, the planning time is considerably less than the first time. Only
> when I restart the Postgres server then I face high planning time again.
>

Oh.  I've not seen that before.  But then again I don't often restart my
server and then immediately run very large queries with a stringent time
deadline.

You can try pg_prewarm, on pg_statistic table and its index.  But I'd
probably just put an entry in my db startup script to run this query
immediately after startng the server, and let the query warm the cache
itself.

Why do you restart your database often enough for this to be an issue?

Cheers,

Jeff

Re: effective_io_concurrency on EBS/gp2

2018-01-31 Thread Jeff Janes

On Wed, Jan 31, 2018 at 4:03 AM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:

>
> The results look really confusing to me in two ways. The first one is that
> I've seen recommendations to set effective_io_concurrency=256 (or more) on
> EBS.

I would not expect this to make much of a difference on a table which is
perfectly correlated with the index.  You would have to create an accounts
table which is randomly ordered to have a meaningful benchmark of the eic
parameter.

I don't know why the default for eic is 1.  It seems like that just turns
on the eic mechanism, without any hope of benefiting from it.

Cheers,

Jeff

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-08-01 Thread Jeff Janes

On Mon, Jul 30, 2018 at 3:02 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-07-30 13:31:33 -0400, Jeff Janes wrote:
> > I don't know where the time is going with the as-committed JIT.  None of
> > the JIT-specific timings reported by EXPLAIN (ANALYZE) add up to anything
> > close to the slow-down I'm seeing.  Shouldn't compiling and optimization
> > time show up there?
>
> As my timings showed, I don't see the slowdown you're reporting. Could
> you post a few EXPLAIN ANALYZEs?
>


I don't think you showed any timings where jit_above_cost < query cost <
jit_optimize_above_cost, which is where I saw the slow down.  (That is also
where things naturally land for me using default settings)

I've repeated my test case on a default build (./configure --with-llvm
--prefix=) and default postgresql.conf, using the post-11BETA2 commit
5a71d3e.


I've attached the full test case, and the full output.

Here are the last two executions, with jit=on and jit=off, respectively.
Doing it with TIMING OFF doesn't meaningfully change things, nor does
increasing shared_buffers beyond the default.



QUERY PLAN
--
 Seq Scan on i200c200  (cost=0.00..22.28 rows=828 width=16) (actual
time=29.317..11966.291 rows=1000 loops=1)
 Planning Time: 0.034 ms
 JIT:
   Functions: 2
   Generation Time: 1.589 ms
   Inlining: false
   Inlining Time: 0.000 ms
   Optimization: false
   Optimization Time: 9.002 ms
   Emission Time: 19.948 ms
 Execution Time: 12375.493 ms
(11 rows)

Time: 12376.281 ms (00:12.376)
SET
Time: 1.955 ms
   QUERY PLAN

 Seq Scan on i200c200  (cost=0.00..22.28 rows=828 width=16) (actual
time=0.063..3897.302 rows=1000 loops=1)
 Planning Time: 0.037 ms
 Execution Time: 4292.400 ms
(3 rows)

Time: 4293.196 ms (00:04.293)

Cheers,

Jeff


wide.sql
Description: Binary data


wide.out
Description: Binary data

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-30 Thread Jeff Janes

On Mon, Jul 30, 2018 at 12:01 PM, Pavel Stehule 
wrote:

>
>
> 2018-07-30 13:19 GMT+02:00 Jeff Janes :
>
>> On Mon, Jul 30, 2018 at 12:11 AM, Pavel Stehule 
>> wrote:
>>
>>> 2018-07-30 1:00 GMT+02:00 Tom Lane :
>>>
>>>> David Rowley  writes:
>>>> > On 29 July 2018 at 17:38, Dinesh Kumar  wrote:
>>>> >> I found performance variance between accessing int1 and int200
>>>> column which
>>>> >> is quite large.
>>>>
>>>> > Have a look at slot_deform_tuple and heap_deform_tuple. You'll see
>>>> > that tuples are deformed starting at the first attribute. If you ask
>>>> > for attribute 200 then it must deform 1-199 first.
>>>>
>>>> Note that that can be optimized away in some cases, though evidently
>>>> not the one the OP is testing.  From memory, you need a tuple that
>>>> contains no nulls, and all the columns to the left of the target
>>>> column have to be fixed-width datatypes.  Otherwise, the offset to
>>>> the target column is uncertain, and we have to search for it.
>>>>
>>>
>>> JIT decrease a overhead of this.
>>>
>>
>> The bottleneck here is such a simple construct, I don't see how JIT could
>> improve it by much.
>>
>> And indeed, in my hands JIT makes it almost 3 times worse.
>>
>> Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
>> execution and 4594.994 ms for the JIT=off.
>>
>
> look on http://www.postgresql-archive.org/PATCH-LLVM-tuple-
> deforming-improvements-td6029385.html thread, please.
>
>
The opt1 patch did get performance back to "at least do no harm" territory,
but it didn't improve over JIT=off.  Adding the other two didn't get any
further improvement.

I don't know where the time is going with the as-committed JIT.  None of
the JIT-specific timings reported by EXPLAIN (ANALYZE) add up to anything
close to the slow-down I'm seeing.  Shouldn't compiling and optimization
time show up there?

Cheers,

Jeff

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-30 Thread Jeff Janes

On Mon, Jul 30, 2018 at 12:11 AM, Pavel Stehule 
wrote:

> 2018-07-30 1:00 GMT+02:00 Tom Lane :
>
>> David Rowley  writes:
>> > On 29 July 2018 at 17:38, Dinesh Kumar  wrote:
>> >> I found performance variance between accessing int1 and int200 column
>> which
>> >> is quite large.
>>
>> > Have a look at slot_deform_tuple and heap_deform_tuple. You'll see
>> > that tuples are deformed starting at the first attribute. If you ask
>> > for attribute 200 then it must deform 1-199 first.
>>
>> Note that that can be optimized away in some cases, though evidently
>> not the one the OP is testing.  From memory, you need a tuple that
>> contains no nulls, and all the columns to the left of the target
>> column have to be fixed-width datatypes.  Otherwise, the offset to
>> the target column is uncertain, and we have to search for it.
>>
>
> JIT decrease a overhead of this.
>

The bottleneck here is such a simple construct, I don't see how JIT could
improve it by much.

And indeed, in my hands JIT makes it almost 3 times worse.

Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
execution and 4594.994 ms for the JIT=off.

Cheers,

Jeff
drop table if exists i200c200;
create table i200c200 ( pk bigint primary key, 
int1 bigint,
int2 bigint,
int3 bigint,
int4 bigint,
int5 bigint,
int6 bigint,
int7 bigint,
int8 bigint,
int9 bigint,
int10 bigint,
int11 bigint,
int12 bigint,
int13 bigint,
int14 bigint,
int15 bigint,
int16 bigint,
int17 bigint,
int18 bigint,
int19 bigint,
int20 bigint,
int21 bigint,
int22 bigint,
int23 bigint,
int24 bigint,
int25 bigint,
int26 bigint,
int27 bigint,
int28 bigint,
int29 bigint,
int30 bigint,
int31 bigint,
int32 bigint,
int33 bigint,
int34 bigint,
int35 bigint,
int36 bigint,
int37 bigint,
int38 bigint,
int39 bigint,
int40 bigint,
int41 bigint,
int42 bigint,
int43 bigint,
int44 bigint,
int45 bigint,
int46 bigint,
int47 bigint,
int48 bigint,
int49 bigint,
int50 bigint,
int51 bigint,
int52 bigint,
int53 bigint,
int54 bigint,
int55 bigint,
int56 bigint,
int57 bigint,
int58 bigint,
int59 bigint,
int60 bigint,
int61 bigint,
int62 bigint,
int63 bigint,
int64 bigint,
int65 bigint,
int66 bigint,
int67 bigint,
int68 bigint,
int69 bigint,
int70 bigint,
int71 bigint,
int72 bigint,
int73 bigint,
int74 bigint,
int75 bigint,
int76 bigint,
int77 bigint,
int78 bigint,
int79 bigint,
int80 bigint,
int81 bigint,
int82 bigint,
int83 bigint,
int84 bigint,
int85 bigint,
int86 bigint,
int87 bigint,
int88 bigint,
int89 bigint,
int90 bigint,
int91 bigint,
int92 bigint,
int93 bigint,
int94 bigint,
int95 bigint,
int96 bigint,
int97 bigint,
int98 bigint,
int99 bigint,
int100 bigint,
int101 bigint,
int102 bigint,
int103 bigint,
int104 bigint,
int105 bigint,
int106 bigint,
int107 bigint,
int108 bigint,
int109 bigint,
int110 bigint,
int111 bigint,
int112 bigint,
int113 bigint,
int114 bigint,
int115 bigint,
int116 bigint,
int117 bigint,
int118 bigint,
int119 bigint,
int120 bigint,
int121 bigint,
int122 bigint,
int123 bigint,
int124 bigint,
int125 bigint,
int126 bigint,
int127 bigint,
int128 bigint,
int129 bigint,
int130 bigint,
int131 bigint,
int132 bigint,
int133 bigint,
int134 bigint,
int135 bigint,
int136 bigint,
int137 bigint,
int138 bigint,
int139 bigint,
int140 bigint,
int141 bigint,
int142 bigint,
int143 bigint,
int144 bigint,
int145 bigint,
int146 bigint,
int147 bigint,
int148 bigint,
int149 bigint,
int150 bigint,
int151 bigint,
int152 bigint,
int153 bigint,
int154 bigint,
int155 bigint,
int156 bigint,
int157 bigint,
int158 bigint,
int159 bigint,
int160 bigint,
int161 bigint,
int162 bigint,
int163 bigint,
int164 bigint,
int165 bigint,
int166 bigint,
int167 bigint,
int168 bigint,
int169 bigint,
int170 bigint,
int171 bigint,
int172 bigint,
int173 bigint,
int174 bigint,
int175 bigint,
int176 bigint,
int177 bigint,
int178 bigint,
int179 bigint,
int180 bigint,
int181 bigint,
int182 bigint,
int183 bigint,
int184 bigint,
int185 bigint,
int186 bigint,
int187 bigint,
int188 bigint,
int189 bigint,
int190 bigint,
int191 bigint,
int192 bigint,
int193 bigint,
int194 bigint,
int195 bigint,
int196 bigint,
int197 bigint,
int198 bigint,
int199 bigint,
int200 bigint,
char1 varchar(255),
char2 varchar(255),
char3 varchar(255),
char4 varchar(255),
char5 varchar(255),
char6 varchar(255),
char7 varchar(255),
char8 varchar(255),
char9 varchar(255),
char10 varchar(255),
char11 varchar(255),
char12 varchar(255),
char13 varchar(255),
char14 varchar(255),
char15 varchar(255),
char16 varchar(255),
char17 varchar(255),
char18 varchar(255),
char19 varchar(255),
char20 varchar(255),
char21 varchar(255),
char22 varchar(255),
char23 varchar(255),
char24 varchar(255),
char25 varchar(255),
char26 varchar(255),
char27 varchar(255),
char28 varchar(255),
char29 varchar(255),
char30 varchar(255),
char31 varchar(255),
char32 varchar(255),
char33 varchar(255),
char34 varchar(255),
char35 varchar(255),
char36 varchar(255),
char37 varchar(255),
char38 varchar(255),
char39 varchar(255),
char40 varchar(255),
char41 varchar(255),
char42

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-31 Thread Jeff Janes

On Mon, Jul 30, 2018 at 1:23 PM, Andres Freund  wrote:

> On 2018-07-30 07:19:07 -0400, Jeff Janes wrote:
>
> > And indeed, in my hands JIT makes it almost 3 times worse.
>
> Not in my measurement. Your example won't use JIT at all, because it's
> below the cost threshold. So I think you might just be seeing cache +
> hint bit effects?
>

No, it is definitely JIT.  The explain plans show it, and the cost of the
query is 230,000 while the default setting of jit_above_cost is 100,000.
It is fully reproducible by repeatedly toggling the JIT setting.  It
doesn't seem to be the cost of compiling the code that slows it down (I'm
assuming the code is compiled once per tuple descriptor, not once per
tuple), but rather the efficiency of the compiled code.

>
> > Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
> > execution and 4594.994 ms for the JIT=off.
>
> Even with a debug LLVM build, which greatly increases compilation
> overhead, I actually see quite the benefit when I force JIT to be used:
>

I don't see a change when I compile without --enable-debug,
and jit_debugging_support is off, or in 11beta2 nonexistent.  How can I
know if I have a debug LLVM build, and turn it off if I do?

>
>
> postgres[26832][1]=# ;SET jit_above_cost = -1; set jit_optimize_above_cost
> = 0; set jit_inline_above_cost = 0;
> postgres[26832][1]=# explain (analyze, buffers, timing off) select pk,
> int200 from i200c200;
>

Lowering jit_optimize_above_cost does redeem this for me.  It brings it
back to being a tie with JIT=OFF.  I don't see any further improvement by
lowering jit_inline_above_cost, and overall it is just a statistical tie
with JIT=off, not an improvement as you get, but at least it isn't a
substantial loss.

Under what conditions would I want to do jit without doing optimizations on
it?  Is there a rule of thumb that could be documented, or do we just use
the experimental method for each query?

I don't know how sensitive JIT is to hardware.  I'm using Ubuntu 16.04 on
VirtualBox (running on Windows 10) on an i5-7200U, which might be important.

I had previously done a poor-man's JIT where I created 4 versions of the
main 'for' loop in slot_deform_tuple.  I did a branch on "if(hasnulls)",
and then each branch had two loops, one for when 'slow' is false, and then
one for after 'slow' becomes true so we don't have to keep setting it true
again once it already is, in a tight loop.  I didn't see noticeable
improvement there (although perhaps I would have on different hardware), so
didn't see how JIT could help with this almost-entirely-null case.  I'm not
trying to address JIT in general, just as it applies to this particular
case.

Unrelated to JIT and relevant to the 'select pk, int199' case but not the
'select pk, int200' case, it seems we have gone to some length to make slot
deforming be efficient for incremental use, but then just deform in bulk
anyway up to maximum attnum used in the query, at least in this case.  Is
that because incremental deforming is not cache efficient?

Cheers,

Jeff

Re: Bi-modal streaming replication throughput

2018-08-14 Thread Jeff Janes

On Tue, Aug 14, 2018 at 9:18 AM, Alexis Lê-Quôc  wrote:

>
each
 running PG 9.3
 on linux

That is the oldest version which is still supported.  There have been a lot
of improvements since then, including to performance.  You should see if an
upgrade solves the problem.  If not, at least you will have access to
better tools (like pg_stat_activity.wait_event_type), and people will be
more enthusiastic about helping you figure it out knowing it is not an
already-solved problem.

>
> Here are some settings that may help and a perf profile of a recovery
> process that runs without any competing read traffic processing the INSERT
> backlog (I don't unfortunately have the same profile on a lagging read
> replica).
>

Unfortunately the perf when the problem is not occuring won't be very
helpful.  You need it from when the problem is occurring.  Also, I find
strace and gdb to more helpful than perf in this type of situation where
you already know it is not CPU bound, although perhaps that is just my own
lack of skill with perf. You need to know why it is not on the CPU, not
what it is doing when it is on the CPU.

Where the settings you showed all of the non-default settings?

I assume max_standby_streaming_delay is at the default value of 30s?  Are
you getting query cancellations due conflicts with recovery, or anything
else suspicious in the log?  What is the maximum lag you see measured in
seconds?

Cheers,

Jeff

Re: Improving Performance of Query ~ Filter by A, Sort by B

2018-07-14 Thread Jeff Janes

On Tue, Jul 10, 2018 at 11:07 AM, Lincoln Swaine-Moore <
lswainemo...@gmail.com> wrote:

>
>
>
> Something about the estimated row counts (this problem persisted after I
> tried ANALYZEing)
>

What is your default_statistics_target?  What can you tell us about the
distribution of parent_id?  (exponential, power law, etc?).  Can you show
the results for select * from pg_stats where tablename='a' and
attname='parent_id' \x\g\x  ?

> forces usage of the tmstmp index and Merge Append (which seems wise) but
> also a filter condition on parent_id over an index condition, which is
> apparently prohibitively slow.
>
> I tried creating a multicolumn index like:
>
> CREATE INDEX "a_partition1_parent_and_tmstmp_idx" on "a_partition2" USING
> btree ("parent_id", "tmstmp" DESC);
>
> But this didn't help (it wasn't used).
>

You could try reversing the order and adding a column to be (tmstmp,
parent_id, id) and keeping the table well vacuumed.  This would allow the
slow plan to still walk the indexes in tmstmp order but do it as an
index-only scan, so it could omit the extra trip to the table. That trip to
the table must be awfully slow to explain the numbers you show later in the
thread.

...

> This query plan (which is the same as when LIMIT is removed) has been a
> good short term solution when the number of "parent_id"s I'm using is still
> relatively small, but unfortunately queries grow untenably slow as the
> number of "parent_id"s involved increases:
>

What happens when you remove that extra order by phrase that you added?
The original slow plan should become much faster when the number of
parent_ids is large (it has to dig through fewer index entries before
accumulating 20 good ones), so you should try going back to that.

...

> I'd be very grateful for help with one or both of these questions:
> 1) Why is adding an unnecessary (from the perspective of result
> correctness) ORDER BY valuable for forcing the parent_id index usage, and
> does that indicate that there is something wrong with my
> table/indexes/statistics/etc.?
>

It finds the indexes on tmstmp to be falsely attractive, as it can walk in
tmstmp order and so avoid the sort. (Really not the sort itself, but the
fact that sort has to first read every row to be sorted, while walking an
index can abort once the LIMIT is satisfied).  Adding an extra phrase to
the ORDER BY means the index is no longer capable of delivering rows in the
needed order, so it no longer looks falsely attractive.  The same thing
could be obtained by doing a dummy operation, such as ORDER BY tmstmp + '0
seconds' DESC.  I prefer that method, as it is more obviously a tuning
trick.  Adding in "id" looks more like a legitimate desire to break any
ties that might occasionally occur in tmstmp.

As Tom pointed out, there clearly is something wrong with your statistics,
although we don't know what is causing it to go wrong.  Fixing the
statistics isn't guaranteed to fix the problem, but it would be a good
start.

> 2) Is there any way I can improve my query time when there are many
> "parent_id"s involved? I seem to only be able to get the query plan to use
> at most one of the parent_id index and the tmstmp index at a time. Perhaps
> the correct multicolumn index would help?
>

A few things mentioned above might help.

But if they don't, is there any chance you could redesign your partitioning
so that all parent_id queries together will always be in the same
partition?  And if not, could you just get rid of the partitioning
altogether?  1e7 row is not all that many and doesn't generally need
partitioning.  Unless it is serving a specific purpose, it is probably
costing you more than you are getting.

Finally, could you rewrite it as a join to a VALUES list, rather than as an
in-list?

Cheers,

Jeff

Re: Why HDD performance is better than SSD in this case

2018-07-17 Thread Jeff Janes

On Tue, Jul 17, 2018 at 1:00 AM, Neto pr  wrote:

> Dear,
> Some of you can help me understand this.
>
> This query plan is executed in the query below (query 9 of TPC-H
> Benchmark, with scale 40, database with approximately 40 gb).
>
> The experiment consisted of running the query on a HDD (Raid zero).
> Then the same query is executed on an SSD (Raid Zero).
>
> Why did the HDD (7200 rpm)  perform better?
> HDD - TIME 9 MINUTES
> SSD - TIME 15 MINUTES
>
> As far as I know, the SSD has a reading that is 300 times faster than SSD.
>

Is the 300 times faster comparing random to random, or sequential to
sequential?  Maybe your SSD simply fails to perform as advertised.  This
would not surprise me at all.

To remove some confounding variables, can you turn off parallelism and
repeat the queries?  (Yes, they will probably get slower.  But is the
relative timings still the same?)  Also, turn on track_io_timings and
repeat the "EXPLAIN (ANALYZE, BUFFERS)", perhaps with TIMINGS OFF.

Also, see how long it takes to read the entire database, or just the
largest table, outside of postgres.

Something like:

time tar -f - $PGDATA/base | wc -c

or

time cat $PGDATA/base//* | wc -c

Cheers,

Jeff

Re: Improving Performance of Query ~ Filter by A, Sort by B

2018-07-17 Thread Jeff Janes

On Mon, Jul 16, 2018 at 5:29 PM, Lincoln Swaine-Moore <
lswainemo...@gmail.com> wrote:

> Tom and Jeff,
>
> Thanks very much for the suggestions!
>
> Here's what I've found so far after playing around for a few more days:
>
> What is your default_statistics_target?  What can you tell us about the
>> distribution of parent_id?  (exponential, power law, etc?).  Can you show
>> the results for select * from pg_stats where tablename='a' and
>> attname='parent_id' \x\g\x  ?
>
>
> The default_statistics_target is 500, which I agree seems quite
> insufficient for these purposes. I bumped this up to 2000, and saw some
> improvement in the row count estimation, but pretty much the same query
> plans. Unfortunately the distribution of counts is not intended to be
> correlated to parent_id, which is one reason I imagine the histograms might
> not be particularly effective unless theres one bucket for every value.
> Here is the output you requested:
>
> select * from pg_stats where tablename='a' and attname='parent_id';
>
> schemaname | public
> tablename  | a
> attname| parent_id
> inherited  | t
> null_frac  | 0
> avg_width  | 4
> n_distinct | 18871
> most_common_vals   | {15503,49787,49786,24595,49784,17549, ...} (2000
> values)
> most_common_freqs  | {0.0252983,0.02435,0.0241317,
> 0.02329,0.019095,0.0103967,0.00758833,0.004245, ...} (2000 values)
>

You showed the 8 most common frequencies.  But could you also show the last
couple of them?  When your queried parent_id value is not on the MCV list,
it is the frequency of the least frequent one on the list which puts an
upper limit on how frequent the one you queried for can be.

> A few questions re: statistics:
>  1) would it be helpful to bump column statistics to, say, 20k (the number
> of distinct values of parent_id)?
>

Only one way to find out...
However you can only go up to 10k, not 20k.

>  2) is the discrepancy between the statistics on the parent and child
> table be expected? certainly I would think that the statistics would be
> different, but I would've imagined they would have histograms of the same
> size given the settings being the same.
>

Is the n_distinct estimate accurate for the partition?  There is an
algorithm (which will change in v11) to stop the MCV from filling the
entire statistics target size if it thinks adding more won't be useful.
But I don't know why the histogram boundary list would be short.  But, I
doubt that that is very important here.  The histogram is only used for
inequality/range, not for equality/set membership.

>  3) is there a way to manually specify the the distribution of rows to be
> even? that is, set the frequency of each value to be ~ n_rows/n_distinct.
> This isn't quite accurate, but is a reasonable assumption about the
> distribution, and might generate better query plans.
>

This would be going in the wrong direction.  Your queries seem to
preferentially use rare parent_ids, not typical parent_ids.  In fact, it
seems like many of your hard-coded parent_ids don't exist in the table at
all.  That certainly isn't going to help the planner any.  Could you
somehow remove those before constructing the query?

You might also take a step back, where is that list of parent_ids coming
from in the first place, and why couldn't you convert the list of literals
into a query that returns that list naturally?

> You could try reversing the order and adding a column to be (tmstmp,
>> parent_id, id) and keeping the table well vacuumed.  This would allow the
>> slow plan to still walk the indexes in tmstmp order but do it as an
>> index-only scan, so it could omit the extra trip to the table. That trip to
>> the table must be awfully slow to explain the numbers you show later in the
>> thread.
>
>
> Just to clarify, do you mean building indexes like:
> CREATE INDEX "a_tmstmp_parent_id_id_idx_[PART_KEY]" on
> "a_partition[PART_KEY]" USING btree("tmstmp", "parent_id", "id")
> That seems promising! Is the intuition here that we want the first key of
> the index to be the one we are ultimately ordering by? Sounds like I make
> have had that flipped initially. My understanding of this whole situation
> (and please do correct me if this doesn't make sense) is the big bottleneck
> here is reading pages from disk (when looking at stopped up queries, the
> wait_event is DataFileRead), and so anything that can be done to minimize
> the pages read will be valuable. Which is why I would love to get the query
> plan to use the tmstmp index without having to filter thereafter by
> parent_id.
>

Yes, that is the index.

You really want it to filter by parent_id in the index, rather than going
to the table to do the filter on parent_id.  The index pages with tmstmp as
the leading column are going to be more tightly packed with potentially
relevant rows, while the table pages are less likely to be densely packed.
So filtering in the

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-09-05 Thread Jeff Janes

On Wed, Sep 5, 2018 at 12:21 AM Dinesh Kumar  wrote:

> Hi All,
> I was wondering whether the case is solved or still continuing. As a
> Postgres newbie, I can't understand any of the terms (JIT, tuple
> deformation) as you mentioned above. Please anyone let me know , what is
> the current scenario.
>
>
JIT is a just-in-time compilation, which will be new in v11.  Tuple
deforming is how you get the row from the on-disk format to the in-memory
format.

Some people see small improvements in tuple deforming using JIT in your
situation, some see large decreases, depending on settings and apparently
on hardware.  But regardless, JIT is not going to reduce your particular
use case (many nullable and actually null columns, referencing a
high-numbered column) down to being constant-time operation in the number
of preceding columns.  Maybe JIT will reduce the penalty for accessing a
high-numbered column by 30%, but won't reduce the penalty by 30 fold.  Put
your NOT NULL columns first and then most frequently accessed NULLable
columns right after them, if you can.

Cheers,

Jeff

>

Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Jeff Janes

On Fri, Sep 7, 2018 at 8:00 AM Patrick Molgaard  wrote:

> Hi folks,
>
> I've been seeing some curious behaviour on a postgres server I administer.
>
> Intermittently (one or two times a week), all queries on that host are
> simultaneously blocked for extended periods (10s of seconds).
>
> The blocked queries are trivial & not related to locking - I'm seeing
> slowlogs of the form:
>
> `LOG: duration: 22627.299 ms statement: SET client_encoding='''utf-8''';`
>
>
Do you have log_lock_waits set to on?  If not, you might want to turn it on.

Cheers,

Jeff

Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Jeff Janes

On Fri, Sep 7, 2018 at 2:03 PM Patrick Molgaard  wrote:

>
> Hi Jeff,
>
> Thanks for your reply. Are locks relevant in this case, though?
>

I don't know, but why theorize when we can know for sure?  It at least
invokes VirtualXactLockTableInsert.  I don't see how that could block on a
heavyweight lock, though. But again, why theorize when logging it is simple?

Is it always the first statement in a connection which is blocking, or will
established connections also block at the same time the new ones start to
block?

Cheers,

Jeff

>

Re: Memory size

2018-03-11 Thread Jeff Janes

On Sun, Mar 11, 2018 at 10:33 AM, dangal  wrote:

> jeff thank you very much for your time, I tell you, they are the same
> queries
> with the same parameters, I take 3 minutes for example, but I execute it
> and
> it takes me seconds, that's why I suspect it is the shared buffer
> The server had 16 GB and we increased it to 24, but I really do not know if
> it should continue to increase since they are not our own resources, we
> have
> to ask for them and justify them
>

If that is the only query that you have trouble with, it might be easiest
just to set up a cron job to run it periodically just to keep that data set
in cache.  Not very elegant, but it can be effective.

Cheers,

Jeff

Re: Gained %20 performance after disabling bitmapscan

2018-10-19 Thread Jeff Janes

On Fri, Oct 19, 2018 at 3:19 AM Yavuz Selim Sertoglu <
yavuzselim.serto...@medyasoft.com.tr> wrote:

> Hi all,
>
> I have a problem with my query. Query always using parallel bitmap heap
> scan. I've created an index with all where conditions and id but query does
> not this index and continue to use bitmapscan. So I decided disable bitmap
> scan for testing. And after that, things became strange. Cost is higher,
> execution time is lower.
>

A 20% difference in speed is unlikely to make or break you.  Is it even
worth worrying about?

> But I want to use index_only_scan because index have all column that query
> need. No need to access table.
>

Your table is not very well vacuumed, so there is need to access it (9010
times to get 6115 rows, which seems like quite an anti-feat; but I don't
know which of those numbers are averaged over loops/parallel workers,
versus summed over them). Vacuuming your table will not only make the
index-only scan look faster to the planner, but also actually be faster.

The difference in timing could easily be down to one query warming the
cache for the other.  Are these timings fully reproducible altering
execution orders back and forth?  And they have different degrees of
parallelism, what happens if you disable parallelism to simplify the
analysis?

> It is doing index_only_scan when disabling bitmap scan but I cannot
> disable bitmap scan for cluster wide. There are other queries...
> Can you help me to solve the issue?
>
>
Cranking up effective_cache_size can make index scans look better in
comparison to bitmap scans, without changing a lot of other stuff.  This
still holds even for index-only-scan, in cases where the planner knows the
table to be poorly vacuumed.

But moving the column tested for inequality to the end of the index would
be probably make much more of  a difference, regardless of which plan it
chooses.

Cheers,

Jeff

>

Re: Gained %20 performance after disabling bitmapscan

2018-10-26 Thread Jeff Janes

On Mon, Oct 22, 2018 at 3:20 AM Yavuz Selim Sertoglu <
yavuzselim.serto...@medyasoft.com.tr> wrote:

> Thanks for the reply Jeff,
>
> I know 20ms is nothing but it shows me that there is a problem with my
> configuration. I want to find it.
>

This is a dangerous assumption.  This is no configuration you can come up
with which will cause the planner to be within 20% of perfection in all
cases.  Given the other plans you've shown and discussed, I think this is
just chasing our own tail.

Cheers,

Jeff

>

Re: High CPU Usage of "SET ROLE"

2018-10-30 Thread Jeff Janes

On Tue, Oct 30, 2018 at 3:50 PM Ulf Lohbrügge 
wrote:


> When I use the psql cli on the same database I can see via "\timing" that
> the first statement after "RESET ROLE;" is significantly slower. I was even
> able to strip it down to two statements ("SET ROLE ...;" and "RESET ROLE;"):
>
> ...
>
Maybe my observations here are already sufficient to find out what happens
> here? I guess that my setup with 1k rows in pg_roles and 1.5m rows in
> pg_class is probably the cause.
>

It would probably be enough if it were reproducible, but I can't reproduce
it.

-- set up
perl -le 'print "create user foo$_;" foreach 1..1000'|psql
perl -le 'foreach $r (1..1000) {print "create schema foo$r authorization
foo$r;"}'|psql
perl -le 'foreach $r (reverse 1..1000) {print "set role foo$r;"; print
"create table foo$r.foo$_ (x serial primary key);" foreach 1..1000;}'|psql
> out

-- test
perl -le 'print "set role foo$_;\nreset role;" foreach 1..1000'|psql

Does it help when I create a test setup with a docker image that contains a
> database with that many entries in pg_roles and pg_class and share it here?
>

If you have a script to create the database, I'd be more likely to play
around with that than with a docker image.  (Which I have to guess would be
quite large anyway, with 1.5 rows in pg_class)

Cheers,

Jeff

>

Re: trying to delete most of the table by range of date col

2018-09-03 Thread Jeff Janes

>
> 4)delete in chunks :
> do $$
> declare
> rec integer;
> begin
> select count(*) from my_table into rec where end_date <=
> to_date('12/12/2018','DD/MM/') and end_date >
> to_date('11/12/2018','DD/MM/');
> while rec > 0 loop
> DELETE FROM my_Table WHERE id IN (select id from my_tablewhere end_date <=
> to_date('12/12/2018','DD/MM/') and end_date >
> to_date('11/12/2018','DD/MM/') limit 5000);
> rec := rec - 5000;
> raise notice '5000 records were deleted, current rows :%',rec;
> end loop;
>
> end;
> $$
> ;
>
> Execution time : 6 minutes.
>
> So, it seems that the second solution is the fastest one. It there a
> reason why the delete chunks (solution 4) wasnt faster?
>

Why would it be faster?  The same amount of work needs to get done, no
matter how you slice it.  Unless there is a specific reason to think it
would be faster, I would expect it won't be.

If you aren't willing to drop the constraints, then I think you just need
to resign yourself to paying the price of checking those constraints. Maybe
some future version of PostgreSQL will be able to do them in parallel.

Cheers,

Jeff

Re: Optimizer choosing the wrong plan

2018-12-29 Thread Jeff Janes

On Sat, Dec 29, 2018 at 7:17 AM Jim Finnerty  wrote:

> Jeff, can you describe the changes that were made to ANALYZE in v11,
> please?
>
> I've found that running ANALYZE on v10 on the Join Order Benchmark, using
> the default statistics target of 100, produces quite unstable results, so
> I'd be interested to hear what has been improved in v11.
>

There are two paths the code can take.  One if all values which were
sampled at all were sampled at least twice, and another if the
least-sampled value was sampled exactly once.  For some distributions (like
exponential-ish or maybe power-law), it is basically a coin flip whether
the least-sampled value is seen once, or more than once.  If you are seeing
instability, it is probably for this reason.  That fundamental instability
was not addressed in v11.

Once you follow the "something seen exactly once" path, it has to decide
how many of the values get represented in the most-common-value list.  That
is where the change was.  The old method said a value had to have an
estimated prevalence at least 25% more than the average estimated
prevalence to get accepted into the list.  The problem is that if there
were a few dominant values, it wouldn't be possible for any others to be
"over-represented" because those few dominant values dragged the average
prevalence up so far nothing else could qualify.  What it was changed to
was to include a value in the most-common-value list if its
overrepresentation was statistically significant given the sample size.
The most significant change (from my perspective) is that
over-representation is measured not against all values, but only against
all values more rare in the sample then the one currently being considered
for inclusion into the MCV.  The old method basically said "all rare values
are the same", while the new method realizes that a rare value present
10,000 times in a billion row table is much different than a rare value
present 10 time in a billion row table.

It is possible that this change will fix the instability for you, because
it could cause the "seen exactly once" path to generate a MCV list which is
close enough in size to the "seen at least twice" path that you won't
notice the difference between them anymore.  But, it is also possible they
will still be different enough in size that it will still appear unstable.
It depends on your distribution of values.

Cheers,

Jeff

Re: Query Performance Issue

2018-12-29 Thread Jeff Janes

On Sat, Dec 29, 2018 at 1:58 AM David Rowley 
wrote:

> On Sat, 29 Dec 2018 at 04:32, Justin Pryzby  wrote:
> > I think the solution is to upgrade (at least) to PG10 and CREATE
> STATISTICS
> > (dependencies).
>
> Unfortunately, I don't think that'll help this situation. Extended
> statistics are currently only handled for base quals, not join quals.
> See dependency_is_compatible_clause().
>
>
But "recommended_content_id" and "version" are both in the same table,
doesn't that make them base quals?

The most obvious thing to me would be to vacuum
product_content_recommendation_main2 to get rid of the massive number of
heap fetches.  And to analyze everything to make sure the estimation errors
are not simply due to out-of-date stats.  And to increase work_mem.

It isn't clear we want to get rid of the nested loop, from the info we have
to go on the hash join might be even slower yet.  Seeing the plan with
enable_nestloop=off could help there.

Cheers,

Jeff

Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0

2018-11-27 Thread Jeff Janes

On Tue, Nov 27, 2018 at 9:17 AM Sanyo Moura  wrote:

> Hi,
>
> I'm running performance tests for my application at version 11.1 and
> encountered
> queries with high planning time compared to the same planning, running at
> versions 10.5 and 11.0.
>

Can you reproduce the regression if the tables are empty?  If so, can you
share the create script that creates the tables?

Cheers,

Jeff

>

Re: Optimizer choosing the wrong plan

2018-11-26 Thread Jeff Janes

On Mon, Nov 26, 2018 at 5:11 AM Viswanath  wrote:

> *Postgres server version -  9.5.10*
> *RAM - 128 GB*
> *WorkMem 64 MB*
>
> *Problematic query with explain :*
> *Query 1 (original):*
> explain analyse SELECT myTable1.ID FROM myTable1 LEFT JOIN myTable2 ON
> myTable1.ID=myTable2.ID WHERE  myTable1.bool_val = true) AND
> (myTable1.small_intval IN (1,2,3))) AND ((*myTable2.bigint_val = 1*) AND
> (myTable1.bool_val = true))) AND (((myTable1.ID >= 1) AND
> (myTable1.ID <= 1)) ))  ORDER BY 1 DESC , 1 NULLS FIRST  LIMIT
> 11;
>

There is no point doing a LEFT JOIN when the NULL-extended rows get
filtered out later.

Also, ordering by the same column twice is peculiar, to say the least.

> The table myTable2 contains *12701952* entries. Out of which only *86227*
> is
> not null and *146* entries are distinct.
>

I assume you mean the column myTable2.ID has that many not null and
distinct?

>
> The above query returns 0 rows since 'myTable2.bigint_val = 1' criteria
> satisfies nothing. It takes 6 seconds for execution as the planner chooses*
> myTable1.ID column's index*.

More importantly, it chooses the index on myTable2.ID.  It does also use
the index on myTable1.ID, but that is secondary.

The ideal index for this query would probably be a composite index on
myTable2 (bigint_val, id DESC);
The planner would probably choose to use that index, even if the statistics
are off.

I tried running *vacuum analyse* table many times, tried changing the
> *statistics target of the column to 250 (since there are only 149 distinct
> values)*. But none worked out. The planner thinks that there are *1727*
> rows
> that matches the condition *myTable2.bigint_val = 1* but there are none.
>

It would interesting if you can upgrade a copy of your server to v11 and
try it there.  We made changes to ANALYZE in that version which were
intended to improve this situation, and it would be nice to know if it
actually did so for your case.

Also, increasing statistics target even beyond 250 might help.  If every
one of the observed value is seen at least twice, it will trigger the
system to assume that it has observed all distinct values that exist.  But
if any of the values are seen exactly once, that causes it to take a
different path (the one which got modified in v11).

Cheers,

Jeff

Re: PostgreSQL Read IOPS limit per connection

2018-12-27 Thread Jeff Janes

>
>
> *Performance issue:*
>
> I’m trying to figure out if PostgreSQL (PG) has some inherent limit on
> IOPS per connection.
>
> Running pgbench with multiple clients (-c 30) we are able to see 20K+ IOPS
> , which is what we expect. But, if we use just one client, we get 1200
> IOPS, avg disk queue size around 1:
>

The default transaction done by pgbench simply has no opportunity for
dispatching multiple io requests per connection.  It just a series of
single-row lookups and single-row updates or inserts.  You will have to use
a different benchmark if you want to exercise this area.  Probably
something analytics heavy.

Also, you would want to use the newest version of PostgreSQL, as 9.6
doesn't have parallel query, which is much more generally applicable than
effective_io_concurrency is.

One of the issues I’m trying to solve is related to extracting data from a
> large table, which users a full table scan. We see the same 1200 IOPS limit
> of pgbench when we SELECT on this table using just one connection. If there
> is a limitation per connection, I might set up the application to have
> several connections, and then issue SELECTs for different sections of the
> table, and later join the data, but it looks cumbersome, especially if the
> DB can do extract data using more IOPS.
>
The kernel should detect a sequential read in progress and invoke
readahead.  That should be able to keep the CPU quite busy with data for
any decent IO system.  Are you sure IO is even the bottleneck for your
query?

Perhaps your kernel readahead settings need to be tuned.  Also, you may
benefit from parallel query features implemented in newer versions of
PostgreSQL.  In any event, the default transactions of pgbench are not
going to be useful for benchmarking what you care about.

Cheers,

Jeff

Re: postgresql unix socket connections

2019-01-09 Thread Jeff Janes

On Wed, Jan 9, 2019 at 3:35 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

>
> Now, In machine 1 when I run psql I get the prompt password but in machine
> 2 I keep getting the next error :
>
> psql: could not connect to server: No such file or directory
> Is the server running locally and accepting
> connections on Unix domain socket
> "/var/run/postgresql/.s.PGSQL.5432"?
>
> One important thing that I didnt mention, is that I installed in machine 2
> package postgresql-libs.x86_64 0:8.4.20-8.el6_9 from the postgres
> repository (in order to upgrade it to 9.6).
>

The front end and the backend have compiled-in defaults for the socket
directory.  If you installed them from different sources, they may have
different compiled-in defaults.  Which means they may not be able to
rendezvous using the default settings for both of them.

You can override the default using unix_socket_directory on the server (as
you discovered).  On the client you can override it by using -h (or PGHOST
or host= or whatever mechanism), with an argument that looks like a
directory, starting with a '/'.

Cheers,

Jeff

Re: postgresql unix socket connections

2019-01-09 Thread Jeff Janes

On Wed, Jan 9, 2019 at 10:09 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Hey Tom,
> I'm aware of how I can solve it. I wanted to understand why after
> installing the pg 9.6 packages suddenly psql tries to access the socket on
> /var/run/postgresql. Does the libpq default unix socket is changed between
> those two versions ? (9.6,9.2)
>

It is not a version issue, but a packaging issue.  Different systems have
different conventions on where sockets should go, and the packager imposes
their opinion on the things they package.

Cheers,

Jeff

Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-27 Thread Jeff Janes

>
>
> You could also try pg_test_fsync to get low-level information, to
>> supplement the high level you get from pgbench.
>
>
> Thanks for pointing me to this tool. never knew pg_test_fsync existed!
> I've run `pg_test_fsync -s 60` two times and this is the output -
> https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm
> not sure what to make of it?
>

I don't know what to make of that either.  I'd expect fdatasync using two
8kB writes to be about the same throughput as using one 8kB write, but
instead it is 4 times slower.  Also, I'd expect open_datasync to get slower
by a factor of 2, not a factor of 8, when going from one to two 8kB writes
(that is not directly relevant, as you aren't using open_datasync, but is
curious nonetheless).  Is this reproducible with different run lengths?  I
wonder if your write cache (or something) gets "tired" during the first
part of pg_test_fsync and thus degrades the subsequent parts of the test.
I would say something in your IO stack is not optimal, maybe some component
is "consumer grade" rather than "server grade".  Maybe you can ask Hetzner
about that.


> The effects of max_wal_size are going to depend on how you have IO
>> configured, for example does pg_wal shared the same devices and controllers
>> as the base data?  It is mostly about controlling disk usage and
>> crash-recovery performance, neither of which is of primary importance to
>> pgbench performance.
>
>
>  The WAL and the data-directory reside on the same SSD disk -- is this a
> bad idea?
>

If you are trying to squeeze out every last bit of performance, then I
think it is bad idea.  Or at least, something to try the alternative and
see.  The flushing that occurs during checkpoints and the flushing that
occurs for every commit can interfere with each other.


> I was under the impression that smaller values for max_wal_size cause
> pg-server to do "maintenance work" related to wal rotation, etc. more
> frequently and would lead to lower pgbench performance.
>

If you choose ridiculously small values it would.  But once the value is
sufficient, increasing it further wouldn't do much.  Given your low level
of throughput, I would think the default is already sufficient.

Thanks for including the storage info.  Nothing about it stands out to me
as either good or bad, but I'm not a hardware maven; hopefully one will be
reading along and speak up.


> PS: Cc-ing the list back again because I assume you didn't intend for your
> reply to be private, right?
>

Yes, I had intended to include the list but hit the wrong button, sorry.

Cheers,

Jeff

>

Re: Commit(?) overhead

2019-04-04 Thread Jeff Janes

On Thu, Apr 4, 2019 at 3:42 AM Duncan Kinnear 
wrote:

>
> the EXPLAIN (ANALYSE, TIMING TRUE) of this query gives:
>
>  Update on next_id  (cost=0.14..8.16 rows=1 width=36) (actual
> time=0.057..0.057 rows=0 loops=1)
>->  Index Scan using next_id_pk on next_id  (cost=0.14..8.16 rows=1
> width=36) (actual time=0.039..0.040 rows=1 loops=1)
>  Index Cond: ((id)::text = 'Session'::text)
>  Planning Time: 0.083 ms
>  Execution Time: 0.089 ms
>
> which is significantly less than 50ms.
>

The EXPLAIN ANALYZE doesn't include the time needed to fsync the
transaction logs.  It measures only the update itself, not the implicit
commit at the end.  DBeaver is seeing the fsync-inclusive time.  50ms is
pretty long, but some file systems and OSes seem to be pretty inefficient
at this and take several disk revolutions to get the data down.

>
> Now, if I point DBeaver to a VM server on the same gigabit network switch,
> running version:
>9.5.3 on i386-pc-solaris2.11, compiled by cc: Sun C 5.10 SunOS_i386
> Patch 142363-07 2010/12/09, 64-bit
> then the same query executes in about 2-3ms
>

That machine probably has hardware to do a fast fsync, has fsync turned
off, or is lying about the safety of its data.

Cheers,

Jeff

Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-25 Thread Jeff Janes

On Mon, Feb 25, 2019 at 11:13 AM Gunther Schadow 
wrote:

> Anyway, I think the partitioned table is the right and brilliant solution,
> because an index really isn't required. The actual pending partition will
> always remain quite small, and being a queue, it doesn't even matter how
> big it might grow, as long as new rows are inserted at the end and not in
> the middle of the data file and still there be some way of fast skip over
> the part of the dead rows at the beginning that have already been processed
> and moved away.
>
Why do you want to do that?  If you are trying to force the queue to be
handled in a "fair" order, then this isn't the way to do it, you would want
to add an "ORDER BY" to your dequeueing query (in which case you are
probably back to adding an index).

Once the space in the beginning of the table has been reclaimed as free,
then it will be reused for newly inserted tuples.  After the space is freed
up but before it is reused, the seq scan can't skip those blocks entirely,
but it can deal with the blocks quickly because they are empty.  If the
blocks are full of dead but not freed tuples (because the long-running
transactions are preventing them from being cleaned up) then it will have
to go through each dead tuple to satisfy itself that it actually is dead.
This might not be as bad as it is for indexes, but certainly won't be good
for performance.

 Cheers,

Jeff

Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-25 Thread Jeff Janes

On Sat, Feb 23, 2019 at 4:06 PM Gunther  wrote:

> Hi,
>
> I am using an SQL queue for distributing work to massively parallel
> workers.
>
You should look into specialized queueing software.

...

> I figured I might just pause all workers briefly to schedule the REINDEX
> Queue command, but the problem with this is that while the transaction
> volume is large, some jobs may take minutes to process, and in that case we
> need to wait minutes to quiet the database with then 47 workers sitting as
> idle capacity waiting for the 48th to finish so that the index can be
> rebuilt!
>
The jobs that take minutes are themselves the problem.  They prevent tuples
from being cleaned up, meaning all the other jobs needs to grovel through
the detritus every time they need to claim a new row.  If you got those
long running jobs to end, you probably wouldn't even need to reindex--the
problem would go away on its own as the dead-to-all tuples get cleaned up.

Locking a tuple and leaving the transaction open for minutes is going to
cause no end of trouble on a highly active system.  You should look at a
three-state method where the tuple can be pending/claimed/finished, rather
than pending/locked/finished.  That way the process commits immediately
after claiming the tuple, and then records the outcome in another
transaction once it is done processing.  You will need a way to detect
processes that failed after claiming a row but before finishing, but
implementing that is going to be easier than all of this re-indexing stuff
you are trying to do now.  You would claim the row by updating a field in
it to have something distinctive about the process, like its hostname and
pid, so you can figure out if it is still running when it comes time to
clean up apparently forgotten entries.

Cheers,

Jeff

Re: Aggregate and many LEFT JOIN

2019-02-26 Thread Jeff Janes

On Mon, Feb 25, 2019 at 3:54 AM kimaidou  wrote:

> Wich strikes me is that if I try to simplify it a lot, removing all data
> but the main table (occtax.observation) primary key cd_nom and aggregate,
> the query plan should be able tu use the cd_nom index for sorting and
> provide better query plan (hash aggregate), but it does not seems so :
>

HashAggregate doesn't support aggregates with DISTINCT.  I don't think
there is any reason it can't, it is just that no one has gotten around to
it.

Aggregates with DISTINCT also kill your ability to get parallel queries.

Cheers,

Jeff

Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Jeff Janes

On Sun, Feb 24, 2019 at 1:02 PM Gunther  wrote:

> Thank you all for responding so far.
>
> David Rowley  and Justin Pryzby suggested things about autovacuum. But I
> don't think autovacuum has any helpful role here. I am explicitly doing a
> vacuum on that table. And it doesn't help at all. Almost not at all.
>
If you do a vacuum verbose, what stats do you get back?  What is the size
of the index when the degradation starts to show, and immediately after a
successful reindex?

Also, how is JobID assigned?  Are they from a sequence, or some other
method where they are always added to the end of the index's keyspace?

When it starts to degrade, what is the EXPLAIN plan for the query?

Cheers,

Jeff

Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Jeff Janes

On Sat, Feb 23, 2019 at 4:06 PM Gunther  wrote:

> the dequeue operation is essentially this:
>
> BEGIN
>
> SELECT jobId, action
>   FROM Queue
>   WHERE pending
>   FOR UPDATE SKIP LOCKED
>
>
There is no LIMIT shown.  Wouldn't the first thread to start up just lock
all the rows and everyone else would starve?

Cheers,

Jeff

Re: impact of auto explain on overall performance

2019-03-14 Thread Jeff Janes

On Thu, Mar 14, 2019 at 3:29 AM Stephan Schmidt  wrote:

> Hello,
>
>
>
> i’m currently working on a high Performance Database and want to make sure
> that whenever there are slow queries during regular operations i’ve got all
> Information about the query in my logs. So auto_explain come to mind, but
> the documentation explicitly states that it Comes at a cost. My Question
> is, how big is the latency added by auto_explain in percentage or ms ?
>

You will have to measure it yourself and see.  It depends on your hardware,
OS, and OS version, and PostgreSQL version.  And the nature of your
queries.  If you have auto_explain.log_timing=on, then I find that large
sorts are the worst impacted.  So if you have a lot of those, you should be
careful.

On older kernels, I would run with auto_explain.log_timing=off.  On newer
kernels where you can read the clock from user-space, I run with
auto_explain.log_timing=on.  I find the slowdown noticeable with careful
investigation (around 3%, last time I carefully investigated it), but
usually well worth paying to have actual data to work with when I find slow
queries in the log.  I made a special role with auto_explain disabled for
use with a few reporting queries with large sorts, both to circumvent the
overhead and to avoid spamming the log with slow queries I already know
about.

Cheers,

Jeff

>

Re: autovacuum big table taking hours and sometimes seconds

2019-02-07 Thread Jeff Janes

On Thu, Feb 7, 2019 at 6:55 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

I have 3 questions :
> 1)To what value do you recommend to increase the vacuum cost_limit ? 2000
> seems reasonable ? Or maybe its better to leave it as default and assign a
> specific value for big tables ?
>

That depends on your IO hardware, and your workload.  You wouldn't want
background vacuum to use so much of your available IO that it starves your
other processes.

> 2)When the autovacuum reaches the cost_limit while trying to vacuum a
> specific table, it wait nap_time seconds and then it continue to work on
> the same table ?
>

No, it waits for autovacuum_vacuum_cost_delay before resuming within the
same table. During this delay, the table is still open and it still holds a
lock on it, and holds the transaction open, etc.  Naptime is entirely
different, it controls how often the vacuum scheduler checks to see which
tables need to be vacuumed again.

> 3)So in case I have a table that keeps growing (not fast because I set the
> vacuum_scale_factor to 0 and the autovacuum_vacuum_threshold to 1). If
> the table keep growing it means I should try to increase the cost right ?
> Do you see any other option ?
>

 You can use pg_freespacemap to see if the free space is spread evenly
throughout the table, or clustered together.  That might help figure out
what is going on.  And, is it the table itself that is growing, or the
index on it?

Cheers,

Jeff

Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread Jeff Janes

On Wed, Feb 6, 2019 at 5:29 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Now the question is how to handle or tune it ? Is there any change that I
> need to increase the cost_limit / cost_delay ?
>

Sometimes vacuum has more work to do, so it takes more time to do it.

There is no indication of a problem. Or at least, you haven't described
one. So, there is nothing to handle or to tune.

If there is a problem, those log entries might help identify it.  But in
the absence of a problem, they are just log spam.

Cheers,

Jeff

Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread Jeff Janes

On Wed, Feb 6, 2019 at 9:42 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Well, basically I'm trying to tune it because the table still keep
> growing. I thought that by setting the scale and the threshold it will be
> enough but its seems that it wasnt. I attached some of the logs output to
> hear what you guys think about it ..
>

Are all four log entries from well after you made the change?  My first
inclination is to think that the first 2 are from either before the change,
or just after the change when it is still settling into the new regime.
Also, is the table still continuing to grow, or is at a new steady-state of
bloat which isn't growing but also isn't shrinking back to where you want
it to be?  More aggressive vacuuming alone should stop the bloat, but is
not likely to reverse it.

I habitually set vacuum_cost_page_hit and vacuum_cost_page_miss to zero.
Page reads are self-limiting (vacuum is single threaded, so you can't have
more than one read (times autovacuum_max_workers) going on at a time) so I
don't see a need to throttle them intentionally as well--unless your entire
db is sitting on one spindle.  Based on the high ratio of read rates to
write rates in the last two log entries, this change alone should be enough
greatly speed up the run time of the vacuum.

If you need to speed it up beyond that, I don't think it matters much
whether you decrease cost_delay or increase cost_limit, it is the ratio
that mostly matters.

And if these latter measures do work, you should consider undoing changes
to autovacuum_vacuum_scale_factor.  Reading the entire index just to remove
10,000 rows from the table is a lot of extra work that might be
unnecessary. Although that extra work might not be on anyone's critical
path.

>

Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes

On Tue, Feb 12, 2019 at 10:42 AM Tom Lane  wrote:

> Thomas Kellerer  writes:
> > The bloom index is only used if either Seq Scan is disabled or if the
> random_page_cost is set to 1 (anything about 1 triggers a Seq Scan on my
> Windows laptop).
>
> Hm.  blcostestimate is using the default cost calculation, except for
>
> /* We have to visit all index tuples anyway */
> costs.numIndexTuples = index->tuples;
>
> which essentially tells genericcostestimate to assume that every index
> tuple will be visited.  This obviously is going to increase the cost
> estimate; maybe there's something wrong with that?
>

I assumed (without investigating yet) that genericcostestimate is applying
a cpu_operator_cost (or a few of them) on each index tuple, while the
premise of a bloom index is that you do very fast bit-fiddling, not more
expense SQL operators, for each tuple and then do the recheck only on what
survives to the table tuple part.

Cheers,

Jeff

Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes

On Tue, Feb 12, 2019 at 11:58 AM Jeff Janes  wrote:

>
> On Tue, Feb 12, 2019 at 10:42 AM Tom Lane  wrote:
>
>>
>> Hm.  blcostestimate is using the default cost calculation, except for
>>
>> /* We have to visit all index tuples anyway */
>> costs.numIndexTuples = index->tuples;
>>
>> which essentially tells genericcostestimate to assume that every index
>> tuple will be visited.  This obviously is going to increase the cost
>> estimate; maybe there's something wrong with that?
>>
>
> I assumed (without investigating yet) that genericcostestimate is applying
> a cpu_operator_cost (or a few of them) on each index tuple, while the
> premise of a bloom index is that you do very fast bit-fiddling, not more
> expense SQL operators, for each tuple and then do the recheck only on what
> survives to the table tuple part.
>

In order for bloom (or any other users of CREATE ACCESS METHOD, if there
are any) to have a fighting chance to do better, I think many of selfuncs.c
currently private functions would have to be declared in some header file,
perhaps utils/selfuncs.h.  But that then requires a cascade of other
inclusions.  Perhaps that is why it was not done.

Cheers,

Jeff

>

Re: Performance regressions found using sqlfuzz

2019-02-12 Thread Jeff Janes

On Tue, Feb 12, 2019 at 4:23 AM Jung, Jinho  wrote:

>
> Hello,
>
> We are developing a tool called sqlfuzz for automatically finding
> performance regressions in PostgreSQL. sqlfuzz performs mutational fuzzing
> to generate SQL queries that take more time to execute on the latest
> version of PostgreSQL compared to prior versions. We hope that these
> queries would help further increase the utility of the regression test
> suite.
>
> We would greatly appreciate feedback from the community regarding the
> queries found by the tool so far. We have already incorporated prior
> feedback from the community in the latest version of sqlfuzz.
>

This approach doesn't seem very exciting to me as-is, because optimization
is a very pragmatic endeavor.  We make decisions all the time that might
make some queries better and others worse.  If the queries that get better
are natural/common ones, and the ones that get worse are weird/uncommon
ones (like generated by a fuzzer), then making that change is an
improvement even if there are some performance (as opposed to correctness)
regressions.

I would be more interested in investigating some of these if the report
would:

1) include the exact commit in which the regression was introduced (i.e.
automate "git bisect").
2) verify that the regression still exists in the dev HEAD and report which
commit it was verified in (since HEAD changes frequently).
3) report which queries (if any) in your corpus were made better by the
same commit which made the victim query worse.

Cheers,

Jeff

>

Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes

On Tue, Feb 12, 2019 at 4:17 PM Tom Lane  wrote:

> Jeff Janes  writes:
> > In order for bloom (or any other users of CREATE ACCESS METHOD, if there
> > are any) to have a fighting chance to do better, I think many of
> selfuncs.c
> > currently private functions would have to be declared in some header
> file,
> > perhaps utils/selfuncs.h.  But that then requires a cascade of other
> > inclusions.  Perhaps that is why it was not done.
>
> I'm just in the midst of refactoring that stuff, so if you have
> suggestions, let's hear 'em.
>

The goal would be that I can copy the entire definition of
genericcostestimate into blcost.c, change the function's name, and get it
to compile.  I don't know the correct way accomplish that.  Maybe
utils/selfuncs.h can be expanded to work, or if there should be a new
header file like "utils/index_am_cost.h"

What I've done for now is:

#include "../../src/backend/utils/adt/selfuncs.c"

which I assume is not acceptable as a real solution.

It's possible that a good cost model for bloom is so far outside
> genericcostestimate's ideas that trying to use it is not a good
> idea anyway.
>

I think that might be the case.  I don't know what the right answer would
look like, but I think it will likely end up needing to access everything
that genericcostestimate currently needs to access.  Or if bloom doesn't,
some other extension implementing an ACCESS METHOD will.

Cheers,

Jeff

Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-29 Thread Jeff Janes

On Mon, Jan 28, 2019 at 12:03 AM Saurabh Nanda 
wrote:

> All this benchmarking has led me to a philosophical question, why does PG
> need shared_buffers in the first place?
>

PostgreSQL cannot let the OS get its hands on a dirty shared buffer until
the WAL record "protecting" that buffer has been flushed to disk. If a
dirty shared buffer got written to disk, but then a crash happened before
the WAL record go flushed to disk, then the data could be corrupted when it
comes back up. So shared_buffers effectively serves as cooling pond where
dirty buffers wait for their WAL to be flushed naturally so they can be
written without instigating a performance-reducing flush just for them.

Also, concurrent clients needs to access the same disk pages at overlapping
times without corrupting each other.  Perhaps that could be implemented to
have just the buffer headers in shared memory to coordinate the locking,
and not having the buffers themselves in shared memory.  But that is not
how it is currently implemented.

> What's wrong with letting the OS do the caching/buffering?
>

Nothing, and that is what it does.  Which is why the advice for
shared_buffers is often to use a small fraction of RAM, leaving the rest
for the OS to do its thing.  But PostgreSQL still needs a way to lock those
pages, both against concurrent access by its own clients, and against
getting flushed out of order by the OS.  There is no performant way to
release the dirty pages immediately to the OS while still constraining the
order in which the OS flushes them to disk.

Finally, while reading a page from the OS cache into shared_buffers is much
faster than reading it from disk, it is still much slower than finding it
already located in shared_buffers.  So if your entire database fits in RAM,
you will get better performance if shared_buffers is large enough for the
entire thing to fit in there, as well.  This is an exception to the rule
that shared_buffers should be a small fraction of RAM.

> Isn't it optimised for this kind of stuff?
>

Maybe.  But you might be surprised at poorly optimized it is.  It depends
on your OS and version of it, of course.  If you have a high usage_count
buffer which is re-dirtied constantly, it will only get written and flushed
to disk once per checkpoint if under PostgreSQL control. But I've seen
pages like that get written many times per second under kernel control.
Whatever optimization it tried to do, it wasn't very good at.  Also, if
many contiguous pages are dirtied in a close time-frame, but not dirtied in
their physical order, the kernel should be able to re-order them into long
sequential writes, correct?  But empirically, it doesn't, at least back in
the late 2.* series kernels when I did the experiments.  I don't know if it
didn't even try, or tried but failed.  (Of course back then, PostgreSQL
didn't do a good job of it either)

Cheers,

Jeff

Re: Will higher shared_buffers improve tpcb-like benchmarks?

2019-01-29 Thread Jeff Janes

On Tue, Jan 29, 2019 at 6:39 AM Saurabh Nanda 
wrote:

> Hi,
>
> I'm going crazy trying to optimise my Postgres config for a production
> setting [1] Once I realised random changes weren't getting my anywhere, I
> finally purchased PostgreSQL 10 - Higher Performance [2] and understood the
> impact of shared_buffers.
>
> IIUC, shared_buffers won't have any significant impact in the following
> scenario, right?
>
> -- DB size = 30GB
> -- shared_buffers = 2GB
> -- workload = tpcb-like
>
> This is because the tpcb-like workload selects & updates random rows from
> the DB [3]. Therefore, with a 2GB shared buffer, there is only a 6-7%
> chance (did I get my probability correct?) that the required data will be
> in the shared_buffer. Did I understand this correctly?
>

That is likely correct, but the data will likely be stored in the OS file
cache, so reading it from there will still be pretty fast.

>
> If nothing else becomes the bottleneck (eg. periodically writing dirty
> pages to disk), increasing the shared_buffers to 15GB+ should have a
> significant impact, for this DB-size and workload, right? (The system has
> 64 GB RAM)
>

About the only way to know for sure that writing dirty data is not the
bottleneck is to use a read only benchmark, such as the -S flag for
pgbench.  And at that point, the IPC overhead between pgbench and the
backend, even when both are running on the same machine, is likely to be
the bottleneck.  And after that, the bottleneck might shift to opening and
closing transactions and taking and releasing locks[1].

If you overcome that, then you might reliably see a difference between 2GB
and 15GB of shared buffers, because at 2GB each query to pgbench_accounts
is likely to fetch 2 pages into shared_buffers from the OS cache: the index
leaf page for pgbench_accounts_pkey, and the table page for
pgbench_accounts.  At 15GB, the entire index should be reliably in
shared_buffers (after enough warm-up time), so you would only need to fetch
1 page, and often not even that.

Cheers,

Jeff

[1]   I have a very old patch to pgbench that introduces a new query to
overcome this,
https://www.postgresql.org/message-id/BANLkTi%3DQBYOM%2Bzj%3DReQeiEKDyVpKUtHm6Q%40mail.gmail.com
.  I don't know how much work it would be to get it to compile against
newer versions--I stopped maintaining it because it became too much work to
rebase it past conflicting work, and because I lost interest in this line
of research.

Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Jeff Janes

On Sun, Apr 14, 2019 at 9:06 PM Gunther  wrote:

> Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not
> having given enough detail.
>
> The version is 10.2 latest. The database was originally built with 10.1
> and then just started with 10.2.
>
Do you mean 11.2?  The latest in the 10 series is 10.7.  If you do mean
10.2, there a fix for a memory leak bug since then that might plausibly be
relevant (bdc7f686d1b8f423cb)

>
> I said "crash" and that is wrong. Not a signal nor core dump. It is the
> ERROR:  out of memory. Only the query crashes. Although I don't know if may
> be the backend server might have left a core dump?
>
I don't think there would be a core dump on only an ERROR, and probably not
worthwhile to trick it into generating one.


> The short version is:
>
> Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 
> 1425134928 used
> 2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
> 2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 
> in memory context "ExecutorState".
>
>  I don't know why a 8GB system with a lot of cache that could be evicted
would get an OOM when something using 1.5GB asks for 8272 bytes more.  But
that is a question of how the kernel works, rather than how PostgreSQL
works.  But I also think the log you quote above belongs to a different
event than the vmstat trace in your first email.


>   ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 
> chunks); 1408127768 used
> HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 
> used
>   HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 
> 41360 used
>
>
This does not seem to match your query plan.  Why would a plan with no Hash
Joins have a HashBatchContext?  I think this log must be from a different
query than the one that generated the plan you posted.  Perhaps one was
before you lowered work_mem and one was after?

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Jeff Janes

On Sun, Apr 14, 2019 at 4:51 PM Gunther  wrote:

> For weeks now, I am banging my head at an "out of memory" situation. There
> is only one query I am running on an 8 GB system, whatever I try, I get
> knocked out on this out of memory.
>
Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the
OOM killer?  Do you get a core file you can inspect with gdb?

You might want to see the query, but it is a huge plan, and I can't really
> break this down. It shouldn't matter though. But just so you can get a
> glimpse here is the plan:
>
> Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 
> width=1197)
>   ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
>
>
>
Maybe it is memory for trigger or constraint checking, although I don't
know why that would appear instantly.  What triggers or constraints do you
have on businessoperation?

What if you just run the SELECT without the INSERT?  Or insert into a temp
table rather than into businessoperation?  And if that doesn't crash, what
if you then insert to businessoperation from the temp table?

Also, what version?

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes

On Sun, Apr 14, 2019 at 11:04 PM Gunther  wrote:

> Could you rerun the query with \set VERBOSITY verbose to show the file/line
> that's failing ?
>
> Here goes:
>
> integrator=# \set VERBOSITY verbose
> integrator=# SET ENABLE_NESTLOOP TO OFF;
> SET
> integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM 
> reports.v_BusinessOperation;
> ERROR:  53200: out of memory
> DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
> LOCATION:  MemoryContextAlloc, mcxt.c:798
>
> you notice that I set ENABLE_NESTLOOP to off, that is because the planner
> goes off thinking the NL plan is marginally more efficient, but in fact it
> will take 5 hours to get to the same out of memory crash, while the no NL
> plan gets there in half an hour. That verbose setting didn't help much I
> guess.
>
I think the backtrace of the enable_nestloop=on plan would be more useful.
Here someone has filled up memory, and then we see HashBatchContext trip
over it that.  But it isn't the one the one that caused the problem, so the
backtrace doesn't help.  With the previous plan, it was an allocation into
ExecutorState which tripped over the problem, and it is likely that it is
coming from the same series of allocations that caused the problem.

To get it to happen faster, maybe you could run the server with a small
setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
Unfortunately I don't know how to set a breakpoint for allocations into a
specific context, and setting a breakpoint for any memory allocation is
probably going to fire too often to be useful.

Yes, the verbose option didn't help (but the gdb backtrace made up for
it--kind of--we really need the backtrace of the allocations into
ExecutorState).  It isn't helpful to know that a memory allocation failed
in the mcxt.c code.  To bad it doesn't report the location of the caller of
that code.  I know in Perl you can use Carp::croak to do that, but I don't
know to do it in C.

But really the first thing I want to know now is what if you just do the
select, without the insert?

explain analyze SELECT * FROM reports.v_BusinessOperation

If that works, what about "create temporary table foo as SELECT * FROM
reports.v_BusinessOperation" ?

And if that works, what about "INSERT INTO reports.BusinessOperation SELECT
* FROM foo"?

If the ERROR happens in the first or last of these, it might be much easier
to analyze in that simplified context. If it happens in the middle one,
then we probably haven't achieved much. (And if there is no ERROR at all,
then you have workaround, but we still haven't found the fundamental bug).

Are you not showing the view definition for proprietary reasons, or just
because you don't think it will be useful? If the latter, please send it as
an attachment, I can't guarantee it will be useful, but there is only one
way find out.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes

On Sun, Apr 14, 2019 at 11:59 PM Gunther  wrote:

> Is there any doubt that this might be a problem with Linux? Because if
> you want, I can whip out a FreeBSD machine, compile pgsql, and attach
> the same disk, and try it there. I am longing to have a reason to move
> back to FreeBSD anyway. But I have tons of stuff to do, so if you do not
> have reason to suspect Linux to do wrong here, I prefer skipping that
> futile attempt
>

I think the PostgreSQL leaking in the first place would be independent of
Linux being ungraceful about it.  So repeating it on BSD probably wouldn't
help us here.  If you want to take up the 2nd issue with the kernel folks,
having some evidence from BSD might (but not very likely) be helpful for
that, but that would be for a different mailing list.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes

On Mon, Apr 15, 2019 at 12:34 PM Gunther  wrote:

> Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
> 715 {
> (gdb) p context->name
> $8 = 0x96ce5b "ExecutorState"
>
>
I think that the above one might have been the one you wanted.

> I guess I should run this for a little longer. So I disable my breakpoints
>
>
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing
> fast, so now back to gdb and break:
>
Unfortunately, I think this means you missed your opportunity and are now
getting backtraces of the innocent bystanders.

Particularly since you report that the version using nested loops rather
than hash joins also leaked, so it is probably not the hash-join specific
code that is doing it.

What I've done before is compile with the comments removed from
src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */

and then look for allocations sizes which are getting allocated but not
freed, and then you can go back to gdb to look for allocations of those
specific sizes.  This generates a massive amount of output, and it bypasses
the logging configuration and goes directly to stderr--so it might not end
up where you expect.

Thanks for the view definition.  Nothing in it stood out to me as risky.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes

On Mon, Apr 15, 2019 at 11:28 AM Tom Lane  wrote:

> Jeff Janes  writes:
> > To get it to happen faster, maybe you could run the server with a small
> > setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
> > Unfortunately I don't know how to set a breakpoint for allocations into a
> > specific context, and setting a breakpoint for any memory allocation is
> > probably going to fire too often to be useful.
>
> If you can use gdb at all, it's not that hard to break on allocations
> into a specific context; I've done it many times.  The strategy is
> basically
>

> 1. Let query run long enough for memory usage to start increasing,
> then attach to backend with gdb.
>
> 2. Set breakpoint at, probably, AllocSetAlloc.  (In some cases,
> reallocs could be the problem, but I doubt it here.)  Then "c".
>
> 3. When it stops, "p *context" and see if this is the context
> you're looking for.  In this case, since we want to know about
> allocations into ExecutorState and we know there's only one
> active one, you just have to look at the context name.  In general
> you might have to look at the backtrace.  Anyway, if it isn't the
> one you want, just "c" until you get to an allocation into the
> one you do want.
>
> 4. Once you have found out the address of the context you care
> about, make the breakpoint conditional on the context argument
> being that one.  It might look like this:
>
> Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
> 715 {
> (gdb) p *context
> $1 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
> false,
>   methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild =
> 0x0,
>   nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0,
>   reset_cbs = 0x0}
> (gdb) cond 1  context == 0x1483be0
>
> 5. Now repeatedly "c", and check the stack trace each time, for a
> dozen or two times to get a feeling for where the allocations are
> being requested.
>
> In some cases you might be able to find the context address in a
> more efficient way than what I suggest in #3 --- for instance,
> you could instead set a breakpoint where the context is created
> and snag its address immediately, or you could dig around in
> backend data structures to find it.  But these ways generally
> require more familiarity with the code than just watching the
> requests go by.
>


Thanks for the recipe.  I can use gdb at all, just not very skillfully :)

With that as a starting point, experimentally, this seems to work to short
circuit the loop described in your step 3 (which I fear could be thousands
of iterations in some situations):

cond 1 strcmp(context.name,"ExecutorState")==0

Also, I've found that in the last few versions of PostgreSQL, processes
might get unreasonable numbers of SIGUSR1 (maybe due to parallelization?)
and so to avoid having to stand on the 'c' button, you might need this:

handle SIGUSR1 noprint nostop

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Jeff Janes

On Mon, Apr 15, 2019 at 9:49 PM Gunther  wrote:

> Jeff Janes had more
>
> Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
>> 715 {
>> (gdb) p context->name
>> $8 = 0x96ce5b "ExecutorState"
>>
>>
> I think that the above one might have been the one you wanted.
>
> Not sure how you could tell that? It's the same place as everything else.
> If we can find out what you're looking for, may be we can set a break point
> earlier up the call chain?
>

It is just a hunch.  That size is similar to one you reported for the
last-straw allocation on the case with the nested loops rather than hash
joins. So it is reasonable (but surely not guaranteed) that they are coming
from the same place in the code.  And since that other one occurred as the
last straw, then that one must have not be part of the start up
allocations, but rather the part where we should be at steady state, but
are not.  So maybe this one is too.

>
> I guess I should run this for a little longer. So I disable my breakpoints
>>
>>
> it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing
>> fast, so now back to gdb and break:
>>
> Unfortunately, I think this means you missed your opportunity and are now
> getting backtraces of the innocent bystanders.
>
> But why? If I see the memory still go up insanely fast, isn't that a sign
> for the leak?
>

You said it was no longer going up insanely fast by the time you got your
breakpoints re-activated.

> Particularly since you report that the version using nested loops rather
> than hash joins also leaked, so it is probably not the hash-join specific
> code that is doing it.
>
> How about it's in the DISTINCT? I noticed while peeking up the call chain,
> that it was already in the UNIQUE sort thing also.  I guess it's streaming
> the results from the hash join right into the unique sort step.
>

Isn't crashing before any rows are emitted from the hash join?  I thought
it was, but I am not very confident on that.

> What I've done before is compile with the comments removed from
> src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */
>
> I have just done that and it creates an insane amount of output from all
> the processes, I'm afraid there will be no way to keep that stuff
> separated. If there was a way of turning that one and off for one process
> only, then we could probably get more info...
>

Are you doing all this stuff on a production server in use by other people?

> Everything is also extremely slow that way. Like in a half hour the memory
> didn't even reach 100 MB.
>
> and then look for allocations sizes which are getting allocated but not
> freed, and then you can go back to gdb to look for allocations of those
> specific sizes.
>
> I guess I should look for both, address and size to match it better.
>

You can analyze the log for specific addresses which are allocated but not
freed, but once you find them you can't do much with them.  Those specific
addresses probably won't repeat on the next run, so, you so can't do
anything with the knowledge.  If you find a request size that is
systematically analyzed but not freed, you can condition logging (or gdb)
on that size.

> This generates a massive amount of output, and it bypasses the logging
> configuration and goes directly to stderr--so it might not end up where you
> expect.
>
> Yes, massive, like I said. Impossible to use. File system fills up
> rapidly. I made it so that it can be turned on and off, with the debugger.
>
> int _alloc_info = 0;
> #ifdef HAVE_ALLOCINFO
> #define AllocFreeInfo(_cxt, _chunk) \
> if(_alloc_info) \
> fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
> (_cxt)->header.name, (_chunk), (_chunk)->size)
> #define AllocAllocInfo(_cxt, _chunk) \
> if(_alloc_info) \
> fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
> (_cxt)->header.name, (_chunk), (_chunk)->size)
> #else
> #define AllocFreeInfo(_cxt, _chunk)
> #define AllocAllocInfo(_cxt, _chunk)
> #endif
>
> so with this I do
>
> (gdb) b AllocSetAlloc
> (gdb) cont
> (gdb) set _alloc_info=1
> (gdb) disable
> (gdb) cont
>
>
Thanks for this trick, I'll save this so I can refer back to it if I need
to do this again some time.

> then I wait, ... until it crashes again ... no, it's too much. It fills up
> my filesystem in no time with the logs.  It produced 3 GB in just a minute
> of run time.
>

You don't need to run it until it crashes, only until the preliminaries are
done and the leak has started to happen.  I would think a minute would be
more than enough, unless the leak doesn't start until some other join

Re: Shortest offline window on database migration

2019-05-30 Thread Jeff Janes

On Thu, May 30, 2019 at 11:08 AM Haroldo Kerry  wrote:

> Hello,
>
> We are migrating our PostgreSQL 9.6.10 database (with streaming
> replication active) to a faster disk array.
> We are using this opportunity to enable checksums, so we will have to do a
> full backup-restore.
> The database size is about 500GB, it takes about 2h:30min for a full
> backup, and then about 1h to fully restore it with checksum enabled on the
> new array, plus 2h to recreate the replica on the old array.
>

As others have noticed, your "trick" won't work.  So back to basics.  Are
you using the best degree of parallelization on each one of these tasks?
What is the bottleneck of each one (CPU, disk, network)? how are you
creating the replica?  Can you share the actual command lines for each
one?  It seems odd that the dump (which only needs to dump the index and
constraint definitions) is so much slower than the restore (which actually
needs to build those indexes and validate the constraints). Is that because
the dump is happening from the old slow disk and restore a new fast ones?
Same with creating the replica, why is that slower than actually doing the
restore?

It sounds like you are planning on blowing away the old master server on
the old array as soon as the upgrade is complete, so you can re-use that
space to build the new replica?  That doesn't seem very safe to me--what if
during the rebuilding of the replica you run into a major problem and have
to roll the whole thing back?  What will the old array which is holding the
current replica server be doing in all of this?

Cheers,

Jeff

>

Re: PostgreSQL optimizer use seq scan instead of pkey index only scan (in queries with postgres_fdw)

2019-05-06 Thread Jeff Janes

On Mon, May 6, 2019 at 11:38 AM Vitaly Baranovsky 
wrote:

> Thank you, Jeff!
>
> We'll be looking forward to the next version of Postgres in this case.
>
> As far as I understand, you've answered about sending filtering condition
> to a foreign server... Could you, please, clarify about another (the first)
> part of my question? Why the server choose seq scan instead of pk key index
> only scan for the local table?
>
> Thank you
>
>
Aren't those the same thing?  The foreign server can't use the where
clause, if it doesn't get sent.

Cheers,

Jeff

Re: PostgreSQL optimizer use seq scan instead of pkey index only scan (in queries with postgres_fdw)

2019-05-06 Thread Jeff Janes

On Mon, May 6, 2019 at 10:44 AM Vitaly Baranovsky 
wrote:

> Hello all,
>
> I faced strange behavior of PostgreSQL during the query execution.
>

 ...

> Also, please, note, that SQL without WHERE clause has been set to the
> foreign server:
> "  Remote SQL: SELECT primary_uuid FROM public.foreign_table"
>
> So, the optimizer doesn't select optimal plans for such executions :(
>

It works the way you want in version 12, which is currently under
development and should be released in 5 months or so.

Cheers,

Jeff

>

Re: PostgreSQL optimizer use seq scan instead of pkey index only scan (in queries with postgres_fdw)

2019-05-06 Thread Jeff Janes

On Mon, May 6, 2019 at 11:53 AM Jeff Janes  wrote:

> On Mon, May 6, 2019 at 11:38 AM Vitaly Baranovsky <
> barvetalfor...@gmail.com> wrote:
>
>> Thank you, Jeff!
>>
>> We'll be looking forward to the next version of Postgres in this case.
>>
>> As far as I understand, you've answered about sending filtering condition
>> to a foreign server... Could you, please, clarify about another (the first)
>> part of my question? Why the server choose seq scan instead of pk key index
>> only scan for the local table?
>>
>> Thank you
>>
>>
> Aren't those the same thing?  The foreign server can't use the where
> clause, if it doesn't get sent.
>

Nevermind. When you said local table, I had some tunnel vision and was
thinking of the foreign table as viewed from the perspective of the foreign
server (to which it is local), not the actual local table.   That too is
"fixed" in the same commit to the 12dev branch as the other issue is:

commit 4be058fe9ec5e630239b656af21fc083371f30ed
Date:   Mon Jan 28 17:54:10 2019 -0500

In the planner, replace an empty FROM clause with a dummy RTE.

My tests are all done with empty, unanalyzed tables as I just took you DDL
without inventing my own DML, so may be different than what what you were
seeing with your populated tables.

Cheers,

Jeff

>

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Jeff Janes

On Sat, Apr 20, 2019 at 4:26 PM Tom Lane  wrote:

> Tomas Vondra  writes:
> > Considering how rare this issue likely is, we need to be looking for a
> > solution that does not break the common case.
>
> Agreed.  What I think we need to focus on next is why the code keeps
> increasing the number of batches.  It seems like there must be an undue
> amount of data all falling into the same bucket ... but if it were simply
> a matter of a lot of duplicate hash keys, the growEnabled shutoff
> heuristic ought to trigger.
>

The growEnabled stuff only prevents infinite loops.  It doesn't prevent
extreme silliness.

If a single 32 bit hash value has enough tuples by itself to not fit in
work_mem, then it will keep splitting until that value is in a batch by
itself before shutting off (or at least until the split-point bit of
whatever else is in the that bucket happens to be the same value as the
split-point-bit of the degenerate one, so by luck nothing or everything
gets moved)

Probabilistically we keep splitting until every batch, other than the one
containing the degenerate value, has about one tuple in it.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-22 Thread Jeff Janes

On Sat, Apr 20, 2019 at 4:48 PM Tom Lane  wrote:

> Gunther  writes:
> > and checked my log file and there was nothing before the call
> > MemoryContextStats(TopPortalContext) so I don't understand where this
> > printf stuff is ending up.
>
> It's going to stdout, which is likely block-buffered whereas stderr
> is line-buffered, so data from the latter will show up in your log
> file much sooner.  You might consider adding something to startup
> to switch stdout to line buffering.
>

Is there a reason to not just elog the HJDEBUG stuff?  With some of the
other DEBUG defines, we will probably be using them before the logging
system is set up, but surely we won't be doing Hash Joins that early?

I think we could just get rid of the conditional compilation and elog this
at DEBUG1 or DEBUG2.  Or keep the conditional compilation and elog it at
LOG.

Cheers,

Jeff

Re: Extremely slow HashAggregate in simple UNION query

2019-08-21 Thread Jeff Janes

On Tue, Aug 20, 2019 at 11:12 AM Felix Geisendörfer 
wrote:
 ...


> [1] My actual query had bad estimates for other reasons (GIN Index), but
> that's another story. The query above was of course deliberately designed
> to have bad estimates.
>

As noted elsewhere, v12 thwarts your attempts to deliberately design the
bad estimates.  You can still get them, you just have to work a bit harder
at it:

CREATE FUNCTION j (bigint, bigint) returns setof bigint as $$ select
generate_series($1,$2) $$ rows 1000 language sql;

EXPLAIN ANALYZE
SELECT * FROM j(1, 1) a, j(1, 1) b
UNION
SELECT * FROM j(1, 1) a, j(1, 1) b;
QUERY PLAN

--
 HashAggregate  (cost=80021.00..100021.00 rows=200 width=16) (actual
time=11.332..13.241 rows=1 loops=1)
   Group Key: a.a, b.b
   ->  Append  (cost=0.50..70021.00 rows=200 width=16) (actual
time=0.118..0.163 rows=2 loops=1)
 ->  Nested Loop  (cost=0.50..20010.50 rows=100 width=16)
(actual time=0.117..0.118 rows=1 loops=1)
   ->  Function Scan on j a  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.087..0.088 rows=1 loops=1)
   ->  Function Scan on j b  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.027..0.027 rows=1 loops=1)
 ->  Nested Loop  (cost=0.50..20010.50 rows=100 width=16)
(actual time=0.044..0.044 rows=1 loops=1)
   ->  Function Scan on j a_1  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.022..0.022 rows=1 loops=1)
   ->  Function Scan on j b_1  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.020..0.021 rows=1 loops=1)
 Planning Time: 0.085 ms
 Execution Time: 69.277 ms
(11 rows)

But the same advance in v12 which makes it harder to fool with your test
case also opens the possibility of fixing your real case.

I've made an extension which has a function which always returns true, but
lies about how often it is expected to return true. See the attachment.
With that, you can fine-tune the planner.

CREATE EXTENSION pg_selectivities ;

EXPLAIN ANALYZE
SELECT * FROM j(1, 1) a, j(1, 1) b where pg_always(0.1)
UNION
SELECT * FROM j(1, 1) a, j(1, 1) b where pg_always(0.1);
QUERY PLAN

--
 HashAggregate  (cost=45021.40..45021.60 rows=20 width=16) (actual
time=0.226..0.227 rows=1 loops=1)
   Group Key: a.a, b.b
   ->  Append  (cost=0.50..45021.30 rows=20 width=16) (actual
time=0.105..0.220 rows=2 loops=1)
 ->  Nested Loop  (cost=0.50..22510.50 rows=10 width=16) (actual
time=0.104..0.105 rows=1 loops=1)
   Join Filter: pg_always('1e-05'::double precision)
   ->  Function Scan on j a  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.066..0.066 rows=1 loops=1)
   ->  Function Scan on j b  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.035..0.035 rows=1 loops=1)
 ->  Nested Loop  (cost=0.50..22510.50 rows=10 width=16) (actual
time=0.112..0.113 rows=1 loops=1)
   Join Filter: pg_always('1e-05'::double precision)
   ->  Function Scan on j a_1  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.077..0.077 rows=1 loops=1)
   ->  Function Scan on j b_1  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.034..0.034 rows=1 loops=1)
 Planning Time: 0.139 ms
 Execution Time: 0.281 ms

Cheers,

Jeff


pg_selectivities.patch
Description: Binary data

Re: Extremely slow HashAggregate in simple UNION query

2019-08-24 Thread Jeff Janes

On Thu, Aug 22, 2019 at 1:09 AM Pavel Stehule 
wrote:

> čt 22. 8. 2019 v 3:11 odesílatel Jeff Janes  napsal:
>
>> ...

> But the same advance in v12 which makes it harder to fool with your test
>> case also opens the possibility of fixing your real case.
>>
>
> I think so much more interesting should be long time after query
> processing - last row was processed in 13ms, but Execution Time was 69ms ..
> so some cleaning is 56ms - that is pretty long.
>

Most of the time is not after the clock stops, but before the stepwise
ANALYZE clock starts.  If you just do an EXPLAIN rather than EXPLAIN
ANALYZE, that is also slow.  The giant hash table is created during the
planning step (or somewhere around there--I notice that EXPLAIN ANALYZE
output doesn't count it in what it labels as the planning step--but it is
some step that EXPLAIN without ANALYZE does execute, which to me makes it a
planning step).

For me, "perf top" shows kernel's __do_page_fault as the top
function.  tuplehash_iterate does show up at 20% (which I think is
overattributed, considering how little the speedup is when dropping
ANALYZE), but everything else just looks like kernel memory management code.

Cheers,

Jeff

Re: Erratically behaving query needs optimization

2019-09-02 Thread Jeff Janes

On Mon, Aug 26, 2019 at 4:26 AM Barbu Paul - Gheorghe <
barbu.paul.gheor...@gmail.com> wrote:

> On Sun, Aug 25, 2019 at 5:51 PM Jeff Janes  wrote:
> >
> > Yes, it certainly looks like it is due to cold caches.  But you say it
> is slow at first, and then say it varies greatly during a run.  Is being
> slow at first the only way it varies greatly, or is there large variation
> even beyond that?
>
> There is a great variation in run times (hundreds of ms to several
> seconds) even beyond the start of the server.
> The query runs several times with a different device_id, object_id and
> another list of attribute_ids and it varies from one another.
>

If you run the exact same query (with the same parameters) once the cache
is hot, is the performance than pretty consistent within a given
parameterization?  Or is still variable even within one parameterization.

If they are consistent, could you capture a fast parameterizaton and a slow
parameterization and show then and the plans or them?

> > You can use pg_rewarm to overcome the cold cache issue when you first
> start up the server.
>
> I cannot find anything related to pg_rewarm other than some dead ends
> from 2013 from which I gather it only works on Linux.
> Anyway, I have problems even beyond the start of the database, it's
> just easier to reproduce the problem at the start, otherwise I have to
> leave the application running for a while (to invalidate part of the
> cache I think).
>

Sorry, should have been pg_prewarm, not pg_rewarm. Unfortunately, you
probably have two different problems.  Reproducing it one way is unlikely
to help you solve the other one.

> > If you query only on "results" with only the conditions that apply to
> "results", what is the expected number of rows, and what is the actual
> number of rows?
>
> Explain for the query on results only: https://explain.depesz.com/s/Csau

>
> EXPLAIN (ANALYZE,BUFFERS)
>  SELECT DISTINCT ON (results.attribute_id) results.timestamp,
> results.data FROM results
>  WHERE
>  results.data <> ''
>  AND results.data IS NOT NULL
>  AND results.object_id = 1955
>  AND results.attribute_id IN (4, 5) -- possibly a longer list here
>  AND results.data_access_result = 'SUCCESS'
>  ORDER BY results.attribute_id, results.timestamp DESC
>  LIMIT 2 -- limit by the length of the attributes list
>
> Limit  (cost=166793.28..167335.52 rows=2 width=54) (actual
> time=134783.510..134816.941 rows=2 loops=1)
>   Buffers: shared hit=19086 read=46836, temp read=1522 written=3311
>   ->  Unique  (cost=166793.28..168420.01 rows=6 width=54) (actual
> time=134783.507..134816.850 rows=2 loops=1)
> Buffers: shared hit=19086 read=46836, temp read=1522 written=3311
> ->  Sort  (cost=166793.28..167606.64 rows=325346 width=54)
> (actual time=134783.505..134802.602 rows=205380 loops=1)
>   Sort Key: attribute_id, "timestamp" DESC
>

Do you have an index on (attribute_id, "timestamp" DESC)?  That might
really help if it can step through the rows already sorted, filter out the
ones that need filtering out (building the partial index might help here),
hit the other two tables for each of those rows using a nested loop, and
stop after 2 rows which meet those conditions.  The problem is if you have
to step through an enormous number for rows before finding 2 of them with
device_id=97.

> So maybe I should de-normalize and place the device_id column into the
> "results" table and add it to the index in your suggestion above?
>

Yes, if nothing else works, that should.  How hard would it be to maintain
that column in the correct state?

Cheers,

Jeff

Re: Upsert performance considerations (~1 mil/hour)

2019-09-04 Thread Jeff Janes

On Wed, Sep 4, 2019 at 1:30 PM Fredrik Blomqvist <
fredrik.blomqvist...@gmail.com> wrote:

> Hi,
>
> I have tried doing some research for quite a while on the performance
> implications of the built-in upsert (INSERT ... ON CONFLICT UPDATE...) when
> a lot of upserts are made. The scale is something like 1 million
> records/hour, that is split up in groups of around 300 records each.
>

How is that done?  300 single-valued insert statements, grouped into on
transaction?  one 300-valued insert statement?

> So far, one hypothesis is that this project seems to be suffering from the
> large amount of writes that happen constantly since even if the upsert
> results in no inserts/updates, the "failed" inserts from the upsert will
> still get written somewhere (according to our knowledge).
>

You can suppress redundant updates with a trigger, as described
https://www.postgresql.org/docs/current/functions-trigger.html.  This works
even for updates that are the result of insert..on conflict..update.  There
is still some writing, as each tuple does get locked, but it is much less
(at least from a WAL perspective).   You can also put  a WHERE clause on
the DO UPDATE so it only updates is a field has changed, but you have to
list each field connected with OR.

> Therefore, the idea is to utilize old-fashioned upserts (writeable CTEs)
> and do more granular operations that can make sure to only insert data that
> doesn't already exist, and only update data that has actually changed.
> Naturally, however, this will put more read-load on the DB and increase
> query complexity.
>

It shouldn't put a meaningful additional read load on the database, as the
ON CONFLICT code still needs to do the read as well.  Yes, it makes the
code slightly more complex.

Cheers,

Jeff

>

Re: Bitmap heap scan performance

2019-08-09 Thread Jeff Janes

On Fri, Aug 9, 2019 at 4:42 AM Rob Emery  wrote:

>
> It
> seems to me like the Bitmap Heap Scan on proposal is the issue because
> the recheck is throwing away enormous amounts of data.

Have you tried increasing work_mem?  The probable reason for the recheck is
that your bitmap overflows the allowed memory, and then switches
from storing every tid to storing just the block numbers.  As indicated by
the lossy part of "Heap Blocks: exact=3983 lossy=27989"

The
> has_been_anonymised flag on the proposal is effectively a soft-delete;
> so I’ve tried adding something like :
>
> CREATE INDEX ON proposal.proposal (system_id, legacy_organisation_id,
> reference)
> WHERE has_been_anonymised = false;
>
> Which I was hoping would shrink the size of the index significantly
>

The partial index should be smaller, but when comparing to the index with
"has_been_anonymised" as the leading column, it won't make a lot of
difference.  You only have to scan a smaller part of the larger index, and
the sizes of part of the index you have to scan in each case will be
roughly comparable.

> and encourage an index scan rather than bitmap, however it didn’t have
> that effect.

To encourage index scans over bitmap scans, you can increase
effective_cache_size.  Or to really force the issue, you can "set
enable_bitmapscan=off" but that is something you would usually do locally
for experimental purposes, not do it in production's config settings.

Cheers,

Jeff

Re: Huge shared hit for small table

2019-11-04 Thread Jeff Janes

On Mon, Nov 4, 2019 at 3:38 PM Scott Rankin  wrote:

> Definitely no long-running transactions on this table;
>

Any long running transactions at all?  The lock on the table is only
necessary to explain why the problem would have gone away at the same time
as the reindex finished.  If there is a long running transaction which
doesn't touch this table, it would still cause the problem. It is just that
the reindinex would not solve the problem (because the
not-entirely-dead-yet tuples would have to be copied into the new index),
and with no lock there is no reason for them to be correlated in time,
other than sheer dumb luck.

Does another reindex solve the problem again?

>  in fact, this table is pretty infrequently updated – on the order of a
few tens of rows updated per day.

That would seem to argue against this explanations, but all the others ones
too I think.  But a few tens of rows per day and a transaction left open
for a few tens of days, and you could get enough zombie tuples to add up to
trouble.  Particularly if there is one row (as defined by prog.id) which is
seeing both most of those updates, an most of the index-scan activity.

But now I am curious, if it is a small table and the index scan is going to
be invoked 21,956 times in one query, it seems like it should hash it
instead.  Does it misestimate how often that index scan is going to get
invoked? (assuming the index scan is the 2nd child of a nested loop, what
is the expected and actual row count of the 1st child of that loop?)

Cheers,

Jeff

>

Re: Huge shared hit for small table

2019-11-04 Thread Jeff Janes

On Mon, Nov 4, 2019 at 2:38 PM Scott Rankin  wrote:

> Hello all,
>
>
>
> We are trying to debug some slow performance in our production environment
> (Amazon RDS, Postgresql 9.6.11), and we’re looking at a particular EXPLAIN
> node that seems… weird.  This is a very large query involving a number of
> joins, but it performs pretty well in our staging environment (which has
> roughly the same data set as production, with a few tweaks).  However,
> there is one node in the EXPLAIN plan that is wildly different:
>

Could there be a long-open transaction, which is preventing hint-bits from
getting on set on the table rows, as well on the index rows?

...

> The tables in both environments are about the same size (18MB) and the
> indexes are about the same size (360kb/410kb) – and the shared hits are
> pretty much the same on the other nodes of the query between the two
> environments.
>

If this table has more turn-over than those other tables (as measured in
rows, not in percentage of the table), this would not be inconsistent with
my theory.

> This has happened one time before, and we did a “REINDEX” on the program
> table – and that made the problem mostly go away.  Now it seems to be back,
> and I’m not sure what to make of it.
>

A reindex would not by itself fix the problem if it were the long open
transaction.  But  if the long open transaction held a sufficient lock on
the table, then the reindex would block until the transaction went away on
its own, at which point the problem would go away on its own, so it might
**appear** to have fixed the problem.

Cheers,

Jeff

>

Re: GIN index on JSONB not used due to lack of nested statistics

2019-10-30 Thread Jeff Janes

On Wed, Oct 30, 2019 at 12:25 PM Alessandro Baretta 
wrote:

>  ->  Bitmap Index Scan on idx_object
> (cost=0.00..75580.00 rows=1 width=0) (actual time=24.094..24.094
> rows=77 loops=1)
>Index Cond: ((... @> ...::jsonb) AND (...
> @> ...::jsonb))
>  Planning time: 0.301 ms
>  Execution time: 24.723 ms
> (14 rows)
>
> It would seem that this miscalculation of the cost of the index scan
> is due to the query planner lacking detailed statistics about the
> relevant JSONB column.

Since it expected 1 row but actually found 77, I think that if it had
accurate statistics it would have overestimated the costs by even more.

Can you repeat the executions with "EXPLAIN (ANALYZE, BUFFERS)"?

How does the cost estimate change if you make effective_cache_size much
larger or much smaller? (No need for ANALYZE, just the cost estimate)

What kind of performance do you get if you turn enable_seqscan and then
repeat the query from a cold start (restart PostgreSQL, then run sudo sh -c
 "echo 3 > /proc/sys/vm/drop_caches").  If the performance is very fast
after a cold start, then something is wrong with the planner estimate.  If
it is slow from a cold start, then the planner has at least a plausible
basis for charging as much as it does.

If you run the query with just one branch of your AND at a time, what is
the expected and actual number of rows?

> The workaround I found so far is to set a low value of
> random_page_cost, but this could result in the query planner using
> index scans for other tables and other queries, where a seq scan would
> actually be more appropriate.
>

Based on what you know about your IO system, and the cacheability of your
data, what is the appropriate setting of random_page_cost from first
principles?  Maybe it is those other queries which have the problem, not
this one.

If you can up with a random object generator which creates data structured
similar to yours, and shows the same issue when run with disclosable
queries, that would help us look into it.

Also, what version are you running?

Cheers,

Jeff

Re: Slow "not in array" operation

2019-11-13 Thread Jeff Janes

On Wed, Nov 13, 2019 at 6:56 AM Marco Colli  wrote:

> > the answer is that is because it is a GIN index. Make the same index
> only as btree, and you should get good performance as it can filter the
> tags within a given project without visiting the table.
>
> Currently I have this GIN index:
> "index_subscriptions_on_project_id_and_tags" gin (project_id, tags)
> WHERE trashed_at IS NULL
>
>
Multicolumn GIN indexes are nearly worthless IMO when one column is a
scalar.  You can use this index, but it won't be better than one just on
"GIN (tags)  trashed_at IS NULL".  An N-column GIN index is mostly the same
thing as N single column GIN indexes.

> It uses the btree_gin extension and works perfectly for tag search, except
> for the "NOT" operator. I don't understand why it doesn't use the GIN index
> also for the "NOT" operator.
>

Because it can't.  Tom already did a good job of describing that. Can you
describe what steps you think an index should take to jump to the specific
rows which fail to exist in an inverted index?

The problem is that I cannot create the same index with BTree, because PG
> doesn't support BTree on array :(
>

Sure it does.  It can't jump to specific parts of the index based on the
array containment operators, but it can use them for in-index filtering
(but only if you can do an index-only scan).  And really, that is probably
all you need to get > 100x improvement.

Are you getting an error when you try to build it?  If so, what is the
error?

Cheers,

Jeff

>

Re: Slow "not in array" operation

2019-11-13 Thread Jeff Janes

On Wed, Nov 13, 2019 at 4:20 AM Marco Colli  wrote:

> Replying to the previous questions:
> - work_mem = 64MB (there are hundreds of connections)
> - the project 123 has more than 7M records, and those that don't have the
> tag 'en' are 4.8M
>
>
>> What was the plan for the one that took 500ms?
>
>
> This is the query / plan without the filter on tags:
>
> SELECT COUNT(*) FROM "subscriptions" WHERE "subscriptions"."project_id" =
> 123 AND "subscriptions"."trashed_at" IS NULL;
>
>  QUERY PLAN
>
>
> 
>  Finalize Aggregate  (cost=291342.67..291342.68 rows=1 width=8) (actual
> time=354.556..354.556 rows=1 loops=1)
>->  Gather  (cost=291342.05..291342.66 rows=6 width=8) (actual
> time=354.495..374.305 rows=7 loops=1)
>  Workers Planned: 6
>  Workers Launched: 6
>  ->  Partial Aggregate  (cost=290342.05..290342.06 rows=1 width=8)
> (actual time=349.799..349.799 rows=1 loops=7)
>->  Parallel Index Only Scan using
> index_subscriptions_on_project_id_and_uid on subscriptions
>  (cost=0.56..287610.27 rows=1092713 width=0) (actual time=0.083..273.018
> rows=1030593 loops=7)
>  Index Cond: (project_id = 123)
>  Heap Fetches: 280849
>  Planning Time: 0.753 ms
>  Execution Time: 374.483 ms
> (10 rows)
>

My previous comment about the bitmap index scan taking half the time was a
slip of the eye, I was comparing *cost* of the bitmap index scan to the
*time* of the overall plan.  But then the question is, why isn't it doing
an index-only scan on  "index_subscriptions_on_project_id_and_tags"?  And
the answer is that is because it is a GIN index.  Make the same index only
as btree, and you should get good performance as it can filter the tags
within a given project without visiting the table.

Cheers,

Jeff

>

Re: Parallel Query

2019-11-13 Thread Jeff Janes

On Wed, Nov 13, 2019 at 3:59 PM Luís Roberto Weck <
luisrobe...@siscobra.com.br> wrote:

>
>
> Indeed, reducing the costs made the query run in parallel, but the
> improvement in speed was not worth the cost (CPU).
>

Could you show the plan for that?

Re: Parallel Query

2019-11-13 Thread Jeff Janes

On Wed, Nov 13, 2019 at 3:11 PM Luís Roberto Weck <
luisrobe...@siscobra.com.br> wrote:

> Hi!
>
> Is there a reason query 3 can't use parallel workers? Using q1 and q2
> they seem very similar but can use up to 4 workers to run faster:
>
> q1: https://pastebin.com/ufkbSmfB
> q2: https://pastebin.com/Yt32zRNX
> q3: https://pastebin.com/dqh7yKPb
>
> The sort node on q3 takes almost 12 seconds, making the query run on 68
> if I had set enough work_mem to make it all in memory.
>

The third one thinks it is going find 3454539 output rows.  If it run in
parallel, it thinks it will be passing lots of rows up from the parallel
workers, and charges a high price (parallel_tuple_cost = 0.1) for doing
so.  So you can try lowering  parallel_tuple_cost, or figuring out why the
estimate is so bad.

Cheers,

Jeff

Re: Parallel Query

2019-11-13 Thread Jeff Janes

On Wed, Nov 13, 2019 at 4:01 PM Luís Roberto Weck <
luisrobe...@siscobra.com.br> wrote:

>
> Maybe PostgreSQL can't find a way to calculate having estimates?
>

I wasn't even thinking of the HAVING estimates I was thinking of just the
raw aggregates.  It can't implement the HAVING until has the raw aggregate
in hand. But, what is the actual row count without the HAVING?  Well, I
notice now this line:

Rows Removed by Filter: 6787359

So the row count of rows=86 is mostly due to the HAVING, not due to the raw
aggregation, a point I overlooked initially.  So the planner is not
mistaken in thinking that a huge number of rows need to be passed up--it is
correct in thinking that.

Cheers,

Jeff

Re: Slow "not in array" operation

2019-11-12 Thread Jeff Janes

>
>
> 3) Here's the query plan that I get after disabling the seq scan:
>
>
>   QUERY PLAN
>
>
>
> ---
>
>  Finalize Aggregate  (cost=2183938.89..2183938.90 rows=1 width=8) (actual
> time=94972.253..94972.254 rows=1 loops=1)
>
>->  Gather  (cost=2183938.16..2183938.87 rows=7 width=8) (actual
> time=94952.895..95132.626 rows=8 loops=1)
>
>  Workers Planned: 7
>
>  Workers Launched: 7
>
>  ->  Partial Aggregate  (cost=2182938.16..2182938.17 rows=1
> width=8) (actual time=94950.958..94950.958 rows=1 loops=8)
>
>->  Parallel Bitmap Heap Scan on subscriptions
> (cost=50294.50..2180801.47 rows=854677 width=0) (actual
> time=1831.342..94895.208 rows=611828 loops=8)
>
>  Recheck Cond: ((project_id = 123) AND (trashed_at IS
> NULL))
>
>  Rows Removed by Index Recheck: 2217924
>
>  Filter: (NOT (tags @> '{en}'::character varying[]))
>
>  Rows Removed by Filter: 288545
>
>  Heap Blocks: exact=120301 lossy=134269
>
>  ->  Bitmap Index Scan on
> index_subscriptions_on_project_id_and_tags  (cost=0.00..48798.81
> rows=6518094 width=0) (actual time=1493.823..1493.823 rows=7203173 loops=1)
>
>Index Cond: (project_id = 123)
>
>  Planning Time: 1.273 ms
>
>  Execution Time: 95132.766 ms
>
> (15 rows)
>

What was the plan for the one that took 500ms?  I don't see how it is
possible that this one is 180 times slower than that one.  Maybe a hot
cache versus cold cache?  Also, it seems weird to me that "trashed_at IS
NULL" shows up in the recheck but not in the original  Index Cond.
Increasing work_mem can also help, but since the  Bitmap Index Scan itself
took half the time there is only so much it can do.

Cheers,

Jeff

Re: Simple DELETE on modest-size table runs 100% CPU forever

2019-11-15 Thread Jeff Janes

On Thu, Nov 14, 2019 at 5:20 PM Craig James  wrote:

> I'm completely baffled by this problem: I'm doing a delete that joins
> three modest-sized tables, and it gets completely stuck: 100% CPU use
> forever.  Here's the query:
>
>
> Aggregate  (cost=193.54..193.55 rows=1 width=8)
>   ->  Nested Loop Semi Join  (cost=0.84..193.54 rows=1 width=0)
> Join Filter: (categories.id = c.id)
> ->  Index Scan using i_categories_category_id on categories
>  (cost=0.42..2.44 rows=1 width=4)
>   Index Cond: (category_id = 23)
> ->  Nested Loop Anti Join  (cost=0.42..191.09 rows=1 width=4)
>   Join Filter: (c.id = st.id)
>   ->  Index Scan using i_categories_category_id on categories
> c  (cost=0.42..2.44 rows=1 width=4)
> Index Cond: (category_id = 23)
>   ->  Seq Scan on category_staging_23 st  (cost=0.00..99.40
> rows=7140 width=4)
>

If the estimates were correct, this shouldn't be slow.  But how can it
screw up the estimate for this by much, when the conditions are so simple?
How many rows are there actually in categories where category_id=23?

What do you see in `select * from pg_stats where tablename='categories' and
attname='category_id' \x\g\x`?

Since it thinks the seq scan of  category_staging_23 is only going to
happen once (at the bottom of two nested loops, but each executing just
once) it sees no benefit in hashing that table.  Of course it is actually
happening a lot more than once.

Cheers,

Jeff

Re: Simple DELETE on modest-size table runs 100% CPU forever

2019-11-16 Thread Jeff Janes

On Fri, Nov 15, 2019 at 7:27 PM Craig James  wrote:

> On Fri, Nov 15, 2019 at 2:45 PM Jeff Janes  wrote:
> BTW, I'll note at this point that "analyze category_staging_8" prior to
> this query made no difference.
>

Isn't that the wrong table to have analyzed? The offender here is
"categories", not  "category_staging_8".  Is this some sort of inheritance
situation?

>
>> What do you see in `select * from pg_stats where tablename='categories'
>> and attname='category_id' \x\g\x`?
>>
>
> db=> select * from pg_stats where tablename='categories' and
> attname='category_id' \x\g\x;
> Expanded display is on.
>

> ...
>
n_distinct | 21
> most_common_vals   |
> {4,3,2,10,11,13,12,16,9,6,7,5,15,23,14,25,24,1,26,28,27}
> most_common_freqs  |
> {0.2397,0.159933,0.0926667,0.0556,0.0555667,0.0546333,0.0525333,0.0439,0.0426667,0.0346333,0.0331,0.0302333,0.0288333,0.0240667,0.0224,0.0122333,0.011,0.0035,0.0023,0.00037,0.0001}
>

There is a path in the analyze code where if the least-seen value in the
sample was seen more than once (i.e. no value was seen exactly once) then
it assumes that the seen values are all the values that exist.  I think the
logic behind that is dubious.  I think it is pretty clear that that is
kicking in here.  But why?  I think the simple answer is that you analyzed
the wrong table, and the statistics shown here might be accurate for some
time in the past but are no longer accurate.  It is hard to see how a value
present 5000 times in a table of 274602 rows could have evaded sampling if
they were present at the time the sample was done.

Cheers,

Jeff

Re: [External] Join queries slow with predicate, limit, and ordering

2019-12-02 Thread Jeff Janes

On Mon, Dec 2, 2019 at 8:29 AM Aufar Gilbran  wrote:

> Hello,
>
> I'm trying to figure out how to optimise 3-table (many-to-many relation)
> joins
> with predicate, limit, and ordering, where one of the tables returns at
> most one
> row.
>
> This is the query that I have right now:
>
> SELECT entity.id
> FROM (
> SELECT entity_tag.entity_id
> FROM tag
> JOIN entity_tag ON tag.id = entity_tag.tag_id
> WHERE tag.key = 'status'
>   AND tag.value = 'SUCCEEDED'
> ) matched
> JOIN entity ON matched.entity_id = entity.id
> WHERE entity.type = 'execution'
> ORDER BY entity.id DESC
> LIMIT 10;
>

What happens if you set enable_sort to off before running it?

> ->  Nested Loop  (cost=1.28..723.38 rows=1 width=4) (actual
> time=0.153..5590.717 rows=89222 loops=1)
>

It thinks it will find 1 row, and actually finds 89,222.  I don't know
exactly why that would be, I suppose tag_id has an extremely skewed
distribution.  But yeah, that is going to cause some problems.  For one
thing, if there was actually just one qualifying row, then it wouldn't get
to stop early, as the LIMIT would never be satisfied.  So it thinks that if
it choose to walk the index backwards, it would have to walk the **entire**
index.

->  Index Only Scan using
> entity_tag_tag_id_entity_id_idx on public.entity_tag (cost=0.43..711.53
> rows=201 width=16) (actual time=0.035..756.829 rows=89222 loops=1)
>   Heap Fetches: 89222
>

You should vacuum this table.  Doing that (and only that) probably won't
make a great deal of difference to this particular query, but still, it
will help some.  And might help other ones you haven't noticed yet as well.

>
> Both tag_key_value_key and entity_tag_tag_id_entity_id_idx is a UNIQUE
> constraint on tag(key,value) and entity_tag(tag_id, entity_id)
> respectively.
>
> It seems to me that PostgreSQL runs the nested loop against all of the 90K
> records because it wants to sort the result before limiting the result.

It doesn't **know** there are going to be 9 records.  It cannot plan
queries based on knowledge it doesn't possess.

> It
> doesn't take into account of the UNIQUE constraint imposed on the table and
> thinks that the join being done inside the subquery will change the
> ordering of
> entity_id returned by the subquery, thus prompting the sort.
>

This seems like rather adventurous speculation.  It does the sort because
the horrible estimation makes it think it will be faster that way, not
because it thinks it is the only possible way.  Of you set enable_sort =
off and it still does a sort, then you know it thinks there is no other way.

>
> I believe with how the index sorted, it should be able to just scan the
> index
> backwards because at most only one tag_id will be returned. When I tried
> changing the predicate here to filter by ID with the following query:
>
> -- This runs very fast
> SELECT entity.id
> FROM (
> SELECT entity_tag.entity_id
> FROM tag
> JOIN entity_tag ON tag.id = entity_tag.tag_id
> WHERE tag.id = 24
> ) matched
> JOIN entity ON matched.entity_id = entity.id
> WHERE entity.type = 'execution'
> ORDER BY entity.id DESC
> LIMIT 10;
>

With this query, it can use the join condition to transfer the knowledge of
tag.id=24 to become entity_tag.tag_id=24, and then look up stats on
entity_tag.tag_id for the value 24.  When you specify the single row of tag
indirectly, it can't do that as it doesn't know what specific value of
tag.id is going to be the one it finds (until after the query is done being
planned and starts executing, at which point it is too late).  But the row
with id=24 doesn't seem to be the same one with "tag.key = 'status' AND
tag.value = 'SUCCEEDED'", so you have basically changed the query entirely
on us.

If you replanned this query with ORDER BY entity.id+0 DESC, (and with the
true value of tag_id) that might give you some more insight into the hidden
"thought process" behind the planner.

Cheers,

Jeff

Re: autovacuum locking question

2019-12-06 Thread Jeff Janes

On Fri, Dec 6, 2019 at 12:50 PM MichaelDBA  wrote:

> And Just to reiterate my own understanding of this...
>
> autovacuum priority is less than a user-initiated request, so issuing a
> manual vacuum (user-initiated request) will not result in being cancelled.
>

Somethings happen in some situations and not in others.  I don't know that
it is useful to categorize them into a monotonic priority scale.

Autovacs "to prevent wraparound" don't get cancelled the way ordinary
autovacs do, but they still use autovac IO throttling settings, not the
unthrottled (by default settings) manual vacuum settings, which can be a
major problem sometimes.

Note that no kind of vacuum should normally get cancelled using the
signalling mechanism during truncation phase, that seems to be due to some
rather extreme situation with IO congestion.

Cheers,

Jeff

Re: Logical replication performance

2019-12-09 Thread Jeff Janes

On Fri, Nov 29, 2019 at 11:06 AM Florian Philippon <
florian.philip...@doctolib.com> wrote:

>
> We tried another solution: we loaded a minimal schema (without indexes and
> constraints) on the subscriber and created the subscription. The initial
> copy phase was way faster (a few hours). Then we created indexes and
> constraints. Is this a suitable solution for production?
>

This is probably not suitable for production.  Once the COPY is finished,
it still has to replicate row-by-row changes to the table rows which
occurred since the starting COPY snapshot.  UPDATEs and DELETEs will
probably fail due to the lack of indexes on the “replica identity”
columns.  This failure will make the entire transaction, including the
COPY, roll back to beginning.  So you there will be no point at which you
can build the missing indexes without first losing all the work that was
done.  If the master was quiescent (at least in regards to UPDATEs and
DELETEs) then it there will be no row-by-row changes to apply between the
start of the COPY and the start of transactional replication.  In that
case, the COPY will have committed before the system discovers the problem
with the “replica identity”, giving you an opportunity to go build the
index without losing all of the work.

> Will the logical replication flow be buffered by the replication slots
> during index creation and get in sync afterwards or will it conflict due to
> locking issues?
>

It can't buffer in the middle of the transaction which includes the initial
COPY.

Cheers,

Jeff

Re: Consecutive Query Executions with Increasing Execution Time

2019-12-17 Thread Jeff Janes

On Tue, Dec 17, 2019 at 8:08 AM Laurenz Albe 
wrote:

> On Mon, 2019-12-16 at 15:50 -0500, Tom Lane wrote:
> > Peter Geoghegan  writes:
> > > Why do the first and the twentieth executions of the query have almost
> > > identical "buffers shared/read" numbers? That seems odd.
> >
> > It's repeat execution of the same query, so that doesn't seem odd to me.
>
> Really?  Shouldn't the blocks be in shared buffers after a couple
> of executions?
>

If it is doing a seq scan (I don't know if it is) they intentionally use a
small ring buffer to, so they evict their own recently used blocks, rather
than evicting other people's blocks.  So these blocks won't build up in
shared_buffers very rapidly just on the basis of repeated seq scans.

Cheers,

Jeff

Re: Specific query taking time to process

2019-12-11 Thread Jeff Janes

On Tue, Dec 10, 2019 at 3:40 AM Fahiz Mohamed  wrote:

> Thank you very much for your prompt responses.
>
> I have analysed more regarding this and found the long running query.
>
> I ran "explain analyse" on this query and I got following result. (We have
> 2 identical DB instances and they consist of same data. Instane 1 took 20+
> second to process and instance 2 took less than a second)
>

They do not consist of the same data.  One returns 17 rows, the other 22.

One finds 5635 rows (scattered over 40765 blocks!) where qname_id = 251,
the other find 85 rows for the same condition.  It seems the first one is
not very well vacuumed.

I don't know if these differences are enough to be driving the different
plans (the estimation differences appear smaller than the actual
differences), but clearly the data is not the same.

Your first query is using the index idx_alf_node_mdq in a way which seems
to be counter-productive.  Perhaps you could inhibit it to see what plan it
chooses then.  For example, specify in your query "type_qname_id+0 = 240"
to prevent the use of that index.  Or you could drop the index, if it is
not vital.

But if the data has not be ANALYZEd recently, you should do that before
anything else.  Might as well make it a VACUUM ANALYZE.

Cheers,

Jeff

>

Re: autovacuum locking question

2019-12-05 Thread Jeff Janes

On Thu, Dec 5, 2019 at 5:26 PM Mike Schanne  wrote:

> Hi,
>
> I am investigating a performance problem in our application and am seeing
> something unexpected in the postgres logs regarding the autovacuum.
>
>
>
> 2019-12-01 13:05:39.029 
> UTC,"wb","postgres",6966,"127.0.0.1:53976",5ddbd990.1b36,17099,"INSERT
> waiting",2019-11-25 13:39:28 UTC,12/1884256,12615023,LOG,0,"process
> 6966 still waiting for RowExclusiveLock on relation 32938 of database 32768
> after 1000.085 ms","Process holding the lock: 6045. Wait queue:
> 6966.","INSERT INTO myschema.mytable (...) VALUES (...) RETURNING
> process.mytable.mytable_id",13,,""
>
> 2019-12-01 13:05:39.458 UTC,,,6045,,5de3b800.179d,1,,2019-12-01 12:54:24
> UTC,10/417900,0,ERROR,57014,"canceling autovacuum task","automatic
> vacuum of table ""postgres.myschema.mytable"""""
>
>
>
> My understanding from reading the documentation was that a vacuum can run
> concurrently with table inserts/updates, but from reading the logs it
> appears they are conflicting over a row lock.  This particular table gets
> very frequent inserts/updates (10-100 inserts / sec) so I am concerned that
> if the autovacuum is constantly canceled, then the table never gets cleaned
> and its performance will continue to degrade over time.  Is it expected for
> the vacuum to be canceled by an insert in this way?
>
>
>
> We are using postgres 9.6.10.
>

If the vacuum finds a lot of empty pages at the end of the table, it will
try to truncate them and takes a strong lock to do so.  It is supposed to
check every 20ms to see if anyone else is blocked on that lock, at which
point it stops doing the truncation and releases the lock.  So it should
never get "caught" holding the lock in order to be cancelled.  Is your
setting for deadlock_timeout much lower than usual?  Also, if the
truncation is bogged down in very slow IO, perhaps it doesn't actually get
around to checking ever 20ms despite its intentionsl

How often have you seen it in the logs?

Cheers,

Jeff

>

Re: autovacuum locking question

2019-12-06 Thread Jeff Janes

On Fri, Dec 6, 2019 at 10:55 AM Mike Schanne  wrote:

> The error is not actually showing up very often (I have 8 occurrences from
> 11/29 and none since then).  So maybe I should not be concerned about it.
> I suspect we have an I/O bottleneck from other logs (i.e. long checkpoint
> sync times), so this error may be a symptom rather than the cause.
>

I think that at the point it is getting cancelled, it has done all the work
except the truncation of the empty pages, and reporting the results (for
example, updating n_live_tup  and n_dead_tup).  If this happens
every single time (neither last_autovacuum nor last_vacuum ever advances)
it will eventually cause problems.  So this is mostly a symptom, but not
entirely.  Simply running a manual vacuum should fix the reporting
problem.  It is not subject to cancelling, so it will detect it is blocking
someone and gracefully bow.  Meaning it will suspend the truncation, but
will still report its results as normal.

Reading the table backwards in order to truncate it might be contributing
to the IO problems as well as being a victim of those problems.  Upgrading
to v10 might help with this, as it implemented a prefetch where it reads
the table forward in 128kB chunks, and then jumps backwards one chunk at a
time.  Rather than just reading backwards 8kB at a time.

Cheers,

Jeff

>

Re: Specific query taking time to process

2019-12-11 Thread Jeff Janes

On Wed, Dec 11, 2019 at 5:21 PM Fahiz Mohamed  wrote:

> There is a slight different in both instance’s data. Inastanbce 1 contains
> latest data and instance 2 consists of data which is 3 weeks older than
> instance 1.
>

In knowing where to look for differences in performance, there is a big
difference between them being identical, and being generally similar, but
not identical.

> I hope the above data difference can make a drastic difference. Please
> correct me if I am wrong.
>

They are similar in scale, but we know there is a big difference in
distribution of some values.  For example, we still know the slow plan has
4697 rows in aspect_1 where qname_id = 251, while the other plan has 85
rows in aspect_1 meeting that same criterion. That is a big difference, and
it is real difference in the data, not just a difference in planning or
estimation.  Is this difference driving the difference in plan choice?
Probably not (plan choice is driven by estimated rows, not actual, and
estimates are quite similar), but it does demonstrate the data is quite
different between the two systems when you look under the hood.  It is
likely that there are other, similar differences in the distribution of
particular values which is driving the difference in plans.  It is just
that we can't see those differences, because the EXPLAIN ANALYZE only
reports on the plan it ran, not other plans it could have ran but didn't.

Your query is now using the index named  idx_alf_node_tqn in a way which is
equally unproductive as the previous use of  idx_alf_node_mdq was.  It
looks like they have the same columns, just in a different order.  My
previous advice to try  "type_qname_id+0 = 240" should still apply.

If you can't get that to work, then another avenue is to run "explain
(analyze, buffers) select count(*) from alf_node where (type_qname_id =
240) AND (store_id = 6)" on both instances.

> I did execute vacuum manually and I noticed the below in the output
>
> "INFO: vacuuming "public.alf_node_aspects"
> INFO: "alf_node_aspects": found 0 removable, 150264654 nonremovable row
> versions in 812242 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> CPU 13.53s/33.35u sec elapsed 77.88 sec.
>

I'm not really sure what that means.  I certainly would not have expected 0
removable. There should have been some prior output, something like:

INFO:  scanned index "fk_alf_nasp_qn" to remove 50 row versions

It could be that autovacuum had already gotten around to vacuuming after
your initial email but before you did the above, meaning there was not much
for your manual to do.

But you can see that the vacuum did have an effect, by comparing these
lines (despite them finding about same number of rows)

Heap Blocks: exact=40765

Heap Blocks: exact=1774

It wasn't all that large of an effect in this case, but it is still
something worth fixing.

Cheers,

Jeff

Re: unexpected result for wastedbytes query after vacuum full

2019-12-10 Thread Jeff Janes

On Tue, Dec 10, 2019 at 11:43 AM Guillaume Lelarge 
wrote:

This query uses the column statistics to estimate bloat. AFAIK, json
> columns don't have statistics, so the estimation can't be relied on (for
> this specific table at least).
>

This was true prior to 9.5 (for xml at least, I don't know about json), but
should not be true from that release onward.  But still the difference
between 74440704 and 74506240, this does seem to me to be straining at a
gnat to swallow a camel.

Cheers,

Jeff

Re: pg_stat_bgwriter

2019-10-17 Thread Jeff Janes

On Mon, Oct 14, 2019 at 1:25 PM dangal  wrote:

> Do you think it should increase bgwriter_lru_maxpages due to the value of
> maxwritten_clean?
>

I find the background writer to be pretty unimportant these days.  If the
kernel is freely accepting writes without blocking, the backends can
probably write their own buffers without it being a meaningful bottleneck.
On the other hand, if the kernel is constipated, no tweaking of the
background writer parameters is going to insulate the backends from that
fact.  That said, I would increase bgwriter_lru_maxpages (or decrease
bgwriter_delay) anyway.  It probably won't make much difference, but if it
does it is more likely to help than to hurt.

> Do you think it should increase bgwriter_lru_maxpages,
> bgwriter_lru_multiplier, and decrease bgwriter_delay due to the value of
> buffers_backend compared to buffers_alloc?
>

I don't think that that comparison is meaningful, so wouldn't make changes
based on it.

> Do you think a modification is necessary?
> What values would you recommend?
> thank you
>

If you are experiencing a problem, this is probably not the right way to
investigate it.  If a particular query is slow, try EXPLAIN (ANALYZE,
BUFFERS).  If lots of user-facing things are slow, try sampling "select
wait_event, wait_event_type  from pg_stat_activity where
backend_type='client backend';"

Re: performance degredation after upgrade from 9.6 to 12

2019-11-24 Thread Jeff Janes

On Sun, Nov 24, 2019 at 1:05 PM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Hey Jeff,
> This example was only used to show that pg96  had better perfomance than
> pg12 in a very simple case.
>

OK, but do you agree that a 15% slow down is more realistic than 3 fold
one?  Or are you still getting 3 fold slow down with more careful testing
and over a wide variety of queries?

I find that the main regression (about 15%) in your example occurs in major
version 10, at the following commit:

commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755
Author: Andres Freund 
Date:   Tue Mar 14 15:45:36 2017 -0700

Faster expression evaluation and targetlist projection.

It is disappointing that this made this case slower rather than faster, and
that the "future work" alluded to either hasn't happened, or wasn't
effective for this example.  I also tested the same example, only 100 times
more rows, and still see the regression at about 16%.  This is a major
infrastructure change patch which has been extensively built on since then,
the chances of reverting it are very small.  It is making an omelette, and
your example is one of the eggs that got broken.

Performance changes in a large body of queries are usually not all due to
the same thing.  Are you a position to custom compile your own PostgreSQL?
It would be nice to test this commit against the one before it, and see how
much of the change in your real queries is explained by this one thing (or
whether any of it is)

>  In all the tests that I run most of the queries took less time on 9.6`s
> version.  I dont know why, but as you can see after disabling the parameter
> the simple test that I did showed different results.
>

I can't see--You didn't post results for that.  And running your test on my
own system doesn't show that at all.  In your test case,
max_parallel_workers_per_gather makes no difference.  With 100 times more
rows, setting it to 0 actually slows things down, as at that size
parallelization is useful and disabling it hurts.

Of course parallel query might be hurting some of the other queries, but
for the one example you show you will have to show something more
convincing for me to believe that that is what caused it.

It is easy to benchmark with something like:

PGOPTIONS="-c max_parallel_workers_per_gather=0" pgbench -T30 -f <(echo
"select count(*) from test1") -p 9912 -n -M prepared

If it is other queries where mpwpg is making a difference, than one issue
could be that your settings of parallel_setup_cost and/or
parllel_tuple_cost are too low (although I usually find the default
settings too high, not too low); or you are running your test concurrently
already and so don't need parallel query to fully load the CPUs and trying
to use parallel query just increases the overhead; or your machine doesn't
have the number of truly effective CPUs you think it does.

Cheers,

Jeff

>

Re: performance degredation after upgrade from 9.6 to 12

2019-11-24 Thread Jeff Janes

On Sun, Nov 24, 2019 at 8:52 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Hey Andrew,
> It seems that changing this parameter worked for me.
> Setting it to zero means that there wont be any parallel workers for one
> query right ?
> Is it something familiar this problem with the gatherers ?
>

Your example would not be using parallel workers anyway, regardless of the
setting of max_parallel_workers_per_gather, so I don't see how changing
this could have worked for you.  Unless you mean it worked in your full
test, rather than in your test case. I doubt your test case benchmarking
was very reliable to start with, you only show a single execution and
didn't indicate you had more unshown ones.

If I do more credible benchmarking, I do get a performance regression but
it closer is to 16% than to 3 fold.  And it doesn't depend on the setting
of max_parallel_workers_per_gather.  I doubt a regression of this size is
even worth investigating.

pgbench -T300 -P5 -f <(echo "select count(*) from test1") -p 9912 -n -M
prepared

Cheers,

Jeff

Re: Modification of data in base folder and very large tables

2019-10-10 Thread Jeff Janes

On Thu, Oct 10, 2019 at 3:40 AM Ogden Brash  wrote:

> If each of the tables has about 3+ billion rows, the index is still going
> to be pretty large and spread over many files. In the source database that
> was backed up, the primary key sequence was sequentially assigned and
> written, but as various posprocessing operations were applied and the rows
> modified,  the data, is probably in a relatively random evenly distributed
> order. So I now believe that all the files that are being constantly
> touched are not actually the files for the data rows, but the files for the
> index, and as the system is reading the data it is jumping around
> recreating the index for the primary key based on the random order of the
> dta rows it reads.
>
> Sound plausible? I'm still a postgres noob.
>

Yes, perfectly plausible.

>
> As an experiment, I am in the process of clustering the source database
> tables by the primary key constraint. I am hoping that if I redo the
> pg_dump after that, it will contain the records in more-or-less primary key
> order and on the subsequent pg_restore it should not have to spend the vast
> majority of the time on reading and seeking.
>
> It is surprising to me that the cluster operations (which also have to
> churn through the entire index and all the records) are going *much* faster
> than pg_restore.
>

The cluster gets to lock the table against any concurrent changes, and then
rebuild the indexes from scratch in bulk. You could have done more-or-less
the same thing just by dropping the primary key while doing the load. Alas,
there is no way to do that without losing the work already done. When you
do a data-only pg_restore, you are dis-inviting it from doing such
optimizations.  Isn't the whole point of data-only restore is that you
leave the table open for other business while it happens?  (Or maybe there
is some other point to it that I am missing--if you want some halfway
measure between creating the table from scratch, and leaving it completely
open for business as usual, then you have to evaluate each of those steps
and implement them yourself, there is no way that pg_restore can reasonably
guess which constraints and indexes it is allowed to drop and which it is
not).

Perhaps
https://www.postgresql.org/docs/current/populate.html#POPULATE-RM-INDEXES
should
mention the index assocated with primary keys, since dropping them does
require a different syntax and they might be overlooked.

Cheers,

Jeff

Re: Query slows when used with view

2019-10-09 Thread Jeff Janes

On Wed, Oct 9, 2019 at 10:56 AM Yavuz Selim Sertoğlu (ETIYA) <
yavuz.serto...@etiya.com> wrote:

> Thanks for the reply Tom,
>
> Sorry, I couldn't understand. I just copied inside of view and add
> conditions from query that runs with view.
> The comma parts are the same in two queries, one is inside of view the
> other is in the query.
>

When you join to a view, the view sticks together, as if they were all in
parentheses.   But when you substitute the text of a view into another
query, then they are all on the same level and can be parsed differently.

Consider the difference between "1+1 * 3", and "(1+1) * 3"

Cheers,

Jeff

Re: Would SSD improve Index Only Scan performance by a lot?

2019-10-09 Thread Jeff Janes

On Tue, Oct 8, 2019 at 7:37 PM Arya F  wrote:

> As my table has gotten bigger, it takes longer to get a single row back
> when querying a row by its btree index.
>
>
Is this in a probabilistic sense, they take longer on average, or has every
single access gotten slower?  If the increase in size has caused the index
leaf pages to go from being almost entirely in cache to almost entirely not
being in cache, the slow down would probably be a lot more 3 to 4 fold.
But maybe you went from 100% in cache, to 90% in cache and 10% out of cache
(with a 40 fold slowdown for those ones), coming out to 4 fold slow down on
average.  If that is the case, maybe you can get the performance back up by
tweaking some settings, rather than changing hardware.


> Right now the database is running on a traditional HDD. SSDs have a much
> faster seek time than traditional HDDs.
>
> Would switching to an SSD improve "Index Only Scan" time greatly? by at
> least 3-4 times?
>

If drive access is truly the bottleneck on every single execution, then
yes, probably far more than 3-4 times.

Cheers,

Jeff

Re: Modification of data in base folder and very large tables

2019-10-09 Thread Jeff Janes

On Wed, Oct 9, 2019 at 4:33 AM Ogden Brash  wrote:

> # lsof -p 6600 | wc -l;
> 840
>
> # lsof -p 6601 | wc -l;
> 906
>
> Is that normal? That there be so many open file pointers? ~900 open file
> pointers for each of the processes?
>

I don't think PostgreSQL makes any effort to conserve file handles, until
it starts reaching the max.  So any file that has ever been opened will
remain open, unless it was somehow invalidated (e.g. the file needs to be
deleted).  If those processes were previously loading smaller tables before
the got bogged down in the huge ones, a large number of handles would not
be unexpected.

> The next I did was go to see the actual data files, to see how many there
> are. In my case they are in postgresql/11/main/base/24576 and there are
> 2076 files there. That made sense. However, I found that when I list them
> by modification date I see something interesting:
>
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.7
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.8
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.9
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.10
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.11
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.12
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.13
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.14
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.16
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.15
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.17
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.18
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.19
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.21
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.22
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.23
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.24
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.25
> -rw--- 1 postgres postgres 1073741824 Oct  8 13:05 27083.26
> -rw--- 1 postgres postgres   19062784 Oct  8 13:05 27082_fsm
> -rw--- 1 postgres postgres  544489472 Oct  8 13:05 27077.34
> -rw--- 1 postgres postgres  169705472 Oct  8 13:05 27082.72
> -rw--- 1 postgres postgres  978321408 Oct  8 13:05 27083.27
> -rw--- 1 postgres postgres  342925312 Oct  8 13:05 27076.88
>
> If you notice, the file size is capped at 1 GB and as the giant table has
> grown it has added more files in this directory. However, the mysterious
> thing to me is that it keeps modifying those files constantly - even the
> ones that are completely full. So for the two large tables it has been
> restoring all week, the modification time for the ever growing list of
> files is being updating constantly.
>

The bgwriter, the checkpointer, and autovac, plus any backends that decide
they need a clean page from the buffer cache can all touch those files.
They might touch them in ways that are not IO intensive, but still cause
the modification time to get updated. In my hands, one all dirty buffers a
given file have been flushed and all contents in the file have been
vacuumed, its mtime stops changing just due to copy in which is directed to
later files.

It is also squishy what it even means to modify a file.  I think
filesystems have heuristics to avoid "gratuitous" updates to mtime, which
make it hard to recon with.

>
> Could it be that thats why I am seeing a slowdown over the course of the
> week - that for some reason as the number of files for the table has grown,
> the system is spending more and more time seeking around the disk to touch
> all those files for some reason?
>

I don't think lsof or mtime are effective ways to research this.  How about
running strace -ttt -T -y on those processes?

Cheers,

Jeff

Re: Optimising a two column OR check

2019-10-12 Thread Jeff Janes

On Sat, Oct 12, 2019 at 10:43 AM Justin Pryzby  wrote:

> On Sat, Oct 12, 2019 at 04:39:56PM +0200, Ivan Voras wrote:
> > With seqscan disabled, I get this plan on 9.6:
> >  Bitmap Heap Scan on friend  (cost=8.42..19.01 rows=14 width=8)
> ...
> > I expected to get an index-only scan in this situation, as that would be
> a
> > very common query. Is there a way to actually make this sort of query
> > resolvable with an index-only scan? Maybe a different table structure
> would
> > help?
>

It would have to scan the entire index to find the cases where   user2_id=42
but user1_id is not constrained.  Technically User1_id would be constrained
to be less than 42, but I don't think the planner will take that into
account.


> The v11 release notes have this relevant item:
>
> https://www.postgresql.org/docs/11/release-11.html
> |Allow bitmap scans to perform index-only scans when possible (Alexander
> Kuzmenkov)
>
>
But this is not one of those cases.  It is only possible when the only data
needed is whether the row exists or not.

Cheers,

Jeff

Re: How to set parallel_tuple_cost

2019-12-20 Thread Jeff Janes

On Fri, Dec 20, 2019 at 1:58 PM Tom Lane  wrote:

> Jeff Janes  writes:
> > The docs for parallel_tuple_cost are quite terse, as the reference
> section
> > of the docs usually are:
> > "Sets the planner's estimate of the cost of transferring one tuple from a
> > parallel worker process to another process. The default is 0.1."
>
> > If you were take the doc description literally, then the default value
> > seems much too high, as it doesn't take 10x the (default) cpu_tuple_cost
> to
> > transfer a tuple up from a parallel worker.
>
> Really?  If anything, I'd have thought it might be worse than 10x.
> Cross-process communication isn't cheap, at least not according to
> my instincts.
>

I was a bit surprised.  I set it up so that there was a fine-tunable filter
which can be applied in the parallel workers, and then only the surviving
tuples get passed up to the leader.  The use of a parallel seq scan didn't
become slower than the non-parallel version until over 95% of the tuples
were surviving the filter.  If I wanted to make the estimated cost
cross-over point match the measured time cross-over point, I had to mark
the parallel_tuple_cost down to about 0.011.  This was an 8 CPU machine, an
AWS m5.4xlarge, with max_parallel_workers_per_gather=7.  (On my crummy
2-CPU Windows 10 laptop running ubuntu via VirtualBox, the cross-over point
was closer to 40% of the tuples surviving, and the parallel_tuple_cost to
match cross-over point would be about 0.016, but I don't have enough RAM to
make a large enough all-in-shared-buffer table to really get a good
assessments).

My method was to make shared_buffers be a large fraction of RAM (55GB, out
of 64GB), then make a table slightly smaller than that and forced it into
shared_buffers with pg_prewarm.  I set seq_page_cost = random_age_cost = 0,
to accurately reflect the fact that no IO is occuring.

create table para_seq as select floor(random()*1)::int as id, random()
as x, md5(random()::text)||md5(random()::text) t from
generate_series(1,800*55);
vacuum ANALYZE para_seq ;
select pg_prewarm('para_seq');

explain (analyze, buffers, settings, timing off) select * from para_seq
where id<9500;

Where you can change the 9500 to tune the selectivity of the filter.  Is
this the correct way to try to isolate just the overhead of transferring of
a tuple away from other considerations so it can be measured?

I don't think the fact that EXPLAIN ANALYZE throws away the result set
without reading it should change anything.  Reading it should add the same
fixed overhead to both parallel and non-parallel, so would dilute out
percentage difference without change absolute differences.

I tried it with wider tuples as well, but not so wide they would activate
TOAST, and didn't really see a difference in the conclusion.

> > On the other hand, you
> > probably don't want a query which consumes 8x the CPU resources just to
> > finish only 5% faster (on an otherwise idle server with 8 CPUs).  Maybe
> > this Amdahl factor is what inspired the high default value?
>
> I think the large value of parallel_setup_cost is what's meant to
> discourage that scenario.
>

I think that can only account for overhead like forking and setting up
memory segments.  The overhead of moving around tuples (more than
single-threaded execution already moves them around) would need to scale
with the number of tuples moved around.

Cheers,

Jeff

Re: How to prevent POSTGRES killing linux system from accepting too much inserts?

2019-12-18 Thread Jeff Janes

On Wed, Dec 18, 2019 at 4:53 AM James(王旭)  wrote:

> Hello,
>>
>> I encountered into this kernel message, and I cannot login into the Linux
>> system anymore:
>
>
>>
>> Dec 17 23:01:50 hq-pg kernel: sh (6563): drop_caches: 1
>>
>> Dec 17 23:02:30 hq-pg kernel: INFO: task sync:6573 blocked for more than
>>> 120 seconds.
>>
>> Dec 17 23:02:30 hq-pg kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>
>> Dec 17 23:02:30 hq-pg kernel: syncD 965ebabd1040 0
>>> 6573   6572 0x0080
>>
>> Dec 17 23:02:30 hq-pg kernel: Call Trace:
>>
>> Dec 17 23:02:30 hq-pg kernel: [] ?
>>> generic_write_sync+0x70/0x70
>>
>>
>> After some google I guess it's the problem that IO speed is low, while
>> the insert requests are coming too much quickly.So PG put these into cache
>> first then kernel called sync
>
>
Could you expand on what you found in the googling, with links?  I've never
seen these in my kernel log, and I don't know what they mean other than the
obvious that it is something to do with IO.  Also, what kernel and file
system are you using?


> .
>
> I know I can queue the requests, so that POSTGRES will not accept these
>> requests which will result in an increase in system cache.
>
> But is there any way I can tell POSTGRES, that you can only handle 2
>> records per second, or 4M per second, please don't accept inserts more than
>> that speed.
>
> For me, POSTGRES just waiting is much better than current behavior.
>
>
I don't believe there is a setting from within PostgreSQL to do this.

There was a proposal for a throttle on WAL generation back in February, but
with no recent discussion or (visible) progress:

https://www.postgresql.org/message-id/flat/2B42AB02-03FC-406B-B92B-18DED2D8D491%40anarazel.de#b63131617e84d3a0ac29da956e6b8c5f


I think the real answer here to get a better IO system, or maybe a better
kernel.  Otherwise, once you find a painful workaround for one symptom you
will just smack into another one.

Cheers,

Jeff

>

How to set parallel_tuple_cost

2019-12-20 Thread Jeff Janes

The docs for parallel_tuple_cost are quite terse, as the reference section
of the docs usually are:

"Sets the planner's estimate of the cost of transferring one tuple from a
parallel worker process to another process. The default is 0.1."

Usually you can find more extensive discussion of such settings in
informal resources like blog posts or mailing lists, but Googling the name
I don't find much for this setting.  Is there good information out there
somewhere?

If you were take the doc description literally, then the default value
seems much too high, as it doesn't take 10x the (default) cpu_tuple_cost to
transfer a tuple up from a parallel worker.  On the other hand, you
probably don't want a query which consumes 8x the CPU resources just to
finish only 5% faster (on an otherwise idle server with 8 CPUs).  Maybe
this Amdahl factor is what inspired the high default value?

Cheers,

Jeff

1 2 >

1 - 100 of 166 matches

Mail list logo