from:"Jeff Janes"

Re: Extremely slow HashAggregate in simple UNION query

2019-08-21 Thread Jeff Janes

On Tue, Aug 20, 2019 at 11:12 AM Felix Geisendörfer 
wrote:
 ...


> [1] My actual query had bad estimates for other reasons (GIN Index), but
> that's another story. The query above was of course deliberately designed
> to have bad estimates.
>

As noted elsewhere, v12 thwarts your attempts to deliberately design the
bad estimates.  You can still get them, you just have to work a bit harder
at it:

CREATE FUNCTION j (bigint, bigint) returns setof bigint as $$ select
generate_series($1,$2) $$ rows 1000 language sql;

EXPLAIN ANALYZE
SELECT * FROM j(1, 1) a, j(1, 1) b
UNION
SELECT * FROM j(1, 1) a, j(1, 1) b;
QUERY PLAN

--
 HashAggregate  (cost=80021.00..100021.00 rows=200 width=16) (actual
time=11.332..13.241 rows=1 loops=1)
   Group Key: a.a, b.b
   ->  Append  (cost=0.50..70021.00 rows=200 width=16) (actual
time=0.118..0.163 rows=2 loops=1)
 ->  Nested Loop  (cost=0.50..20010.50 rows=100 width=16)
(actual time=0.117..0.118 rows=1 loops=1)
   ->  Function Scan on j a  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.087..0.088 rows=1 loops=1)
   ->  Function Scan on j b  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.027..0.027 rows=1 loops=1)
 ->  Nested Loop  (cost=0.50..20010.50 rows=100 width=16)
(actual time=0.044..0.044 rows=1 loops=1)
   ->  Function Scan on j a_1  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.022..0.022 rows=1 loops=1)
   ->  Function Scan on j b_1  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.020..0.021 rows=1 loops=1)
 Planning Time: 0.085 ms
 Execution Time: 69.277 ms
(11 rows)

But the same advance in v12 which makes it harder to fool with your test
case also opens the possibility of fixing your real case.

I've made an extension which has a function which always returns true, but
lies about how often it is expected to return true. See the attachment.
With that, you can fine-tune the planner.

CREATE EXTENSION pg_selectivities ;

EXPLAIN ANALYZE
SELECT * FROM j(1, 1) a, j(1, 1) b where pg_always(0.1)
UNION
SELECT * FROM j(1, 1) a, j(1, 1) b where pg_always(0.1);
QUERY PLAN

--
 HashAggregate  (cost=45021.40..45021.60 rows=20 width=16) (actual
time=0.226..0.227 rows=1 loops=1)
   Group Key: a.a, b.b
   ->  Append  (cost=0.50..45021.30 rows=20 width=16) (actual
time=0.105..0.220 rows=2 loops=1)
 ->  Nested Loop  (cost=0.50..22510.50 rows=10 width=16) (actual
time=0.104..0.105 rows=1 loops=1)
   Join Filter: pg_always('1e-05'::double precision)
   ->  Function Scan on j a  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.066..0.066 rows=1 loops=1)
   ->  Function Scan on j b  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.035..0.035 rows=1 loops=1)
 ->  Nested Loop  (cost=0.50..22510.50 rows=10 width=16) (actual
time=0.112..0.113 rows=1 loops=1)
   Join Filter: pg_always('1e-05'::double precision)
   ->  Function Scan on j a_1  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.077..0.077 rows=1 loops=1)
   ->  Function Scan on j b_1  (cost=0.25..10.25 rows=1000
width=8) (actual time=0.034..0.034 rows=1 loops=1)
 Planning Time: 0.139 ms
 Execution Time: 0.281 ms

Cheers,

Jeff


pg_selectivities.patch
Description: Binary data

Re: Bitmap heap scan performance

2019-08-09 Thread Jeff Janes

On Fri, Aug 9, 2019 at 4:42 AM Rob Emery  wrote:

>
> It
> seems to me like the Bitmap Heap Scan on proposal is the issue because
> the recheck is throwing away enormous amounts of data.

Have you tried increasing work_mem?  The probable reason for the recheck is
that your bitmap overflows the allowed memory, and then switches
from storing every tid to storing just the block numbers.  As indicated by
the lossy part of "Heap Blocks: exact=3983 lossy=27989"

The
> has_been_anonymised flag on the proposal is effectively a soft-delete;
> so I’ve tried adding something like :
>
> CREATE INDEX ON proposal.proposal (system_id, legacy_organisation_id,
> reference)
> WHERE has_been_anonymised = false;
>
> Which I was hoping would shrink the size of the index significantly
>

The partial index should be smaller, but when comparing to the index with
"has_been_anonymised" as the leading column, it won't make a lot of
difference.  You only have to scan a smaller part of the larger index, and
the sizes of part of the index you have to scan in each case will be
roughly comparable.

> and encourage an index scan rather than bitmap, however it didn’t have
> that effect.

To encourage index scans over bitmap scans, you can increase
effective_cache_size.  Or to really force the issue, you can "set
enable_bitmapscan=off" but that is something you would usually do locally
for experimental purposes, not do it in production's config settings.

Cheers,

Jeff

Re: Shortest offline window on database migration

2019-05-30 Thread Jeff Janes

On Thu, May 30, 2019 at 11:08 AM Haroldo Kerry  wrote:

> Hello,
>
> We are migrating our PostgreSQL 9.6.10 database (with streaming
> replication active) to a faster disk array.
> We are using this opportunity to enable checksums, so we will have to do a
> full backup-restore.
> The database size is about 500GB, it takes about 2h:30min for a full
> backup, and then about 1h to fully restore it with checksum enabled on the
> new array, plus 2h to recreate the replica on the old array.
>

As others have noticed, your "trick" won't work.  So back to basics.  Are
you using the best degree of parallelization on each one of these tasks?
What is the bottleneck of each one (CPU, disk, network)? how are you
creating the replica?  Can you share the actual command lines for each
one?  It seems odd that the dump (which only needs to dump the index and
constraint definitions) is so much slower than the restore (which actually
needs to build those indexes and validate the constraints). Is that because
the dump is happening from the old slow disk and restore a new fast ones?
Same with creating the replica, why is that slower than actually doing the
restore?

It sounds like you are planning on blowing away the old master server on
the old array as soon as the upgrade is complete, so you can re-use that
space to build the new replica?  That doesn't seem very safe to me--what if
during the rebuilding of the replica you run into a major problem and have
to roll the whole thing back?  What will the old array which is holding the
current replica server be doing in all of this?

Cheers,

Jeff

>

Re: PostgreSQL optimizer use seq scan instead of pkey index only scan (in queries with postgres_fdw)

2019-05-06 Thread Jeff Janes

On Mon, May 6, 2019 at 11:53 AM Jeff Janes  wrote:

> On Mon, May 6, 2019 at 11:38 AM Vitaly Baranovsky <
> barvetalfor...@gmail.com> wrote:
>
>> Thank you, Jeff!
>>
>> We'll be looking forward to the next version of Postgres in this case.
>>
>> As far as I understand, you've answered about sending filtering condition
>> to a foreign server... Could you, please, clarify about another (the first)
>> part of my question? Why the server choose seq scan instead of pk key index
>> only scan for the local table?
>>
>> Thank you
>>
>>
> Aren't those the same thing?  The foreign server can't use the where
> clause, if it doesn't get sent.
>

Nevermind. When you said local table, I had some tunnel vision and was
thinking of the foreign table as viewed from the perspective of the foreign
server (to which it is local), not the actual local table.   That too is
"fixed" in the same commit to the 12dev branch as the other issue is:

commit 4be058fe9ec5e630239b656af21fc083371f30ed
Date:   Mon Jan 28 17:54:10 2019 -0500

In the planner, replace an empty FROM clause with a dummy RTE.

My tests are all done with empty, unanalyzed tables as I just took you DDL
without inventing my own DML, so may be different than what what you were
seeing with your populated tables.

Cheers,

Jeff

>

Re: PostgreSQL optimizer use seq scan instead of pkey index only scan (in queries with postgres_fdw)

2019-05-06 Thread Jeff Janes

On Mon, May 6, 2019 at 11:38 AM Vitaly Baranovsky 
wrote:

> Thank you, Jeff!
>
> We'll be looking forward to the next version of Postgres in this case.
>
> As far as I understand, you've answered about sending filtering condition
> to a foreign server... Could you, please, clarify about another (the first)
> part of my question? Why the server choose seq scan instead of pk key index
> only scan for the local table?
>
> Thank you
>
>
Aren't those the same thing?  The foreign server can't use the where
clause, if it doesn't get sent.

Cheers,

Jeff

Re: PostgreSQL optimizer use seq scan instead of pkey index only scan (in queries with postgres_fdw)

2019-05-06 Thread Jeff Janes

On Mon, May 6, 2019 at 10:44 AM Vitaly Baranovsky 
wrote:

> Hello all,
>
> I faced strange behavior of PostgreSQL during the query execution.
>

 ...

> Also, please, note, that SQL without WHERE clause has been set to the
> foreign server:
> "  Remote SQL: SELECT primary_uuid FROM public.foreign_table"
>
> So, the optimizer doesn't select optimal plans for such executions :(
>

It works the way you want in version 12, which is currently under
development and should be released in 5 months or so.

Cheers,

Jeff

>

Re: Out of Memory errors are frustrating as heck!

2019-04-22 Thread Jeff Janes

On Sat, Apr 20, 2019 at 4:48 PM Tom Lane  wrote:

> Gunther  writes:
> > and checked my log file and there was nothing before the call
> > MemoryContextStats(TopPortalContext) so I don't understand where this
> > printf stuff is ending up.
>
> It's going to stdout, which is likely block-buffered whereas stderr
> is line-buffered, so data from the latter will show up in your log
> file much sooner.  You might consider adding something to startup
> to switch stdout to line buffering.
>

Is there a reason to not just elog the HJDEBUG stuff?  With some of the
other DEBUG defines, we will probably be using them before the logging
system is set up, but surely we won't be doing Hash Joins that early?

I think we could just get rid of the conditional compilation and elog this
at DEBUG1 or DEBUG2.  Or keep the conditional compilation and elog it at
LOG.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-21 Thread Jeff Janes

On Sat, Apr 20, 2019 at 4:26 PM Tom Lane  wrote:

> Tomas Vondra  writes:
> > Considering how rare this issue likely is, we need to be looking for a
> > solution that does not break the common case.
>
> Agreed.  What I think we need to focus on next is why the code keeps
> increasing the number of batches.  It seems like there must be an undue
> amount of data all falling into the same bucket ... but if it were simply
> a matter of a lot of duplicate hash keys, the growEnabled shutoff
> heuristic ought to trigger.
>

The growEnabled stuff only prevents infinite loops.  It doesn't prevent
extreme silliness.

If a single 32 bit hash value has enough tuples by itself to not fit in
work_mem, then it will keep splitting until that value is in a batch by
itself before shutting off (or at least until the split-point bit of
whatever else is in the that bucket happens to be the same value as the
split-point-bit of the degenerate one, so by luck nothing or everything
gets moved)

Probabilistically we keep splitting until every batch, other than the one
containing the degenerate value, has about one tuple in it.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-16 Thread Jeff Janes

On Mon, Apr 15, 2019 at 9:49 PM Gunther  wrote:

> Jeff Janes had more
>
> Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
>> 715 {
>> (gdb) p context->name
>> $8 = 0x96ce5b "ExecutorState"
>>
>>
> I think that the above one might have been the one you wanted.
>
> Not sure how you could tell that? It's the same place as everything else.
> If we can find out what you're looking for, may be we can set a break point
> earlier up the call chain?
>

It is just a hunch.  That size is similar to one you reported for the
last-straw allocation on the case with the nested loops rather than hash
joins. So it is reasonable (but surely not guaranteed) that they are coming
from the same place in the code.  And since that other one occurred as the
last straw, then that one must have not be part of the start up
allocations, but rather the part where we should be at steady state, but
are not.  So maybe this one is too.

>
> I guess I should run this for a little longer. So I disable my breakpoints
>>
>>
> it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing
>> fast, so now back to gdb and break:
>>
> Unfortunately, I think this means you missed your opportunity and are now
> getting backtraces of the innocent bystanders.
>
> But why? If I see the memory still go up insanely fast, isn't that a sign
> for the leak?
>

You said it was no longer going up insanely fast by the time you got your
breakpoints re-activated.

> Particularly since you report that the version using nested loops rather
> than hash joins also leaked, so it is probably not the hash-join specific
> code that is doing it.
>
> How about it's in the DISTINCT? I noticed while peeking up the call chain,
> that it was already in the UNIQUE sort thing also.  I guess it's streaming
> the results from the hash join right into the unique sort step.
>

Isn't crashing before any rows are emitted from the hash join?  I thought
it was, but I am not very confident on that.

> What I've done before is compile with the comments removed from
> src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */
>
> I have just done that and it creates an insane amount of output from all
> the processes, I'm afraid there will be no way to keep that stuff
> separated. If there was a way of turning that one and off for one process
> only, then we could probably get more info...
>

Are you doing all this stuff on a production server in use by other people?

> Everything is also extremely slow that way. Like in a half hour the memory
> didn't even reach 100 MB.
>
> and then look for allocations sizes which are getting allocated but not
> freed, and then you can go back to gdb to look for allocations of those
> specific sizes.
>
> I guess I should look for both, address and size to match it better.
>

You can analyze the log for specific addresses which are allocated but not
freed, but once you find them you can't do much with them.  Those specific
addresses probably won't repeat on the next run, so, you so can't do
anything with the knowledge.  If you find a request size that is
systematically analyzed but not freed, you can condition logging (or gdb)
on that size.

> This generates a massive amount of output, and it bypasses the logging
> configuration and goes directly to stderr--so it might not end up where you
> expect.
>
> Yes, massive, like I said. Impossible to use. File system fills up
> rapidly. I made it so that it can be turned on and off, with the debugger.
>
> int _alloc_info = 0;
> #ifdef HAVE_ALLOCINFO
> #define AllocFreeInfo(_cxt, _chunk) \
> if(_alloc_info) \
> fprintf(stderr, "AllocFree: %s: %p, %zu\n", \
> (_cxt)->header.name, (_chunk), (_chunk)->size)
> #define AllocAllocInfo(_cxt, _chunk) \
> if(_alloc_info) \
> fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \
> (_cxt)->header.name, (_chunk), (_chunk)->size)
> #else
> #define AllocFreeInfo(_cxt, _chunk)
> #define AllocAllocInfo(_cxt, _chunk)
> #endif
>
> so with this I do
>
> (gdb) b AllocSetAlloc
> (gdb) cont
> (gdb) set _alloc_info=1
> (gdb) disable
> (gdb) cont
>
>
Thanks for this trick, I'll save this so I can refer back to it if I need
to do this again some time.

> then I wait, ... until it crashes again ... no, it's too much. It fills up
> my filesystem in no time with the logs.  It produced 3 GB in just a minute
> of run time.
>

You don't need to run it until it crashes, only until the preliminaries are
done and the leak has started to happen.  I would think a minute would be
more than enough, unless the leak doesn't start until some other join

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes

On Mon, Apr 15, 2019 at 12:34 PM Gunther  wrote:

> Breakpoint 2, AllocSetAlloc (context=0x1168230, size=8272) at aset.c:715
> 715 {
> (gdb) p context->name
> $8 = 0x96ce5b "ExecutorState"
>
>
I think that the above one might have been the one you wanted.

> I guess I should run this for a little longer. So I disable my breakpoints
>
>
it went up pretty quick from 1.2 GB to 1.5 GB, but then it stopped growing
> fast, so now back to gdb and break:
>
Unfortunately, I think this means you missed your opportunity and are now
getting backtraces of the innocent bystanders.

Particularly since you report that the version using nested loops rather
than hash joins also leaked, so it is probably not the hash-join specific
code that is doing it.

What I've done before is compile with the comments removed from
src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */

and then look for allocations sizes which are getting allocated but not
freed, and then you can go back to gdb to look for allocations of those
specific sizes.  This generates a massive amount of output, and it bypasses
the logging configuration and goes directly to stderr--so it might not end
up where you expect.

Thanks for the view definition.  Nothing in it stood out to me as risky.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes

On Mon, Apr 15, 2019 at 11:28 AM Tom Lane  wrote:

> Jeff Janes  writes:
> > To get it to happen faster, maybe you could run the server with a small
> > setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
> > Unfortunately I don't know how to set a breakpoint for allocations into a
> > specific context, and setting a breakpoint for any memory allocation is
> > probably going to fire too often to be useful.
>
> If you can use gdb at all, it's not that hard to break on allocations
> into a specific context; I've done it many times.  The strategy is
> basically
>

> 1. Let query run long enough for memory usage to start increasing,
> then attach to backend with gdb.
>
> 2. Set breakpoint at, probably, AllocSetAlloc.  (In some cases,
> reallocs could be the problem, but I doubt it here.)  Then "c".
>
> 3. When it stops, "p *context" and see if this is the context
> you're looking for.  In this case, since we want to know about
> allocations into ExecutorState and we know there's only one
> active one, you just have to look at the context name.  In general
> you might have to look at the backtrace.  Anyway, if it isn't the
> one you want, just "c" until you get to an allocation into the
> one you do want.
>
> 4. Once you have found out the address of the context you care
> about, make the breakpoint conditional on the context argument
> being that one.  It might look like this:
>
> Breakpoint 1, AllocSetAlloc (context=0x1483be0, size=480) at aset.c:715
> 715 {
> (gdb) p *context
> $1 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
> false,
>   methods = 0xa33f40, parent = 0x0, firstchild = 0x1537f30, prevchild =
> 0x0,
>   nextchild = 0x0, name = 0xa3483f "TopMemoryContext", ident = 0x0,
>   reset_cbs = 0x0}
> (gdb) cond 1  context == 0x1483be0
>
> 5. Now repeatedly "c", and check the stack trace each time, for a
> dozen or two times to get a feeling for where the allocations are
> being requested.
>
> In some cases you might be able to find the context address in a
> more efficient way than what I suggest in #3 --- for instance,
> you could instead set a breakpoint where the context is created
> and snag its address immediately, or you could dig around in
> backend data structures to find it.  But these ways generally
> require more familiarity with the code than just watching the
> requests go by.
>


Thanks for the recipe.  I can use gdb at all, just not very skillfully :)

With that as a starting point, experimentally, this seems to work to short
circuit the loop described in your step 3 (which I fear could be thousands
of iterations in some situations):

cond 1 strcmp(context.name,"ExecutorState")==0

Also, I've found that in the last few versions of PostgreSQL, processes
might get unreasonable numbers of SIGUSR1 (maybe due to parallelization?)
and so to avoid having to stand on the 'c' button, you might need this:

handle SIGUSR1 noprint nostop

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes

On Sun, Apr 14, 2019 at 11:59 PM Gunther  wrote:

> Is there any doubt that this might be a problem with Linux? Because if
> you want, I can whip out a FreeBSD machine, compile pgsql, and attach
> the same disk, and try it there. I am longing to have a reason to move
> back to FreeBSD anyway. But I have tons of stuff to do, so if you do not
> have reason to suspect Linux to do wrong here, I prefer skipping that
> futile attempt
>

I think the PostgreSQL leaking in the first place would be independent of
Linux being ungraceful about it.  So repeating it on BSD probably wouldn't
help us here.  If you want to take up the 2nd issue with the kernel folks,
having some evidence from BSD might (but not very likely) be helpful for
that, but that would be for a different mailing list.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-15 Thread Jeff Janes

On Sun, Apr 14, 2019 at 11:04 PM Gunther  wrote:

> Could you rerun the query with \set VERBOSITY verbose to show the file/line
> that's failing ?
>
> Here goes:
>
> integrator=# \set VERBOSITY verbose
> integrator=# SET ENABLE_NESTLOOP TO OFF;
> SET
> integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM 
> reports.v_BusinessOperation;
> ERROR:  53200: out of memory
> DETAIL:  Failed on request of size 32800 in memory context "HashBatchContext".
> LOCATION:  MemoryContextAlloc, mcxt.c:798
>
> you notice that I set ENABLE_NESTLOOP to off, that is because the planner
> goes off thinking the NL plan is marginally more efficient, but in fact it
> will take 5 hours to get to the same out of memory crash, while the no NL
> plan gets there in half an hour. That verbose setting didn't help much I
> guess.
>
I think the backtrace of the enable_nestloop=on plan would be more useful.
Here someone has filled up memory, and then we see HashBatchContext trip
over it that.  But it isn't the one the one that caused the problem, so the
backtrace doesn't help.  With the previous plan, it was an allocation into
ExecutorState which tripped over the problem, and it is likely that it is
coming from the same series of allocations that caused the problem.

To get it to happen faster, maybe you could run the server with a small
setting of "ulimit -v"?  Or, you could try to capture it live in gdb.
Unfortunately I don't know how to set a breakpoint for allocations into a
specific context, and setting a breakpoint for any memory allocation is
probably going to fire too often to be useful.

Yes, the verbose option didn't help (but the gdb backtrace made up for
it--kind of--we really need the backtrace of the allocations into
ExecutorState).  It isn't helpful to know that a memory allocation failed
in the mcxt.c code.  To bad it doesn't report the location of the caller of
that code.  I know in Perl you can use Carp::croak to do that, but I don't
know to do it in C.

But really the first thing I want to know now is what if you just do the
select, without the insert?

explain analyze SELECT * FROM reports.v_BusinessOperation

If that works, what about "create temporary table foo as SELECT * FROM
reports.v_BusinessOperation" ?

And if that works, what about "INSERT INTO reports.BusinessOperation SELECT
* FROM foo"?

If the ERROR happens in the first or last of these, it might be much easier
to analyze in that simplified context. If it happens in the middle one,
then we probably haven't achieved much. (And if there is no ERROR at all,
then you have workaround, but we still haven't found the fundamental bug).

Are you not showing the view definition for proprietary reasons, or just
because you don't think it will be useful? If the latter, please send it as
an attachment, I can't guarantee it will be useful, but there is only one
way find out.

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Jeff Janes

On Sun, Apr 14, 2019 at 9:06 PM Gunther  wrote:

> Thanks for looking at my problem Tom Lane and Jeff Janes. Sorry for not
> having given enough detail.
>
> The version is 10.2 latest. The database was originally built with 10.1
> and then just started with 10.2.
>
Do you mean 11.2?  The latest in the 10 series is 10.7.  If you do mean
10.2, there a fix for a memory leak bug since then that might plausibly be
relevant (bdc7f686d1b8f423cb)

>
> I said "crash" and that is wrong. Not a signal nor core dump. It is the
> ERROR:  out of memory. Only the query crashes. Although I don't know if may
> be the backend server might have left a core dump?
>
I don't think there would be a core dump on only an ERROR, and probably not
worthwhile to trick it into generating one.


> The short version is:
>
> Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 
> 1425134928 used
> 2019-04-14 16:38:26.355 UTC [11061] ERROR:  out of memory
> 2019-04-14 16:38:26.355 UTC [11061] DETAIL:  Failed on request of size 8272 
> in memory context "ExecutorState".
>
>  I don't know why a 8GB system with a lot of cache that could be evicted
would get an OOM when something using 1.5GB asks for 8272 bytes more.  But
that is a question of how the kernel works, rather than how PostgreSQL
works.  But I also think the log you quote above belongs to a different
event than the vmstat trace in your first email.


>   ExecutorState: 1416621920 total in 168098 blocks; 8494152 free (3102 
> chunks); 1408127768 used
> HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 
> used
>   HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 
> 41360 used
>
>
This does not seem to match your query plan.  Why would a plan with no Hash
Joins have a HashBatchContext?  I think this log must be from a different
query than the one that generated the plan you posted.  Perhaps one was
before you lowered work_mem and one was after?

Cheers,

Jeff

Re: Out of Memory errors are frustrating as heck!

2019-04-14 Thread Jeff Janes

On Sun, Apr 14, 2019 at 4:51 PM Gunther  wrote:

> For weeks now, I am banging my head at an "out of memory" situation. There
> is only one query I am running on an 8 GB system, whatever I try, I get
> knocked out on this out of memory.
>
Is PostgreSQL throwing an error with OOM, or is getting killed -9 by the
OOM killer?  Do you get a core file you can inspect with gdb?

You might want to see the query, but it is a huge plan, and I can't really
> break this down. It shouldn't matter though. But just so you can get a
> glimpse here is the plan:
>
> Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 
> width=1197)
>   ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
>
>
>
Maybe it is memory for trigger or constraint checking, although I don't
know why that would appear instantly.  What triggers or constraints do you
have on businessoperation?

What if you just run the SELECT without the INSERT?  Or insert into a temp
table rather than into businessoperation?  And if that doesn't crash, what
if you then insert to businessoperation from the temp table?

Also, what version?

Cheers,

Jeff

Re: Commit(?) overhead

2019-04-04 Thread Jeff Janes

On Thu, Apr 4, 2019 at 3:42 AM Duncan Kinnear 
wrote:

>
> the EXPLAIN (ANALYSE, TIMING TRUE) of this query gives:
>
>  Update on next_id  (cost=0.14..8.16 rows=1 width=36) (actual
> time=0.057..0.057 rows=0 loops=1)
>->  Index Scan using next_id_pk on next_id  (cost=0.14..8.16 rows=1
> width=36) (actual time=0.039..0.040 rows=1 loops=1)
>  Index Cond: ((id)::text = 'Session'::text)
>  Planning Time: 0.083 ms
>  Execution Time: 0.089 ms
>
> which is significantly less than 50ms.
>

The EXPLAIN ANALYZE doesn't include the time needed to fsync the
transaction logs.  It measures only the update itself, not the implicit
commit at the end.  DBeaver is seeing the fsync-inclusive time.  50ms is
pretty long, but some file systems and OSes seem to be pretty inefficient
at this and take several disk revolutions to get the data down.

>
> Now, if I point DBeaver to a VM server on the same gigabit network switch,
> running version:
>9.5.3 on i386-pc-solaris2.11, compiled by cc: Sun C 5.10 SunOS_i386
> Patch 142363-07 2010/12/09, 64-bit
> then the same query executes in about 2-3ms
>

That machine probably has hardware to do a fast fsync, has fsync turned
off, or is lying about the safety of its data.

Cheers,

Jeff

Re: impact of auto explain on overall performance

2019-03-14 Thread Jeff Janes

On Thu, Mar 14, 2019 at 3:29 AM Stephan Schmidt  wrote:

> Hello,
>
>
>
> i’m currently working on a high Performance Database and want to make sure
> that whenever there are slow queries during regular operations i’ve got all
> Information about the query in my logs. So auto_explain come to mind, but
> the documentation explicitly states that it Comes at a cost. My Question
> is, how big is the latency added by auto_explain in percentage or ms ?
>

You will have to measure it yourself and see.  It depends on your hardware,
OS, and OS version, and PostgreSQL version.  And the nature of your
queries.  If you have auto_explain.log_timing=on, then I find that large
sorts are the worst impacted.  So if you have a lot of those, you should be
careful.

On older kernels, I would run with auto_explain.log_timing=off.  On newer
kernels where you can read the clock from user-space, I run with
auto_explain.log_timing=on.  I find the slowdown noticeable with careful
investigation (around 3%, last time I carefully investigated it), but
usually well worth paying to have actual data to work with when I find slow
queries in the log.  I made a special role with auto_explain disabled for
use with a few reporting queries with large sorts, both to circumvent the
overhead and to avoid spamming the log with slow queries I already know
about.

Cheers,

Jeff

>

Re: Aggregate and many LEFT JOIN

2019-02-26 Thread Jeff Janes

On Mon, Feb 25, 2019 at 3:54 AM kimaidou  wrote:

> Wich strikes me is that if I try to simplify it a lot, removing all data
> but the main table (occtax.observation) primary key cd_nom and aggregate,
> the query plan should be able tu use the cd_nom index for sorting and
> provide better query plan (hash aggregate), but it does not seems so :
>

HashAggregate doesn't support aggregates with DISTINCT.  I don't think
there is any reason it can't, it is just that no one has gotten around to
it.

Aggregates with DISTINCT also kill your ability to get parallel queries.

Cheers,

Jeff

Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-25 Thread Jeff Janes

On Mon, Feb 25, 2019 at 11:13 AM Gunther Schadow 
wrote:

> Anyway, I think the partitioned table is the right and brilliant solution,
> because an index really isn't required. The actual pending partition will
> always remain quite small, and being a queue, it doesn't even matter how
> big it might grow, as long as new rows are inserted at the end and not in
> the middle of the data file and still there be some way of fast skip over
> the part of the dead rows at the beginning that have already been processed
> and moved away.
>
Why do you want to do that?  If you are trying to force the queue to be
handled in a "fair" order, then this isn't the way to do it, you would want
to add an "ORDER BY" to your dequeueing query (in which case you are
probably back to adding an index).

Once the space in the beginning of the table has been reclaimed as free,
then it will be reused for newly inserted tuples.  After the space is freed
up but before it is reused, the seq scan can't skip those blocks entirely,
but it can deal with the blocks quickly because they are empty.  If the
blocks are full of dead but not freed tuples (because the long-running
transactions are preventing them from being cleaned up) then it will have
to go through each dead tuple to satisfy itself that it actually is dead.
This might not be as bad as it is for indexes, but certainly won't be good
for performance.

 Cheers,

Jeff

Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-25 Thread Jeff Janes

On Sat, Feb 23, 2019 at 4:06 PM Gunther  wrote:

> Hi,
>
> I am using an SQL queue for distributing work to massively parallel
> workers.
>
You should look into specialized queueing software.

...

> I figured I might just pause all workers briefly to schedule the REINDEX
> Queue command, but the problem with this is that while the transaction
> volume is large, some jobs may take minutes to process, and in that case we
> need to wait minutes to quiet the database with then 47 workers sitting as
> idle capacity waiting for the 48th to finish so that the index can be
> rebuilt!
>
The jobs that take minutes are themselves the problem.  They prevent tuples
from being cleaned up, meaning all the other jobs needs to grovel through
the detritus every time they need to claim a new row.  If you got those
long running jobs to end, you probably wouldn't even need to reindex--the
problem would go away on its own as the dead-to-all tuples get cleaned up.

Locking a tuple and leaving the transaction open for minutes is going to
cause no end of trouble on a highly active system.  You should look at a
three-state method where the tuple can be pending/claimed/finished, rather
than pending/locked/finished.  That way the process commits immediately
after claiming the tuple, and then records the outcome in another
transaction once it is done processing.  You will need a way to detect
processes that failed after claiming a row but before finishing, but
implementing that is going to be easier than all of this re-indexing stuff
you are trying to do now.  You would claim the row by updating a field in
it to have something distinctive about the process, like its hostname and
pid, so you can figure out if it is still running when it comes time to
clean up apparently forgotten entries.

Cheers,

Jeff

Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Jeff Janes

On Sun, Feb 24, 2019 at 1:02 PM Gunther  wrote:

> Thank you all for responding so far.
>
> David Rowley  and Justin Pryzby suggested things about autovacuum. But I
> don't think autovacuum has any helpful role here. I am explicitly doing a
> vacuum on that table. And it doesn't help at all. Almost not at all.
>
If you do a vacuum verbose, what stats do you get back?  What is the size
of the index when the degradation starts to show, and immediately after a
successful reindex?

Also, how is JobID assigned?  Are they from a sequence, or some other
method where they are always added to the end of the index's keyspace?

When it starts to degrade, what is the EXPLAIN plan for the query?

Cheers,

Jeff

Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Jeff Janes

On Sat, Feb 23, 2019 at 4:06 PM Gunther  wrote:

> the dequeue operation is essentially this:
>
> BEGIN
>
> SELECT jobId, action
>   FROM Queue
>   WHERE pending
>   FOR UPDATE SKIP LOCKED
>
>
There is no LIMIT shown.  Wouldn't the first thread to start up just lock
all the rows and everyone else would starve?

Cheers,

Jeff

Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes

On Tue, Feb 12, 2019 at 4:17 PM Tom Lane  wrote:

> Jeff Janes  writes:
> > In order for bloom (or any other users of CREATE ACCESS METHOD, if there
> > are any) to have a fighting chance to do better, I think many of
> selfuncs.c
> > currently private functions would have to be declared in some header
> file,
> > perhaps utils/selfuncs.h.  But that then requires a cascade of other
> > inclusions.  Perhaps that is why it was not done.
>
> I'm just in the midst of refactoring that stuff, so if you have
> suggestions, let's hear 'em.
>

The goal would be that I can copy the entire definition of
genericcostestimate into blcost.c, change the function's name, and get it
to compile.  I don't know the correct way accomplish that.  Maybe
utils/selfuncs.h can be expanded to work, or if there should be a new
header file like "utils/index_am_cost.h"

What I've done for now is:

#include "../../src/backend/utils/adt/selfuncs.c"

which I assume is not acceptable as a real solution.

It's possible that a good cost model for bloom is so far outside
> genericcostestimate's ideas that trying to use it is not a good
> idea anyway.
>

I think that might be the case.  I don't know what the right answer would
look like, but I think it will likely end up needing to access everything
that genericcostestimate currently needs to access.  Or if bloom doesn't,
some other extension implementing an ACCESS METHOD will.

Cheers,

Jeff

Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes

On Tue, Feb 12, 2019 at 11:58 AM Jeff Janes  wrote:

>
> On Tue, Feb 12, 2019 at 10:42 AM Tom Lane  wrote:
>
>>
>> Hm.  blcostestimate is using the default cost calculation, except for
>>
>> /* We have to visit all index tuples anyway */
>> costs.numIndexTuples = index->tuples;
>>
>> which essentially tells genericcostestimate to assume that every index
>> tuple will be visited.  This obviously is going to increase the cost
>> estimate; maybe there's something wrong with that?
>>
>
> I assumed (without investigating yet) that genericcostestimate is applying
> a cpu_operator_cost (or a few of them) on each index tuple, while the
> premise of a bloom index is that you do very fast bit-fiddling, not more
> expense SQL operators, for each tuple and then do the recheck only on what
> survives to the table tuple part.
>

In order for bloom (or any other users of CREATE ACCESS METHOD, if there
are any) to have a fighting chance to do better, I think many of selfuncs.c
currently private functions would have to be declared in some header file,
perhaps utils/selfuncs.h.  But that then requires a cascade of other
inclusions.  Perhaps that is why it was not done.

Cheers,

Jeff

>

Re: Performance regressions found using sqlfuzz

2019-02-12 Thread Jeff Janes

On Tue, Feb 12, 2019 at 4:23 AM Jung, Jinho  wrote:

>
> Hello,
>
> We are developing a tool called sqlfuzz for automatically finding
> performance regressions in PostgreSQL. sqlfuzz performs mutational fuzzing
> to generate SQL queries that take more time to execute on the latest
> version of PostgreSQL compared to prior versions. We hope that these
> queries would help further increase the utility of the regression test
> suite.
>
> We would greatly appreciate feedback from the community regarding the
> queries found by the tool so far. We have already incorporated prior
> feedback from the community in the latest version of sqlfuzz.
>

This approach doesn't seem very exciting to me as-is, because optimization
is a very pragmatic endeavor.  We make decisions all the time that might
make some queries better and others worse.  If the queries that get better
are natural/common ones, and the ones that get worse are weird/uncommon
ones (like generated by a fuzzer), then making that change is an
improvement even if there are some performance (as opposed to correctness)
regressions.

I would be more interested in investigating some of these if the report
would:

1) include the exact commit in which the regression was introduced (i.e.
automate "git bisect").
2) verify that the regression still exists in the dev HEAD and report which
commit it was verified in (since HEAD changes frequently).
3) report which queries (if any) in your corpus were made better by the
same commit which made the victim query worse.

Cheers,

Jeff

>

Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes

On Tue, Feb 12, 2019 at 10:42 AM Tom Lane  wrote:

> Thomas Kellerer  writes:
> > The bloom index is only used if either Seq Scan is disabled or if the
> random_page_cost is set to 1 (anything about 1 triggers a Seq Scan on my
> Windows laptop).
>
> Hm.  blcostestimate is using the default cost calculation, except for
>
> /* We have to visit all index tuples anyway */
> costs.numIndexTuples = index->tuples;
>
> which essentially tells genericcostestimate to assume that every index
> tuple will be visited.  This obviously is going to increase the cost
> estimate; maybe there's something wrong with that?
>

I assumed (without investigating yet) that genericcostestimate is applying
a cpu_operator_cost (or a few of them) on each index tuple, while the
premise of a bloom index is that you do very fast bit-fiddling, not more
expense SQL operators, for each tuple and then do the recheck only on what
survives to the table tuple part.

Cheers,

Jeff

Re: autovacuum big table taking hours and sometimes seconds

2019-02-07 Thread Jeff Janes

On Thu, Feb 7, 2019 at 6:55 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

I have 3 questions :
> 1)To what value do you recommend to increase the vacuum cost_limit ? 2000
> seems reasonable ? Or maybe its better to leave it as default and assign a
> specific value for big tables ?
>

That depends on your IO hardware, and your workload.  You wouldn't want
background vacuum to use so much of your available IO that it starves your
other processes.

> 2)When the autovacuum reaches the cost_limit while trying to vacuum a
> specific table, it wait nap_time seconds and then it continue to work on
> the same table ?
>

No, it waits for autovacuum_vacuum_cost_delay before resuming within the
same table. During this delay, the table is still open and it still holds a
lock on it, and holds the transaction open, etc.  Naptime is entirely
different, it controls how often the vacuum scheduler checks to see which
tables need to be vacuumed again.

> 3)So in case I have a table that keeps growing (not fast because I set the
> vacuum_scale_factor to 0 and the autovacuum_vacuum_threshold to 1). If
> the table keep growing it means I should try to increase the cost right ?
> Do you see any other option ?
>

 You can use pg_freespacemap to see if the free space is spread evenly
throughout the table, or clustered together.  That might help figure out
what is going on.  And, is it the table itself that is growing, or the
index on it?

Cheers,

Jeff

Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread Jeff Janes

On Wed, Feb 6, 2019 at 9:42 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Well, basically I'm trying to tune it because the table still keep
> growing. I thought that by setting the scale and the threshold it will be
> enough but its seems that it wasnt. I attached some of the logs output to
> hear what you guys think about it ..
>

Are all four log entries from well after you made the change?  My first
inclination is to think that the first 2 are from either before the change,
or just after the change when it is still settling into the new regime.
Also, is the table still continuing to grow, or is at a new steady-state of
bloat which isn't growing but also isn't shrinking back to where you want
it to be?  More aggressive vacuuming alone should stop the bloat, but is
not likely to reverse it.

I habitually set vacuum_cost_page_hit and vacuum_cost_page_miss to zero.
Page reads are self-limiting (vacuum is single threaded, so you can't have
more than one read (times autovacuum_max_workers) going on at a time) so I
don't see a need to throttle them intentionally as well--unless your entire
db is sitting on one spindle.  Based on the high ratio of read rates to
write rates in the last two log entries, this change alone should be enough
greatly speed up the run time of the vacuum.

If you need to speed it up beyond that, I don't think it matters much
whether you decrease cost_delay or increase cost_limit, it is the ratio
that mostly matters.

And if these latter measures do work, you should consider undoing changes
to autovacuum_vacuum_scale_factor.  Reading the entire index just to remove
10,000 rows from the table is a lot of extra work that might be
unnecessary. Although that extra work might not be on anyone's critical
path.

>

Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread Jeff Janes

On Wed, Feb 6, 2019 at 5:29 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Now the question is how to handle or tune it ? Is there any change that I
> need to increase the cost_limit / cost_delay ?
>

Sometimes vacuum has more work to do, so it takes more time to do it.

There is no indication of a problem. Or at least, you haven't described
one. So, there is nothing to handle or to tune.

If there is a problem, those log entries might help identify it.  But in
the absence of a problem, they are just log spam.

Cheers,

Jeff

Re: Will higher shared_buffers improve tpcb-like benchmarks?

2019-01-29 Thread Jeff Janes

On Tue, Jan 29, 2019 at 6:39 AM Saurabh Nanda 
wrote:

> Hi,
>
> I'm going crazy trying to optimise my Postgres config for a production
> setting [1] Once I realised random changes weren't getting my anywhere, I
> finally purchased PostgreSQL 10 - Higher Performance [2] and understood the
> impact of shared_buffers.
>
> IIUC, shared_buffers won't have any significant impact in the following
> scenario, right?
>
> -- DB size = 30GB
> -- shared_buffers = 2GB
> -- workload = tpcb-like
>
> This is because the tpcb-like workload selects & updates random rows from
> the DB [3]. Therefore, with a 2GB shared buffer, there is only a 6-7%
> chance (did I get my probability correct?) that the required data will be
> in the shared_buffer. Did I understand this correctly?
>

That is likely correct, but the data will likely be stored in the OS file
cache, so reading it from there will still be pretty fast.

>
> If nothing else becomes the bottleneck (eg. periodically writing dirty
> pages to disk), increasing the shared_buffers to 15GB+ should have a
> significant impact, for this DB-size and workload, right? (The system has
> 64 GB RAM)
>

About the only way to know for sure that writing dirty data is not the
bottleneck is to use a read only benchmark, such as the -S flag for
pgbench.  And at that point, the IPC overhead between pgbench and the
backend, even when both are running on the same machine, is likely to be
the bottleneck.  And after that, the bottleneck might shift to opening and
closing transactions and taking and releasing locks[1].

If you overcome that, then you might reliably see a difference between 2GB
and 15GB of shared buffers, because at 2GB each query to pgbench_accounts
is likely to fetch 2 pages into shared_buffers from the OS cache: the index
leaf page for pgbench_accounts_pkey, and the table page for
pgbench_accounts.  At 15GB, the entire index should be reliably in
shared_buffers (after enough warm-up time), so you would only need to fetch
1 page, and often not even that.

Cheers,

Jeff

[1]   I have a very old patch to pgbench that introduces a new query to
overcome this,
https://www.postgresql.org/message-id/BANLkTi%3DQBYOM%2Bzj%3DReQeiEKDyVpKUtHm6Q%40mail.gmail.com
.  I don't know how much work it would be to get it to compile against
newer versions--I stopped maintaining it because it became too much work to
rebase it past conflicting work, and because I lost interest in this line
of research.

Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-29 Thread Jeff Janes

On Mon, Jan 28, 2019 at 12:03 AM Saurabh Nanda 
wrote:

> All this benchmarking has led me to a philosophical question, why does PG
> need shared_buffers in the first place?
>

PostgreSQL cannot let the OS get its hands on a dirty shared buffer until
the WAL record "protecting" that buffer has been flushed to disk. If a
dirty shared buffer got written to disk, but then a crash happened before
the WAL record go flushed to disk, then the data could be corrupted when it
comes back up. So shared_buffers effectively serves as cooling pond where
dirty buffers wait for their WAL to be flushed naturally so they can be
written without instigating a performance-reducing flush just for them.

Also, concurrent clients needs to access the same disk pages at overlapping
times without corrupting each other.  Perhaps that could be implemented to
have just the buffer headers in shared memory to coordinate the locking,
and not having the buffers themselves in shared memory.  But that is not
how it is currently implemented.

> What's wrong with letting the OS do the caching/buffering?
>

Nothing, and that is what it does.  Which is why the advice for
shared_buffers is often to use a small fraction of RAM, leaving the rest
for the OS to do its thing.  But PostgreSQL still needs a way to lock those
pages, both against concurrent access by its own clients, and against
getting flushed out of order by the OS.  There is no performant way to
release the dirty pages immediately to the OS while still constraining the
order in which the OS flushes them to disk.

Finally, while reading a page from the OS cache into shared_buffers is much
faster than reading it from disk, it is still much slower than finding it
already located in shared_buffers.  So if your entire database fits in RAM,
you will get better performance if shared_buffers is large enough for the
entire thing to fit in there, as well.  This is an exception to the rule
that shared_buffers should be a small fraction of RAM.

> Isn't it optimised for this kind of stuff?
>

Maybe.  But you might be surprised at poorly optimized it is.  It depends
on your OS and version of it, of course.  If you have a high usage_count
buffer which is re-dirtied constantly, it will only get written and flushed
to disk once per checkpoint if under PostgreSQL control. But I've seen
pages like that get written many times per second under kernel control.
Whatever optimization it tried to do, it wasn't very good at.  Also, if
many contiguous pages are dirtied in a close time-frame, but not dirtied in
their physical order, the kernel should be able to re-order them into long
sequential writes, correct?  But empirically, it doesn't, at least back in
the late 2.* series kernels when I did the experiments.  I don't know if it
didn't even try, or tried but failed.  (Of course back then, PostgreSQL
didn't do a good job of it either)

Cheers,

Jeff

Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-27 Thread Jeff Janes

>
>
> You could also try pg_test_fsync to get low-level information, to
>> supplement the high level you get from pgbench.
>
>
> Thanks for pointing me to this tool. never knew pg_test_fsync existed!
> I've run `pg_test_fsync -s 60` two times and this is the output -
> https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm
> not sure what to make of it?
>

I don't know what to make of that either.  I'd expect fdatasync using two
8kB writes to be about the same throughput as using one 8kB write, but
instead it is 4 times slower.  Also, I'd expect open_datasync to get slower
by a factor of 2, not a factor of 8, when going from one to two 8kB writes
(that is not directly relevant, as you aren't using open_datasync, but is
curious nonetheless).  Is this reproducible with different run lengths?  I
wonder if your write cache (or something) gets "tired" during the first
part of pg_test_fsync and thus degrades the subsequent parts of the test.
I would say something in your IO stack is not optimal, maybe some component
is "consumer grade" rather than "server grade".  Maybe you can ask Hetzner
about that.


> The effects of max_wal_size are going to depend on how you have IO
>> configured, for example does pg_wal shared the same devices and controllers
>> as the base data?  It is mostly about controlling disk usage and
>> crash-recovery performance, neither of which is of primary importance to
>> pgbench performance.
>
>
>  The WAL and the data-directory reside on the same SSD disk -- is this a
> bad idea?
>

If you are trying to squeeze out every last bit of performance, then I
think it is bad idea.  Or at least, something to try the alternative and
see.  The flushing that occurs during checkpoints and the flushing that
occurs for every commit can interfere with each other.


> I was under the impression that smaller values for max_wal_size cause
> pg-server to do "maintenance work" related to wal rotation, etc. more
> frequently and would lead to lower pgbench performance.
>

If you choose ridiculously small values it would.  But once the value is
sufficient, increasing it further wouldn't do much.  Given your low level
of throughput, I would think the default is already sufficient.

Thanks for including the storage info.  Nothing about it stands out to me
as either good or bad, but I'm not a hardware maven; hopefully one will be
reading along and speak up.


> PS: Cc-ing the list back again because I assume you didn't intend for your
> reply to be private, right?
>

Yes, I had intended to include the list but hit the wrong button, sorry.

Cheers,

Jeff

>

Re: postgresql unix socket connections

2019-01-09 Thread Jeff Janes

On Wed, Jan 9, 2019 at 10:09 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Hey Tom,
> I'm aware of how I can solve it. I wanted to understand why after
> installing the pg 9.6 packages suddenly psql tries to access the socket on
> /var/run/postgresql. Does the libpq default unix socket is changed between
> those two versions ? (9.6,9.2)
>

It is not a version issue, but a packaging issue.  Different systems have
different conventions on where sockets should go, and the packager imposes
their opinion on the things they package.

Cheers,

Jeff

Re: postgresql unix socket connections

2019-01-09 Thread Jeff Janes

On Wed, Jan 9, 2019 at 3:35 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

>
> Now, In machine 1 when I run psql I get the prompt password but in machine
> 2 I keep getting the next error :
>
> psql: could not connect to server: No such file or directory
> Is the server running locally and accepting
> connections on Unix domain socket
> "/var/run/postgresql/.s.PGSQL.5432"?
>
> One important thing that I didnt mention, is that I installed in machine 2
> package postgresql-libs.x86_64 0:8.4.20-8.el6_9 from the postgres
> repository (in order to upgrade it to 9.6).
>

The front end and the backend have compiled-in defaults for the socket
directory.  If you installed them from different sources, they may have
different compiled-in defaults.  Which means they may not be able to
rendezvous using the default settings for both of them.

You can override the default using unix_socket_directory on the server (as
you discovered).  On the client you can override it by using -h (or PGHOST
or host= or whatever mechanism), with an argument that looks like a
directory, starting with a '/'.

Cheers,

Jeff

Re: Query Performance Issue

2018-12-29 Thread Jeff Janes

On Sat, Dec 29, 2018 at 1:58 AM David Rowley 
wrote:

> On Sat, 29 Dec 2018 at 04:32, Justin Pryzby  wrote:
> > I think the solution is to upgrade (at least) to PG10 and CREATE
> STATISTICS
> > (dependencies).
>
> Unfortunately, I don't think that'll help this situation. Extended
> statistics are currently only handled for base quals, not join quals.
> See dependency_is_compatible_clause().
>
>
But "recommended_content_id" and "version" are both in the same table,
doesn't that make them base quals?

The most obvious thing to me would be to vacuum
product_content_recommendation_main2 to get rid of the massive number of
heap fetches.  And to analyze everything to make sure the estimation errors
are not simply due to out-of-date stats.  And to increase work_mem.

It isn't clear we want to get rid of the nested loop, from the info we have
to go on the hash join might be even slower yet.  Seeing the plan with
enable_nestloop=off could help there.

Cheers,

Jeff

Re: Optimizer choosing the wrong plan

2018-12-29 Thread Jeff Janes

On Sat, Dec 29, 2018 at 7:17 AM Jim Finnerty  wrote:

> Jeff, can you describe the changes that were made to ANALYZE in v11,
> please?
>
> I've found that running ANALYZE on v10 on the Join Order Benchmark, using
> the default statistics target of 100, produces quite unstable results, so
> I'd be interested to hear what has been improved in v11.
>

There are two paths the code can take.  One if all values which were
sampled at all were sampled at least twice, and another if the
least-sampled value was sampled exactly once.  For some distributions (like
exponential-ish or maybe power-law), it is basically a coin flip whether
the least-sampled value is seen once, or more than once.  If you are seeing
instability, it is probably for this reason.  That fundamental instability
was not addressed in v11.

Once you follow the "something seen exactly once" path, it has to decide
how many of the values get represented in the most-common-value list.  That
is where the change was.  The old method said a value had to have an
estimated prevalence at least 25% more than the average estimated
prevalence to get accepted into the list.  The problem is that if there
were a few dominant values, it wouldn't be possible for any others to be
"over-represented" because those few dominant values dragged the average
prevalence up so far nothing else could qualify.  What it was changed to
was to include a value in the most-common-value list if its
overrepresentation was statistically significant given the sample size.
The most significant change (from my perspective) is that
over-representation is measured not against all values, but only against
all values more rare in the sample then the one currently being considered
for inclusion into the MCV.  The old method basically said "all rare values
are the same", while the new method realizes that a rare value present
10,000 times in a billion row table is much different than a rare value
present 10 time in a billion row table.

It is possible that this change will fix the instability for you, because
it could cause the "seen exactly once" path to generate a MCV list which is
close enough in size to the "seen at least twice" path that you won't
notice the difference between them anymore.  But, it is also possible they
will still be different enough in size that it will still appear unstable.
It depends on your distribution of values.

Cheers,

Jeff

Re: PostgreSQL Read IOPS limit per connection

2018-12-27 Thread Jeff Janes

>
>
> *Performance issue:*
>
> I’m trying to figure out if PostgreSQL (PG) has some inherent limit on
> IOPS per connection.
>
> Running pgbench with multiple clients (-c 30) we are able to see 20K+ IOPS
> , which is what we expect. But, if we use just one client, we get 1200
> IOPS, avg disk queue size around 1:
>

The default transaction done by pgbench simply has no opportunity for
dispatching multiple io requests per connection.  It just a series of
single-row lookups and single-row updates or inserts.  You will have to use
a different benchmark if you want to exercise this area.  Probably
something analytics heavy.

Also, you would want to use the newest version of PostgreSQL, as 9.6
doesn't have parallel query, which is much more generally applicable than
effective_io_concurrency is.

One of the issues I’m trying to solve is related to extracting data from a
> large table, which users a full table scan. We see the same 1200 IOPS limit
> of pgbench when we SELECT on this table using just one connection. If there
> is a limitation per connection, I might set up the application to have
> several connections, and then issue SELECTs for different sections of the
> table, and later join the data, but it looks cumbersome, especially if the
> DB can do extract data using more IOPS.
>
The kernel should detect a sequential read in progress and invoke
readahead.  That should be able to keep the CPU quite busy with data for
any decent IO system.  Are you sure IO is even the bottleneck for your
query?

Perhaps your kernel readahead settings need to be tuned.  Also, you may
benefit from parallel query features implemented in newer versions of
PostgreSQL.  In any event, the default transactions of pgbench are not
going to be useful for benchmarking what you care about.

Cheers,

Jeff

Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0

2018-11-27 Thread Jeff Janes

On Tue, Nov 27, 2018 at 9:17 AM Sanyo Moura  wrote:

> Hi,
>
> I'm running performance tests for my application at version 11.1 and
> encountered
> queries with high planning time compared to the same planning, running at
> versions 10.5 and 11.0.
>

Can you reproduce the regression if the tables are empty?  If so, can you
share the create script that creates the tables?

Cheers,

Jeff

>

Re: Optimizer choosing the wrong plan

2018-11-26 Thread Jeff Janes

On Mon, Nov 26, 2018 at 5:11 AM Viswanath  wrote:

> *Postgres server version -  9.5.10*
> *RAM - 128 GB*
> *WorkMem 64 MB*
>
> *Problematic query with explain :*
> *Query 1 (original):*
> explain analyse SELECT myTable1.ID FROM myTable1 LEFT JOIN myTable2 ON
> myTable1.ID=myTable2.ID WHERE  myTable1.bool_val = true) AND
> (myTable1.small_intval IN (1,2,3))) AND ((*myTable2.bigint_val = 1*) AND
> (myTable1.bool_val = true))) AND (((myTable1.ID >= 1) AND
> (myTable1.ID <= 1)) ))  ORDER BY 1 DESC , 1 NULLS FIRST  LIMIT
> 11;
>

There is no point doing a LEFT JOIN when the NULL-extended rows get
filtered out later.

Also, ordering by the same column twice is peculiar, to say the least.

> The table myTable2 contains *12701952* entries. Out of which only *86227*
> is
> not null and *146* entries are distinct.
>

I assume you mean the column myTable2.ID has that many not null and
distinct?

>
> The above query returns 0 rows since 'myTable2.bigint_val = 1' criteria
> satisfies nothing. It takes 6 seconds for execution as the planner chooses*
> myTable1.ID column's index*.

More importantly, it chooses the index on myTable2.ID.  It does also use
the index on myTable1.ID, but that is secondary.

The ideal index for this query would probably be a composite index on
myTable2 (bigint_val, id DESC);
The planner would probably choose to use that index, even if the statistics
are off.

I tried running *vacuum analyse* table many times, tried changing the
> *statistics target of the column to 250 (since there are only 149 distinct
> values)*. But none worked out. The planner thinks that there are *1727*
> rows
> that matches the condition *myTable2.bigint_val = 1* but there are none.
>

It would interesting if you can upgrade a copy of your server to v11 and
try it there.  We made changes to ANALYZE in that version which were
intended to improve this situation, and it would be nice to know if it
actually did so for your case.

Also, increasing statistics target even beyond 250 might help.  If every
one of the observed value is seen at least twice, it will trigger the
system to assume that it has observed all distinct values that exist.  But
if any of the values are seen exactly once, that causes it to take a
different path (the one which got modified in v11).

Cheers,

Jeff

Re: High CPU Usage of "SET ROLE"

2018-10-30 Thread Jeff Janes

On Tue, Oct 30, 2018 at 3:50 PM Ulf Lohbrügge 
wrote:


> When I use the psql cli on the same database I can see via "\timing" that
> the first statement after "RESET ROLE;" is significantly slower. I was even
> able to strip it down to two statements ("SET ROLE ...;" and "RESET ROLE;"):
>
> ...
>
Maybe my observations here are already sufficient to find out what happens
> here? I guess that my setup with 1k rows in pg_roles and 1.5m rows in
> pg_class is probably the cause.
>

It would probably be enough if it were reproducible, but I can't reproduce
it.

-- set up
perl -le 'print "create user foo$_;" foreach 1..1000'|psql
perl -le 'foreach $r (1..1000) {print "create schema foo$r authorization
foo$r;"}'|psql
perl -le 'foreach $r (reverse 1..1000) {print "set role foo$r;"; print
"create table foo$r.foo$_ (x serial primary key);" foreach 1..1000;}'|psql
> out

-- test
perl -le 'print "set role foo$_;\nreset role;" foreach 1..1000'|psql

Does it help when I create a test setup with a docker image that contains a
> database with that many entries in pg_roles and pg_class and share it here?
>

If you have a script to create the database, I'd be more likely to play
around with that than with a docker image.  (Which I have to guess would be
quite large anyway, with 1.5 rows in pg_class)

Cheers,

Jeff

>

Re: Gained %20 performance after disabling bitmapscan

2018-10-26 Thread Jeff Janes

On Mon, Oct 22, 2018 at 3:20 AM Yavuz Selim Sertoglu <
yavuzselim.serto...@medyasoft.com.tr> wrote:

> Thanks for the reply Jeff,
>
> I know 20ms is nothing but it shows me that there is a problem with my
> configuration. I want to find it.
>

This is a dangerous assumption.  This is no configuration you can come up
with which will cause the planner to be within 20% of perfection in all
cases.  Given the other plans you've shown and discussed, I think this is
just chasing our own tail.

Cheers,

Jeff

>

Re: Gained %20 performance after disabling bitmapscan

2018-10-19 Thread Jeff Janes

On Fri, Oct 19, 2018 at 3:19 AM Yavuz Selim Sertoglu <
yavuzselim.serto...@medyasoft.com.tr> wrote:

> Hi all,
>
> I have a problem with my query. Query always using parallel bitmap heap
> scan. I've created an index with all where conditions and id but query does
> not this index and continue to use bitmapscan. So I decided disable bitmap
> scan for testing. And after that, things became strange. Cost is higher,
> execution time is lower.
>

A 20% difference in speed is unlikely to make or break you.  Is it even
worth worrying about?

> But I want to use index_only_scan because index have all column that query
> need. No need to access table.
>

Your table is not very well vacuumed, so there is need to access it (9010
times to get 6115 rows, which seems like quite an anti-feat; but I don't
know which of those numbers are averaged over loops/parallel workers,
versus summed over them). Vacuuming your table will not only make the
index-only scan look faster to the planner, but also actually be faster.

The difference in timing could easily be down to one query warming the
cache for the other.  Are these timings fully reproducible altering
execution orders back and forth?  And they have different degrees of
parallelism, what happens if you disable parallelism to simplify the
analysis?

> It is doing index_only_scan when disabling bitmap scan but I cannot
> disable bitmap scan for cluster wide. There are other queries...
> Can you help me to solve the issue?
>
>
Cranking up effective_cache_size can make index scans look better in
comparison to bitmap scans, without changing a lot of other stuff.  This
still holds even for index-only-scan, in cases where the planner knows the
table to be poorly vacuumed.

But moving the column tested for inequality to the end of the index would
be probably make much more of  a difference, regardless of which plan it
chooses.

Cheers,

Jeff

>

Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Jeff Janes

On Fri, Sep 7, 2018 at 2:03 PM Patrick Molgaard  wrote:

>
> Hi Jeff,
>
> Thanks for your reply. Are locks relevant in this case, though?
>

I don't know, but why theorize when we can know for sure?  It at least
invokes VirtualXactLockTableInsert.  I don't see how that could block on a
heavyweight lock, though. But again, why theorize when logging it is simple?

Is it always the first statement in a connection which is blocking, or will
established connections also block at the same time the new ones start to
block?

Cheers,

Jeff

>

Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Jeff Janes

On Fri, Sep 7, 2018 at 8:00 AM Patrick Molgaard  wrote:

> Hi folks,
>
> I've been seeing some curious behaviour on a postgres server I administer.
>
> Intermittently (one or two times a week), all queries on that host are
> simultaneously blocked for extended periods (10s of seconds).
>
> The blocked queries are trivial & not related to locking - I'm seeing
> slowlogs of the form:
>
> `LOG: duration: 22627.299 ms statement: SET client_encoding='''utf-8''';`
>
>
Do you have log_lock_waits set to on?  If not, you might want to turn it on.

Cheers,

Jeff

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-09-05 Thread Jeff Janes

On Wed, Sep 5, 2018 at 12:21 AM Dinesh Kumar  wrote:

> Hi All,
> I was wondering whether the case is solved or still continuing. As a
> Postgres newbie, I can't understand any of the terms (JIT, tuple
> deformation) as you mentioned above. Please anyone let me know , what is
> the current scenario.
>
>
JIT is a just-in-time compilation, which will be new in v11.  Tuple
deforming is how you get the row from the on-disk format to the in-memory
format.

Some people see small improvements in tuple deforming using JIT in your
situation, some see large decreases, depending on settings and apparently
on hardware.  But regardless, JIT is not going to reduce your particular
use case (many nullable and actually null columns, referencing a
high-numbered column) down to being constant-time operation in the number
of preceding columns.  Maybe JIT will reduce the penalty for accessing a
high-numbered column by 30%, but won't reduce the penalty by 30 fold.  Put
your NOT NULL columns first and then most frequently accessed NULLable
columns right after them, if you can.

Cheers,

Jeff

>

Re: trying to delete most of the table by range of date col

2018-09-03 Thread Jeff Janes

>
> 4)delete in chunks :
> do $$
> declare
> rec integer;
> begin
> select count(*) from my_table into rec where end_date <=
> to_date('12/12/2018','DD/MM/') and end_date >
> to_date('11/12/2018','DD/MM/');
> while rec > 0 loop
> DELETE FROM my_Table WHERE id IN (select id from my_tablewhere end_date <=
> to_date('12/12/2018','DD/MM/') and end_date >
> to_date('11/12/2018','DD/MM/') limit 5000);
> rec := rec - 5000;
> raise notice '5000 records were deleted, current rows :%',rec;
> end loop;
>
> end;
> $$
> ;
>
> Execution time : 6 minutes.
>
> So, it seems that the second solution is the fastest one. It there a
> reason why the delete chunks (solution 4) wasnt faster?
>

Why would it be faster?  The same amount of work needs to get done, no
matter how you slice it.  Unless there is a specific reason to think it
would be faster, I would expect it won't be.

If you aren't willing to drop the constraints, then I think you just need
to resign yourself to paying the price of checking those constraints. Maybe
some future version of PostgreSQL will be able to do them in parallel.

Cheers,

Jeff

Re: Bi-modal streaming replication throughput

2018-08-14 Thread Jeff Janes

On Tue, Aug 14, 2018 at 9:18 AM, Alexis Lê-Quôc  wrote:

>
each
 running PG 9.3
 on linux

That is the oldest version which is still supported.  There have been a lot
of improvements since then, including to performance.  You should see if an
upgrade solves the problem.  If not, at least you will have access to
better tools (like pg_stat_activity.wait_event_type), and people will be
more enthusiastic about helping you figure it out knowing it is not an
already-solved problem.

>
> Here are some settings that may help and a perf profile of a recovery
> process that runs without any competing read traffic processing the INSERT
> backlog (I don't unfortunately have the same profile on a lagging read
> replica).
>

Unfortunately the perf when the problem is not occuring won't be very
helpful.  You need it from when the problem is occurring.  Also, I find
strace and gdb to more helpful than perf in this type of situation where
you already know it is not CPU bound, although perhaps that is just my own
lack of skill with perf. You need to know why it is not on the CPU, not
what it is doing when it is on the CPU.

Where the settings you showed all of the non-default settings?

I assume max_standby_streaming_delay is at the default value of 30s?  Are
you getting query cancellations due conflicts with recovery, or anything
else suspicious in the log?  What is the maximum lag you see measured in
seconds?

Cheers,

Jeff

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-08-01 Thread Jeff Janes

On Mon, Jul 30, 2018 at 3:02 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-07-30 13:31:33 -0400, Jeff Janes wrote:
> > I don't know where the time is going with the as-committed JIT.  None of
> > the JIT-specific timings reported by EXPLAIN (ANALYZE) add up to anything
> > close to the slow-down I'm seeing.  Shouldn't compiling and optimization
> > time show up there?
>
> As my timings showed, I don't see the slowdown you're reporting. Could
> you post a few EXPLAIN ANALYZEs?
>


I don't think you showed any timings where jit_above_cost < query cost <
jit_optimize_above_cost, which is where I saw the slow down.  (That is also
where things naturally land for me using default settings)

I've repeated my test case on a default build (./configure --with-llvm
--prefix=) and default postgresql.conf, using the post-11BETA2 commit
5a71d3e.


I've attached the full test case, and the full output.

Here are the last two executions, with jit=on and jit=off, respectively.
Doing it with TIMING OFF doesn't meaningfully change things, nor does
increasing shared_buffers beyond the default.



QUERY PLAN
--
 Seq Scan on i200c200  (cost=0.00..22.28 rows=828 width=16) (actual
time=29.317..11966.291 rows=1000 loops=1)
 Planning Time: 0.034 ms
 JIT:
   Functions: 2
   Generation Time: 1.589 ms
   Inlining: false
   Inlining Time: 0.000 ms
   Optimization: false
   Optimization Time: 9.002 ms
   Emission Time: 19.948 ms
 Execution Time: 12375.493 ms
(11 rows)

Time: 12376.281 ms (00:12.376)
SET
Time: 1.955 ms
   QUERY PLAN

 Seq Scan on i200c200  (cost=0.00..22.28 rows=828 width=16) (actual
time=0.063..3897.302 rows=1000 loops=1)
 Planning Time: 0.037 ms
 Execution Time: 4292.400 ms
(3 rows)

Time: 4293.196 ms (00:04.293)

Cheers,

Jeff


wide.sql
Description: Binary data


wide.out
Description: Binary data

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-31 Thread Jeff Janes

On Mon, Jul 30, 2018 at 1:23 PM, Andres Freund  wrote:

> On 2018-07-30 07:19:07 -0400, Jeff Janes wrote:
>
> > And indeed, in my hands JIT makes it almost 3 times worse.
>
> Not in my measurement. Your example won't use JIT at all, because it's
> below the cost threshold. So I think you might just be seeing cache +
> hint bit effects?
>

No, it is definitely JIT.  The explain plans show it, and the cost of the
query is 230,000 while the default setting of jit_above_cost is 100,000.
It is fully reproducible by repeatedly toggling the JIT setting.  It
doesn't seem to be the cost of compiling the code that slows it down (I'm
assuming the code is compiled once per tuple descriptor, not once per
tuple), but rather the efficiency of the compiled code.

>
> > Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
> > execution and 4594.994 ms for the JIT=off.
>
> Even with a debug LLVM build, which greatly increases compilation
> overhead, I actually see quite the benefit when I force JIT to be used:
>

I don't see a change when I compile without --enable-debug,
and jit_debugging_support is off, or in 11beta2 nonexistent.  How can I
know if I have a debug LLVM build, and turn it off if I do?

>
>
> postgres[26832][1]=# ;SET jit_above_cost = -1; set jit_optimize_above_cost
> = 0; set jit_inline_above_cost = 0;
> postgres[26832][1]=# explain (analyze, buffers, timing off) select pk,
> int200 from i200c200;
>

Lowering jit_optimize_above_cost does redeem this for me.  It brings it
back to being a tie with JIT=OFF.  I don't see any further improvement by
lowering jit_inline_above_cost, and overall it is just a statistical tie
with JIT=off, not an improvement as you get, but at least it isn't a
substantial loss.

Under what conditions would I want to do jit without doing optimizations on
it?  Is there a rule of thumb that could be documented, or do we just use
the experimental method for each query?

I don't know how sensitive JIT is to hardware.  I'm using Ubuntu 16.04 on
VirtualBox (running on Windows 10) on an i5-7200U, which might be important.

I had previously done a poor-man's JIT where I created 4 versions of the
main 'for' loop in slot_deform_tuple.  I did a branch on "if(hasnulls)",
and then each branch had two loops, one for when 'slow' is false, and then
one for after 'slow' becomes true so we don't have to keep setting it true
again once it already is, in a tight loop.  I didn't see noticeable
improvement there (although perhaps I would have on different hardware), so
didn't see how JIT could help with this almost-entirely-null case.  I'm not
trying to address JIT in general, just as it applies to this particular
case.

Unrelated to JIT and relevant to the 'select pk, int199' case but not the
'select pk, int200' case, it seems we have gone to some length to make slot
deforming be efficient for incremental use, but then just deform in bulk
anyway up to maximum attnum used in the query, at least in this case.  Is
that because incremental deforming is not cache efficient?

Cheers,

Jeff

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-30 Thread Jeff Janes

On Mon, Jul 30, 2018 at 12:01 PM, Pavel Stehule 
wrote:

>
>
> 2018-07-30 13:19 GMT+02:00 Jeff Janes :
>
>> On Mon, Jul 30, 2018 at 12:11 AM, Pavel Stehule 
>> wrote:
>>
>>> 2018-07-30 1:00 GMT+02:00 Tom Lane :
>>>
>>>> David Rowley  writes:
>>>> > On 29 July 2018 at 17:38, Dinesh Kumar  wrote:
>>>> >> I found performance variance between accessing int1 and int200
>>>> column which
>>>> >> is quite large.
>>>>
>>>> > Have a look at slot_deform_tuple and heap_deform_tuple. You'll see
>>>> > that tuples are deformed starting at the first attribute. If you ask
>>>> > for attribute 200 then it must deform 1-199 first.
>>>>
>>>> Note that that can be optimized away in some cases, though evidently
>>>> not the one the OP is testing.  From memory, you need a tuple that
>>>> contains no nulls, and all the columns to the left of the target
>>>> column have to be fixed-width datatypes.  Otherwise, the offset to
>>>> the target column is uncertain, and we have to search for it.
>>>>
>>>
>>> JIT decrease a overhead of this.
>>>
>>
>> The bottleneck here is such a simple construct, I don't see how JIT could
>> improve it by much.
>>
>> And indeed, in my hands JIT makes it almost 3 times worse.
>>
>> Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
>> execution and 4594.994 ms for the JIT=off.
>>
>
> look on http://www.postgresql-archive.org/PATCH-LLVM-tuple-
> deforming-improvements-td6029385.html thread, please.
>
>
The opt1 patch did get performance back to "at least do no harm" territory,
but it didn't improve over JIT=off.  Adding the other two didn't get any
further improvement.

I don't know where the time is going with the as-committed JIT.  None of
the JIT-specific timings reported by EXPLAIN (ANALYZE) add up to anything
close to the slow-down I'm seeing.  Shouldn't compiling and optimization
time show up there?

Cheers,

Jeff

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-30 Thread Jeff Janes

On Mon, Jul 30, 2018 at 12:11 AM, Pavel Stehule 
wrote:

> 2018-07-30 1:00 GMT+02:00 Tom Lane :
>
>> David Rowley  writes:
>> > On 29 July 2018 at 17:38, Dinesh Kumar  wrote:
>> >> I found performance variance between accessing int1 and int200 column
>> which
>> >> is quite large.
>>
>> > Have a look at slot_deform_tuple and heap_deform_tuple. You'll see
>> > that tuples are deformed starting at the first attribute. If you ask
>> > for attribute 200 then it must deform 1-199 first.
>>
>> Note that that can be optimized away in some cases, though evidently
>> not the one the OP is testing.  From memory, you need a tuple that
>> contains no nulls, and all the columns to the left of the target
>> column have to be fixed-width datatypes.  Otherwise, the offset to
>> the target column is uncertain, and we have to search for it.
>>
>
> JIT decrease a overhead of this.
>

The bottleneck here is such a simple construct, I don't see how JIT could
improve it by much.

And indeed, in my hands JIT makes it almost 3 times worse.

Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
execution and 4594.994 ms for the JIT=off.

Cheers,

Jeff
drop table if exists i200c200;
create table i200c200 ( pk bigint primary key, 
int1 bigint,
int2 bigint,
int3 bigint,
int4 bigint,
int5 bigint,
int6 bigint,
int7 bigint,
int8 bigint,
int9 bigint,
int10 bigint,
int11 bigint,
int12 bigint,
int13 bigint,
int14 bigint,
int15 bigint,
int16 bigint,
int17 bigint,
int18 bigint,
int19 bigint,
int20 bigint,
int21 bigint,
int22 bigint,
int23 bigint,
int24 bigint,
int25 bigint,
int26 bigint,
int27 bigint,
int28 bigint,
int29 bigint,
int30 bigint,
int31 bigint,
int32 bigint,
int33 bigint,
int34 bigint,
int35 bigint,
int36 bigint,
int37 bigint,
int38 bigint,
int39 bigint,
int40 bigint,
int41 bigint,
int42 bigint,
int43 bigint,
int44 bigint,
int45 bigint,
int46 bigint,
int47 bigint,
int48 bigint,
int49 bigint,
int50 bigint,
int51 bigint,
int52 bigint,
int53 bigint,
int54 bigint,
int55 bigint,
int56 bigint,
int57 bigint,
int58 bigint,
int59 bigint,
int60 bigint,
int61 bigint,
int62 bigint,
int63 bigint,
int64 bigint,
int65 bigint,
int66 bigint,
int67 bigint,
int68 bigint,
int69 bigint,
int70 bigint,
int71 bigint,
int72 bigint,
int73 bigint,
int74 bigint,
int75 bigint,
int76 bigint,
int77 bigint,
int78 bigint,
int79 bigint,
int80 bigint,
int81 bigint,
int82 bigint,
int83 bigint,
int84 bigint,
int85 bigint,
int86 bigint,
int87 bigint,
int88 bigint,
int89 bigint,
int90 bigint,
int91 bigint,
int92 bigint,
int93 bigint,
int94 bigint,
int95 bigint,
int96 bigint,
int97 bigint,
int98 bigint,
int99 bigint,
int100 bigint,
int101 bigint,
int102 bigint,
int103 bigint,
int104 bigint,
int105 bigint,
int106 bigint,
int107 bigint,
int108 bigint,
int109 bigint,
int110 bigint,
int111 bigint,
int112 bigint,
int113 bigint,
int114 bigint,
int115 bigint,
int116 bigint,
int117 bigint,
int118 bigint,
int119 bigint,
int120 bigint,
int121 bigint,
int122 bigint,
int123 bigint,
int124 bigint,
int125 bigint,
int126 bigint,
int127 bigint,
int128 bigint,
int129 bigint,
int130 bigint,
int131 bigint,
int132 bigint,
int133 bigint,
int134 bigint,
int135 bigint,
int136 bigint,
int137 bigint,
int138 bigint,
int139 bigint,
int140 bigint,
int141 bigint,
int142 bigint,
int143 bigint,
int144 bigint,
int145 bigint,
int146 bigint,
int147 bigint,
int148 bigint,
int149 bigint,
int150 bigint,
int151 bigint,
int152 bigint,
int153 bigint,
int154 bigint,
int155 bigint,
int156 bigint,
int157 bigint,
int158 bigint,
int159 bigint,
int160 bigint,
int161 bigint,
int162 bigint,
int163 bigint,
int164 bigint,
int165 bigint,
int166 bigint,
int167 bigint,
int168 bigint,
int169 bigint,
int170 bigint,
int171 bigint,
int172 bigint,
int173 bigint,
int174 bigint,
int175 bigint,
int176 bigint,
int177 bigint,
int178 bigint,
int179 bigint,
int180 bigint,
int181 bigint,
int182 bigint,
int183 bigint,
int184 bigint,
int185 bigint,
int186 bigint,
int187 bigint,
int188 bigint,
int189 bigint,
int190 bigint,
int191 bigint,
int192 bigint,
int193 bigint,
int194 bigint,
int195 bigint,
int196 bigint,
int197 bigint,
int198 bigint,
int199 bigint,
int200 bigint,
char1 varchar(255),
char2 varchar(255),
char3 varchar(255),
char4 varchar(255),
char5 varchar(255),
char6 varchar(255),
char7 varchar(255),
char8 varchar(255),
char9 varchar(255),
char10 varchar(255),
char11 varchar(255),
char12 varchar(255),
char13 varchar(255),
char14 varchar(255),
char15 varchar(255),
char16 varchar(255),
char17 varchar(255),
char18 varchar(255),
char19 varchar(255),
char20 varchar(255),
char21 varchar(255),
char22 varchar(255),
char23 varchar(255),
char24 varchar(255),
char25 varchar(255),
char26 varchar(255),
char27 varchar(255),
char28 varchar(255),
char29 varchar(255),
char30 varchar(255),
char31 varchar(255),
char32 varchar(255),
char33 varchar(255),
char34 varchar(255),
char35 varchar(255),
char36 varchar(255),
char37 varchar(255),
char38 varchar(255),
char39 varchar(255),
char40 varchar(255),
char41 varchar(255),
char42

Re: Improving Performance of Query ~ Filter by A, Sort by B

2018-07-17 Thread Jeff Janes

On Mon, Jul 16, 2018 at 5:29 PM, Lincoln Swaine-Moore <
lswainemo...@gmail.com> wrote:

> Tom and Jeff,
>
> Thanks very much for the suggestions!
>
> Here's what I've found so far after playing around for a few more days:
>
> What is your default_statistics_target?  What can you tell us about the
>> distribution of parent_id?  (exponential, power law, etc?).  Can you show
>> the results for select * from pg_stats where tablename='a' and
>> attname='parent_id' \x\g\x  ?
>
>
> The default_statistics_target is 500, which I agree seems quite
> insufficient for these purposes. I bumped this up to 2000, and saw some
> improvement in the row count estimation, but pretty much the same query
> plans. Unfortunately the distribution of counts is not intended to be
> correlated to parent_id, which is one reason I imagine the histograms might
> not be particularly effective unless theres one bucket for every value.
> Here is the output you requested:
>
> select * from pg_stats where tablename='a' and attname='parent_id';
>
> schemaname | public
> tablename  | a
> attname| parent_id
> inherited  | t
> null_frac  | 0
> avg_width  | 4
> n_distinct | 18871
> most_common_vals   | {15503,49787,49786,24595,49784,17549, ...} (2000
> values)
> most_common_freqs  | {0.0252983,0.02435,0.0241317,
> 0.02329,0.019095,0.0103967,0.00758833,0.004245, ...} (2000 values)
>

You showed the 8 most common frequencies.  But could you also show the last
couple of them?  When your queried parent_id value is not on the MCV list,
it is the frequency of the least frequent one on the list which puts an
upper limit on how frequent the one you queried for can be.

> A few questions re: statistics:
>  1) would it be helpful to bump column statistics to, say, 20k (the number
> of distinct values of parent_id)?
>

Only one way to find out...
However you can only go up to 10k, not 20k.

>  2) is the discrepancy between the statistics on the parent and child
> table be expected? certainly I would think that the statistics would be
> different, but I would've imagined they would have histograms of the same
> size given the settings being the same.
>

Is the n_distinct estimate accurate for the partition?  There is an
algorithm (which will change in v11) to stop the MCV from filling the
entire statistics target size if it thinks adding more won't be useful.
But I don't know why the histogram boundary list would be short.  But, I
doubt that that is very important here.  The histogram is only used for
inequality/range, not for equality/set membership.

>  3) is there a way to manually specify the the distribution of rows to be
> even? that is, set the frequency of each value to be ~ n_rows/n_distinct.
> This isn't quite accurate, but is a reasonable assumption about the
> distribution, and might generate better query plans.
>

This would be going in the wrong direction.  Your queries seem to
preferentially use rare parent_ids, not typical parent_ids.  In fact, it
seems like many of your hard-coded parent_ids don't exist in the table at
all.  That certainly isn't going to help the planner any.  Could you
somehow remove those before constructing the query?

You might also take a step back, where is that list of parent_ids coming
from in the first place, and why couldn't you convert the list of literals
into a query that returns that list naturally?

> You could try reversing the order and adding a column to be (tmstmp,
>> parent_id, id) and keeping the table well vacuumed.  This would allow the
>> slow plan to still walk the indexes in tmstmp order but do it as an
>> index-only scan, so it could omit the extra trip to the table. That trip to
>> the table must be awfully slow to explain the numbers you show later in the
>> thread.
>
>
> Just to clarify, do you mean building indexes like:
> CREATE INDEX "a_tmstmp_parent_id_id_idx_[PART_KEY]" on
> "a_partition[PART_KEY]" USING btree("tmstmp", "parent_id", "id")
> That seems promising! Is the intuition here that we want the first key of
> the index to be the one we are ultimately ordering by? Sounds like I make
> have had that flipped initially. My understanding of this whole situation
> (and please do correct me if this doesn't make sense) is the big bottleneck
> here is reading pages from disk (when looking at stopped up queries, the
> wait_event is DataFileRead), and so anything that can be done to minimize
> the pages read will be valuable. Which is why I would love to get the query
> plan to use the tmstmp index without having to filter thereafter by
> parent_id.
>

Yes, that is the index.

You really want it to filter by parent_id in the index, rather than going
to the table to do the filter on parent_id.  The index pages with tmstmp as
the leading column are going to be more tightly packed with potentially
relevant rows, while the table pages are less likely to be densely packed.
So filtering in the

Re: Why HDD performance is better than SSD in this case

2018-07-17 Thread Jeff Janes

On Tue, Jul 17, 2018 at 1:00 AM, Neto pr  wrote:

> Dear,
> Some of you can help me understand this.
>
> This query plan is executed in the query below (query 9 of TPC-H
> Benchmark, with scale 40, database with approximately 40 gb).
>
> The experiment consisted of running the query on a HDD (Raid zero).
> Then the same query is executed on an SSD (Raid Zero).
>
> Why did the HDD (7200 rpm)  perform better?
> HDD - TIME 9 MINUTES
> SSD - TIME 15 MINUTES
>
> As far as I know, the SSD has a reading that is 300 times faster than SSD.
>

Is the 300 times faster comparing random to random, or sequential to
sequential?  Maybe your SSD simply fails to perform as advertised.  This
would not surprise me at all.

To remove some confounding variables, can you turn off parallelism and
repeat the queries?  (Yes, they will probably get slower.  But is the
relative timings still the same?)  Also, turn on track_io_timings and
repeat the "EXPLAIN (ANALYZE, BUFFERS)", perhaps with TIMINGS OFF.

Also, see how long it takes to read the entire database, or just the
largest table, outside of postgres.

Something like:

time tar -f - $PGDATA/base | wc -c

or

time cat $PGDATA/base//* | wc -c

Cheers,

Jeff

Re: Improving Performance of Query ~ Filter by A, Sort by B

2018-07-14 Thread Jeff Janes

On Tue, Jul 10, 2018 at 11:07 AM, Lincoln Swaine-Moore <
lswainemo...@gmail.com> wrote:

>
>
>
> Something about the estimated row counts (this problem persisted after I
> tried ANALYZEing)
>

What is your default_statistics_target?  What can you tell us about the
distribution of parent_id?  (exponential, power law, etc?).  Can you show
the results for select * from pg_stats where tablename='a' and
attname='parent_id' \x\g\x  ?

> forces usage of the tmstmp index and Merge Append (which seems wise) but
> also a filter condition on parent_id over an index condition, which is
> apparently prohibitively slow.
>
> I tried creating a multicolumn index like:
>
> CREATE INDEX "a_partition1_parent_and_tmstmp_idx" on "a_partition2" USING
> btree ("parent_id", "tmstmp" DESC);
>
> But this didn't help (it wasn't used).
>

You could try reversing the order and adding a column to be (tmstmp,
parent_id, id) and keeping the table well vacuumed.  This would allow the
slow plan to still walk the indexes in tmstmp order but do it as an
index-only scan, so it could omit the extra trip to the table. That trip to
the table must be awfully slow to explain the numbers you show later in the
thread.

...

> This query plan (which is the same as when LIMIT is removed) has been a
> good short term solution when the number of "parent_id"s I'm using is still
> relatively small, but unfortunately queries grow untenably slow as the
> number of "parent_id"s involved increases:
>

What happens when you remove that extra order by phrase that you added?
The original slow plan should become much faster when the number of
parent_ids is large (it has to dig through fewer index entries before
accumulating 20 good ones), so you should try going back to that.

...

> I'd be very grateful for help with one or both of these questions:
> 1) Why is adding an unnecessary (from the perspective of result
> correctness) ORDER BY valuable for forcing the parent_id index usage, and
> does that indicate that there is something wrong with my
> table/indexes/statistics/etc.?
>

It finds the indexes on tmstmp to be falsely attractive, as it can walk in
tmstmp order and so avoid the sort. (Really not the sort itself, but the
fact that sort has to first read every row to be sorted, while walking an
index can abort once the LIMIT is satisfied).  Adding an extra phrase to
the ORDER BY means the index is no longer capable of delivering rows in the
needed order, so it no longer looks falsely attractive.  The same thing
could be obtained by doing a dummy operation, such as ORDER BY tmstmp + '0
seconds' DESC.  I prefer that method, as it is more obviously a tuning
trick.  Adding in "id" looks more like a legitimate desire to break any
ties that might occasionally occur in tmstmp.

As Tom pointed out, there clearly is something wrong with your statistics,
although we don't know what is causing it to go wrong.  Fixing the
statistics isn't guaranteed to fix the problem, but it would be a good
start.

> 2) Is there any way I can improve my query time when there are many
> "parent_id"s involved? I seem to only be able to get the query plan to use
> at most one of the parent_id index and the tmstmp index at a time. Perhaps
> the correct multicolumn index would help?
>

A few things mentioned above might help.

But if they don't, is there any chance you could redesign your partitioning
so that all parent_id queries together will always be in the same
partition?  And if not, could you just get rid of the partitioning
altogether?  1e7 row is not all that many and doesn't generally need
partitioning.  Unless it is serving a specific purpose, it is probably
costing you more than you are getting.

Finally, could you rewrite it as a join to a VALUES list, rather than as an
in-list?

Cheers,

Jeff

Re: Sort is generating rows

2018-05-31 Thread Jeff Janes

On Thu, May 31, 2018 at 7:22 AM, Nicolas Seinlet 
wrote:

> Hi,
>
> I have a query with a strange query plan.
>
> This query is roughly searching for sales, and convert them with a
> currency rate. As currency rate changes from time to time, table contains
> the currency, the company, the rate, the start date of availability of this
> rate and the end date of availability.
>
> The join is done using :
> left join currency_rate cr on (cr.currency_id = pp.currency_id and
>   cr.company_id = s.company_id and
>   cr.date_start <= coalesce(s.date_order, now()) and
>  (cr.date_end is null or cr.date_end > coalesce(s.date_order,
> now(
>
> The tricky part is the date range on the currency rate, which is not an
> equality.
>
> the query plan shows:
> ->  Sort  (cost=120.13..124.22 rows=1637 width=56) (actual
> time=14.300..72084.758 rows=308054684 loops=1)
>   Sort Key: cr.currency_id, cr.company_id
>   Sort Method: quicksort  Memory: 172kB
>   ->  CTE Scan on currency_rate cr
> (cost=0.00..32.74 rows=1637 width=56) (actual time=1.403..13.610 rows=1576
> loops=1)
>
> There's 2 challenging things :
> - planner estimates 1637 rows, and get 300 million lines
> - sorting is generating lines
>

These are both explained by the same thing.  The sort is feeding into a
merge join.  For every row in the other node which have the same value of
the scan keys, the entire section of this sort with those same keys gets
scanned again.  The repeated scanning gets counted in the actual row count,
but isn't counted in the expected row count, or the actual row count of the
thing feeding into the sort (the CTE)

>
>
For now, the more currency rates, the slowest the query. There's not that
> much currency rates (1k in this case), as you can only have one rate per
> day per currency.
>

If it is only per currency per day, then why is company_id present? In any
case, you might be better off listing the rates per day, rather than as a
range, and then doing an equality join.

Cheers,

Jeff

Re: Memory size

2018-03-11 Thread Jeff Janes

On Sun, Mar 11, 2018 at 10:33 AM, dangal  wrote:

> jeff thank you very much for your time, I tell you, they are the same
> queries
> with the same parameters, I take 3 minutes for example, but I execute it
> and
> it takes me seconds, that's why I suspect it is the shared buffer
> The server had 16 GB and we increased it to 24, but I really do not know if
> it should continue to increase since they are not our own resources, we
> have
> to ask for them and justify them
>

If that is the only query that you have trouble with, it might be easiest
just to set up a cron job to run it periodically just to keep that data set
in cache.  Not very elegant, but it can be effective.

Cheers,

Jeff

Re: effective_io_concurrency on EBS/gp2

2018-01-31 Thread Jeff Janes

On Wed, Jan 31, 2018 at 4:03 AM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:

>
> The results look really confusing to me in two ways. The first one is that
> I've seen recommendations to set effective_io_concurrency=256 (or more) on
> EBS.

I would not expect this to make much of a difference on a table which is
perfectly correlated with the index.  You would have to create an accounts
table which is randomly ordered to have a meaningful benchmark of the eic
parameter.

I don't know why the default for eic is 1.  It seems like that just turns
on the eic mechanism, without any hope of benefiting from it.

Cheers,

Jeff

Re: Query is slow when run for first time; subsequent execution is fast

2018-01-16 Thread Jeff Janes

On Fri, Jan 12, 2018 at 12:03 AM, Nandakumar M  wrote:

> Hello Jeff,
>
> Thanks for the insights.
>
> >Don't keep closing and reopening connections.
>
> Even if I close a connection and open a new one and execute the same
> query, the planning time is considerably less than the first time. Only
> when I restart the Postgres server then I face high planning time again.
>

Oh.  I've not seen that before.  But then again I don't often restart my
server and then immediately run very large queries with a stringent time
deadline.

You can try pg_prewarm, on pg_statistic table and its index.  But I'd
probably just put an entry in my db startup script to run this query
immediately after startng the server, and let the query warm the cache
itself.

Why do you restart your database often enough for this to be an issue?

Cheers,

Jeff

Re: Query is slow when run for first time; subsequent execution is fast

2018-01-10 Thread Jeff Janes

On Wed, Jan 10, 2018 at 3:59 AM, Nandakumar M  wrote:

>
> I am not using prepared statements. Postgres documentation and previous
> questions in the pgsql-performance mailing list mention that the query plan
> is cached only when prepared statements are used.
>
> https://www.postgresql.org/message-id/15600.1346885470%40sss.pgh.pa.us
>
> In the above thread Tom Lane mentions that the plan is never cached for
> raw queries. Yet, this is exactly what seems to be happening in my case. Am
> I missing something?
>

The query plan itself is not cached, but all the metadata about the (large
number) of tables used in the query is cached.  Apparently reading/parsing
that data is the slow step, not coming up with the actual plan.

> Please let me know how I can make sure the query execution for the first
time is fast too.

Don't keep closing and reopening connections.  Use a connection pooler
(pgbouncer, pgpool, whatever pooler is built into your
language/library/driver, etc.) if necessary to accomplish this.

Cheers,

Jeff

Re: primary key hash index

2018-01-04 Thread Jeff Janes

On Tue, Jan 2, 2018 at 6:02 AM, Rick Otten  wrote:

> After reading this article about keys in relational databases, highlighted
> on hacker news this morning:
> https://begriffs.com/posts/2018-01-01-sql-keys-in-depth.html
>
> I keep pondering the performance chart, regarding uuid insert, shown
> towards the bottom of the article.  I believe he was doing that test with
> PostgreSQL.
>
> My understanding is that the performance is degrading because he has a
> btree primary key index.  Is it possible to try a hash index or some other
> index type for a uuid primary key that would mitigate the performance issue
> he is recording?
>

Hash indexes do not yet support primary keys, but you could always test it
with just an plain index, since you already know the keys are unique via
the way they are constructed.  But I wouldn't expect any real improvement.
Hash indexes still trigger FPW and still dirty massive numbers of pages in
a random fashion (even worse than btree does as far as randomness goes but
since the hash is more compact maybe more of the pages will be re-dirtied
and so save on FPW or separate writes).  I was surprised that turning off
FPW was so effective for him, that suggests that maybe his checkpoints are
too close together, which I guess means max_wal_size is too low.

Cheers,

Jeff

Re: Batch insert heavily affecting query performance.

2017-12-27 Thread Jeff Janes

On Sun, Dec 24, 2017 at 11:51 AM, Jean Baro  wrote:

> Hi there,
>
> We are testing a new application to try to find performance issues.
>
> AWS RDS m4.large 500GB storage (SSD)
>

Is that general purpose SSD, or provisioned IOPS SSD?  If provisioned, what
is the level of provisioning?

Cheers,

Jeff

Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Jeff Janes

On Dec 3, 2017 15:31, "Tom Lane" <t...@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.ja...@gmail.com> writes:
> On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby <pry...@telsasoft.com>
wrote:
>> It thinks there's somewhat-high correlation since it gets a list of x
>> and y values (integer positions by logical and physical sort order) and
>> 90% of the x list (logical value) are the same value ('t'), and the
>> CTIDs are in order on the new index, so 90% of the values are 100%
>> correlated.

> But there is no index involved (except in the case of the functional
> index).  The correlation of table columns to physical order of the table
> doesn't depend on the existence of an index, or the physical order within
> an index.

> But I do see that ties within the logical order of the column values are
> broken to agree with the physical order.  That is wrong, right?  Is there
> any argument that this is desirable?

Uh ... what do you propose doing instead?  We'd have to do something with
ties, and it's not so obvious this way is wrong.

Let them be tied.  If there are 10 distinct values, number the values 0 to
9, and all rows of a given distinct values get the same number for the
logical order axis.

Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
me.  Although if we switched btree to store duplicate values with tid as a
tie breaker, then maybe it wouldn't be as obviously wrong.

Cheers,

Jeff

Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Jeff Janes

On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby <pry...@telsasoft.com> wrote:

> On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> > I think the non-extended stats code also has trouble with booleans.
> > pg_stats gives me a correlation  of 0.8 or higher for the flag column.
>
> It's not due to the boolean though; you see the same thing if you do:
> CREATE INDEX aaa_f ON aaa((flag::text));
> ANALYZE aaa;
> correlation | 0.81193
>
> or:
> ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
> correlation | 0.81193
>
> I think it's caused by having so few (2) values to correlate.
>
> most_common_vals   | {f,t}
> most_common_freqs  | {0.9014,0.0986}
> correlation| 0.822792
>
> It thinks there's somewhat-high correlation since it gets a list of x and y
> values (integer positions by logical and physical sort order) and 90% of
> the x
> list (logical value) are the same value ('t'), and the CTIDs are in order
> on
> the new index, so 90% of the values are 100% correlated.
>

But there is no index involved (except in the case of the functional
index).  The correlation of table columns to physical order of the table
doesn't depend on the existence of an index, or the physical order within
an index.

But I do see that ties within the logical order of the column values are
broken to agree with the physical order.  That is wrong, right?  Is there
any argument that this is desirable?

It looks like it could be fixed with a few extra double calcs per distinct
value.  Considering we already sorted the sample values using SQL-callable
collation dependent comparators, I doubt a few C-level double calcs is
going to be meaningful.

Cheers,

Jeff

Re: Bitmap scan is undercosted?

2017-12-02 Thread Jeff Janes

On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:

>
>
> seq_page_cost = 0.0
> random_page_cost = 0.0
> explain analyze select * from aaa where num = 2 and flag = true;
>
> Bitmap Heap Scan on aaa  (cost=753.00..2003.00 rows=10257 width=5) (actual
> time=82.212..82.212 rows=0 loops=1)
>   ->  Bitmap Index Scan on i1  (cost=0.00..750.43 rows=10 width=0)
> (actual time=17.401..17.401 rows=10 loops=1)
>
> Index Scan using i1 on aaa  (cost=0.44..1750.43 rows=10257 width=5)
> (actual time=49.766..49.766 rows=0 loops=1)
>
> The bitmap plan was reduced to use only one bitmap scan, and finally it
> costs more than the index plan.
>

Right, so there is a cpu costing problem (which could only be fixed by
hacking postgresql and recompiling it), but it is much smaller of a problem
than the IO cost not being accurate due to the high hit rate.  Fixing the
CPU costing problem is unlikely to make a difference to your real query.
If you set the page costs to zero, what happens to your real query?

> But I doubt that the settings seq_page_cost = random_page_cost = 0.0
> should actually be used.
>

Why not?  If your production server really has everything in memory during
normal operation, that is the correct course of action.  If you ever
restart the server, then you could have some unpleasant time getting it
back up to speed again, but pg_prewarm could help with that.

> Probably it should be instead something like 1.0/1.0 or 1.0/1.1, but other
> costs increased, to have more weight.
>

This doesn't make any  sense to me.  Halving the page costs is
mathematically the same as doubling all the other constants.  But the first
way of doing things says what you are doing, and the second way is an
obfuscation of what you are doing.

>
> # x4 tuple/operator costs - bitmap scan still a bit cheaper
> set seq_page_cost = 1.0;
> set random_page_cost = 1.0;
> set cpu_tuple_cost = 0.04;
> set cpu_index_tuple_cost = 0.02;
> set cpu_operator_cost = 0.01;
>

If you really want to target the plan with the BitmapAnd, you should
increase  cpu_index_tuple_cost and/or cpu_operator_cost but not increase
cpu_tuple_cost.  That is because the  unselective bitmap index scan does
not incur any cpu_tuple_cost, but does incur index_tuple and operator
costs.  Unfortunately all other index scans in the system will also be
skewed by such a change if you make the change system-wide.

Incidentally, the "actual rows" field of BitmapAnd is always zero.  That
field is not implemented for that node type.

Why do you have an index on flag in the first place?  What does the index
accomplish, other than enticing the planner into bad plans?  I don't know
how this translates back into your real query, but dropping that index
should be considered.  Or replace both indexes with one on (num,flag).

Or you can re-write the part of the WHERE clause in a way that it can't use
an index, something like:

and flag::text ='t'

Cheers,

Jeff

Re: Setting effective_io_concurrency in VM?

2017-11-27 Thread Jeff Janes

On Mon, Nov 27, 2017 at 10:40 AM, Scott Marlowe 
wrote:

>
> Generally VMs are never going to be as fast as running on bare metal
> etc. You can adjust it and test it with something simple like pgbench
> with various settings for -c (concurrency) and see where it peaks etc
> with the setting. This will at least get you into the ball park.
>

None of the built-in workloads for pgbench cares a whit about
effective_io_concurrency.  He would have to come up with some custom
transactions to exercise that feature.  (Or use the tool people use to run
the TPCH benchmark, rather than using pgbench's built in transactions)

I think the best overall advice would be to configure it the same as you
would if it were not a VM.  There may be cases where you diverge from that,
but I think each one would require extensive investigation and
experimentation, so can't be turned into a rule of thumb.

Cheers,

Jeff

Re: insert and query performance on big string table with pg_trgm

2017-11-24 Thread Jeff Janes

On Nov 21, 2017 00:05, "Matthew Hall"  wrote:

> Are all indexes present at the time you insert?  It will probably be much
faster to insert without the gin index (at least) and build it after the
load.

There is some flexibility on the initial load, but the updates in the
future will require the de-duplication capability. I'm willing to accept
that might be somewhat slower on the load process, to get the accurate
updates, provided we could try meeting the read-side goal I wrote about, or
at least figure out why it's impossible, so I can understand what I need to
fix to make it possible.

As long as you don't let anyone use the table between the initial load and
when the index build finishes, you don't have to compromise on
correctness.  But yeah, makes sense to worry about query speed first.

> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds
as could be expected for such a torture test query, but it's still WAY
faster than the first such query. If you change it out to a different
expression, it's longer again of course. There does seem to be a
low-to-medium correlation between the number of rows found and the query
completion time.

To make this quick, you will need to get most of the table and most of the
index cached into RAM.  A good way to do that is with pg_prewarm.  Of
course that only works if you have enough RAM in the first place.

What is the size of the table and the gin index?

Cheers,

Jeff

< 1 2

101 - 166 of 166 matches

Mail list logo