Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-09 Thread Andres Freund
On 2017-11-09 16:45:07 -0800, Peter Geoghegan wrote:
> On Thu, Nov 9, 2017 at 4:17 PM, Andres Freund <and...@anarazel.de> wrote:
> >> Actually, on second thought, I take that back -- I don't think that
> >> REINDEXing will even finish once a HOT chain is broken by the bug.
> >> IndexBuildHeapScan() actually does quite a good job of making sure
> >> that HOT chains are sane, which is how the enhanced amcheck notices
> >> the bug here in practice.
> >
> > I think that's too optimistic.
> 
> Why? Because the "find the TID of the root" logic in
> IndexBuildHeapScan()/heap_get_root_tuples() won't reliably find the
> actual root (it might be some other HOT chain root following TID
> recycling by VACUUM)?

Primarily because it's not an anti-corruption tool. I'd be surprised if
there weren't ways to corrupt the page using these corruptions that
aren't detected by it.  But even if it were, I don't think there's
enough information to do so in the general case.  You very well can end
up with pages where subsequent hot pruning has removed a good bit of the
direct evidence of this bug.

But I'm not really sure why the error detection capabilities of matter
much for the principal point I raised, which is how much work we need to
do to not further worsen the corruption.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-09 Thread Andres Freund
On 2017-11-09 16:02:17 -0800, Peter Geoghegan wrote:
> > What I'm currently wondering about is how much we need to harden
> > postgres against such existing corruption. If e.g. the hot chains are
> > broken somebody might have reindexed thinking the problem is fixed - but
> > if they then later vacuum everything goes to shit again, with dead rows
> > reappearing.
> 
> I don't follow you here. Why would REINDEXing make the rows that
> should be dead disappear again, even for a short period of time?

It's not the REINDEX that makes them reappear. It's the second
vacuum. The reindex part was about $user trying to fix the problem...
As you need two vacuums with appropriate cutoffs to hit the "rows
revive" problem, that'll often in practice not happen immediately.


> Actually, on second thought, I take that back -- I don't think that
> REINDEXing will even finish once a HOT chain is broken by the bug.
> IndexBuildHeapScan() actually does quite a good job of making sure
> that HOT chains are sane, which is how the enhanced amcheck notices
> the bug here in practice.

I think that's too optimistic.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-09 Thread Andres Freund
On 2017-11-04 06:15:00 -0700, Andres Freund wrote:
> The reason for that is that I hadn't yet quite figured out how the bug I
> described in the commit message (and the previously committed testcase)
> would cause that. I was planning to diagnose / experiment more with this
> and write an email if I couldn't come up with an explanation.   The
> committed test does *not* actually trigger that.
> 
> The reason I couldn't quite figure out how the problem triggers is that
> [ long explanation ]

Attached is a version of the already existing regression test that both
reproduces the broken hot chain (and thus failing index lookups) and
then also the tuple reviving.  I don't see any need for letting this run
with arbitrary permutations.

Thanks to whoever allowed isolationtester permutations to go over
multiple lines and allow comments. I was wondering about adding that as
a feature just to discover it's already there ;)


What I'm currently wondering about is how much we need to harden
postgres against such existing corruption. If e.g. the hot chains are
broken somebody might have reindexed thinking the problem is fixed - but
if they then later vacuum everything goes to shit again, with dead rows
reappearing.  There's no way we can fix hot chains after the fact, but
preventing dead rows from reapparing seems important.  A minimal version
of that is fairly easy - we slap a bunch of if if
!TransactionIdDidCommit() elog(ERROR) at various code paths. But that'll
often trigger clog access errors when the problem occurred - if we want
to do better we need to pass down relfrozenxid/relminmxid to a few
functions.  I'm inclined to do so, but it'll make the patch larger...

Comments?

Greetings,

Andres Freund
# Test for interactions of tuple freezing with dead, as well as recently-dead
# tuples using multixacts via FOR KEY SHARE.
setup
{
  DROP TABLE IF EXISTS tab_freeze;
  CREATE TABLE tab_freeze (
id int PRIMARY KEY,
name char(3),
x int);
  INSERT INTO tab_freeze VALUES (1, '111', 0);
  INSERT INTO tab_freeze VALUES (3, '333', 0);
}

teardown
{
  DROP TABLE tab_freeze;
}

session "s1"
step "s1_begin" { BEGIN; }
step "s1_update"{ UPDATE tab_freeze SET x = x + 1 WHERE id = 3; }
step "s1_commit"{ COMMIT; }
step "s1_vacuum"{ VACUUM FREEZE tab_freeze; }
step "s1_selectone" {
BEGIN;
SET LOCAL enable_seqscan = false;
SET LOCAL enable_bitmapscan = false;
SELECT * FROM tab_freeze WHERE id = 3;
COMMIT;
}
step "s1_selectall" { SELECT * FROM tab_freeze ORDER BY name, id; }

session "s2"
step "s2_begin" { BEGIN; }
step "s2_key_share" { SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; 
}
step "s2_commit"{ COMMIT; }
step "s2_vacuum"{ VACUUM FREEZE tab_freeze; }

session "s3"
step "s3_begin" { BEGIN; }
step "s3_key_share" { SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; 
}
step "s3_commit"{ COMMIT; }
step "s3_vacuum"{ VACUUM FREEZE tab_freeze; }

# This permutation verfies that a previous bug
# https://postgr.es/m/e5711e62-8fdf-4dca-a888-c200bf6b5...@amazon.com
# https://postgr.es/m/20171102112019.33wb7g5wp4zpj...@alap3.anarazel.de
# is not reintroduced. We used to make wrong pruning / freezing
# decision for multixacts, which could lead to a) broken hot chains b)
# dead rows being revived.
permutation "s1_begin" "s2_begin" "s3_begin" # start transactions
   "s1_update" "s2_key_share" "s3_key_share" # have xmax be a multi with an 
updater, updater being oldest xid
   "s1_update" # create additional row version that has multis
   "s1_commit" "s2_commit" # commit both updater and share locker
   "s2_vacuum" # due to bug in freezing logic, we used to *not* prune updated 
row, and then froze it
   "s1_selectone" # if hot chain is broken, the row can't be found via index 
scan
   "s3_commit" # commit remaining open xact
   "s2_vacuum" # pruning / freezing in broken hot chains would unset xmax, 
reviving rows
   "s1_selectall" # show borkedness

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect option to forgo buffer locking?

2017-11-09 Thread Andres Freund
On 2017-11-09 17:14:11 -0500, Tom Lane wrote:
> If we do this, I'd suggest exposing it as a separate SQL function
> get_raw_page_unlocked() rather than as an option to get_raw_page().
> 
> The reasoning is that if we ever allow these functions to be controlled
> via GRANT instead of hardwired superuser checks (cf nearby debate about
> lo_import/lo_export), one might reasonably consider the unlocked form as
> more risky than the locked form, and hence not wish to have to give out
> privileges to both at once.

Good idea!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect option to forgo buffer locking?

2017-11-09 Thread Andres Freund
On 2017-11-09 12:55:30 -0500, Robert Haas wrote:
> On Thu, Nov 9, 2017 at 12:49 PM, Andres Freund <and...@anarazel.de> wrote:
> > Occasionally, when debugging issues, I find it quite useful to be able
> > to do a heap_page_items() on a page that's actually locked exclusively
> > concurrently. E.g. investigating the recent multixact vacuuming issues,
> > it was very useful to attach a debugger to one backend to step through
> > freezing, and display the page in another session.
> >
> > Currently the locking in get_raw_page_internal() prevents that.  If it's
> > an option defaulting to off, I don't see why we couldn't allow that to
> > skip locking the page's contents. Obviously you can get corrupted
> > contents that way, but we already allow to pass arbitrary stuff to
> > heap_page_items().  Since pinning wouldn't be changed, there's no danger
> > of the page being moved out from under us.
> 
> heap_page_items() is, if I remember correctly, not necessarily going
> to tolerate malformed input very well - I think that's why we restrict
> all of these functions to superusers.  So using it in this way would
> seem to increase the risk of a server crash or other horrible
> misbehavior.  Of course if we're just dev-testing that doesn't really
> matter.

You can already pass arbitrary byteas to heap_page_items(), so I don't
see how this'd change anything exposure-wise? Or are you thinking that
users would continually do this with actual page contents and would be
more likely to hit edge cases than if using pg_read_binary_file() or
such (which obviously sees an out of date page)?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pageinspect option to forgo buffer locking?

2017-11-09 Thread Andres Freund
Hi,

Occasionally, when debugging issues, I find it quite useful to be able
to do a heap_page_items() on a page that's actually locked exclusively
concurrently. E.g. investigating the recent multixact vacuuming issues,
it was very useful to attach a debugger to one backend to step through
freezing, and display the page in another session.

Currently the locking in get_raw_page_internal() prevents that.  If it's
an option defaulting to off, I don't see why we couldn't allow that to
skip locking the page's contents. Obviously you can get corrupted
contents that way, but we already allow to pass arbitrary stuff to
heap_page_items().  Since pinning wouldn't be changed, there's no danger
of the page being moved out from under us.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0

2017-11-08 Thread Andres Freund
On 2016-12-16 23:04:13 -0500, Robert Haas wrote:
> >>  BTW, I suggest this rewritten comment:
> >>
> >> /*--
> >>  * FD_READ events are edge-triggered on Windows per
> >>  * 
> >> https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576(v=vs.85).aspx
> >>
> >
> > Isn't the statement in above doc. "For FD_READ, FD_OOB, and FD_ACCEPT
> > network events, network event recording and event object signaling are
> > level-triggered." says that FD_READ events are level-triggered which
> > seems to be contradictory with above comment?
> 
> Argh.  I see your point.  Maybe we'd better rephrase that.  The
> document does say that, but the behavior they described is actually a
> weird hybrid of level-triggered and edge-triggered.  We should
> probably just avoid that terminology altogether and explain that
> read-ready won't necessarily be re-signalled unless there is an
> intervening recv().

(just looked at some latch code, reminding me of this)

FWIW, I wonder if the issue here isn't whether WaitForMultipleObjects /
WSAEventSelect is level triggered, but that WSAEnumNetworkEvents()
pretty much explicitly is not. And it seems to have effects over
multiple handles for the same socket.

The relevant line from the docs is: "The WSAEnumNetworkEvents function
discovers occurrences of network events for the indicated socket, clear
internal network event records, and reset event objects (optional)."

Note that it clears the state from the socket, *not* just the handle.


That behaviour of WSAEnumNetworkEvents() also seems to explains why
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f7819baa618c528f60e266874051563ecfe08207
was necessary, and why all the weird workaround in win32/socket.c exist.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-08 Thread Andres Freund
On 2017-11-08 12:02:40 -0500, Tom Lane wrote:
> BTW, it strikes me that in considering the rebuild-the-page approach,
> we should not have blinders on and just measure the speed of
> PageRepairFragmentation.  Rather, we should take a look at what happens
> subsequently given a physically-ordered set of tuples.  I can recall
> Andres or someone moaning awhile ago about lack of locality of access in
> index page searches --- maybe applying that approach while vacuuming
> indexes will help?

I complained about multiple related things, I'm not exactly sure what
exactly you're referring to here:
- The fact that HeapTupleHeaderData's are commonly iterated over in
  reverse order is bad for performance. For shared buffers resident
  workloads involving seqscans that yields 15-25% slowdowns for me. It's
  trivial to fix that by just changing iteration order, but that
  obviously changes results. But we could reorder the page during heap
  pruning.

  But that's fairly independent of indexes, so I'm not sure whether
  that's what you're referring.

- The layout of items in index pages is suboptimal. We regularly do
  binary searches over the the linearly ordered items, which is cache
  inefficient. So instead we should sort items as [1/2, 1/4, 3/4, ...]
  elements, which will access items in a close-ish to linear manner.

  But that's fairly independent of pruning, so I'm not sure whether
  that's what you're referring to, either.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-11-08 Thread Andres Freund
Hi,

@@ -747,7 +747,7 @@ try_hashjoin_path(PlannerInfo *root,
 * never have any output pathkeys, per comments in create_hashjoin_path.
 */
initial_cost_hashjoin(root, , jointype, hashclauses,
- outer_path, inner_path, 
extra);
+ outer_path, inner_path, 
extra, false);
 
if (add_path_precheck(joinrel,
  workspace.startup_cost, 
workspace.total_cost,
@@ -761,6 +761,7 @@ try_hashjoin_path(PlannerInfo *root,
  extra,
  
outer_path,
  
inner_path,
+ 
false,/* parallel_hash */
  
extra->restrictlist,
  
required_outer,
  
hashclauses));
@@ -776,6 +777,10 @@ try_hashjoin_path(PlannerInfo *root,
  * try_partial_hashjoin_path
  *   Consider a partial hashjoin join path; if it appears useful, push it 
into
  *   the joinrel's partial_pathlist via add_partial_path().
+ *   The outer side is partial.  If parallel_hash is true, then the inner 
path
+ *   must be partial and will be run in parallel to create one or more 
shared
+ *   hash tables; otherwise the inner path must be complete and a copy of 
it
+ *   is run in every process to create separate identical private hash 
tables.
  */

When do we have "or more shared hash tables" rather than one? Are you
thinking about subordinate nodes?


@@ -1839,6 +1846,10 @@ hash_inner_and_outer(PlannerInfo *root,
 * able to properly guarantee uniqueness.  Similarly, we can't 
handle
 * JOIN_FULL and JOIN_RIGHT, because they can produce false null
 * extended rows.  Also, the resulting path must not be 
parameterized.
+* We should be able to support JOIN_FULL and JOIN_RIGHT for 
Parallel
+* Hash, since in that case we're back to a single hash table 
with a
+* single set of match bits for each batch, but that will 
require
+* figuring out a deadlock-free way to wait for the probe to 
finish.
 */

s/should be able/would be able/?



index 6a45b68e5df..2d38a5efae8 100644
--- a/src/backend/storage/ipc/barrier.c
+++ b/src/backend/storage/ipc/barrier.c
@@ -451,7 +451,6 @@ BarrierDetachImpl(Barrier *barrier, bool arrive)
release = true;
barrier->arrived = 0;
++barrier->phase;
-   Assert(barrier->selected);
barrier->selected = false;
}

Uh, what?




diff --git a/src/test/regress/expected/join.out 
b/src/test/regress/expected/join.out
index 35523bd8065..40a076d976f 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -5821,6 +5821,9 @@ analyze extremely_skewed;
 insert into extremely_skewed
   select 42 as id, 'aa'
   from generate_series(1, 19000);
+-- Make a relation with a couple of enormous tuples.
+create table wide as select generate_series(1, 2) as id, rpad('', 32, 'x') 
as t;
+alter table wide set (parallel_workers = 2);

I'm doubtful this is actually going to be a wide tuple - this'll get
compressed down quite a bit, no?

postgres[26465][1]=# SELECT octet_length(t), pg_column_size(t) FROM wide ;
┌──┬┐
│ octet_length │ pg_column_size │
├──┼┤
│   32 │   3671 │
│   32 │   3671 │
└──┴┘
(2 rows)


(and yes, it's ridiculous that a compressed datum of that size still
takes up 3kb)



+-- parallel with parallel-aware hash join
+set max_parallel_workers_per_gather = 2;
+set work_mem = '128kB';
+set enable_parallel_hash = on;

I think it'd be better if we structured the file so we just sat guc's
with SET LOCAL inside a transaction.


+-- parallel with parallel-aware hash join
+set max_parallel_workers_per_gather = 2;
+set work_mem = '64kB';
+set enable_parallel_hash = on;
+explain (costs off)
+  select count(*) from simple r join extremely_skewed s using (id);
+  QUERY PLAN   
+---
+ Finalize Aggregate
+   ->  Gather
+ Workers Planned: 2
+ ->  Partial Aggregate
+   ->  Parallel Hash Join
+ Hash Cond: (r.id = s.id)
+ ->  Parallel Seq Scan on simple r
+ ->  Parallel Hash
+   

Re: [HACKERS] Horrible CREATE DATABASE Performance in High Sierra

2017-11-08 Thread Andres Freund


On November 8, 2017 7:31:17 AM PST, Peter Eisentraut 
 wrote:
>On 10/7/17 16:46, Tom Lane wrote:
>> I wrote:
>>> Current status is that I've filed a bug report with Apple and am
>waiting
>>> to see their response before deciding what to do next.  If they fix
>the
>>> issue promptly then there's little need for us to do anything.
>
>> Accordingly I propose the attached patch.  If anyone's interested in
>> experimenting on other platforms, we might be able to
>refine/complicate
>> the FLUSH_DISTANCE selection further, but I think this is probably
>good
>> enough for a first cut.
>
>What is the status of this?  Is performance on High Sierra still bad?

Didn't Tom commit a workaround?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Andres Freund
Hi,

 * avoids wasting memory on duplicated hash tables
 * avoids wasting disk space on duplicated batch files
 * avoids wasting CPU executing duplicate subplans

What's the last one referring to?





+static void
+MultiExecParallelHash(HashState *node)
+{

+   switch (BarrierPhase(build_barrier))
+   {
+   case PHJ_BUILD_ALLOCATING:
+
+   /*
+* Either I just allocated the initial hash table in
+* ExecHashTableCreate(), or someone else is doing 
that.  Either
+* way, wait for everyone to arrive here so we can 
proceed, and
+* then fall through.
+*/
+   BarrierArriveAndWait(build_barrier, 
WAIT_EVENT_HASH_BUILD_ALLOCATING);

Can you add a /* fallthrough */ comment here? Gcc is warning if you
don't. While we currently have lotsa other places not having the
annotation, it seem reasonable to have it in new code.


+   case PHJ_BUILD_HASHING_INNER:
+
+   /*
+* It's time to begin hashing, or if we just arrived 
here then
+* hashing is already underway, so join in that effort. 
 While
+* hashing we have to be prepared to help increase the 
number of
+* batches or buckets at any time, and if we arrived 
here when
+* that was already underway we'll have to help 
complete that work
+* immediately so that it's safe to access batches and 
buckets
+* below.
+*/
+   if 
(PHJ_GROW_BATCHES_PHASE(BarrierAttach(>grow_batches_barrier)) !=
+   PHJ_GROW_BATCHES_ELECTING)
+   ExecParallelHashIncreaseNumBatches(hashtable);
+   if 
(PHJ_GROW_BUCKETS_PHASE(BarrierAttach(>grow_buckets_barrier)) !=
+   PHJ_GROW_BUCKETS_ELECTING)
+   ExecParallelHashIncreaseNumBuckets(hashtable);
+   ExecParallelHashEnsureBatchAccessors(hashtable);

"accessors" sounds a bit weird for a bunch of pointers, but maybe that's
just my ESL senses tingling wrongly.



/* 
@@ -240,12 +427,15 @@ ExecEndHash(HashState *node)
  * 
  */
 HashJoinTable
-ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
+ExecHashTableCreate(HashState *state, List *hashOperators, bool keepNulls)

+   /*
+* Parallel Hash tries to use the combined work_mem of all workers to
+* avoid the need to batch.  If that won't work, it falls back to 
work_mem
+* per worker and tries to process batches in parallel.
+*/

One day we're going to need a better approach to this. I have no idea
how, but this per-node, and now per_node * max_parallelism, approach has
only implementation simplicity as its benefit.





+static HashJoinTuple
+ExecParallelHashLoadTuple(HashJoinTable hashtable, MinimalTuple tuple,
+ dsa_pointer *shared)
+{

+static void
+ExecParallelHashJoinSetUpBatches(HashJoinTable hashtable, int nbatch)
+{



+/*
+ * Get the first tuple in a given bucket identified by number.
+ */
+static HashJoinTuple
+ExecHashFirstTupleInBucket(HashJoinTable hashtable, int bucketno)
+{
+   if (hashtable->parallel_state)
+   {
+   dsa_pointer p =
+   dsa_pointer_atomic_read(>buckets.shared[bucketno]);

Can you make this, and possibly a few other places, more readable by
introducing a temporary variable?


+/*
+ * Insert a tuple at the front of a chain of tuples in DSA memory atomically.
+ */
+static void
+ExecParallelHashPushTuple(dsa_pointer_atomic *head,
+ HashJoinTuple tuple,
+ dsa_pointer tuple_shared)
+{
+   do
+   {
+   tuple->next.shared = dsa_pointer_atomic_read(head);
+   } while (!dsa_pointer_atomic_compare_exchange(head,
+   
  >next.shared,
+   
  tuple_shared));
+}

This is hard to read.


+ * While in the phase PHJ_BUILD_HASHING_INNER a separate pair of barriers may
+ * be used repeatedly as required to coordinate expansions in the number of
+ * batches or buckets.  Their phases are as follows:
+ *
+ *   PHJ_GROW_BATCHES_ELECTING   -- initial state
+ *   PHJ_GROW_BATCHES_ALLOCATING -- one allocates new batches
+ *   PHJ_GROW_BATCHES_REPARTITIONING -- all rep
s/rep/repartition/?

 #include "access/htup_details.h"
+#include "access/parallel.h"
 #include 

Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Andres Freund
Hi,

Here's a review of v24

+set min_parallel_table_scan_size = 0;
+set parallel_setup_cost = 0;
+-- Make a simple relation with well distributed keys and correctly
+-- estimated size.
+create table simple as
+  select generate_series(1, 2) AS id, 'aa';
+alter table simple set (parallel_workers = 2);
+analyze simple;
+-- Make a relation whose size we will under-estimate.  We want stats
+-- to say 1000 rows, but actually there are 20,000 rows.
+create table bigger_than_it_looks as
+  select generate_series(1, 2) as id, 'aa';
+alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
+alter table bigger_than_it_looks set (parallel_workers = 2);
+delete from bigger_than_it_looks where id <= 19000;
+vacuum bigger_than_it_looks;
+analyze bigger_than_it_looks;
+insert into bigger_than_it_looks
+  select generate_series(1, 19000) as id, 'aa';

It seems kinda easier to just manipulate ndistinct and reltuples...


+set max_parallel_workers_per_gather = 0;
+set work_mem = '4MB';

I hope there's a fair amount of slop here - with different archs you're
going to see quite some size differences.

+-- The "good" case: batches required, but we plan the right number; we
+-- plan for 16 batches, and we stick to that number, and peak memory
+-- usage says within our work_mem budget
+-- non-parallel
+set max_parallel_workers_per_gather = 0;
+set work_mem = '128kB';

So how do we know that's actually the case we're testing rather than
something arbitrarily different? There's IIRC tests somewhere that just
filter the json explain output to the right parts...





+/*
+ * Build the name for a given segment of a given BufFile.
+ */
+static void
+MakeSharedSegmentName(char *name, const char *buffile_name, int segment)
+{
+   snprintf(name, MAXPGPATH, "%s.%d", buffile_name, segment);
+}

Not a fan of this name - you're not "making" a filename here (as in
allocating or such). I think I'd just remove the Make prefix.



+/*
+ * Open a file that was previously created in another backend with
+ * BufFileCreateShared in the same SharedFileSet using the same name.  The
+ * backend that created the file must have called BufFileClose() or
+ * BufFileExport() to make sure that it is ready to be opened by other
+ * backends and render it read-only.
+ */

Is it actually guaranteed that it's another backend / do we rely on
that?

+BufFile *
+BufFileOpenShared(SharedFileSet *fileset, const char *name)
+{

+   /*
+* If we didn't find any files at all, then no BufFile exists with this
+* tag.
+*/
+   if (nfiles == 0)
+   return NULL;

s/taag/name/?


+/*
+ * Delete a BufFile that was created by BufFileCreateShared in the given
+ * SharedFileSet using the given name.
+ *
+ * It is not necessary to delete files explicitly with this function.  It is
+ * provided only as a way to delete files proactively, rather than waiting for
+ * the SharedFileSet to be cleaned up.
+ *
+ * Only one backend should attempt to delete a given name, and should know
+ * that it exists and has been exported or closed.
+ */
+void
+BufFileDeleteShared(SharedFileSet *fileset, const char *name)
+{
+   charsegment_name[MAXPGPATH];
+   int segment = 0;
+   boolfound = false;
+
+   /*
+* We don't know how many segments the file has.  We'll keep deleting
+* until we run out.  If we don't manage to find even an initial 
segment,
+* raise an error.
+*/
+   for (;;)
+   {
+   MakeSharedSegmentName(segment_name, name, segment);
+   if (!SharedFileSetDelete(fileset, segment_name, true))
+   break;
+   found = true;
+   ++segment;
+   }

Hm. Do we properly delete all the files via the resowner mechanism if
this fails midway? I.e. if there are no leading segments? Also wonder if
this doesn't need a CFI check.

+void
+PathNameCreateTemporaryDir(const char *basedir, const char *directory)
+{
+   if (mkdir(directory, S_IRWXU) < 0)
+   {
+   if (errno == EEXIST)
+   return;
+
+   /*
+* Failed.  Try to create basedir first in case it's missing. 
Tolerate
+* ENOENT to close a race against another process following the 
same
+* algorithm.
+*/
+   if (mkdir(basedir, S_IRWXU) < 0 && errno != ENOENT)
+   elog(ERROR, "cannot create temporary directory \"%s\": 
%m",
+basedir);

ENOENT or EEXIST?



+File
+PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
+{
+   Filefile;
+
+   /*
+* Open the file.  Note: we don't use O_EXCL, in case there is an 
orphaned
+* temp file that can be reused.
+*/
+   file = PathNameOpenFile(path, 

Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-07 Thread Andres Freund
Hi,

On 2017-11-06 10:56:43 +0530, Amit Kapila wrote:
> On Sun, Nov 5, 2017 at 6:54 AM, Andres Freund <and...@anarazel.de> wrote
> > On 2017-11-05 01:05:59 +0100, Robert Haas wrote:
> >> skip-gather-project-v1.patch does what it says on the tin.  I still
> >> don't have a test case for this, and I didn't find that it helped very
> >> much,
> 
> I am also wondering in which case it can help and I can't think of the
> case.

I'm confused?  Isn't it fairly obvious that unnecessarily projecting
at the gather node is wasteful? Obviously depending on the query you'll
see smaller / bigger gains, but that there's beenfdits should be fairly obvious?


> Basically, as part of projection in the gather, I think we are just
> deforming the tuple which we anyway need to perform before sending the
> tuple to the client (printtup) or probably at the upper level of the
> node.

But in most cases you're not going to print millions of tuples, instead
you're going to apply some further operators ontop (e.g. the
OFFSET/LIMIT in my example).

> >> and you said this looked like a big bottleneck in your
> >> testing, so here you go.

> Is it possible that it shows the bottleneck only for 'explain analyze'
> statement as we don't deform the tuple for that at a later stage?

Doesn't matter, there's a OFFSET/LIMIT ontop of the query. Could just as
well be a sort node or something.


> > The query where that showed a big benefit was
> >
> > SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 10 LIMIT 1;
> >
> > (i.e a not very selective filter, and then just throwing the results away)
> >
> > still shows quite massive benefits:
> 
> Do you see the benefit if the query is executed without using Explain Analyze?

Yes.

Before:
tpch_5[11878][1]=# SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 
10 LIMIT 1;^[[A
...
Time: 7590.196 ms (00:07.590)

After:
Time: 3862.955 ms (00:03.863)


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Andres Freund
On 2017-11-07 12:12:02 -0300, Claudio Freire wrote:
> If you need it. I'm not particularly fond of writing code before it's needed.

+1

> Otherwise, if it's a rarely-encountered corner case, I'd recommend
> simply calling the stdlib's qsort.

FWIW, we always map qsort onto our own implementation:

#define qsort(a,b,c,d) pg_qsort(a,b,c,d)

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Andres Freund
Hi,

Please don't top-quote on postgresql lists.

On 2017-11-06 09:44:24 -0600, Jim Van Fleet wrote:
> > >hammerdb, in this configuration, runs a variant of tpcc
> > 
> > Hard to believe that any of the changes here are relevant in that 
> > case - this is parallelism specific stuff. Whereas tpcc is oltp, right?

> correct

In that case, could you provide before/after profiles of the performance
changing runs?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-06 Thread Andres Freund


On November 6, 2017 7:30:49 AM PST, Jim Van Fleet <vanfl...@us.ibm.com> wrote:
>Andres Freund <and...@anarazel.de> wrote on 11/05/2017 03:40:15 PM:
>
>hammerdb, in this configuration, runs a variant of tpcc

Hard to believe that any of the changes here are relevant in that case - this 
is parallelism specific stuff. Whereas tpcc is oltp, right?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Early locking option to parallel backup

2017-11-06 Thread Andres Freund
On 2017-11-05 22:43:34 -0500, Tom Lane wrote:
> > IIUC the problem here is that even though a lock is already
> > held by the main backend an independent locker's request will prevent
> > the on-demand lock by the dump worker from being granted.  It seems to
> > me the correct fix here would be to somehow avoid the fairness logic in
> > the parallel dump case - although I don't quite know how to best do so.
> 
> I wonder if we couldn't somehow repurpose the work that was done for
> parallel workers' locks.  Lots of security-type issues to be handled
> if we're to open that up to clients, but maybe it's solvable.  For
> instance, maybe only allowing it to clients sharing the same snapshot
> would help.

Yea, I'd been wondering the same.

I'm slightly worried that somehow tying multiple clients into parallel
mode would cause a bunch of problems - that's not really the purpose of
the code and a number of its assumptions aren't quite right for that.

I'm not sure it really buys us much in contrast to just allowing a
locker to specify that it's allowed to jump the lock queue for an ASL if
it has 'backup' rights or such.  Or actually, just allow it as a general
option to LOCK, there's plenty other operational cases where the current
"fair" behaviour is really annoying, e.g. when executing operational
DDL/DML and such.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restricting maximum keep segments by repslots

2017-11-06 Thread Andres Freund
Hi,

On 2017-10-31 18:43:10 +0900, Kyotaro HORIGUCHI wrote:
>   - distance:
> how many bytes LSN can advance before the margin defined by
> max_slot_wal_keep_size (and wal_keep_segments) is exhasuted,
> or how many bytes this slot have lost xlog from restart_lsn.

I don't think 'distance' is a good metric - that's going to continually
change. Why not store the LSN that's available and provide a function
that computes this? Or just rely on the lsn - lsn operator?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restricting maximum keep segments by repslots

2017-11-06 Thread Andres Freund
On 2017-11-06 11:07:04 +0800, Craig Ringer wrote:
> Would it make sense to teach xlogreader how to fetch from WAL archive,
> too? That way if there's an archive, slots could continue to be used
> even after we purge from local pg_xlog, albeit at a performance cost.
> 
> I'm thinking of this mainly for logical slots.

That seems more like a page read callback's job than xlogreader's.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Early locking option to parallel backup

2017-11-05 Thread Andres Freund
On 2017-11-05 17:38:39 -0500, Robert Haas wrote:
> On Sun, Nov 5, 2017 at 5:17 AM, Lucas <luca...@gmail.com> wrote:
> > The patch creates a "--lock-early" option which will make pg_dump to issue
> > shared locks on all tables on the backup TOC on each parallel worker start.
> > That way, the backup has a very small chance of failing. When it does,
> > happen in the first few seconds of the backup job. My backup scripts (not
> > included here) are aware of that and retries the backup in case of failure.
> 
> I wonder why we don't do this already ... and by default.

Well, the current approach afaics requires #relations * 2 locks, whereas
acquiring them in every worker would scale that with the number of
workers.  IIUC the problem here is that even though a lock is already
held by the main backend an independent locker's request will prevent
the on-demand lock by the dump worker from being granted.  It seems to
me the correct fix here would be to somehow avoid the fairness logic in
the parallel dump case - although I don't quite know how to best do so.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-05 Thread Andres Freund
Hi,

On November 5, 2017 1:33:24 PM PST, Jim Van Fleet  wrote:
>Ran this change with hammerdb  on a power 8 firestone
>
>with 2 socket, 20 core
>9.6 base--  451991 NOPM
>0926_master -- 464385 NOPM
>11_04master -- 449177 NOPM
>11_04_patch -- 431423 NOPM
>-- two socket patch is a little down from previous base runs
>
>With one socket
>9.6 base  -- 393727 NOPM 
>v10rc1_base -- 350958 NOPM
>11_04master -- 306506 NOPM
>11_04_patch -- 313179 NOPM
>--  one socket 11_04 master is quite a bit down from 9.6 and
>v10rc1_base 
>-- the patch is up a bit over the base
>
>Net -- the patch is about the same as current base on two socket, and
>on 
>one socket  -- consistent with your pgbench (?) findings
>
>As an aside, it is perhaps a worry that one socket is down over 20%
>from 
>9.6 and over 10% from v10rc1

What query(s) did you measure?

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-04 Thread Andres Freund
On 2017-11-05 01:05:59 +0100, Robert Haas wrote:
> skip-gather-project-v1.patch does what it says on the tin.  I still
> don't have a test case for this, and I didn't find that it helped very
> much, but it would probably help more in a test case with more
> columns, and you said this looked like a big bottleneck in your
> testing, so here you go.

The query where that showed a big benefit was

SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 10 LIMIT 1;

(i.e a not very selective filter, and then just throwing the results away)

still shows quite massive benefits:

before:
set parallel_setup_cost=0;set parallel_tuple_cost=0;set 
min_parallel_table_scan_size=0;set max_parallel_workers_per_gather=8;
tpch_5[17938][1]=# explain analyze SELECT * FROM lineitem WHERE l_suppkey > 
'5012' OFFSET 10 LIMIT 1;
┌
│ QUERY PLAN
├
│ Limit  (cost=635802.67..635802.69 rows=1 width=127) (actual 
time=8675.097..8675.097 rows=0 loops=1)
│   ->  Gather  (cost=0.00..635802.67 rows=27003243 width=127) (actual 
time=0.289..7904.849 rows=26989780 loops=1)
│ Workers Planned: 8
│ Workers Launched: 7
│ ->  Parallel Seq Scan on lineitem  (cost=0.00..635802.67 rows=3375405 
width=127) (actual time=0.124..528.667 rows=3373722 loops=8)
│   Filter: (l_suppkey > 5012)
│   Rows Removed by Filter: 376252
│ Planning time: 0.098 ms
│ Execution time: 8676.125 ms
└
(9 rows)
after:
tpch_5[19754][1]=# EXPLAIN ANALYZE SELECT * FROM lineitem WHERE l_suppkey > 
'5012' OFFSET 10 LIMIT 1;
┌
│ QUERY PLAN
├
│ Limit  (cost=635802.67..635802.69 rows=1 width=127) (actual 
time=5984.916..5984.916 rows=0 loops=1)
│   ->  Gather  (cost=0.00..635802.67 rows=27003243 width=127) (actual 
time=0.214..5123.238 rows=26989780 loops=1)
│ Workers Planned: 8
│ Workers Launched: 7
│ ->  Parallel Seq Scan on lineitem  (cost=0.00..635802.67 rows=3375405 
width=127) (actual time=0.025..649.887 rows=3373722 loops=8)
│   Filter: (l_suppkey > 5012)
│   Rows Removed by Filter: 376252
│ Planning time: 0.076 ms
│ Execution time: 5986.171 ms
└
(9 rows)

so there clearly is still benefit (this is scale 100, but that shouldn't
make much of a difference).

Did not review the code.

> shm-mq-reduce-receiver-latch-set-v1.patch causes the receiver to only
> consume input from the shared queue when the amount of unconsumed
> input exceeds 1/4 of the queue size.  This caused a large performance
> improvement in my testing because it causes the number of times the
> latch gets set to drop dramatically. I experimented a bit with
> thresholds of 1/8 and 1/2 before setting on 1/4; 1/4 seems to be
> enough to capture most of the benefit.

Hm. Is consuming the relevant part, or notifying the sender about it?  I
suspect most of the benefit can be captured by updating bytes read (and
similarly on the other side w/ bytes written), but not setting the latch
unless thresholds are reached.  The advantage of updating the value,
even without notifying the other side, is that in the common case that
the other side gets around to checking the queue without having blocked,
it'll see the updated value.  If that works, that'd address the issue
that we might wait unnecessarily in a number of common cases.

Did not review the code.

> remove-memory-leak-protection-v1.patch removes the memory leak
> protection that Tom installed upon discovering that the original
> version of tqueue.c leaked memory like crazy.  I think that it
> shouldn't do that any more, courtesy of
> 6b65a7fe62e129d5c2b85cd74d6a91d8f7564608.  Assuming that's correct, we
> can avoid a whole lot of tuple copying in Gather Merge and a much more
> modest amount of overhead in Gather.

Yup, that conceptually makes sense.

Did not review the code.


> Even with all of these patches applied, there's clearly still room for
> more optimization, but MacOS's "sample" profiler seems to show that
> the bottlenecks are largely shifting elsewhere:
>
> Sort by top of stack, same 

[HACKERS] Display number of heap accesses for index scans

2017-11-04 Thread Andres Freund
Hi,

right now it's hard to figure out whether a plain indexscan returns
matches that then get eliminated due to visibility checks in the
heap. For both index only scans (via "Heap Fetches") and bitmapscans
(via row mismatches between bitmap heap and index scans) one can gather
that to some degree from explain analyze.

The number of index lookups that failed to return anything can be a
critical performance factor in OLTP workloads.  Therefore it seems like
it'd be a good idea to extend the explain analyze output to include that
information.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-04 Thread Andres Freund
On 2017-11-04 06:15:00 -0700, Andres Freund wrote:
> The current testcase, and I think the descriptions in the relevant
> threads, all actually fail to point out the precise way the bug is
> triggered.  As e.g. evidenced that the freeze-the-dead case appears to
> not cause any failures in !assertion builds even if the bug is present.

Trying to write up tests reproducing more of the issues in the area, I
think I might have found a third issue - although I'm not sure how
practically relevant it is:

When FreezeMultiXactId() decides it needs to create a new multi because
the old one is below the cutoff, that attempt can be defeated by the
multixact id caching. If the new multi has exactly the same members the
multixact id cache will just return the existing multi with the same
members. The cache will routinely be primed from the lookup of its
members.

I'm not yet sure how easily this can be hit in practice, because
commonly the multixact horizon should prevent a multi with all its
members living from being below the horizon. I found a situation where
that's not the case with the current bug, but I'm not sif that can
happen otherwise.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-11-04 Thread Andres Freund
ritten(volatile shm_mq *mq, bool *detached)
> -{
> - uint64  v;
> -
> - SpinLockAcquire(>mq_mutex);
> - v = mq->mq_bytes_written;
> - *detached = mq->mq_detached;
> - SpinLockRelease(>mq_mutex);
> -
> - return v;
> -}

You reference this function in a comment elsewhere:

> + /*
> +  * Separate prior reads of mq_ring from the write of mq_bytes_written
> +  * which we're about to do.  Pairs with shm_mq_get_bytes_written's read
> +  * barrier.
> +  */
> + pg_write_barrier();


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-04 Thread Andres Freund
On 2017-11-03 12:36:59 -0700, Peter Geoghegan wrote:
> Andres Freund <and...@anarazel.de> wrote:
> > Here's that patch.  I've stared at this some, and Robert did too. Robert
> > mentioned that the commit message might need some polish and I'm not
> > 100% sure about the error message texts yet.
> 
> The commit message should probably say that the bug involves the
> resurrection of previously dead tuples, which is different to there
> being duplicates because a constraint is not enforced because HOT chains
> are broken (that's a separate, arguably less serious problem).

The reason for that is that I hadn't yet quite figured out how the bug I
described in the commit message (and the previously committed testcase)
would cause that. I was planning to diagnose / experiment more with this
and write an email if I couldn't come up with an explanation.   The
committed test does *not* actually trigger that.

The reason I couldn't quite figure out how the problem triggers is that
the xmax removing branch in FreezeMultiXactId() can only be reached if
the multi is from before the cutoff - which it can't have been for a
single vacuum execution to trigger the bug, because that'd require the
multi to be running to falsely return RECENTLY_DEAD rather than DEAD (by
definition a multi can't be below the cutoff if running).

For the problem to occur I think vacuum has to be executed *twice*: The
first time through HTSV mistakenly returns RECENTLY_DEAD preventing the
tuple from being pruned. That triggers FreezeMultiXactId() to create a
*new* multi with dead members. At this point the database already is in
a bad state. Then in a second vacuum HTSV returns DEAD, but
 * Ordinarily, DEAD tuples would have 
been removed by
 * heap_page_prune(), but it's possible 
that the tuple
 * state changed since 
heap_page_prune() looked.  In
 * particular an INSERT_IN_PROGRESS 
tuple could have
 * changed to DEAD if the inserter 
aborted.  So this
 * cannot be considered an error 
condition.
 *
..
if (HeapTupleIsHotUpdated() ||
HeapTupleIsHeapOnly())
{
nkeep += 1;

prevents the tuple from being removed. If now the multi xmax is below
the xmin horizon it triggers
/*
 * If the xid is older than the cutoff, it has to have 
aborted,
 * otherwise the tuple would have gotten pruned away.
 */
if (TransactionIdPrecedes(xid, cutoff_xid))
{
if (TransactionIdDidCommit(xid))
elog(ERROR, "can't freeze committed 
xmax");
*flags |= FRM_INVALIDATE_XMAX;
in FreezeMultiXact. Without the new elog, this then causes xmax to be
removed, reviving the tuple.


The current testcase, and I think the descriptions in the relevant
threads, all actually fail to point out the precise way the bug is
triggered.  As e.g. evidenced that the freeze-the-dead case appears to
not cause any failures in !assertion builds even if the bug is present.


The good news is that the error checks I added in the patch upthread
prevent all of this from happening, even though I'd not yet understood
the mechanics fully - it's imnsho pretty clear that we need to be more
paranoid in production builds around this.   A bunch of users that
triggered largely "invisible" corruption (the first vacuum described
above) will possibly run into one of these elog()s, but that seems far
preferrable to making the corruption a lot worse.


I think unfortunately the check + elog() in the
HeapTupleIsHeapOnly())
{
nkeep += 1;

/*
 * If this were to happen for a 
tuple that actually
 * needed to be frozen, we'd be 
in trouble, because
 * it'd leave a tuple below the 
relation's xmin
 * horizon alive.
 */
if 
(heap_tuple_needs_freeze(tuple.t_data, FreezeLimit,
   

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-03 Thread Andres Freund


On November 4, 2017 1:22:04 AM GMT+05:30, Alvaro Herrera 
<alvhe...@alvh.no-ip.org> wrote:
>Peter Geoghegan wrote:
>> Andres Freund <and...@anarazel.de> wrote:
>
>> > Staring at the vacuumlazy hunk I think I might have found a related
>bug:
>> > heap_update_tuple() just copies the old xmax to the new tuple's
>xmax if
>> > a multixact and still running.  It does so without verifying
>liveliness
>> > of members.  Isn't that buggy? Consider what happens if we have
>three
>> > blocks: 1 has free space, two is being vacuumed and is locked,
>three is
>> > full and has a tuple that's key share locked by a live tuple and is
>> > updated by a dead xmax from before the xmin horizon. In that case
>afaict
>> > the multi will be copied from the third page to the first one. 
>Which is
>> > quite bad, because vacuum already processed it, and we'll set
>> > relfrozenxid accordingly.  I hope I'm missing something here?
>> 
>> Can you be more specific about what you mean here? I think that I
>> understand where you're going with this, but I'm not sure.
>
>He means that the tuple that heap_update moves to page 1 (which will no
>longer be processed by vacuum) will contain a multixact that's older
>than relminmxid -- because it is copied unchanged by heap_update
>instead
>of properly checking against age limit.

Right. That, or a member xid below relminxid. I think both scenarios are 
possible.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-03 Thread Andres Freund
On 2017-11-02 06:05:51 -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-11-02 13:49:47 +0100, Alvaro Herrera wrote:
> > Andres Freund wrote:
> > > I think the problem is on the pruning, rather than the freezing side. We
> > > can't freeze a tuple if it has an alive predecessor - rather than
> > > weakining this, we should be fixing the pruning to not have the alive
> > > predecessor.
> > 
> > I gave a look at HTSV back then, but I didn't find what the right tweak
> > was, but then I only tried changing the return value to DEAD and
> > DELETE_IN_PROGRESS; the thought of selecting DEAD or RECENTLY_DEAD based
> > on OldestXmin didn't occur to me ... I was thinking that the fact that
> > there were live lockers meant that the tuple could not be removed,
> > obviously failing to notice that the subsequent versions of the tuple
> > would be good enough.
> 
> I'll try to write up a commit based on that idea. I think there's some
> comment work needed too, Robert and I were both confused by a few
> things.
> I'm unfortunately travelling atm - it's evening here, and I'll flying
> back to the US all Saturday. I'm fairly sure I'll be able to come up
> with a decent patch tomorrow, but I'll need review and testing by
> others.

Here's that patch.  I've stared at this some, and Robert did too. Robert
mentioned that the commit message might need some polish and I'm not
100% sure about the error message texts yet.

I'm not yet convinced that the new elog in vacuumlazy can never trigger
- but I also don't think we want to actually freeze the tuple in that
case.

Staring at the vacuumlazy hunk I think I might have found a related bug:
heap_update_tuple() just copies the old xmax to the new tuple's xmax if
a multixact and still running.  It does so without verifying liveliness
of members.  Isn't that buggy? Consider what happens if we have three
blocks: 1 has free space, two is being vacuumed and is locked, three is
full and has a tuple that's key share locked by a live tuple and is
updated by a dead xmax from before the xmin horizon. In that case afaict
the multi will be copied from the third page to the first one.  Which is
quite bad, because vacuum already processed it, and we'll set
relfrozenxid accordingly.  I hope I'm missing something here?

Greetings,

Andres Freund
>From 774376dc8f7ed76657660e6beb0f5191d1bdaae4 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 3 Nov 2017 07:52:29 -0700
Subject: [PATCH] Fix pruning of locked and updated tuples.

Previously it was possible that a tuple was not pruned during vacuum,
even though it's update xmax (i.e. the updating xid in a multixact
with both key share lockers and an updater) was below the cutoff
horizon.

As the freezing code assumed, rightly so, that that's not supposed to
happen, xmax would be preserved (as a member of a new multixact or
xmax directly). That causes two problems: For one the tuple is below
the xmin horizon, which can cause problems if the clog is truncated or
once there's an xid wraparound. The bigger problem is that that will
break HOT chains, leading to index lookups failing, which in turn can
lead to constraints being violated.

Fix the problem by recognizing that tuples can be DEAD instead of
RECENTLY_DEAD, even if the multixactid has alive members, if the
update_xid is below the xmin horizon. That's safe because newer
versions of the tuple will contain the locking xids.

As the wrong freezing of xmax can cause data corruption, extend sanity
checks and promote them to elogs rather than assertions.

Per-Discussion: Robert Haas and Freund
Reported-By: Daniel Wood
Discussion:
https://postgr.es/m/e5711e62-8fdf-4dca-a888-c200bf6b5...@amazon.com
https://postgr.es/m/20171102112019.33wb7g5wp4zpj...@alap3.anarazel.de
---
 src/backend/access/heap/heapam.c  | 34 --
 src/backend/commands/vacuumlazy.c | 13 +
 src/backend/utils/time/tqual.c| 53 +++
 src/test/isolation/isolation_schedule |  1 +
 4 files changed, 62 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 765750b8743..9b963e0cd0d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6412,7 +6412,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 			 */
 			if (TransactionIdPrecedes(xid, cutoff_xid))
 			{
-Assert(!TransactionIdDidCommit(xid));
+if (TransactionIdDidCommit(xid))
+	elog(ERROR, "can't freeze committed xmax");
 *flags |= FRM_INVALIDATE_XMAX;
 xid = InvalidTransactionId; /* not strictly necessary */
 			}
@@ -6484,6 +6485,8 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		{
 			TransactionId xid = members[i].xid;
 
+			Assert(TransactionIdIsValid(xid));
+
 			/*
 			 * It's an upda

Re: [HACKERS] ArrayLists instead of List (for some things)

2017-11-02 Thread Andres Freund
On 2017-11-02 10:38:34 -0400, Tom Lane wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
> > On 3 November 2017 at 03:17, Tom Lane <t...@sss.pgh.pa.us> wrote:
> >> We've jacked up the List API and driven a new implementation underneath
> >> once before.  Maybe it's time to do that again.
> 
> > Maybe, but the new implementation is not going to do well with places
> > where we perform lcons(). Probably many of those places could be
> > changed to lappend(), but I bet there's plenty that need prepend.
> 
> [ shrug... ] To me, that means this implementation isn't necessarily
> the right solution.
> 
> It seems to me that a whole lot of the complaints about this could be
> resolved simply by improving the List infrastructure to allocate ListCells
> in batches.  That would address the question of "too much palloc traffic"
> and greatly improve the it-accesses-all-over-memory situation too.

It'll however not solve the problem that pointer chained datastructures
have very poor access patterns.


> Possibly there are more aggressive changes that could be implemented
> without breaking too many places, but I think it would be useful to
> start there and see what it buys us.

I'm suspicious that that will result in causing pain twice, once doing
the compromise thing, and once doing it right.


FWIW, I'd started pursuing converting a lot of executor uses of lists
with arrays. But everywhere I did, particularly the uses inside the
expression machinery, I noticed that larger changes where needed. For
the expression stuff we needed a much larger architectural change (hence
the stuff in 10 I did with Tom's help).  The other big user of lists at
the execution phase are targetlists - I think there the architectural
change is also bigger: Rebuilding/analyzing all the targetlists,
tupledescs etc at execution time is the architectural problem, not the
use of lists itself (although that's also bad).   The amount of cycles
spent in ExecTypeFromTL() etc and friends is partially caused by the use
of lists, but the real problem is that we're doing all that work in the
first place.

So I personally think most uses of lists at execution time are
archetecturally wrong, therefore optimizing their implementation is just
a bandaid.  Where I think the arguments for a more optimized
implementation are better is the parser, where we really don't ahead of
time know the size of the data we're dealing with.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-02 Thread Andres Freund
Hi,

On 2017-11-02 13:49:47 +0100, Alvaro Herrera wrote:
> Andres Freund wrote:
> > I think the problem is on the pruning, rather than the freezing side. We
> > can't freeze a tuple if it has an alive predecessor - rather than
> > weakining this, we should be fixing the pruning to not have the alive
> > predecessor.
> 
> I gave a look at HTSV back then, but I didn't find what the right tweak
> was, but then I only tried changing the return value to DEAD and
> DELETE_IN_PROGRESS; the thought of selecting DEAD or RECENTLY_DEAD based
> on OldestXmin didn't occur to me ... I was thinking that the fact that
> there were live lockers meant that the tuple could not be removed,
> obviously failing to notice that the subsequent versions of the tuple
> would be good enough.

I'll try to write up a commit based on that idea. I think there's some
comment work needed too, Robert and I were both confused by a few
things.
I'm unfortunately travelling atm - it's evening here, and I'll flying
back to the US all Saturday. I'm fairly sure I'll be able to come up
with a decent patch tomorrow, but I'll need review and testing by
others.


> > If the update xmin is actually below the cutoff we can remove the tuple
> > even if there's live lockers - the lockers will also be present in the
> > newer version of the tuple.  I verified that for me that fixes the
> > problem. Obviously that'd require some comment work and more careful
> > diagnosis.
> 
> Sounds good.
> 
> I'll revert those commits then, keeping the test, and then you can
> commit your change.  OK?

Generally that sounds good - but you can't keep the testcase in without
the new fix - the buildfarm would immediately turn red. I guess the best
thing would be to temporarily remove it from the schedule?


Do we care about people upgrading to unreleased versions? We could do
nothing, document it in the release notes, or ???


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-02 Thread Andres Freund
Hi,

On 2017-09-28 14:47:53 +, Alvaro Herrera wrote:
> Fix freezing of a dead HOT-updated tuple
>
> Vacuum calls page-level HOT prune to remove dead HOT tuples before doing
> liveness checks (HeapTupleSatisfiesVacuum) on the remaining tuples.  But
> concurrent transaction commit/abort may turn DEAD some of the HOT tuples
> that survived the prune, before HeapTupleSatisfiesVacuum tests them.
> This happens to activate the code that decides to freeze the tuple ...
> which resuscitates it, duplicating data.
>
> (This is especially bad if there's any unique constraints, because those
> are now internally violated due to the duplicate entries, though you
> won't know until you try to REINDEX or dump/restore the table.)
>
> One possible fix would be to simply skip doing anything to the tuple,
> and hope that the next HOT prune would remove it.  But there is a
> problem: if the tuple is older than freeze horizon, this would leave an
> unfrozen XID behind, and if no HOT prune happens to clean it up before
> the containing pg_clog segment is truncated away, it'd later cause an
> error when the XID is looked up.
>
> Fix the problem by having the tuple freezing routines cope with the
> situation: don't freeze the tuple (and keep it dead).  In the cases that
> the XID is older than the freeze age, set the HEAP_XMAX_COMMITTED flag
> so that there is no need to look up the XID in pg_clog later on.

I think this is the wrong fix - the assumption that ctid chains can be
validated based on the prev-xmax = cur-xmin is fairly ingrained into the
system, and we shouldn't just be breaking it. The need to later
lobotomize the checks, in a5736bf754, is some evidence of that.

I spent some time discussing this with Robert today (with both of us
alternating between feeling the other and ourselves as stupid), and the
conclusion I think is that the problem is on the pruning, rather than
the freezing side.

FWIW, I don't think the explanation in the commit message of how the
problem triggers is actually correct - the testcase you added doesn't
have any transactions concurrently committing / aborting when
crashing. Rather the problem is that the liveliness checks for freezing
is different from the ones for pruning. HTSV considers xmin
RECENTLY_DEAD when there's a multi xmax with at least one alive locker,
whereas pruning thinks it has to do something because there's the member
xid below the cutoff. No concurrent activity is needed to trigger that.

I think the problem is on the pruning, rather than the freezing side. We
can't freeze a tuple if it has an alive predecessor - rather than
weakining this, we should be fixing the pruning to not have the alive
predecessor.

The relevant logic in HTSV is:

if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
{
TransactionId xmax;

if (MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple), 
false))
{
/* already checked above */
Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));

xmax = HeapTupleGetUpdateXid(tuple);

/* not LOCKED_ONLY, so it has to have an xmax */
Assert(TransactionIdIsValid(xmax));

if (TransactionIdIsInProgress(xmax))
return HEAPTUPLE_DELETE_IN_PROGRESS;
else if (TransactionIdDidCommit(xmax))
/* there are still lockers around -- can't 
return DEAD here */
return HEAPTUPLE_RECENTLY_DEAD;
/* updating transaction aborted */
return HEAPTUPLE_LIVE;

with the problematic branch being the TransactionIdDidCommit()
case. Rather than unconditionally returning HEAPTUPLE_RECENTLY_DEAD, we
should test the update xid against OldestXmin and return DEAD /
RECENTLY_DEAD according to that.

If the update xmin is actually below the cutoff we can remove the tuple
even if there's live lockers - the lockers will also be present in the
newer version of the tuple.  I verified that for me that fixes the
problem. Obviously that'd require some comment work and more careful
diagnosis.


I think a5736bf754c82d8b86674e199e232096c679201d might be dangerous in
the face of previously corrupted tuple chains and pg_upgraded clusters -
it can lead to tuples being considered related, even though they they're
from entirely independent hot chains. Especially when upgrading 9.3 post
your fix, to current releases.


In short, I think the two commits should be reverted, and replaced with
a fix along what I'm outlining above.

There'll be some trouble for people that upgraded to an unreleased
version, but I don't really see what we could do about that.

I could be entirely wrong - I've been travelling for the last two weeks
and my brain is somewhat more fried than usual.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?

2017-10-31 Thread Andres Freund
Hi,

On 2017-10-31 18:47:06 +0100, Tomas Vondra wrote:
> On 10/31/2017 04:48 PM, Greg Stark wrote:
> > On 31 October 2017 at 07:05, Chris Travers <chris.trav...@adjust.com>
> wrote:
> >> Hi;
> >>
> >> After Andres's excellent talk at PGConf we tried benchmarking
> >> effective_io_concurrency on some of our servers and found that those
> which
> >> have a number of NVME storage volumes could not fill the I/O queue
> even at
> >> the maximum setting (1000).
> >
> > And was the system still i/o bound? If the cpu was 100% busy then
> > perhaps Postgres just can't keep up with the I/O system. It would
> > depend on workload though, if you start many very large sequential
> > scans you may be able to push the i/o system harder.
> >
> > Keep in mind effective_io_concurrency only really affects bitmap
> > index scans (and to a small degree index scans). It works by issuing
> > posix_fadvise() calls for upcoming buffers one by one. That gets
> > multiple spindles active but it's not really going to scale to many
> > thousands of prefetches (and effective_io_concurrency of 1000
> > actually means 7485 prefetches). At some point those i/o are going
> > to start completing before Postgres even has a chance to start
> > processing the data.

Note that even if they finish well before postgres gets around to
looking at the block, you might still be seeing benefits. SSDs benefit
from larger reads, and a deeper queue gives more chances for reordering
/ coalescing of requests. Won't beenefit the individual reader, but
might help the overall capacity of the system.


> Yeah, initiating the prefetches is not expensive, but it's not free
> either. So there's a trade-off between time spent on prefetching and
> processing the data.

Right. It'd probably be good to be a bit more adaptive here. But it's
hard to do with posix_fadvise - we'd need an operation that actually
notifies us of IO completion.  If we were using, say, asynchronous
direct IO, we could initiate the request and regularly check how many
blocks ahead of the current window are already completed and adjust the
queue based on that, rather than jus tfiring off fadvises and hoping for
the best.


> I believe this may be actually illustrated using Amdahl's law - the I/O
> is the parallel part, and processing the data is the serial part. And no
> matter what you do, the device only has so much bandwidth, which defines
> the maximum possible speedup (compared to "no prefetch" case).

Right.


> Furthermore, the device does not wait for all the I/O requests to be
> submitted - it won't wait for 1000 requests and then go "OMG! There's a
> lot of work to do!" It starts processing the requests as they arrive,
> and some of them will complete before you're done with submitting the
> rest, so you'll never see all the requests in the queue at once.

It'd be interesting to see how much it helps to scale the size of
readahead requests with the distance from the current read
iterator. E.g. if you're less than 16 blocks away from the current head,
issue size 1, up to 32 issue a 2 block request for consecutive blocks.
I suspect it won't help because at least linux's block layer / io
elevator seems quite successfully at merging. E.g. for the query:
EXPLAIN ANALYZE SELECT sum(l_quantity) FROM lineitem where l_receiptdate 
between '1993-05-03' and '1993-08-03';
on a tpc-h scale dataset on my laptop, I see:
Devicer/s w/s rMB/s wMB/s   rrqm/s   wrqm/s  %rrqm  
%wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda   25702.000.00495.27  0.00 37687.00 0.00  59.45   
0.005.130.00 132.0919.73 0.00   0.04 100.00

but it'd be worthwhile to see whether doing the merging ourselves allows
for deeper queues.


I think we really should start incorporating explicit prefetching in
more places. Ordered indexscans might actually be one case that's not
too hard to do in a simple manner - whenever at an inner node, prefetch
the leaf nodes below it. We obviously could do better, but that might be
a decent starting point to get some numbers.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-30 Thread Andres Freund
Hi,

On 2017-10-30 22:29:42 +0530, Robert Haas wrote:
> On Mon, Oct 30, 2017 at 4:57 PM, Andres Freund <and...@anarazel.de> wrote:
> > 0001) Introduces pg_{add,sub,mul}{16,32,64}_overflow(a, b, *result)
> >   These use compiler intrinsics on gcc/clang. If that's not
> >   available, they cast to a wider type and to overflow checks. For
> >   64bit there's a fallback for the case 128bit math is not
> >   available (here I stole from an old patch of Greg's).
> >
> >   These fallbacks are, as far as I can tell, C free of overflow
> >   related undefined behaviour.
> 
> Looks nice.

Thanks.


> >   Perhaps it should rather be pg_add_s32_overflow, or a similar
> >   naming scheme?
> 
> Not sure what the s is supposed to be?  Signed?

Yes, signed. So we could add a u32 or something complementing the
functions already in the patch. Even though overflow checks are a heck
of a lot easier to write for unsigned ints, the intrinsics are still
faster.  I don't have any sort of strong feelings on the naming.


> > 0002) Converts int.c, int8.c and a smattering of other functions to use
> >   the new facilities. This removes a fair amount of code.
> >
> >   It might make sense to split this up further, but right now that's
> >   the set of functions that either are affected performancewise by
> >   previous overflow checks, and/or relied on wraparound
> >   overflow. There's probably more places, but this is what I found
> >   by visual inspection and compiler warnings.
> 
> I lack the patience to review this tonight.

Understandable ;)


> > 0003) Removes -fwrapv. I'm *NOT* suggesting we apply this right now, but
> >   it seems like an important test for the new facilities. Without
> >   0002, tests would fail after this, after it all tests run
> >   successfully.
> 
> I suggest that if we think we don't need -fwrapv any more, we ought to
> remove it.  Otherwise, we won't find out if we're wrong.

I agree that we should do so at some point not too far away in the
future. Not the least because we don't specify this kind of C dialect in
a lot of other compilers. Additionally the flag causes some slowdown
(because e.g. for loop variables are optimized less). But I'm fairly
certain it needs a bit more care that I've invested as of now - should
probably at least compile with -Wstrict-overflow=some-higher-level, and
with ubsan. I'm fairly certain there's more bogus overflow checks
around...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-30 Thread Andres Freund
On 2017-10-30 10:10:19 -0400, Tom Lane wrote:
> Robert Haas <robertmh...@gmail.com> writes:
> > I was mostly just thinking out loud, listing another option rather
> > than advocating for it.
> 
> FWIW, I just wanted the question to be debated and resolved properly.
> 
> After rereading the thread Andres pointed to, I thought of a hazard
> that I think Andres alluded to, but didn't spell out explicitly:
> if we can't read the primary checkpoint, and then back up to a
> secondary one and replay as much of WAL as we can read, we may well
> be left with an inconsistent database.

Exactly.


> I'm content now that removing the secondary checkpoint is an OK
> decision.  (This isn't a review of Simon's patch, though.)

I wonder if we shouldn't add a pg_resetxlog option that sets the
checkpoint to start from to a certain LSN. For the few cases where
there's actual data recovery needed that's a lot more useful than
randomly using checkpoint - 1. And it's an explicit expert only thing,
without costing everyone.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-30 Thread Andres Freund
Hi,

On 2017-10-24 03:39:54 -0700, Andres Freund wrote:
> Largely that's due to the overflow checks.
>
> For integers we currently do:
>
> #define SAMESIGN(a,b) (((a) < 0) == ((b) < 0))
>
>   /*
>* Overflow check.  If the inputs are of different signs then their sum
>* cannot overflow.  If the inputs are of the same sign, their sum had
>* better be that sign too.
>*/
>   if (SAMESIGN(arg1, arg2) && !SAMESIGN(result, arg1))
>   ereport(ERROR,
>   (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
>errmsg("integer out of range")));
>
> which means that we turn a single integer instruction into ~10,
> including a bunch of branches.  All that despite the fact that most
> architectures have flag registers signalling integer overflow. It's just
> that C doesn't easily make that available.
>
> gcc exposes more efficient overflow detection via intrinsics:
> https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Integer-Overflow-Builtins.html
>
> Using that turns the non-error path from int4pl from:
>
>0x00826ec0 <+0>: mov0x20(%rdi),%rcx # arg1
>0x00826ec4 <+4>: mov0x28(%rdi),%rdx # arg2
>0x00826ec8 <+8>: mov%ecx,%esi
>0x00826eca <+10>:lea(%rdx,%rcx,1),%eax # add
># overflow check
>0x00826ecd <+13>:shr$0x1f,%edx
>0x00826ed0 <+16>:not%esi
>0x00826ed2 <+18>:shr$0x1f,%esi
>0x00826ed5 <+21>:cmp%dl,%sil
>0x00826ed8 <+24>:je 0x826f30 <int4pl+112>
>0x00826eda <+26>:mov%eax,%edx
>0x00826edc <+28>:shr$0x1f,%ecx
>0x00826edf <+31>:shr$0x1f,%edx
>0x00826ee2 <+34>:cmp%cl,%dl
>0x00826ee4 <+36>:je 0x826f30 <int4pl+112>
>/* overflow error code */
>0x00826f30 <+112>:   retq
>
> into
>
>0x00826ec0 <+0>: mov0x28(%rdi),%rax # arg2
>0x00826ec4 <+4>: add0x20(%rdi),%eax # arg1 + arg2
>0x00826ec7 <+7>: jo 0x826ecc <int4pl+12> # jump if 
> overflowed
>0x00826ec9 <+9>: mov%eax,%eax # clear high bits
>0x00826ecb <+11>:retq
>
> which, not that surprisingly, is faster. Not to speak of easier to read
> ;)
>
> Besides the fact that the code is faster, there's also the issue that
> the current way to do overflow checks is not actually correct C, and
> requires compiler flags like -fwrapv.

Attached is a series of patches that:

0001) Introduces pg_{add,sub,mul}{16,32,64}_overflow(a, b, *result)
  These use compiler intrinsics on gcc/clang. If that's not
  available, they cast to a wider type and to overflow checks. For
  64bit there's a fallback for the case 128bit math is not
  available (here I stole from an old patch of Greg's).

  These fallbacks are, as far as I can tell, C free of overflow
  related undefined behaviour.

  Perhaps it should rather be pg_add_s32_overflow, or a similar
  naming scheme?

0002) Converts int.c, int8.c and a smattering of other functions to use
  the new facilities. This removes a fair amount of code.

  It might make sense to split this up further, but right now that's
  the set of functions that either are affected performancewise by
  previous overflow checks, and/or relied on wraparound
  overflow. There's probably more places, but this is what I found
  by visual inspection and compiler warnings.

0003) Removes -fwrapv. I'm *NOT* suggesting we apply this right now, but
  it seems like an important test for the new facilities. Without
  0002, tests would fail after this, after it all tests run
  successfully.

Greetings,

Andres Freund
>From 98fbe53be0a3046f8ace687f846f91a0043deee8 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Sun, 29 Oct 2017 22:13:54 -0700
Subject: [PATCH 1/3] Provide overflow safe integer math inline functions.

Author: Andres Freund, with some code stolen from Greg Stark
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
 config/c-compiler.m4  |  22 
 configure |  33 ++
 configure.in  |   4 +
 src/include/common/int.h  | 229 ++
 src/include/pg_config.h.in|   3 +
 src/include/pg_config.h.win32 |   3 +
 6 files changed, 294 insertions(+)
 create mode 100644 src/include/common/int.h

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 6dcc7906491..0d91

Re: [HACKERS] An unlikely() experiment

2017-10-30 Thread Andres Freund
On 2017-10-30 22:39:01 +1300, David Rowley wrote:
> On 30 October 2017 at 22:34, Andres Freund <and...@anarazel.de> wrote:
> > Hi,
> >
> > On 2015-12-20 02:49:13 +1300, David Rowley wrote:
> >> Alternatively, if there was some way to mark the path as cold from within
> >> the path, rather than from the if() condition before the path, then we
> >> could perhaps do something in the elog() macro instead. I just couldn't
> >> figure out a way to do this.
> >
> > I just was thinking about this, and it turns out that
> > __attribute__((cold)) does what we need here. We could just mark
> > elog_start() and errstart() as cold, and afaict that should do the
> > trick.
> 
> I had actually been thinking about this again today. I had previously
> not thought  __attribute__((cold)) would be a good idea since if we
> mark elog_start() with that, then code paths with elog(NOTICE) and
> elog(LOG) not to mention DEBUG would be marked as cold.

That's an issue, true. Although I'm not convinced it's particularly
fatal - if you're ok with emitting debugging output in places it's
probably not the hottest codepath in the first place. And profile guided
optimization would overwrite hot/cold.

But:

> Today I was thinking, to get around that issue, we might be able to
> generate another thin wrapper around elog_start() and mark that as
> __attribute__((cold)) and fudge the macro a bit to call that function
> instead if it can detect a compile time const level >= ERROR. I've not
> looked at the code again to remind myself if that would be possible.

Yes, that's what I was thinking too. Add a elog_fatal() wrapping
elog_finish(), and move the if (__builtin_constant_p(elevel)) branch a
bit earlier, and that should work.  Similar with errstart_fatal() for
ereport().

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An unlikely() experiment

2017-10-30 Thread Andres Freund
Hi,

On 2015-12-20 02:49:13 +1300, David Rowley wrote:
> Alternatively, if there was some way to mark the path as cold from within
> the path, rather than from the if() condition before the path, then we
> could perhaps do something in the elog() macro instead. I just couldn't
> figure out a way to do this.

I just was thinking about this, and it turns out that
__attribute__((cold)) does what we need here. We could just mark
elog_start() and errstart() as cold, and afaict that should do the
trick.

Thanks to Jakub from #gcc for pointing that out.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-28 Thread Andres Freund
On 2017-10-24 15:28:17 -0400, Tom Lane wrote:
> Also, if I recall the old discussion properly, one concern was getting
> uniform behavior across different platforms.  I'm worried that if we do
> what Andres suggests, we'll get behavior that is not only different but
> platform-specific.  Now, to the extent that you believe that every modern
> platform implements edge-case IEEE float behavior the same way, that worry
> may be obsolete.  But I don't think I believe that.

Hm. Is the current code actually meaningfully less dependent on IEEE
float behaviour? Both with the current behaviour and with the
alternative of not ereporting we rely on infinity op something to result
in infinity.  Given that we're not preventing underflows, imprecise
results, denormals from being continued to use, I don't think we're
avoiding edge cases effectively at the moment.

I just spent the last hours digging through intel's architecture
manual. And discovered way too much weird stuff :/.

There indeed doesn't really seem to be any sort of decent way to
implement the overflow checks in an efficient manner. Clearing & testing
the SSE floating point control register, which contains the overflow
bit, is ~10 cycles each. The way gcc implements the isinf check as a
bunch of compares and bitwizzery with constants - I don't see how to
beat that.


Btw, looking at this code I noticed that the current error messages
aren't meaningful:

=# SELECT '-1e38'::float4  + '-3e38'::float4;
ERROR:  22003: value out of range: overflow


The current code gets slightly better if I put an unlikely() around just
the isinf(val) in CHECKFLOATVAL.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Andres Freund
On 2017-10-25 07:33:46 +0200, Robert Haas wrote:
> On Tue, Oct 24, 2017 at 9:28 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> > I don't like changing well-defined, user-visible query behavior for
> > no other reason than a performance gain (of a size that hasn't even
> > been shown to be interesting, btw).  Will we change it back in another
> > ten years if the performance tradeoff changes?

That part of the argument seems unconvincing. It's not like the overflow
check is likely to ever have been beneficial performancewise, nor is it
remotely likely for that to ever be the case.


> > Also, if I recall the old discussion properly, one concern was getting
> > uniform behavior across different platforms.  I'm worried that if we do
> > what Andres suggests, we'll get behavior that is not only different but
> > platform-specific.  Now, to the extent that you believe that every modern
> > platform implements edge-case IEEE float behavior the same way, that worry
> > may be obsolete.  But I don't think I believe that.
> 
> Yeah, those are reasonable concerns.

I agree. I'm not really sure what the right way is here. I do however
think it's worth discussing what ways to address the performance penalty
due to the overflow checks, and one obvious way to do so is not to play.

It'd be interesting to write the overflow checking addition in x86
inline asm, and see how much better that gets - just so we know the
maximum we can reach with that. The problem with the C99 stuff seems to
be the external function calls.  With either, one problem would be that
we'd have to reset the overflow register before doing math, which isn't
free either - otherwise some external function could have left it set to
on.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Andres Freund
On 2017-10-24 10:09:09 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > There's no comparable overflow handling to the above integer
> > intrinsics. But I think we can still do a lot better. Two very different
> > ways:
> 
> > 1) Just give up on detecting overflows for floats. Generating inf in
> >these cases actually seems entirely reasonable. We already don't
> >detect them in a bunch of cases anyway.  I can't quite parse the
> >standard's language around this.
> 
> There's an ancient saying that code can be arbitrarily fast if it
> doesn't have to get the right answer.  I think this proposal falls
> in that category.

Does it? In plenty of cases getting infinity rather than an error is
just about as useful.

This was argued by a certain Tom Lane a few years back ;)
http://archives.postgresql.org/message-id/19208.1167246902%40sss.pgh.pa.us


> > 2) Use platform specific float exception handling where available. We
> >could at backend start, and in FloatExceptionHandler(), us
> >feenableexcept() (windows has similar) to trigger SIGFPE on float
> >overflow.
> 
> SIGFPE isn't going to be easy to recover from, nor portable.

Hm? A trivial hack implementing the above survives the regression test,
with the exception of one output change because some functions currently
do *not* check for overflow.  What's the issue you're concerned about?

The portability indeed is a problem.


> I think what you actually want to do is *disable* SIGFPE (see
> feholdexcept), and then have individual functions use feclearexcept
> and fetestexcept.  These functions were standardized by C99 so
> they should be pretty widely available ... of course, whether they
> actually are widely portable remains to be seen.  Whether they're
> faster than what we're doing now also remains to be seen.

I tested it, and they're fairly slow on at least gcc-7 + glibc 2.24.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove secondary checkpoint

2017-10-24 Thread Andres Freund
On 2017-10-24 09:50:12 -0400, Tom Lane wrote:
> Simon Riggs <si...@2ndquadrant.com> writes:
> > Remove the code that maintained two checkpoint's WAL files and all
> > associated stuff.
> 
> > Try to avoid breaking anything else
> 
> > This
> > * reduces disk space requirements on master
> > * removes a minor bug in fast failover
> > * simplifies code
> 
> Doesn't it also make crash recovery less robust?

I think it does the contrary. The current mechanism is, in my opinion,
downright dangerous:
https://www.postgresql.org/message-id/20160201235854.go8...@awork2.anarazel.de

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-10-24 Thread Andres Freund
On 2017-10-24 12:43:12 +0530, amul sul wrote:
> I tried to get suggested SMHasher[1] test result for the hash_combine
> for 32-bit and 64-bit version.
> 
> SMHasher works on hash keys of the form {0}, {0,1}, {0,1,2}... up to
> N=255, using 256-N as the seed, for the hash_combine testing we
> needed two hash value to be combined, for that, I've generated 64
> and 128-bit hash using cityhash functions[2] for the given smhasher
> key then split in two part to test 32-bit and 64-bit hash_combine
> function respectively.   Attached patch for SMHasher code changes &
> output of 32-bit and 64-bit hash_combine testing. Note that I have
> skipped speed test this test which is irrelevant here.
> 
> By referring other hash function results [3], we can see that hash_combine
> test results are not bad either.
> 
> Do let me know if current testing is not good enough or if you want me to do
> more testing, thanks.

This looks very good! Both the tests you did, and the results for
hash_combineXX. I therefore think we can go ahead with that formulation
of hash_combine64?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Current int & float overflow checking is slow.

2017-10-24 Thread Andres Freund
3,%xmm4,%xmm4
   0x0043ceae <+30>:vucomiss 0x2b6a4a(%rip),%xmm4# 0x6f3900
   0x0043ceb6 <+38>:jbe0x43ced4 <float4pl+68>
   0x0043ceb8 <+40>:vandps %xmm3,%xmm1,%xmm1
   0x0043cebc <+44>:vucomiss 0x2b6a3c(%rip),%xmm1# 0x6f3900
   0x0043cec4 <+52>:ja 0x43ced4 <float4pl+68>
   0x0043cec6 <+54>:vandps %xmm3,%xmm2,%xmm2
   0x0043ceca <+58>:vucomiss 0x2b6a2e(%rip),%xmm2# 0x6f3900
   0x0043ced2 <+66>:jbe0x43ced9 <float4pl+73>
   0x0043ced4 <+68>:vmovd  %xmm0,%eax
   0x0043ced8 <+72>:retq
   0x0043ced9 <+73>:push   %rbx
   # call to ereport

clang's code is much worse, it generates *external* function calls for
isinf (can be fixed by redefining isinf to __builtin_isinf).

Entirely removing the overflow checks results in:
   0x00801850 <+0>: vmovss 0x28(%rdi),%xmm1 # arg2
   0x00801855 <+5>: vaddss 0x20(%rdi),%xmm1,%xmm0 # arg1 + arg2
   0x0080185a <+10>:vmovd  %xmm0,%eax # convert to int
   0x0080185e <+14>:mov%eax,%eax # clear upper bits
   0x00801860 <+16>:retq

which unsurprisingly is a good bit faster.  float4mul etc generate even
worse code.

There's no comparable overflow handling to the above integer
intrinsics. But I think we can still do a lot better. Two very different
ways:

1) Just give up on detecting overflows for floats. Generating inf in
   these cases actually seems entirely reasonable. We already don't
   detect them in a bunch of cases anyway.  I can't quite parse the
   standard's language around this.
2) Use platform specific float exception handling where available. We
   could at backend start, and in FloatExceptionHandler(), us
   feenableexcept() (windows has similar) to trigger SIGFPE on float
   overflow.
3) Magic?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] legitimacy of using PG_TRY , PG_CATCH , PG_END_TRY in C function

2017-10-23 Thread Andres Freund
On 2017-10-22 23:04:50 -0400, Tom Lane wrote:
> John Lumby <johnlu...@hotmail.com> writes:
> > I have a C function (a trigger function) which uses the PG_TRY() 
> > construct to handle certain ERROR conditions.
> > One example is where invoked as INSTEAD OF INSERT into a view.  It 
> > PG_TRYs INSERT into the real base table,
> > but this table may not yet exist  (it is a partitioned child of an 
> > inheritance parent).
> > If the error is  ERRCODE_UNDEFINED_TABLE,  then the CATCH issues 
> > FlushErrorState() and returns to caller who CREATes the table and 
> > re-issues the insert.
> > All works perfectly (on every release of 9.x).
> 
> If it works, it's only because you didn't try very hard to break it.
> In general you can't catch and ignore errors without a full-fledged
> subtransaction --- BeginInternalSubTransaction, then either
> ReleaseCurrentSubTransaction or RollbackAndReleaseCurrentSubTransaction,
> not just FlushErrorState.  See e.g. plpgpsql's exec_stmt_block.
> 
> There may well be specific scenarios where an error gets thrown without
> having done anything that requires transaction cleanup.  But when you
> hit a scenario where that's not true, or when a scenario that used to
> not require cleanup now does, nobody is going to consider that a PG bug.

It'd probably be a good idea to expand on this in the sgml docs. This
has confused quite anumber of people...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] legitimacy of using PG_TRY , PG_CATCH , PG_END_TRY in C function

2017-10-23 Thread Andres Freund
On 2017-10-23 16:16:10 +0800, Craig Ringer wrote:
> On 23 October 2017 at 08:30, John Lumby <johnlu...@hotmail.com> wrote:
> 
> > All works but not perfectly --  at COMMIT,  resource_owner issues
> > relcache reference leak messages about relation scans not closed
> > and also about  snapshot still active. I guess that the CREATE has
> > switched resource_owner and pushed a snapshot,  but I did not
> > debug in detail.
> 
> A lot more work is required than what's done pg PG_CATCH to return to
> a queryable state. I've been down this path myself, and it's not fun.
> 
> Take a look at all the extra work done on the error handling path in
> PostgresMain.

That seems quite misleading - that's *not* what needs to be done
to catch an error inside a function. See Tom's response.


> At some point I'd really like to expose that in a more general way so
> it can be used from background workers. Right now AFAICS most
> background workers have to cope with errors with a proc_exit(1) and
> getting restarted to try again. Not ideal.

I agree that generalizing wouldn't be bad, but there's absolutely
nothing preventing you from handling errors in bgworkers without
restarting today.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-19 Thread Andres Freund
On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:
> > > + init_local_spin_delay();
> > 
> > The way you moved this around has the disadvantage that we now do this -
> > a number of writes - even in the very common case where the lwlock can
> > be acquired directly.
> 
> Excuse me, I don't understand fine.
> Do you complain against init_local_spin_delay placed here?

Yes.


> Placing it in other place will complicate code.


> Or you complain against setting `mask` and `add`?

That seems right.


> In both cases, I think simpler version should be accepted first. It acts
> as algorithm definition. And it already gives measurable improvement.

Well, in scalability. I'm less sure about uncontended performance.



> > > +  * We intentionally do not call finish_spin_delay here, because
> > > the loop
> > > +  * above usually finished by queuing into the wait list on
> > > contention, and
> > > +  * doesn't reach spins_per_delay thereby doesn't sleep inside of
> > > +  * perform_spin_delay. Also, different LWLocks has very different
> > > +  * contention pattern, and it is wrong to update spin-lock
> > > statistic based
> > > +  * on LWLock contention.
> > > +  */
> > 
> > Huh? This seems entirely unconvincing. Without adjusting this here we'll
> > just spin the same way every iteration. Especially for the case where
> > somebody else holds LW_FLAG_LOCKED that's quite bad.
> 
> LWLock's are very different. Some of them are always short-term
> (BufferLock), others are always locked for a long time.

That seems not particularly relevant. The same is true for
spinlocks. The relevant question isn't how long the lwlock is held, it's
how long LW_FLAG_LOCKED is held - and that should only depend on
contention (i.e. bus speed, amount of times put into sleep while holding
lock, etc), not on how long the lock is held.

> I've tried to place this delay into lock itself (it has 2 free bytes),
> but this attempt performed worse.

That seems unsurprising - there's a *lot* of locks, and we'd have to
tune all of them. Additionally there's a bunch of platforms where we do
*not* have free bytes (consider the embedding in BufferTag).


> Now I understand, that delays should be stored in array indexed by
> tranche. But I have no time to test this idea. And I doubt it will give
> cardinally better results (ie > 5%), so I think it is better to accept
> patch in this way, and then experiment with per-tranche delay.

I don't think tranches have any decent predictive power here.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-18 Thread Andres Freund
>_state, desired_state))
> + {
> + LWLockQueueSelfLocked(lock, waitmode);
> + break;
>   }
> - else
> - return true;/* somebody else has the lock */
> + }
> + else
> + {
> + perform_spin_delay();
> + old_state = pg_atomic_read_u32(>state);
>   }
>   }
> - pg_unreachable();
> +#ifdef LWLOCK_STATS
> + add_lwlock_stats_spin_stat(lock, );
> +#endif
> +
> + /*
> +  * We intentionally do not call finish_spin_delay here, because the loop
> +  * above usually finished by queuing into the wait list on contention, 
> and
> +  * doesn't reach spins_per_delay thereby doesn't sleep inside of
> +  * perform_spin_delay. Also, different LWLocks has very different
> +  * contention pattern, and it is wrong to update spin-lock statistic 
> based
> +  * on LWLock contention.
> +  */

Huh? This seems entirely unconvincing. Without adjusting this here we'll
just spin the same way every iteration. Especially for the case where
somebody else holds LW_FLAG_LOCKED that's quite bad.


> From e5b13550fc48d62b0b855bedd7fcd5848b806b09 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny.fal...@postgrespro.ru>
> Date: Tue, 30 May 2017 18:54:25 +0300
> Subject: [PATCH 5/6] Fix implementation description in a lwlock.c .
> 
> ---
>  src/backend/storage/lmgr/lwlock.c | 17 -
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/src/backend/storage/lmgr/lwlock.c 
> b/src/backend/storage/lmgr/lwlock.c
> index 334c2a2d96..0a41c2c4e2 100644
> --- a/src/backend/storage/lmgr/lwlock.c
> +++ b/src/backend/storage/lmgr/lwlock.c
> @@ -62,16 +62,15 @@
>   * notice that we have to wait. Unfortunately by the time we have finished
>   * queuing, the former locker very well might have already finished it's
>   * work. That's problematic because we're now stuck waiting inside the OS.
> -
> - * To mitigate those races we use a two phased attempt at locking:
> - *Phase 1: Try to do it atomically, if we succeed, nice
> - *Phase 2: Add ourselves to the waitqueue of the lock
> - *Phase 3: Try to grab the lock again, if we succeed, remove ourselves 
> from
> - * the queue
> - *Phase 4: Sleep till wake-up, goto Phase 1
>   *
> - * This protects us against the problem from above as nobody can release too
> - * quick, before we're queued, since after Phase 2 we're already queued.
> + * This race is avoided by taking a lock for the wait list using CAS with 
> the old
> + * state value, so it would only succeed if lock is still held. Necessary 
> flag
> + * is set using the same CAS.
> + *
> + * Inside LWLockConflictsWithVar the wait list lock is reused to protect the 
> variable
> + * value. First the lock is acquired to check the variable value, then flags 
> are
> + * set with a second CAS before queuing to the wait list in order to ensure 
> that the lock was not
> + * released yet.
>   * -
>   */

I think this needs more extensive surgery.



> From cc74a849a64e331930a2285e15445d7f08b54169 Mon Sep 17 00:00:00 2001
> From: Sokolov Yura <funny.fal...@postgrespro.ru>
> Date: Fri, 2 Jun 2017 11:34:23 +
> Subject: [PATCH 6/6] Make SpinDelayStatus a bit lighter.
> 
> It saves couple of instruction of fast path of spin-loop, and so makes
> fast path of LWLock a bit faster (and in other places where spinlock is
> used).

> Also it makes call to perform_spin_delay a bit slower, that could
> positively affect on spin behavior, especially if no `pause` instruction
> present.

Whaa? That seems pretty absurd reasoning.




One big advantage of this is that we should be able to make LW_SHARED
acquisition an xadd. I'd done that previously, when the separate
spinlock still existed, and it was quite beneficial for performance on
larger systems. But it was some fiddly code - should be easier now.
Requires some careful management when noticing that an xadd acquired a
shared lock even though there's a conflicting exclusive lock, but
increase the fastpath.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-10-18 Thread Andres Freund
Hi,

On 2017-06-05 16:22:58 +0300, Sokolov Yura wrote:
> Patch changes the way LWLock is acquired.
> 
> Before patch, lock is acquired using following procedure:
> 1. first LWLock->state is checked for ability to take a lock, and CAS
>   performed (either lock could be acquired or not). And it is retried if
>   status were changed.
> 2. if lock is not acquired at first 1, Proc is put into wait list, using
>   LW_FLAG_LOCKED bit of LWLock->state as a spin-lock for wait list.
>   In fact, profile shows that LWLockWaitListLock spends a lot of CPU on
>   contendend LWLock (upto 20%).
>   Additional CAS loop is inside pg_atomic_fetch_or_u32 for setting
>   LW_FLAG_HAS_WAITERS. And releasing wait list lock is another CAS loop
>   on the same LWLock->state.
> 3. after that state is checked again, because lock could be released
>   before Proc were queued into wait list.
> 4. if lock were acquired at step 3, Proc should be dequeued from wait
>   list (another spin lock and CAS loop). And this operation could be
>   quite heavy, because whole list should be traversed.
> 
> Patch makes lock acquiring in single CAS loop:
> 1. LWLock->state is read, and ability for lock acquiring is detected.
>   If there is possibility to take a lock, CAS tried.
>   If CAS were successful, lock is aquired (same to original version).
> 2. but if lock is currently held by other backend, we check ability for
>   taking WaitList lock. If wait list lock is not help by anyone, CAS
>   perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once.
>   If CAS were successful, then LWLock were still held at the moment wait
>   list lock were held - this proves correctness of new algorithm. And
>   Proc is queued to wait list then.
> 3. Otherwise spin_delay is performed, and loop returns to step 1.

Right, something like this didn't use to be possible because we'd both
the LWLock->state and LWLock->mutex. But after 008608b9d5106 that's not
a concern anymore.

> Algorithm for LWLockWaitForVar is also refactored. New version is:
> 1. If lock is not held by anyone, it immediately exit.
> 2. Otherwise it is checked for ability to take WaitList lock, because
>   variable is protected with it. If so, CAS is performed, and if it is
>   successful, loop breaked to step 4.
> 3. Otherwise spin_delay perfomed, and loop returns to step 1.
> 4. var is checked for change.
>   If it were changed, we unlock wait list lock and exit.
>   Note: it could not change in following steps because we are holding
>   wait list lock.
> 5. Otherwise CAS on setting necessary flags is performed. If it succeed,
>   then queue Proc to wait list and exit.
> 6. If Case failed, then there is possibility for LWLock to be already
>   released - if so then we should unlock wait list and exit.
> 7. Otherwise loop returns to step 5.
> 
> So invariant is preserved:
> - either we found LWLock free,
> - or we found changed variable,
> - or we set flags and queue self while LWLock were held.
> 
> Spin_delay is not performed at step 7, because we should release wait
> list lock as soon as possible.

That seems unconvincing - by not delaying you're more likely to
*increase* the time till the current locker that holds the lock can
release the lock.

> Additionally, taking wait list lock is skipped in both algorithms if
> SpinDelayStatus.spins is less than part of spins_per_delay loop
> iterations (so, it is initial iterations, and some iterations after
> pg_usleep, because SpinDelayStatus.spins is reset after sleep).

I quite strongly think it's a good idea to change this at the same time
as the other changes you're proposing here.  I think it's a worthwhile
thing to pursue, but that change also has quit ethe potential for
regressing in some cases.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-18 Thread Andres Freund
locks - so there's actually no external barrier
providing these guarantees that could be combined with SetLatch()'s
barrier.

Presumably part of the cost here comes from the fact that the barriers
actually do have an influence over the ordering.


> All that having been said, a batch variant for reading tuples in bulk
> might make sense.  I think when I originally wrote this code I was
> thinking that one process might be filling the queue while another
> process was draining it, and that it might therefore be important to
> free up space as early as possible.  But maybe that's not a very good
> intuition.

I think that's a sensible goal, but I don't think that has to mean that
the draining has to happen after every entry. If you'd e.g. have a
shm_mq_receivev() with 16 iovecs, that'd commonly be only part of a
single tqueue mq (unless your tuples are > 4k).  I'll note that afaict
shm_mq_sendv() already batches its SetLatch() calls.

I think one important thing a batched drain can avoid is that a worker
awakes to just put one new tuple into the queue and then sleep
again. That's kinda expensive.


> >b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
> >   queue whenever it's not empty when a tuple is ready.
> 
> Batching them with what?  I hate to postpone sending tuples we've got;
> that sounds nice in the case where we're sending tons of tuples at
> high speed, but there might be other cases where it makes the leader
> wait.

Yea, that'd need some smarts. How about doing something like batching up
locally as long as the queue contains less than one average sized batch?


> > 6) I've observed, using strace, debug outputs with timings, and top with
> >a short interval, that quite often only one backend has sufficient
> >work, while other backends are largely idle.
> 
> Doesn't that just mean we're bad at choosing how man workers to use?
> If one worker can't outrun the leader, it's better to have the other
> workers sleep and keep one the one lucky worker on CPU than to keep
> context switching.  Or so I would assume.

No, I don't think that's necesarily true. If that worker's queue is full
every-time, then yes. But I think a common scenario is that the
"current" worker only has a small portion of the queue filled. Draining
that immediately just leads to increased cacheline bouncing.

I'd not previously thought about this, but won't staying sticky to the
current worker potentially cause pins on individual tuples be held for a
potentially long time by workers not making any progress?


> >It might also be a good idea to use a more directed form of wakeups,
> >e.g. using a per-worker latch + a wait event set, to avoid iterating
> >over all workers.
> 
> I don't follow.

Well, right now we're busily checking each worker's queue. That's fine
with a handful of workers, but starts to become not that cheap pretty
soon afterwards. In the surely common case where the workers are the
bottleneck (because that's when parallelism is worthwhile), we'll check
each worker's queue once one of them returned a single tuple. It'd not
be a stupid idea to have a per-worker latch and wait for the latches of
all workers at once. Then targetedly drain the queues of the workers
that WaitEventSetWait(nevents = nworkers) signalled as ready.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-17 Thread Andres Freund
Hi,

On 2017-10-17 14:39:57 -0700, Andres Freund wrote:
> I've spent some time looking into this, and I'm not quite convinced this
> is the right approach.  Let me start by describing where I see the
> current performance problems around gather stemming from.

One further approach to several of these issues could also be to change
things a bit more radically:

Instead of the current shm_mq + tqueue.c, have a drastically simpler
queue, that just stores fixed width dsa_pointers. Dealing with that
queue will be quite a bit faster. In that queue one would store dsa.c
managed pointers to tuples.

One thing that makes that attractive is that that'd move a bunch of
copying in the leader process solely to the worker processes, because
the leader could just convert the dsa_pointer into a local pointer and
hand that upwards the execution tree.

We'd possibly need some halfway clever way to reuse dsa allocations, but
that doesn't seem impossible.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-17 Thread Andres Freund
 * variables possibly changed by this process have been 
flushed to main
   │   * memory, before we check/set is_set.
   │   */
   │  pg_memory_barrier();
 77.43 │lock   addl   $0x0,(%rsp)
   │
   │  /* Quick exit if already set */
   │  if (latch->is_set)
  0.12 │mov(%rdi),%eax


   Commenting out the memory barrier - which is NOT CORRECT - improves
   timing:
   before: 4675.626
   after: 4125.587

   The correct fix obviously is not to break latch correctness. I think
   the big problem here is that we perform a SetLatch() for every read
   from the latch.

   I think we should
   a) introduce a batch variant for receiving like:

extern shm_mq_result shm_mq_receivev(shm_mq_handle *mqh,
 shm_mq_iovec *iov, int *iovcnt,
 bool nowait)

  which then only does the SetLatch() at the end. And maybe if it
  wrapped around.

   b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
  queue whenever it's not empty when a tuple is ready.

   I've not benchmarked this, but I'm pretty certain that the benefits
   isn't just going to be reduced cost of SetLatch() calls, but also
   increased performance due to fewer context switches

6) I've observed, using strace, debug outputs with timings, and top with
   a short interval, that quite often only one backend has sufficient
   work, while other backends are largely idle.

   I think the problem here is that the "anti round robin" provisions from
   bc7fcab5e36b9597857, while much better than the previous state, have
   swung a bit too far into the other direction. Especially if we were
   to introduce batching as I suggest in 5), but even without, this
   leads to back-fort on half-empty queues between the gatherstate->nextreader
   backend, and the leader.

   I'm not 100% certain what the right fix here is.

   One fairly drastic solution would be to move away from a
   single-produce-single-consumer style, per worker, queue, to a global
   queue. That'd avoid fairness issues between the individual workers,
   at the price of potential added contention. One disadvantage is that
   such a combined queue approach is not easily applicable for gather
   merge.

   One less drastic approach would be to move to try to drain the queue
   fully in one batch, and then move to the next queue. That'd avoid
   triggering a small wakeups for each individual tuple, as one
   currently would get without the 'stickyness'.

   It might also be a good idea to use a more directed form of wakeups,
   e.g. using a per-worker latch + a wait event set, to avoid iterating
   over all workers.


Unfortunately the patch's "local worker queue" concept seems, to me,
like it's not quite addressing the structural issues, instead opting to
address them by adding another layer of queuing. I suspect that if we'd
go for the above solutions there'd be only very small benefit in
implementing such per-worker local queues.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Windows warnings from VS 2017

2017-10-17 Thread Andres Freund
On 2017-10-11 18:20:24 -0400, Tom Lane wrote:
> Well, it's not that much work to try it and see.  I compared results
> of this simplistic test case:
>   pgbench -S -c 1 -T 60 bench
> (using a scale-factor-10 pgbench database) on current HEAD and HEAD
> with the attached patch, which just lobotomizes all the MemSet
> macros into memset().  Median of 3 runs is 9698 tps for HEAD and
> 9675 tps with the patch; the range is ~100 tps though, making
> this difference well within the noise level.
> 
> I did this using RHEL6's gcc (Red Hat 4.4.7-18), which is pretty
> far from bleeding-edge so I think it's okay as a baseline for
> what optimizations we can expect to find used in the field.
> 
> So I can't sustain Andres' assertion that memset is actually faster
> for the cases we care about, but it certainly doesn't seem any
> slower either.  It would be interesting to check compromise
> patches, such as keeping only the MemSetLoop case.

Well, that'd be because that workload doesn't exercise either version of
memset to a meaningful degree, right?

> +#define MemSetAligned(start, val, len) memset(start, val, len)

If we actually care about this optimization, which seems to solely serve
palloc0fast, we could tell the compiler about the guaranteed alignment
etc.  I'm doubtful it's worth making this work for palloc0fast alone,
but I think it might be worthwhile to make palloc0 an inline function
like

#ifdef YES_I_AM_A_GOOD_COMPILER
/* not needed here, but probably good for plenty other places */
#define pg_attribute_assume_aligned(align) 
__attribute__((assume_aligned(align)))
#define pg_assume_aligned(ptr, align)  __builtin_assume_aligned(ptr, align)
#else
...
#end

extern void *palloc(Size size) pg_attribute_assume_aligned(MAXALIGN);

static inline void *
palloc0(Size size)
{
void *mem = palloc(size);

memset(mem, 0, size);
return mem;
}

that should allow the compiler to always optimize the memset based on
the alignment - at least x86 gcc does so - and on the size if a
built-time constant (nearly every compiler does so).

I assume we probably should do this dance not just for palloc, but for
the other allocation functions too.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-10-16 Thread Andres Freund
Hi Rafia,

On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
> head:
> explain  analyse select * from t where i < 3000;
>  QUERY PLAN

Could you share how exactly you generated the data? Just so others can
compare a bit better with your results?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2017-10-16 Thread Andres Freund
On 2017-10-16 16:59:59 -0400, Peter Eisentraut wrote:
> On 9/20/17 04:32, Andres Freund wrote:
> > Here's what I roughly was thinking of.  I don't quite like the name, and
> > the way the version is specified for libpq (basically just the "raw"
> > integer).
> 
> "forced_protocol_version" reads wrong to me.  I think
> "force_protocol_version" might be better.  Other than that, no issues
> with this concept.

Yea, I agree. I've read through the patch since, and it struck me as
odd. Not sure how I came up with it...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate FILTER option is broken in v10

2017-10-16 Thread Andres Freund
On 2017-10-16 11:12:09 -0400, Tom Lane wrote:
> I wrote:
> > I think possibly the best answer is to revert 8ed3f11bb.  We could
> > think about some compromise solution like merging the projections
> > only for aggregates without FILTER.  But that would require two
> > code paths through the relevant functions in nodeAgg.c, which would
> > be a substantial maintenance burden; and the extra branches required
> > means that this would be a net negative for performance in the
> > simplest case with only one aggregate.
> 
> Hmm ... on closer inspection, the only performance-critical place
> where this matters is advance_aggregates, and that already has a check
> for whether the particular aggregate has a filter.  So we could do
> something like
> 
> /* Skip anything FILTERed out */
> if (filter)
> {
> // existing code to eval/check filter expr
> +
> +   /* Now it's safe to evaluate this agg's arguments */
> +   slot = ExecProject(pertrans->argproj);
> }
> +   else
> +   slot = aggstate->evalslot;
> 
> which seems like a pretty minimal extra cost for the normal case
> with no filter.

Thanks, that looks like a reasonable fix.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_control_recovery() return value when not in recovery

2017-10-13 Thread Andres Freund
On 2017-10-13 16:31:37 -0700, Joe Conway wrote:
> On 09/17/2017 11:29 PM, Andres Freund wrote:
> > On 2017-09-18 07:24:43 +0100, Simon Riggs wrote:
> >> On 18 September 2017 at 05:50, Andres Freund <and...@anarazel.de> wrote:
> >> > Hi,
> >> >
> >> > Just noticed that we're returning the underlying values for
> >> > pg_control_recovery() without any checks:
> >> > postgres[14388][1]=# SELECT * FROM pg_control_recovery();
> >> > ┌──┬───┬──┬┬───┐
> >> > │ min_recovery_end_lsn │ min_recovery_end_timeline │ backup_start_lsn │ 
> >> > backup_end_lsn │ end_of_backup_record_required │
> >> > ├──┼───┼──┼┼───┤
> >> > │ 0/0  │ 0 │ 0/0  │ 
> >> > 0/0│ f │
> >> > └──┴───┴──┴┴───┘
> >> > (1 row)
> >> 
> >> Yes, that would have made sense for these to be NULL
> > 
> > Yea, that's what I think was well.  Joe, IIRC that's your code, do you
> > agree as well?
> 
> Sorry for the slow response, but thinking back on this now, the idea of
> these functions, in my mind at least, was to provide as close to the
> same output as possible to what pg_controldata outputs. So:
> 
> # pg_controldata
> ...
> Minimum recovery ending location: 0/0
> Min recovery ending loc's timeline:   0
> Backup start location:0/0
> Backup end location:  0/0
> End-of-backup record required:no
> ...
> 
> So if we make a change here, do we also change pg_controldata?

I'm unconvince that they need to be kept that close. For one, text
output doesn't have the concept of NULLs. Secondly, pg_controldata
output also the cluster state at the same time.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Disable cross products in postgres

2017-10-13 Thread Andres Freund
On 2017-10-14 03:49:57 +0530, Gourav Kumar wrote:
> But then is there some way to tell Optimizer not to consider transitive
> joins ?

What are you actually trying to achieve here?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-13 Thread Andres Freund
Hi,

On 2017-10-14 10:38:13 +1300, David Rowley wrote:
> On 12 October 2017 at 04:50, Robert Haas <robertmh...@gmail.com> wrote:
> > We haven't really done a very good job figuring out what to do about
> > optimizations that some people (mostly you) think are marginal.  It's
> > obviously true that we can't just burn infinite planner cycles on
> > things that don't usually help, but at the same time, we can't just
> > keep ignoring requests by users for the same features and saying
> > "you'll be sad if we give this to you".  Those people don't give up on
> > wanting the optimization; they just don't use PostgreSQL.  I think we
> > need to stop just saying "no" and start thinking about what we could
> > do that would let us say "yes".

+1 to the general sentiment.


> I'm with Robert on this.  Planning time *is* important, but if we were
> to do a quick tally on the number of patches that we've seen in the
> last few years to improve execution time by either improving the
> executor code or adding some smarts to the planner to reduce executor
> work vs patches aimed to reduce planning time, I think we'd find the
> general focus is on the executor.

That's true. But I think that's partially because people benchmarking
with the goal to identify bottlenecks just habitually use prepared
statements to remove planning overhead.

That works well enough in benchmarking scenarios, but in practice I
think that's less clearly a good tradeoff - e.g in OLTP workloads you'll
often see custom plans defeating the intent of using prepared
statements.


> Personally, I don't recall a single patch aimed to just speed up the
> planner.

There were some a couple years back.


> It looks like there's plenty we could do in there, just nobody seems
> interested enough to go and do it, everyone who cares about
> performance is too busy trying to make execution run faster.

I think that's largely because our executor speed is quite bad. To some
degree because we paid much less attention to seed degradations there
for a couple years, absurdly enough.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve catcache/syscache performance.

2017-10-13 Thread Andres Freund
Hi,

On 2017-10-13 10:38:47 -0700, Andres Freund wrote:
> On 2017-10-13 13:06:41 -0400, Robert Haas wrote:
> > On Thu, Sep 14, 2017 at 2:12 AM, Andres Freund <and...@anarazel.de> wrote:
> > > This patch gives me roughly 8% speedup in a workload that consists out
> > > of a fast query that returns a lot of columns.  If I apply a few
> > > other performance patches, this patch itself starts to make a bigger
> > > difference, of around 11%.
> > 
> > I did a read-through of this patch today.
> 
> Thanks!

Pushed after making the adaptions you suggested, pgindenting, and fixing
one bug (cstring lookups + datumCopy() = not good).

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve catcache/syscache performance.

2017-10-13 Thread Andres Freund
On 2017-10-13 14:07:54 -0400, Tom Lane wrote:
> One idea might be to see if we can precalculate all the control data
> needed for the caches and set it up as compile-time constants,
> a la Gen_fmgrtab.pl, rather than reading it from the catalogs during
> startup.  That would make the code less dependent on initialization
> order rather than more so.

Hm. That sounds somewhat enticing. You're thinking of doing so for
catcaches alone, or something grander, including the relcaches? I'd
assume the former?

For catcaches the hardest part probably is that we need a TupleDesc. Per
type function lookups, oids, should be fairly easy in contrast.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improve catcache/syscache performance.

2017-10-13 Thread Andres Freund
Hi,

On 2017-10-13 13:06:41 -0400, Robert Haas wrote:
> On Thu, Sep 14, 2017 at 2:12 AM, Andres Freund <and...@anarazel.de> wrote:
> > This patch gives me roughly 8% speedup in a workload that consists out
> > of a fast query that returns a lot of columns.  If I apply a few
> > other performance patches, this patch itself starts to make a bigger
> > difference, of around 11%.
> 
> I did a read-through of this patch today.

Thanks!


> +/* not as performance critical & "complicated" */
> 
> This comment kinda sucks.  I don't think it will be clear to somebody
> in 3 years what this means.  It's clear enough in context but later I
> think it won't be.  I suggest dumping this altogether and expanding
> the comment up above to encompass this:
> 
> Hash and equality functions for system types that are used as cache
> key fields.  In some cases, we just call the regular SQL-callable
> functions for the appropriate data type, but that tends to be a little
> slow, and the speed of these functions is performance-critical.
> Therefore, for data types that frequently occur as catcache keys, we
> hard-code the logic here.  Avoiding the overhead of
> DirectFunctionCallN(...) is a substantial win, and in certain cases
> (like int4) we can adopt a faster hash algorithm as well.

K.


> -if (cache->cc_tupdesc == NULL)
> +if (unlikely(cache->cc_tupdesc == NULL))
>  CatalogCacheInitializeCache(cache);
> 
> I don't think it's this patch's job to do it, but it seems like we
> ought to just invent some early-initialization step where things like
> this can happen, so that we don't have to branch here at all.

Yea, that'd be nice - it does show up in profiles.  I'd tried to do
that, but it's not exactly trivial, so I decided to delay it. The
ordering when syscache stuff gets initialized is, uh, fragile.  During
InitCatalogCache() the catalog is not ready for access. After that, we
access catcaches while not all catalogs are quite ready to be accessed.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Continuous integration on Windows?

2017-10-12 Thread Andres Freund
On 2017-10-12 17:57:11 -0400, Andrew Dunstan wrote:
> No testing yet, but this runs a build successfully:
> <https://gist.github.com/adunstan/7f18e5db33bb2d73f69ff8c9337a4e6c>
> 
> See results at <https://ci.appveyor.com/project/AndrewDunstan/pgdevel>

"Time Elapsed 00:04:36.37"

I'd expected building would take a lot longer than that... Neat.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-10-12 Thread Andres Freund
On 2017-10-12 17:27:52 -0400, Robert Haas wrote:
> On Thu, Oct 12, 2017 at 4:20 PM, Andres Freund <and...@anarazel.de> wrote:
> >> In other words, it's not utterly fixed in stone --- we invented
> >> --load-via-partition-root primarily to cope with circumstances that
> >> could change hash values --- but we sure don't want to be changing it
> >> with any regularity, or for a less-than-excellent reason.
> >
> > Yea, that's what I expected. It'd probably good for somebody to run
> > smhasher or such on the output of the combine function (or even better,
> > on both the 32 and 64 bit variants) in that case.
> 
> Not sure how that test suite works exactly, but presumably the
> characteristics in practice will depend the behavior of the hash
> functions used as input the combine function - so the behavior could
> be good for an (int, int) key but bad for a (text, date) key, or
> whatever.

I don't think that's true, unless you have really bad hash functions on
the the component hashes. A hash combine function can't really do
anything about badly hashed input, what you want is that it doesn't
*reduce* the quality of the hash by combining.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show precise repos version for dev builds?

2017-10-12 Thread Andres Freund
Hi,

On 2017-10-12 22:50:47 +0200, Fabien COELHO wrote:
> The make dependencies ensure that the header file is regenerated on each
> build with a phony target, and the C file is thus recompiled and linked into
> the executables on each build. It means that all executables are linked on
> each rebuild, even if not necessary, though.

That'd be quite painful, consider e.g. people using LTO.  If done right
however, that should be avoidable to some degree. When creating the
version file, only replace its contents if the contents differ - that's
just a few lines of scripting.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-10-12 Thread Andres Freund
On 2017-10-12 16:06:11 -0400, Robert Haas wrote:
> On Thu, Oct 12, 2017 at 3:43 PM, Andres Freund <and...@anarazel.de> wrote:
> > Are we going to rely on the the combine function to stay the same
> > forever after?
> 
> If we change them, it will be a pg_upgrade compatibility break for
> anyone using hash-partitioned tables with more than one partitioning
> column.  Dump and reload will also break unless
> --load-via-partition-root is used.
> 
> In other words, it's not utterly fixed in stone --- we invented
> --load-via-partition-root primarily to cope with circumstances that
> could change hash values --- but we sure don't want to be changing it
> with any regularity, or for a less-than-excellent reason.

Yea, that's what I expected. It'd probably good for somebody to run
smhasher or such on the output of the combine function (or even better,
on both the 32 and 64 bit variants) in that case.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-10-12 Thread Andres Freund
On 2017-10-12 10:05:26 -0400, Robert Haas wrote:
> On Thu, Oct 12, 2017 at 9:08 AM, amul sul <sula...@gmail.com> wrote:
> > How about combining high 32 bits and the low 32 bits separately as shown 
> > below?
> >
> > static inline uint64
> > hash_combine64(uint64 a, uint64 b)
> > {
> > return (((uint64) hash_combine((uint32) a >> 32, (uint32) b >> 32) << 
> > 32)
> > | hash_combine((unit32) a, (unit32) b));
> > }
> 
> I doubt that's the best approach, but I don't have something specific
> to recommend.

Yea, that doesn't look great. There's basically no intermixing between
low and high 32 bits. going on.  We probably should just expand the
concept of the 32 bit function:

static inline uint32
hash_combine32(uint32 a, uint32 b)
{
/* 0x9e3779b9 is the golden ratio reciprocal */
a ^= b + 0x9e3779b9 + (a << 6) + (a >> 2);
return a;
}

to something roughly like:

static inline uint64
hash_combine64(uint64 a, uint64 b)
{
/* 0x49A0F4DD15E5A8E3 is 64bit random data */
a ^= b + 0x49A0F4DD15E5A8E3 + (a << 54) + (a >> 7);
return a;
}

In contrast to the 32 bit version's fancy use of the golden ratio
reciprocal as a constant I went brute force, and just used 64bit of
/dev/random. From my understanding the important property is that bits
are independent from each other, nothing else.

The shift widths are fairly random, but they should bring in enough bit
perturbation when mixing in only 32bit of hash value (i.e
0x0000).

Are we going to rely on the the combine function to stay the same
forever after?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Add configure infrastructure to detect support for C99's restric

2017-10-11 Thread Andres Freund
On 2017-10-11 21:59:53 -0700, Andres Freund wrote:
> That fixed that problem I think. But unfortunately since then another
> problem has been reported by some other animals, all with older msvc
> versions afaict (thrips - vs 2012, bowerbird - vs 2012).

Correction, thrips is vs 2010, not 2012.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Add configure infrastructure to detect support for C99's restric

2017-10-11 Thread Andres Freund
On 2017-10-11 17:13:20 -0700, Andres Freund wrote:
> Hi,
> 
> On 2017-10-11 23:11:15 +0000, Andres Freund wrote:
> > Add configure infrastructure to detect support for C99's restrict.
> > 
> > Will be used in later commits improving performance for a few key
> > routines where information about aliasing allows for significantly
> > better code generation.
> > 
> > This allows to use the C99 'restrict' keyword without breaking C89, or
> > for that matter C++, compilers. If not supported it's defined to be
> > empty.
> 
> Woodlouse doesn't like this, erroring out with:
> C:\buildfarm\buildenv\HEAD\pgsql.build\src\include\libpq/pqformat.h(47): 
> error C2219: syntax error : type qualifier must be after '*' 
> (src/backend/access/common/printsimple.c) 
> [C:\buildfarm\buildenv\HEAD\pgsql.build\postgres.vcxproj]
> 
> It's MSVC being super peculiar about error checks. I think msvc is just
> confused by the pointer hiding typedef. Using some online MSVC (and
> other compilers) frontend:
> https://godbolt.org/g/TD3nmA
> 
> I confirmed that removing the pointer hiding typedef indeed resolves the
> isssue.
> 
> I can't quite decide whether msvc just has taste and dislikes pointer
> hiding typedefs as much as I do, or whether it's incredibly stupid ;)
> 
> I'll add a comment and use StringInfoData *.

That fixed that problem I think. But unfortunately since then another
problem has been reported by some other animals, all with older msvc
versions afaict (thrips - vs 2012, bowerbird - vs 2012). Those report
that the defining of restrict to __restrict interfers with some system
headers:
C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\include\stdlib.h(619): 
error C2485: '__restrict' : unrecognized extended attribute 
(src/backend/access/brin/brin.c) 
[c:\prog\bf\root\HEAD\pgsql.build\postgres.vcxproj]

Presumably that is because the headers contain __declspec(restrict) on
some function declarations.

I've temporarily silenced that error by moving the stdlib.h include
before the definition of restrict, but that seems fairly fragile. I
primarily wanted to see whether there's other problems.  At least thrips
is is now happy.

I see a number of options to continue:
- only define restrict on msvc 2013+ - for some reason woodlouse didn't
  complain about this problem, but I'm very doubtful that that's
  actually reliable.
- rename restrict to pg_restrict. That's annoying because we'd have to
  copy the autoconf test.
- live with the hack of including stdlib.h early, in pg_config.h.win32.
- $better_idea

Comments?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-10-11 Thread Andres Freund
On 2017-10-11 08:54:10 -0700, Andres Freund wrote:
> On 2017-10-11 10:53:56 +0200, Alvaro Herrera wrote:
> > Maybe it'd be a good idea to push 0001 with some user of restrict ahead
> > of the rest, just to see how older msvc reacts.
> 
> Can do. Not quite sure which older user yet, but I'm sure I can find
> something.

I looked around and didn't immedialy see a point where it'd be useful. I
don't really want to put it in some place where it's not useful. I think
we can just as well wait for the first patch using it to exercise
restrict support.

There's references to restrict support back to VS 2008:
https://msdn.microsoft.com/en-us/library/5ft82fed(v=vs.90).aspx


So I'll change the pg_config.h.win32 hunk to:
/* Define to the equivalent of the C99 'restrict' keyword, or to
   nothing if this is not supported.  Do not define if restrict is
   supported directly.  */
/* Visual Studio 2008 and upwards */
#if (_MSC_VER >= 1500)
/* works for C and C++ in msvc */
#define restrict __restrict
#else
#define restrict
#endif

there's several patterns like that (except for the version mapping) in
the file already.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Windows warnings from VS 2017

2017-10-11 Thread Andres Freund
> On 09/21/2017 09:41 PM, Andres Freund wrote:
> > On 2017-09-21 09:30:31 -0400, Tom Lane wrote:
> >> Andrew Dunstan <andrew.duns...@2ndquadrant.com> writes:
> >>> The speed of memset is hardly going to be the dominating factor in a
> >>> 'CREATE DATABASE' command, so we could certainly afford to change to
> >>> plain memset calls here.
> >> Another thought is that it may be time for our decennial debate about
> >> whether MemSet is worth the electrons it's printed on.  I continue to
> >> think that any modern compiler+libc ought to do an equivalent or better
> >> optimization given just a plain memset().  If that seems to be true
> >> for recent MSVC, we could consider putting an #if into c.h to define
> >> MemSet as just memset for MSVC.  Maybe later that could be extended
> >> to other compilers.
> > +many. glibc's memset is nearly an order of magnitude faster than MemSet
> > on my machine for medium+ length copies, and still 3-4 four times ~100
> > bytes. And both gcc and clang optimize way the memcpy entirely when the
> > length is a fixed short length.

> Perhaps we need some sort of small benchmark program that we can run on
> a representative set of platforms?

I personally don't think that's needed here, given how incredibly widely
used memset is in our codebase.


> I presume if you can make that assertion you already have something
> along those lines?

Not really. I just replaced memset with MemSetAligned in a bunch of
places in the code and looked at profiles.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [BUGS] BUG #14821: idle_in_transaction_session_timeout sometimes gets ignored when statement timeout is pending

2017-10-11 Thread Andres Freund
Hi,

On 2017-09-20 20:27:05 -0700, Lukas Fittl wrote:
> As per the bug report at
> https://www.postgresql.org/message-id/20170921010956.17345.61461%40wrigleys.postgresql.org
> it seems that the query cancellation holdoff logic in ProcessInterrupts is
> a bit overly aggressive in keeping other interrupts from running.
> 
> In particular I've seen an issue in the wild where
> idle_in_transaction_session_timeout did not get triggered because
> the HOLD_CANCEL_INTERRUPTS() in SocketBackend wraps around a pq_getbyte()
> call, and so ProcessInterrupts doesn't do anything when it gets called
> because the query cancel holdoff counter is positive.
> 
> Andres suggested the following re-ordering of the logic on -bugs:

I've pushed this. Thanks for the report & fix!

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-10-11 Thread Andres Freund
Hi,

On 2017-10-11 18:05:32 +0200, Alvaro Herrera wrote:
> Well, my concern is to ensure that extension authors take advantage of
> the optimized implementation.  If we never let them know that we've
> rewritten things, most are not going to realize they can make their
> extensions faster with a very simple code change.

The inline pq_gsendint() already results in faster code in a good
number of cases, as a decent compiler will often be able to evaluate at
plan time.


> On the other hand, I suppose that for the vast majority of extensions,
> doing the change is unlikely to have a noticeable in performance, so
> perhaps we should just keep the shim and move along.

Yea, I think it's unlikely to be noticeable unless you do a lot of them
in a row. Unfortunately all send functions essentially allocate a new
StringInfo - which is going to dominate execution cost.  We literally
allocate 1kb to send a single four byte integer.

Fixing the output function performance requires a fairly different type
of patch imo.


> If do nothing, it's unlikely we'd ever get rid of the compat function.

I think that's ok.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-11 Thread Andres Freund
Hi,

On 2017-10-09 10:37:01 +0300, Konstantin Knizhnik wrote:
> So we have implement sharding - splitting data between several remote tables
> using pg_pathman and postgres_fdw.
> It means that insert or update of parent table  cause insert or update of
> some derived partitions which is forwarded by postgres_fdw to the
> correspondent node.
> Number of shards is significantly larger than number of nodes, i.e. for 5
> nodes we have 50 shards. Which means that at each onde we have 10 shards.
> To provide fault tolerance each shard is replicated using logical
> replication to one or more nodes. Right now we considered only redundancy
> level 1 - each shard has only one replica.
> So from each node we establish 10 logical replication channels.

Isn't that part of the pretty fundamental problem? There shouldn't be 10
different replication channels per node. There should be one.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-10-11 Thread Andres Freund
On 2017-10-11 10:53:56 +0200, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Hi,
> > 
> > On 2017-10-03 13:58:37 -0400, Robert Haas wrote:
> > > On Tue, Oct 3, 2017 at 12:23 PM, Andres Freund <and...@anarazel.de> wrote:
> > > > Makes sense?
> > > 
> > > Yes.
> > 
> > Here's an updated version of this patchset.
> 
> Maybe it'd be a good idea to push 0001 with some user of restrict ahead
> of the rest, just to see how older msvc reacts.

Can do. Not quite sure which older user yet, but I'm sure I can find
something.


> I wonder if it'd be a good idea to nag external users about pq_sendint
> usage (is a #warning possible?).

Think we'd need some separate infrastructure, i.e. for gcc ending up
with __attribute__((deprecated)). I don't quite see this being worth
adding it, but ...


> OTOH, do we really need to keep it
> around?  Maybe we should ditch it, since obviously the compat shim can
> be installed locally in each extension that really needs it (thinking
> that most external code can simply be adjusted to the new functions).

That seems like causing unnecessary pain - we're talking about a few
lines in a header here, right? It's not like they'll be trivially
converting to pq_sendint$width anytime soon, unless we backpatch this.


> I'm scared about the not-null-terminated stringinfo stuff.  Is it
> possible to create bugs by polluting a stringinfo with it, then having
> the stringinfo be used by unsuspecting code?  Admittedly, you can break
> things already with the binary appends, so probably not an issue.

All of the converted sites already add integers into the StringInfo -
and most of the those integers consist out of a couple bytes of 0,
because they're lengths. So I don't think there's a huge danger here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-10-10 Thread Andres Freund
Hi,

On 2017-10-03 13:58:37 -0400, Robert Haas wrote:
> On Tue, Oct 3, 2017 at 12:23 PM, Andres Freund <and...@anarazel.de> wrote:
> > Makes sense?
> 
> Yes.

Here's an updated version of this patchset. Changes:

- renamed pq_send$type_pre to pq_write*type
- renamed pq_beginmessage_pre/pq_beginmessage_keep to the _reuse suffix
- removed unaligned memory access issues by again using memcpy - gcc and
  clang both successfully optimize it away
- moved permanent buffer for SendRowDescriptionMessage to postgres.c,
  and have other callers use already pre-existing buffers.
- replace all pq_sendint with pq_sendint$width in core
- converted applicable pq_begin/endmessage in printtup.c users to use
  DR_printtup->buf.
- added comments explaining restrict usage
- expanded commit messages considerably
- Small stuff.

The naming I'd discussed a bit back and forth with Robert over IM,
thanks!

- Andres
>From 89e301384c9fbc071de19c1517ffe29371fc6f36 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Wed, 20 Sep 2017 13:01:22 -0700
Subject: [PATCH 1/6] Add configure infrastructure to detect support for C99's
 restrict.

Will be used in later commits improving performance for a few key
routines where information about aliasing allows for significantly
better code generation.

This allows to use the C99 'restrict' keyword without breaking C89, or
for that matter C++, compilers. If not supported it's defined to be
empty.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914063418.sckdzgjfrsbek...@alap3.anarazel.de
---
 configure | 46 +++
 configure.in  |  1 +
 src/include/pg_config.h.in| 14 +
 src/include/pg_config.h.win32 |  6 ++
 4 files changed, 67 insertions(+)

diff --git a/configure b/configure
index b0582657bf4..ca54242d5d7 100755
--- a/configure
+++ b/configure
@@ -11545,6 +11545,52 @@ _ACEOF
 ;;
 esac
 
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for C/C++ restrict keyword" >&5
+$as_echo_n "checking for C/C++ restrict keyword... " >&6; }
+if ${ac_cv_c_restrict+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_cv_c_restrict=no
+   # The order here caters to the fact that C++ does not require restrict.
+   for ac_kw in __restrict __restrict__ _Restrict restrict; do
+ cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+typedef int * int_ptr;
+	int foo (int_ptr $ac_kw ip) {
+	return ip[0];
+   }
+int
+main ()
+{
+int s[1];
+	int * $ac_kw t = s;
+	t[0] = 0;
+	return foo(t)
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  ac_cv_c_restrict=$ac_kw
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ test "$ac_cv_c_restrict" != no && break
+   done
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_c_restrict" >&5
+$as_echo "$ac_cv_c_restrict" >&6; }
+
+ case $ac_cv_c_restrict in
+   restrict) ;;
+   no) $as_echo "#define restrict /**/" >>confdefs.h
+ ;;
+   *)  cat >>confdefs.h <<_ACEOF
+#define restrict $ac_cv_c_restrict
+_ACEOF
+ ;;
+ esac
+
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for printf format archetype" >&5
 $as_echo_n "checking for printf format archetype... " >&6; }
 if ${pgac_cv_printf_archetype+:} false; then :
diff --git a/configure.in b/configure.in
index 4548db0dd3c..ab990d69f4c 100644
--- a/configure.in
+++ b/configure.in
@@ -1299,6 +1299,7 @@ fi
 m4_defun([AC_PROG_CC_STDC], []) dnl We don't want that.
 AC_C_BIGENDIAN
 AC_C_INLINE
+AC_C_RESTRICT
 PGAC_PRINTF_ARCHETYPE
 AC_C_FLEXIBLE_ARRAY_MEMBER
 PGAC_C_SIGNED
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index b0298cca19c..80ee37dd622 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -923,6 +923,20 @@
if such a type exists, and if the system does not define it. */
 #undef intptr_t
 
+/* Define to the equivalent of the C99 'restrict' keyword, or to
+   nothing if this is not supported.  Do not define if restrict is
+   supported directly.  */
+#undef restrict
+/* Work around a bug in Sun C++: it does not support _Restrict or
+   __restrict__, even though the corresponding Sun C compiler ends up with
+   "#define restrict _Restrict" or "#define restrict __restrict__" in the
+   previous line.  Perhaps some future version of Sun C++ will work with
+   restrict; if so, hopefully it defines __RESTRICT like Sun C does.  */
+#if defined __SUNPRO_CC && !defined __RESTRICT
+# define _Restrict
+# define __restrict__
+#endif
+
 /* Define to empty if the C compiler does not understand signed types. */
 #undef signed
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index b76aad02676..b96e93328fe 100644
-

Re: [HACKERS] Is it time to kill support for very old servers?

2017-10-10 Thread Andres Freund
Hi,

On 2017-10-11 10:40:11 +0900, Michael Paquier wrote:
> >> +   if (conn->forced_protocol_version != NULL)
> >> +   {
> >> +   conn->pversion = atoi(conn->forced_protocol_version);
> >> +   }
> >> This should check for strlen > 0 as well.
> >
> > Why? Note that we don't do elsehwere in fe-connect.c.
> 
> Because it seems to me that the default value of the parameter should
> be an empty string instead of D. Feels more consistent with the
> others.

I'm not following. The "D" is in the 'dispchar' field, not the value
field, no? The default value is NULL?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2017-10-10 Thread Andres Freund
Hi,

On 2017-10-11 10:09:34 +0900, Michael Paquier wrote:
> On Wed, Oct 11, 2017 at 9:39 AM, Andres Freund <and...@anarazel.de> wrote:
> > On 2017-09-20 01:32:36 -0700, Andres Freund wrote:
> >> Coverage of the relevant files is a good bit higher afterwards. Although
> >> our libpq coverage is generally pretty damn awful.
> >
> > Any opinions on this? Obviously this needs some cleanup, but I'd like to
> > know whether we've concensus on adding a connection option for this goal
> > before investing more time into this.
> >
> > A nearby thread [1] whacks around some the v2 code, which triggered me
> > to look into this. I obviously can just use thiese patches to test those
> > patches during development, but it seems better to keep coverage.
> 
> FWIW, I think that moving forward with such a possibility is a good
> thing, including having a connection parameter. This would pay in the
> long term if a new protocol version is added.

> 0001 should document the new parameter.

I'm actually inclined not to, and keep this as a undocumented debugging
option. Limiting the use of this option to people willing to read the
code seems like a good idea to me.

> 
> +   if (conn->forced_protocol_version != NULL)
> +   {
> +   conn->pversion = atoi(conn->forced_protocol_version);
> +   }
> This should check for strlen > 0 as well.

Why? Note that we don't do elsehwere in fe-connect.c.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it time to kill support for very old servers?

2017-10-10 Thread Andres Freund
On 2017-09-20 01:32:36 -0700, Andres Freund wrote:
> On 2017-09-18 02:53:03 -0700, Andres Freund wrote:
> > On 2017-09-13 23:39:21 -0400, Tom Lane wrote:
> > > The real problem in this area, to my mind, is that we're not testing that
> > > code --- either end of it --- in any systematic way.  If it's broken it
> > > could take us quite a while to notice.
> >
> > Independent of the thrust of my question - why aren't we adding a
> > 'force-v2' option to libpq?  A test that basically does something like
> > postgres[22923][1]=# \setenv PGFORCEV2 1
> > postgres[22923][1]=# \c
> > You are now connected to database "postgres" as user "andres".
> > postgres[22924][1]=>
> > seems easy enough to add, in fact I tested the above.
> >
> > And the protocol coverage of the v2 protocol seems small enough that a
> > single not too large file ought to cover most if it quite easily.
> 
> Here's what I roughly was thinking of.  I don't quite like the name, and
> the way the version is specified for libpq (basically just the "raw"
> integer).   Not sure if others have an opinion on that.  I personally
> would lean towards not documenting this option...
> 
> There's a few things that I couldn't find easy ways to test:
> - the v2 specific binary protocol - I don't quite see how we could test
>   that without writing C
> - version error checks - pg_regress/psql errors out in non-interactive
>   mode if a connection fails to be established. This we could verify
>   with a s simple tap test.
> 
> Coverage of the relevant files is a good bit higher afterwards. Although
> our libpq coverage is generally pretty damn awful.

Any opinions on this? Obviously this needs some cleanup, but I'd like to
know whether we've concensus on adding a connection option for this goal
before investing more time into this.

A nearby thread [1] whacks around some the v2 code, which triggered me
to look into this. I obviously can just use thiese patches to test those
patches during development, but it seems better to keep coverage.

Thanks,

Andres

[1] https://postgr.es/m/20170914063418.sckdzgjfrsbek...@alap3.anarazel.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latches API on windows

2017-10-09 Thread Andres Freund


On October 9, 2017 6:56:10 AM PDT, Tom Lane  wrote:
>Craig Ringer  writes:
>> On 9 October 2017 at 21:26, Abbas Butt 
>wrote:
>>> In my case this is not true, I am calling InitSharedLatch in
>_PG_init
>>> which gets called at CREATE EXTENSION time.
>>> My question : Is there a way to get the latches API work on windows
>>> the way it is working on Linux?
>
>> I suspect you'd need to do it by having your extension load via
>> shared_preload_libraries, registering its latch in shmem_startup_hook
>
>Yeah.  That would also let you request your shared memory area
>honestly,
>instead of relying on there being some slop in the initial allocation.

Might be dsm  style memory.


I think the right approach here, regardless of the source of the memory, is to 
actually bit create a new latch, but instead to store a pointer the the owning 
processes preexisting latch. Besides solving this issue, it also avoids 
problems with various routines already waiting on the proclatch.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread Andres Freund
On 2017-10-08 17:11:44 -0400, Tom Lane wrote:
> Andres Freund <and...@anarazel.de> writes:
> > On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
> >> https://commitfest.postgresql.org/15/1001/
> >> The reason that's not in v10 is we haven't been able to convince
> >> ourselves whether it's 100% correct.
> 
> > Unfortunately it won't help in this specific case (no support for UNION,
> > just UNION ALL), but I thought it might be interesting to reference
> > https://medium.com/@uwdb/introducing-cosette-527898504bd6
> > here.
> 
> Huh, that is an interesting project indeed.  Although I'm not sure that
> it quite addresses the question of whether an optimization transform
> is valid.  IIUC, it could prove that a particular query having been fed
> through the transform didn't change semantics, but that offers only
> limited insight into whether some other query fed through the transform
> might change.

According to the guide it offers some support for more general
transformations:
http://cosette.cs.washington.edu/guide#24-symbolic-predicates That's
still only going to be sufficient for some of the interesting cases, but
still...

Wonder about pinging them about the OR -> UNION case, they've been
responsive to problem in some threads I found online.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] is possible cache tupledesc templates in execution plan? significant performance issue, maybe bug?

2017-10-08 Thread Andres Freund
On 2017-10-08 18:57:28 +0200, Pavel Stehule wrote:
> 2017-10-08 18:44 GMT+02:00 Andres Freund <and...@anarazel.de>:
> 
> > Hi,
> >
> > On 2017-10-08 18:36:23 +0200, Pavel Stehule wrote:
> > > 2. Lot of used tables are pretty wide - 60, 120, .. columns
> > >
> > > Now, I am doing profiling, and I see so most time is related to
> > >
> > > ExecTypeFromTLInternal(List *targetList, bool hasoid, bool skipjunk)
> >
> > Yea, that's known - I've complained about this a couple times. You could
> > try whether the following master branch helps:
> > https://git.postgresql.org/gitweb/?p=users/andresfreund/
> > postgres.git;a=shortlog;h=refs/heads/simple_statement_perf
> >
> > That's just micro-optimization though, not a more fundamental
> > solution. But for me it yields pretty nice speedups for cases with long
> > tlists.
> >
> >
> it is just this patch
> 
> HeapTuple   tup;
> Form_pg_type typTup;
> 
> +   if (typid < FirstBootstrapObjectId)
> +   break;
> +
> tup = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid));
> if (!HeapTupleIsValid(tup))
> elog(ERROR, "cache lookup failed for type %u", typid);

No.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] is possible cache tupledesc templates in execution plan? significant performance issue, maybe bug?

2017-10-08 Thread Andres Freund
Hi,

On 2017-10-08 18:36:23 +0200, Pavel Stehule wrote:
> 2. Lot of used tables are pretty wide - 60, 120, .. columns
> 
> Now, I am doing profiling, and I see so most time is related to
> 
> ExecTypeFromTLInternal(List *targetList, bool hasoid, bool skipjunk)

Yea, that's known - I've complained about this a couple times. You could
try whether the following master branch helps:
https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/simple_statement_perf

That's just micro-optimization though, not a more fundamental
solution. But for me it yields pretty nice speedups for cases with long
tlists.


> This function is executed in exec init time - in this case pretty often.
> Although there are used few columns from the table, the target list is
> build for columns (maybe it is bug)

It's probably just the physical tlist "optimization".


> 2. If is not possible to reduce the number of fields of target list, is
> possible to store tupledesc template to plan?

We should do that, but it's not a small change.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread Andres Freund
On 2017-10-08 11:28:09 -0400, Tom Lane wrote:
> Adam Brusselback <adambrusselb...@gmail.com> writes:
> > On another note:
> >> turning ORs into UNIONs
> 
> > This is another one which would be incredibly useful for me.  I've had
> > to do this manually for performance reasons far too often.
> 
> Well, maybe you could sign up to help review the open patch for that then:
> https://commitfest.postgresql.org/15/1001/
> 
> The reason that's not in v10 is we haven't been able to convince
> ourselves whether it's 100% correct.

Unfortunately it won't help in this specific case (no support for UNION,
just UNION ALL), but I thought it might be interesting to reference
https://medium.com/@uwdb/introducing-cosette-527898504bd6
here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-07 Thread Andres Freund
Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:
> In our sharded cluster project we are trying to use logical relication for 
> providing HA (maintaining redundant shard copies).
> Using asynchronous logical replication has not so much sense in context of 
> HA. This is why we try to use synchronous logical replication.
> Unfortunately it shows very bad performance. With 50 shards and level of 
> redundancy=1 (just one copy) cluster is 20 times slower then without logical 
> replication.
> With asynchronous replication it is "only" two times slower.
> 
> As far as I understand, the reason of such bad performance is that 
> synchronous replication mechanism was originally developed for streaming 
> replication, when all replicas have the same content and LSNs. When it is 
> used for logical replication, it behaves very inefficiently. Commit has to 
> wait confirmations from all receivers mentioned in 
> "synchronous_standby_names" list. So we are waiting not only for our own 
> single logical replication standby, but all other standbys as well. Number of 
> synchronous standbyes is equal to number of shards divided by number of 
> nodes. To provide uniform distribution number of shards should >> than number 
> of nodes, for example for 10 nodes we usually create 100 shards. As a result 
> we get awful performance and blocking of any replication channel blocks all 
> backends.
> 
> So my question is whether my understanding is correct and synchronous logical 
> replication can not be efficiently used in such manner.
> If so, the next question is how difficult it will be to make synchronous 
> replication mechanism for logical replication more efficient and are there 
> some plans to  work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-06 Thread Andres Freund
On 2017-10-06 22:19:54 -0400, Tom Lane wrote:
> The impression I have in a quick scan is that probably hardly any of these
> are cases that any of the DB designers think are important in
> themselves.

> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

Partially agreed. A comment to the article also mentions that some other
database performs more optimizations depending on the cost of the
plan. That's not easy to do in our current plan structure, but I think
it's quite a worthwhile concept.


> Rather, they fall out of more general optimization attempts, or not,
> depending on the optimization mechanisms in use in a particular DB.
> For example, reducing "WHERE 1=1" to "WHERE TRUE" and then to nothing
> comes out of a constant-subexpression-precalculation mechanism for us,
> whereas "WHERE column=column" doesn't fall to that approach.  ISTM it
> would be really dumb to expend planner cycles looking specifically for
> that case, so I guess that DB2 et al are finding it as a side-effect of
> some more general optimization ... I wonder what that is?
> 
> (edit: a few minutes later, I seem to remember that equivclass.c has
> to do something special with the X=X case, so maybe it could do
> something else special instead, with little new overhead.)

Yea, I think this should be inferrable from information we essentially
already compute for equivclasses.


> > (i.e. combining a IN (a,b,c) and a IN (c,d,f) into a IN (c), similar
> > with OR)
> 
> > I can't remember proposals about adding this, but it seems worthwhile to
> > consider. I think we should be able to check for this without a lot of
> > planner overhead.
> 
> It would be enormously expensive to check that in the general case with
> a bunch of entries in each IN list.  Perhaps it would be OK to add on
> the presumption that few queries would contain multiple IN's on the same
> column in the first place, though.  Not sure.

Merging of ORs should be near free if you leave duplicates in there, and
the duplicates aren't a huge concern if your alternative is is to have
them, but also have two clauses to evaluate.

I think the IN AND IN case should usually end up a clear win as
well. Both lists are going to be evaluated for each row anyway - you
don't need a lot of rows where clauses are evaluated to make it worth
the O(n*log(n)) of sorting the lists.  Sorting them would be beneficial
for other reasons as well, e.g. it improves access patterns for SAO
index scans.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-06 Thread Andres Freund
Hi,

On 2017-10-06 21:33:16 -0400, Adam Brusselback wrote:
> Hopefully it's alright for me to post this here, please let me know if not.

> I ran across an article on blog.jooq.org comparing all the major RDBMS'
> with their ability to optimize away unnecessary work with queries which are
> less than optimal, and saw some discussion on hackernews and reddit, but I
> hadn't seen any discussion here.
> 
> The article in question is here:
> https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/

That's interesting.


> Commercial databases seem to have a serious leg up in this area, and there
> are even some that MySQL has that Postgres doesn't.
> 
> I was wondering which of these are planned, which have had discussion
> before and decided not to support, and which just haven't been thought of?

Hm, going through the ones we don't [fully] support:

3. JOIN Elimination

There's been a lot of discussion and several patches. There's a bunch of
problems here, one being that there's cases (during trigger firing,
before the constraint checks) where foreign keys don't hold true, so we
can't quite generally make these optimization.  Another concern is
whether the added plan time is actually worthwhile.


4. Removing “Silly” Predicates

(i.e stuff like column = column)

This has deemed not to be a valuable use of plan time. I'm doubtful
about that argument, but that IIRC was the concensus the last time this
came up.


6. Predicate Merging

(i.e. combining a IN (a,b,c) and a IN (c,d,f) into a IN (c), similar
with OR)

I can't remember proposals about adding this, but it seems worthwhile to
consider. I think we should be able to check for this without a lot of
planner overhead.


7. Provably Empty Sets
8. CHECK Constraints

I think some of this should actually work with constraint exclusion
turned on.


9. Unneeded Self JOIN

Can't remember discussions of this.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] valgrind complains about WaitEventSetWaitBlock on HEAD (fe9ba28e)

2017-10-05 Thread Andres Freund
On 2017-10-06 11:20:05 +0900, Michael Paquier wrote:
> Hi all,
> 
> While running valgrind on latest HEAD (suppression list included),  I
> am seeing complains with epoll_pwait() on Linux:
> ==12692== Syscall param epoll_pwait(sigmask) points to unaddressable byte(s)
> ==12692==at 0x62F72D0: epoll_pwait (in /usr/lib/libc-2.26.so)
> ==12692==by 0x5D819C: WaitEventSetWaitBlock (latch.c:1048)
> ==12692==by 0x5D8060: WaitEventSetWait (latch.c:1000)
> ==12692==by 0x5D774D: WaitLatchOrSocket (latch.c:385)
> ==12692==by 0x5D7619: WaitLatch (latch.c:339)
> ==12692==by 0x582E57: ApplyLauncherMain (launcher.c:976)
> ==12692==by 0x550BA7: StartBackgroundWorker (bgworker.c:841)
> ==12692==by 0x5658BC: do_start_bgworker (postmaster.c:5693)
> ==12692==by 0x565C37: maybe_start_bgworkers (postmaster.c:5897)
> ==12692==by 0x5620FC: reaper (postmaster.c:2887)
> ==12692==by 0x4E4AD9F: ??? (in /usr/lib/libpthread-2.26.so)
> ==12692==by 0x62EEAA6: select (in /usr/lib/libc-2.26.so)
> ==12692==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> 
> I have not looked at that very closely, but this does not look normal,
> hence this post.

I think this might be more a valgrind bug than anything. Note the
"sigmask points to to unaddressable byte" and "Address 0x0 is not
stack'd, malloc'd or (recently) free'd" bits. It's valid to pass a NULL
sigmask argument.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fork()-safety, thread-safety

2017-10-05 Thread Andres Freund


On October 5, 2017 5:15:41 PM PDT, Tom Lane <t...@sss.pgh.pa.us> wrote:
>Andres Freund <and...@anarazel.de> writes:
>> On 2017-10-06 07:59:40 +0800, Craig Ringer wrote:
>>> The only thing that gets me excited about a threaded postgres is the
>>> ability to have a PL/Java, PL/Mono etc that don't suck. We could do
>>> some really cool things that just aren't practical right now.
>
>> Faster parallelism with a lot less reinventing the wheel. Easier
>backend
>> / session separation. Shared caches.
>
>What you guys are talking about here is a threaded backend, which is a
>whole different matter from replacing the client-side threading that
>Nico
>was looking at.  That would surely offer far higher rewards, but the
>costs
>to get there are likewise orders of magnitude greater.

No disagreement there. Don't really see much need for it client side though.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fork()-safety, thread-safety

2017-10-05 Thread Andres Freund
On 2017-10-06 07:59:40 +0800, Craig Ringer wrote:
> The only thing that gets me excited about a threaded postgres is the
> ability to have a PL/Java, PL/Mono etc that don't suck. We could do
> some really cool things that just aren't practical right now.

Faster parallelism with a lot less reinventing the wheel. Easier backend
/ session separation. Shared caches.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fork()-safety, thread-safety

2017-10-05 Thread Andres Freund
On 2017-10-05 18:49:22 -0400, Tom Lane wrote:
> (There's certainly an argument to be made that no-one cares about
> platforms without thread support anymore.  But I'm unconvinced that
> rewriting existing code that works fine is the most productive
> way to exploit such a choice if we were to make it.)

Yea, that's pretty much what I'm thinking too.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fork()-safety, thread-safety

2017-10-05 Thread Andres Freund
Hi,

On 2017-10-05 17:31:07 -0500, Nico Williams wrote:
> > >vfork() is widely demonized, but it's actually quite superior
> > >(performance-wise) to fork() when all you want to do is exec-or-exit
> > >since no page copying (COW or otherwise) needs be done when using
> > >vfork().
> > 
> > Not on linux, at least not as of a year or two back.
> 
> glibc has it.  Other Linux C libraries might also; I've not checked them
> all.

It has it, but it's not more efficient.


> > I do think it'd be good to move more towards threads, but not at all for
> > the reasons mentioned here.
> 
> You don't think eliminating a large difference between handling of WIN32
> vs. POSIX is a good reason?

I seems like you'd not really get a much reduced set of differences,
just a *different* set of differences. After investing time.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fork()-safety, thread-safety

2017-10-05 Thread Andres Freund
Hi,

On 2017-10-05 17:02:22 -0500, Nico Williams wrote:
>A quick look at the functions called on the child side of fork()
>makes me think that it's unlikely that the children here use
>async-signal-safe functions only.

That's not a requirement unless you're using fork *and* threads. At
least by my last reading of posix and common practice.



>  - fork() is used in a number of places where execl() or execv() are
>called immediately after (and exit() if the exec fails).
> 
>It would be better to use vfork() where available and _exit() instead
>of exit().

vfork is less portable, and doesn't really win us anything on common
platforms. On most it's pretty much the same implementation.


>vfork() is widely demonized, but it's actually quite superior
>(performance-wise) to fork() when all you want to do is exec-or-exit
>since no page copying (COW or otherwise) needs be done when using
>vfork().

Not on linux, at least not as of a year or two back.


I do think it'd be good to move more towards threads, but not at all for
the reasons mentioned here.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Binary search in fmgr_isbuiltin() is a bottleneck.

2017-10-05 Thread Andres Freund
Hi,

On 2017-10-05 17:08:39 +0300, Heikki Linnakangas wrote:
> > I pushed a further cleaned up version of these two patches.  If you see
> > a way to avoid initializing the "trailing" part of the
> > fmgr_builtin_oid_index in a different manner, I'm all ears ;)
> 
> You could put a dummy entry at fmgr_builtins[0].

Right, I'd considered that somewhere upthread. Didn't really seem
better.


> BTW, there's some alignment padding in FmgrBuiltin, when MAXIMUM_ALIGNOF==8.
> You could easily shrink the struct from 32 to 24 bytes by moving funcName to
> the end of the struct:
> 
> --- a/src/include/utils/fmgrtab.h
> +++ b/src/include/utils/fmgrtab.h
> @@ -25,11 +25,11 @@
>  typedef struct
>  {
>   Oid foid;   /* OID of the function 
> */
> - const char *funcName;   /* C name of the function */
>   short   nargs;  /* 0..FUNC_MAX_ARGS, or -1 if 
> variable count */
>   boolstrict; /* T if function is "strict" */
>   boolretset; /* T if function returns a set 
> */
>   PGFunction  func;   /* pointer to compiled function 
> */
> + const char *funcName;   /* C name of the function */
>  } FmgrBuiltin;
> 
>  extern const FmgrBuiltin fmgr_builtins[];

Yea, that's probably worthwhile, although I suspect it's not a huge save
overall. Do you just want to commit that?


> If we care about cache efficiency here, we could move funcName out of the
> fmgr_builtins array, to a separate array of the same size. I believe
> funcName is only used when you create an internal-language function with
> CREATE FUNCTION, and having it in a separate array shouldn't hurt those
> lookups.

When'd that be beneficial? fmgr_builtins is pretty much only used for
internal-language CREATE FUNCTIONs? In other cases oid bounds + mapping
array should filter out the access before fmgr_builtins is accessed.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JIT compiling - v4.0

2017-10-05 Thread Andres Freund
On 2017-10-05 23:43:37 +1300, David Rowley wrote:
> On 5 October 2017 at 19:57, Andres Freund <and...@anarazel.de> wrote:
> > Here's some numbers for a a TPC-H scale 5 run. Obviously the Q01 numbers
> > are pretty nice in partcular. But it's also visible that the shorter
> > query can loose, which is largely due to the JIT overhead - that can be
> > ameliorated to some degree, but JITing obviously isn't always going to
> > be a win.
> 
> It's pretty exciting to see thing being worked on.
> 
> I've not looked at the code, but I'm thinking, could you not just JIT
> if the total cost of the plan is estimated to be > X ? Where X is some
> JIT threshold GUC.

Right, that's the plan. But it seems fairly important to make the
envelope in which it is beneficial as broad as possible. Also, test
coverage is more interesting for me right now ;)

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JIT compiling - v4.0

2017-10-05 Thread Andres Freund
On 2017-10-04 11:56:47 +0300, Ants Aasma wrote:
> On Wed, Oct 4, 2017 at 9:48 AM, Andres Freund <and...@anarazel.de> wrote:
> > Here's an updated version of the patchset.  There's some substantial
> > changes here, but it's still very obviously very far from committable as
> > a whole. There's some helper commmits that are simple and independent
> > enough to be committable earlier on.
>
> Looks pretty impressive already.

Thanks!


> I wanted to take it for a spin, but got errors about the following
> symbols being missing:
>
> LLVMOrcUnregisterPerf
> LLVMOrcRegisterGDB
> LLVMOrcRegisterPerf
> LLVMOrcGetSymbolAddressIn
> LLVMLinkModules2Needed
>
> As far as I can tell these are not in mainline LLVM. Is there a branch
> or patchset of LLVM available somewhere that I need to use this?

Oops, I'd forgotten about the modifications. Sorry. I've attached them
here.  The GDB and Perf stuff should now be an optional dependency,
too.  The required changes are fairly small, so they hopefully shouldn't
be too hard to upstream.

Please check the git tree for a rebased version of the pg patches, with
a bunch bugfixes (oops, some last minute "cleanups") and performance
fixes.

Here's some numbers for a a TPC-H scale 5 run. Obviously the Q01 numbers
are pretty nice in partcular. But it's also visible that the shorter
query can loose, which is largely due to the JIT overhead - that can be
ameliorated to some degree, but JITing obviously isn't always going to
be a win.

It's pretty impressive that in q01, even after all of this, expression
evaluation *still* is 35% of the total time (25% in the aggregate
transition function). That's partially just because the query does
primarily aggregation, but also because the generated code can stand a
good chunk of improvements.

master q01 min: 14146.498 dev min: 11479.05 [diff -23.24] dev-jit min: 
8659.961 [diff -63.36] dev-jit-deform min: 7279.395 [diff -94.34] 
dev-jit-deform-inline min: 6997.956 [diff -102.15]
master q02 min: 1234.229 dev min: 1208.102 [diff -2.16] dev-jit min: 
1292.983 [diff +4.54] dev-jit-deform min: 1580.505 [diff +21.91] 
dev-jit-deform-inline min: 1809.046 [diff +31.77]
master q03 min: 6220.814 dev min: 5424.107 [diff -14.69] dev-jit min: 
5175.125 [diff -20.21] dev-jit-deform min: 4257.368 [diff -46.12] 
dev-jit-deform-inline min: 4218.115 [diff -47.48]
master q04 min: 947.476 dev min: 970.608 [diff +2.38] dev-jit min: 
969.944 [diff +2.32] dev-jit-deform min: 999.006 [diff +5.16] 
dev-jit-deform-inline min: 1033.78 [diff +8.35]
master q05 min: 4729.9 dev min: 4059.665 [diff -16.51] dev-jit min: 
4182.941 [diff -13.08] dev-jit-deform min: 4147.493 [diff -14.04] 
dev-jit-deform-inline min: 4284.473 [diff -10.40]
master q06 min: 1603.708 dev min: 1592.107 [diff -0.73] dev-jit min: 
1556.216 [diff -3.05] dev-jit-deform min: 1516.078 [diff -5.78] 
dev-jit-deform-inline min: 1579.839 [diff -1.51]
master q07 min: 4549.738 dev min: 4331.565 [diff -5.04] dev-jit min: 
4475.654 [diff -1.66] dev-jit-deform min: 4645.773 [diff +2.07] 
dev-jit-deform-inline min: 4885.781 [diff +6.88]
master q08 min: 1394.428 dev min: 1350.363 [diff -3.26] dev-jit min: 
1434.366 [diff +2.78] dev-jit-deform min: 1716.65 [diff +18.77] 
dev-jit-deform-inline min: 1938.152 [diff +28.05]
master q09 min: 5958.198 dev min: 5700.329 [diff -4.52] dev-jit min: 
5491.683 [diff -8.49] dev-jit-deform min: 5582.431 [diff -6.73] 
dev-jit-deform-inline min: 5797.475 [diff -2.77]
master q10 min: 5228.69 dev min: 4475.154 [diff -16.84] dev-jit min: 
4269.365 [diff -22.47] dev-jit-deform min: 3767.888 [diff -38.77] 
dev-jit-deform-inline min: 3962.084 [diff -31.97]
master q11 min: 281.201 dev min: 280.132 [diff -0.38] dev-jit min: 
351.85 [diff +20.08] dev-jit-deform min: 455.885 [diff +38.32] 
dev-jit-deform-inline min: 532.093 [diff +47.15]
master q12 min: 4289.268 dev min: 4082.359 [diff -5.07] dev-jit min: 
4007.199 [diff -7.04] dev-jit-deform min: 3752.396 [diff -14.31] 
dev-jit-deform-inline min: 3916.653 [diff -9.51]
master q13 min: 7110.545 dev min: 6898.576 [diff -3.07] dev-jit min: 
6579.554 [diff -8.07] dev-jit-deform min: 6304.15 [diff -12.79] 
dev-jit-deform-inline min: 6135.952 [diff -15.88]
master q14 min: 678.024 dev min: 650.943 [diff -4.16] dev-jit min: 
682.387 [diff +0.64] dev-jit-deform min: 746.354 [diff +9.16] 
dev-jit-deform-inline min: 878.437 [diff +22.81]
master q15 min: 1641.897 dev min: 1650.57 [diff +0.53] dev-jit min: 
1661.591 [diff +1.19] dev-jit-deform min: 1821.02 [diff +9.84] 
dev-jit-deform-inline min: 1863.304 [diff +11.88]
master q16 min: 1890.246 dev min: 1819.423 [diff -3.89] dev-jit min: 
1838.079 [diff -2.84] dev-jit-deform min: 

[HACKERS] Re: [BUGS] Re: [PATCH] BUG #13416: Postgres >= 9.3 doesn't use optimized shared memory on Solaris anymore

2017-10-04 Thread Andres Freund
Hi,

On 2017-10-04 10:47:06 -0700, Sean Chittenden wrote:
> Hello.  We identified the same problem.  Sam Gwydir and Josh Clulow were able 
> to put together the appropriate fix after.
> 
> The breakage in src/backend/port/sysv_shmem.c and 
> src/include/storage/dsm_impl.h should be back ported to all supported 
> versions (the regression was introduced between the 9.2 and 9.3 branches).

Personally I don't think "breakage" is quite the right work.

I also don't like that we're unconditionally not using
USE_ANONYMOUS_SHMEM - doesn't that run into similar config limits on
solaris based stuff as it does on linux etc?

I think if we want to do this, we should rather go with a patch like
https://www.postgresql.org/message-id/20140422121921.gd4...@awork2.anarazel.de

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Binary search in fmgr_isbuiltin() is a bottleneck.

2017-10-04 Thread Andres Freund
On 2017-10-02 15:01:36 -0700, Andres Freund wrote:
> On 2017-10-02 17:57:51 -0400, Tom Lane wrote:
> > Andres Freund <and...@anarazel.de> writes:
> > > Done that way. It's a bit annoying, because we've to take care to
> > > initialize the "unused" part of the array with a valid signalling it's
> > > an unused mapping. Can't use 0 for that because fmgr_builtins[0] is a
> > > valid entry.
> > 
> > The prototype code I posted further upthread just used -1 as the "unused"
> > marker. There's no reason the array can't be int16 rather than uint16,
> > and "if (index < 0)" is probably a faster test anyway.
> 
> Right, but whether we use -1 or UINT16_MAX or such doesn't matter. The
> relevant bit is that we can't use 0, so we can't rely on the rest of the
> array being zero initialized, but instead of to initialize all of it
> explicitly.  I've no real feelings about using -1 or UINT16_MAX - I'd be
> very surprised if there's any sort of meaningful performance difference.

I pushed a further cleaned up version of these two patches.  If you see
a way to avoid initializing the "trailing" part of the
fmgr_builtin_oid_index in a different manner, I'm all ears ;)

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JIT compiling - v4.0

2017-10-04 Thread Andres Freund
Hi,

Here's an updated version of the patchset.  There's some substantial
changes here, but it's still very obviously very far from committable as
a whole. There's some helper commmits that are simple and independent
enough to be committable earlier on.

The git tree of this work, which is *frequently* rebased, is at:
https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/jit

The biggest changes are:

- The JIT "infrastructure" is less bad than before, and starting to
  shape up.
- The tuple deforming logic is considerably faster than before due to
  various optimizations. The optimizations are:
  - build deforming exactly to the required natts for the specific caller
  - avoid checking the tuple's natts for attributes that have
"following" NOT NULL columns.
  - a bunch of minor codegen improvements.
- The tuple deforming codegen also got simpler by relying on LLVM to
  promote a stack variable to a register, instead of working with a
  register manually - the need to keep IR in SSA form makes doing so
  manually rather painful.
- WIP patch to do execGrouping.c TupleHashTableMatch() via JIT. That
  makes the column comparison faster, but more importantly it JITs the
  deforming (one side at least always is a MinimalTuple).
- All tests pass with JITed expression, tuple deforming, agg transition
  value computation and execGrouping logic. There were a number of bugs,
  who would have imagined that.
- some more experimental changes later in the series to address some
  bottlenecks.

Functionally this covers all of what I think a sensible goal for v11
is. There's a lot of details to figure out, and the inlining
*implementation* isn't what I think we should do.  I'll follow up, not
tonight though, with an email outlining the first few design decisions
we're going to have to finalize, which'll be around the memory/lifetime
management of functions, and other infrastructure pieces (currently
patch 0006).

As the patchset is pretty large already, and not going to get any
smaller, I'll make smaller adjustments solely via the git tree, rather
than full reposts.

Greetings,

Andres Freund


0001-Rely-on-executor-utils-to-build-targetlist-for-DM.v4.patch.gz
Description: application/patch-gzip


0002-WIP-Allow-tupleslots-to-have-a-fixed-tupledesc-us.v4.patch.gz
Description: application/patch-gzip


0003-Perform-slot-validity-checks-in-a-separate-pass-o.v4.patch.gz
Description: application/patch-gzip


0004-Pass-through-PlanState-parent-to-expression-insta.v4.patch.gz
Description: application/patch-gzip


0005-Add-configure-infrastructure-to-enable-LLVM.v4.patch.gz
Description: application/patch-gzip


0006-Beginning-of-a-LLVM-JIT-infrastructure.v4.patch.gz
Description: application/patch-gzip


0007-JIT-compile-expressions.v4.patch.gz
Description: application/patch-gzip


0008-Centralize-slot-deforming-logic-a-bit.v4.patch.gz
Description: application/patch-gzip


0009-WIP-Make-scan-desc-available-for-all-PlanStates.v4.patch.gz
Description: application/patch-gzip


0010-JITed-tuple-deforming.v4.patch.gz
Description: application/patch-gzip


0011-Simplify-aggregate-code-a-bit.v4.patch.gz
Description: application/patch-gzip


0012-More-efficient-AggState-pertrans-iteration.v4.patch.gz
Description: application/patch-gzip


0013-Avoid-dereferencing-tts_values-nulls-repeatedly-i.v4.patch.gz
Description: application/patch-gzip


0014-WIP-Expression-based-agg-transition.v4.patch.gz
Description: application/patch-gzip


0015-Hacky-Preliminary-inlining-implementation.v4.patch.gz
Description: application/patch-gzip


0016-WIP-Inline-ExecScan-mostly-to-make-profiles-easie.v4.patch.gz
Description: application/patch-gzip


0017-WIP-Do-execGrouping.c-via-expression-eval-machine.v4.patch.gz
Description: application/patch-gzip


0018-WIP-deduplicate-int-float-overflow-handling-code.v4.patch.gz
Description: application/patch-gzip


0019-Make-timestamp_cmp_internal-an-inline-function.v4.patch.gz
Description: application/patch-gzip


0020-Make-hot-path-of-pg_detoast_datum-an-inline-funct.v4.patch.gz
Description: application/patch-gzip


0021-WIP-Inline-additional-function.v4.patch.gz
Description: application/patch-gzip


0022-WIP-Faster-order.v4.patch.gz
Description: application/patch-gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #14825: enum type: unsafe use?

2017-10-03 Thread Andres Freund
On 2017-10-03 19:53:41 -0400, Andrew Dunstan wrote:
> On 09/27/2017 02:52 PM, Tom Lane wrote:
> > Andrew Dunstan <andrew.duns...@2ndquadrant.com> writes:
> >> At this stage on reflection I agree it should be pulled :-(
> > That seems to be the consensus, so I'll go make it happen.
> >
> >> I'm not happy about the idea of marking an input function as not
> >> parallel safe, certainly not without a good deal of thought and
> >> discussion that we don't have time for this cycle.
> > I think the way forward is to do what we had as of HEAD (984c92074),
> > but add the ability to transmit the blacklist table to parallel
> > workers.  Since we expect the blacklist table would be empty most of
> > the time, this should be close to no overhead in practice.  I concur
> > that the idea of marking the relevant functions parallel-restricted is
> > probably not as safe a fix as I originally thought, and it's not a
> > very desirable restriction even if it did fix the problem.

> Do you have any suggestion as to how we should transmit the blacklist to
> parallel workers?

How about storing them in the a dshash table instead of dynahash?
Similar to how we're now dealing with the shared typmod registry stuff?
It should be fairly simple to now simply add a new struct Session member
shared_enum_whatevs_table.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] datetime.h defines like PM conflict with external libraries

2017-10-03 Thread Andres Freund
Hi,

On 2017-10-03 16:34:38 -0400, Andrew Dunstan wrote:
> On 10/03/2017 03:00 PM, Andres Freund wrote:
> > Hi,
> >
> > In my llvm jit work I'd to
> >
> > #undef PM
> > /* include some llvm headers */
> > #define PM 1
> >
> > because llvm has a number of functions which have an argument named PM.
> > Now that works, but it's fairly ugly. Perhaps it would be a good idea to
> > name these defines in a manner that's slightly less likely to conflict?
> >
> > Alternatively we could use #pragma push_macro() around the includes, but
> > that'd be a new dependency.
> >
> > Better ideas?

> AFAICT at a quick glance these are only used in a couple of files. Maybe
> the defs need to be floated off to a different header with more limited
> inclusion?

Why not just rename them to PG_PM etc? If we force potential external
users to do some changes, we can use more unique names just as well -
the effort to adapt won't be meaningfully higher... IMNSHO there's not
much excuse for defining macros like PM globally.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   6   7   8   9   10   >