Re: [HACKERS] Pluggable storage

2017-08-11 Thread Haribabu Kommi
On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila  wrote:

> On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi
>  wrote:
> >
> >
> > On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <
> alvhe...@2ndquadrant.com>
> > wrote:
> >>
> >> I have sent the partial patch I have to Hari Babu Kommi.  We expect that
> >> he will be able to further this goal some more.
> >
> >
> > Thanks Alvaro for sharing your development patch.
> >
> > Most of the patch design is same as described by Alvaro in the first mail
> > [1].
> > I will detail the modifications, pending items and open items (needs
> > discussion)
> > to implement proper pluggable storage.
> >
> > Here I attached WIP patches to support pluggable storage. The patch
> series
> > are may not work individually. Still so many things are under
> development.
> > These patches are just to share the approach of the current development.
> >
>
> +typedef struct StorageAmRoutine
> +{
>
> In this structure, you have already covered most of the API's that a
> new storage module needs to provide, but I think there could be more.
> One such API could be heap_hot_search.  This seems specific to current
> heap where we have the provision of HOT.  I think we can provide a new
> API tuple_search or something like that.


Thanks for the review.

Yes, the storageAmRoutine needs more function pointers. Currently I am
adding all the functions that are present in the heapam.h and some slot
related function from tuptable.h. Once I stabilize the code and API's that
are
currently added, then I will further enhance it with remaining functions
that
are necessary to support pluggable storage API.


Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] Pluggable storage

2017-08-11 Thread Haribabu Kommi
On Mon, Aug 7, 2017 at 11:12 PM, Amit Kapila 
wrote:

> On Tue, Aug 1, 2017 at 1:56 PM, Haribabu Kommi 
> wrote:
> >
> >
> > On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila 
> > wrote:
> >>
> >
> >>
> >> > 1. Design an API that returns values/nulls array and convert that
> into a
> >> > HeapTuple whenever it is required in the upper layers. All the
> existing
> >> > heap form/deform tuples are used for every tuple with some
> adjustments.
> >> >
> >>
> >> So, this would have the additional cost of form/deform.  Also, how
> >> would it have lesser changes as compare to what you have described
> >> earlier?
> >
> >
> > Yes, It have the additional cost of form/deform. It is the same approach
> > that
> > is described earlier. But I have an intention of modifying everywhere the
> > HeapTuple is accessed. But with the other prototype changes of removing
> > HeapTuple usage from triggers, I realized that it needs some clear design
> > to proceed further, instead of combining those changes with pluggable
> > Storage API.
> >
> > - heap_getnext function is kept as it as and it is used only for system
> > table
> >   access.
> > - heap_getnext_slot function is introduced to return the slot whenever
> the
> >   data is found, otherwise NULL, This function is used in all the places
> > from
> >   Executor and etc.
> >
> > - The TupleTableSlot structure is modified to contain a void* tuple
> instead
> > of
> > HeapTuple. And also it contains the storagehanlder functions.
> > - heap_insert and etc function can take Slot as an argument and perform
> the
> > insert operation.
> >
> > The cases where the TupleTableSlot is not possible to sent, form a
> HeapTuple
> > from the data and sent it and also note down that it is a HeapTuple data,
> > not
> > the tuple from the storage.
> >
> ..
> >
> >
> >>
> >> > 3. Design an API that returns StorageTuple(void *) but the necessary
> >> > format information of that tuple can be get from the tupledesc.
> wherever
> >> > the tuple is present, there exists a tupledesc in most of the cases.
> How
> >> > about adding some kind of information in tupledesc to find out the
> tuple
> >> > format and call the necessary functions
> >> >
> >
> >
> > heap_getnext function returns StorageTuple instead of HeapTuple. The
> tuple
> > type information is available in the TupleDesc structure.
> >
> > All heap_insert and etc function accepts TupleTableSlot as input and
> perform
> > the insert operation. This approach is almost same as first approach
> except
> > the
> > storage handler functions are stored in TupleDesc.
> >
>
> Why do we need to store handler function in TupleDesc?  As of now, the
> above patch series has it available in RelationData and
> TupleTableSlot, I am not sure if instead of that keeping it in
> TupleDesc is a good idea.  Which all kind of places require TupleDesc
> to contain handler?  If those are few places, can we think of passing
> it as a parameter?


Till now I am to able to proceed without adding any storage handler
functions to
TupleDesc structure. Sure, I will try the way of passing as a parameter
when
there is a need of it.

During the progress of the patch, I am facing problems in designing the
storage API
regarding the Buffer. For example To replace all the HeapTupleSatisfiesMVCC
and
related functions with function pointers, In HeapTuple format, the tuple
may belongs
to one buffer, so the buffer is passed to the HeapTupleSatifisifes***
functions along
with buffer, But in case of other storage formats, the single buffer may
not contains
the actual data. This buffer is used to set the Hint bits and mark the
buffer as dirty.
In case if the buffer is not available, the performance may affect for the
following
queries if the hint bits are not set.

And also the Buffer is used to get from heap_fetch, heap_lock_tuple and
related
functions to check the Tuple visibility, but currently returning a buffer
from the above
heap_** function is not possible for other formats. And also for the
HeapTuple data,
the tuple data is copied into palloced buffer instead of pointing directly
to the page.
So, returning a Buffer is a valid or not here?

Currently I am proceeding to remove the Buffer as parameter in the API and
proceed
further, In case if it affects the performance, we need to find out a
different appraoch
in handling the hint bits.

comments?

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-11 Thread Noah Misch
On Thu, Aug 10, 2017 at 09:59:40PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
> >> I don't think I can usefully contribute to this.  Could someone else
> >> take it?

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

> > If nobody volunteers, you could always resolve this by reverting 1e8a850 and
> > successors.
> 
> I think you're blaming the victim.  Our current theory about the cause
> of this is that on Windows, WaitLatchOrSocket cannot be used to wait for
> completion of a nonblocking connect() call.  That seems pretty broken
> independently of whether libpqwalreceiver needs the capability.

Yes, the theorized defect lies in APIs commit 1e8a850 used, not in the commit
itself.  Nonetheless, commit 1e8a850 promoted the defect from one reachable
only by writing C code to one reachable by merely configuring replication on
Windows according to the documentation.  For that, its committer owns this
open item.  Besides the one approach I mentioned, there exist several other
fine ways to implement said ownership.

> In any case, we have a draft patch, so what we should be pressing for
> is for somebody to test it.

Now done.  (Thanks, Jobin.)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-08-11 Thread Andres Freund
Hi,

On 2017-08-11 20:39:13 +1200, Thomas Munro wrote:
> Please find attached a new patch series.  I apologise in advance for
> 0001 and note that the patchset now weighs in at ~75kB compressed.
> Here are my in-line replies to your two reviews:

Replying to a few points here, then I'll do a pass through your your
submission...


> On Tue, Jul 25, 2017 at 10:09 PM, Andres Freund  wrote:
> > It does concern me that we're growing yet another somewhat different
> > hashtable implementation. Yet I don't quite see how we could avoid
> > it. dynahash relies on proper pointers, simplehash doesn't do locking
> > (and shouldn't) and also relies on pointers, although to a much lesser
> > degree.  All the open coded tables aren't a good match either.  So I
> > don't quite see an alternative, but I'd love one.
> 
> Yeah, I agree.  To deal with data structures with different pointer
> types, locking policy, inlined hash/eq functions etc, perhaps there is
> a way we could eventually do 'policy based design' using the kind of
> macro trickery you started where we generate N different hash table
> variations but only have to maintain source code for one chaining hash
> table implementation?  Or perl scripts that effectively behave as a
> cfront^H^H^H nevermind.  I'm not planning to investigate that for this
> cycle.

Whaaa, what have I done But more seriously, I'm doubtful it's worth
going there.


> > + * level.  However, when a resize operation begins, all partition locks 
> > must
> > + * be acquired simultaneously for a brief period.  This is only expected to
> > + * happen a small number of times until a stable size is found, since 
> > growth is
> > + * geometric.
> >
> > I'm a bit doubtful that we need partitioning at this point, and that it
> > doesn't actually *degrade* performance for your typmod case.
> 
> Yeah, partitioning not needed for this case, but this is supposed to
> be more generally useful.  I thought about making the number of
> partitions a construction parameter, but it doesn't really hurt does
> it?

Well, using multiple locks and such certainly isn't free. An exclusively
owned cacheline mutex is nearly an order of magnitude faster than one
that's currently shared, not to speak of modified. Also it does increase
the size overhead, which might end up happening for a few other cases.


> > + * Resizing is done incrementally so that no individual insert operation 
> > pays
> > + * for the potentially large cost of splitting all buckets.
> >
> > I'm not sure this is a reasonable tradeoff for the use-case suggested so
> > far, it doesn't exactly make things simpler. We're not going to grow
> > much.
> 
> Yeah, designed to be more generally useful.  Are you saying you would
> prefer to see the DHT patch split into an initial submission that does
> the simplest thing possible, so that the unlucky guy who causes the
> hash table to grow has to do all the work of moving buckets to a
> bigger hash table?  Then we could move the more complicated
> incremental growth stuff to a later patch.

Well, most of the potential usecases for dsmhash I've heard about so
far, don't actually benefit much from incremental growth. In nearly all
the implementations I've seen incremental move ends up requiring more
total cycles than doing it at once, and for parallelism type usecases
the stall isn't really an issue.  So yes, I think this is something
worth considering.   If we were to actually use DHT for shared caches or
such, this'd be different, but that seems darned far off.


> This is complicated, and in the category that I would normally want a
> stack of heavy unit tests for.  If you don't feel like making
> decisions about this now, perhaps iteration (and incremental resize?)
> could be removed, leaving only the most primitive get/put hash table
> facilities -- just enough for this purpose?  Then a later patch could
> add them back, with a set of really convincing unit tests...

I'm inclined to go for that, yes.


> > +/*
> > + * Detach from a hash table.  This frees backend-local resources associated
> > + * with the hash table, but the hash table will continue to exist until it 
> > is
> > + * either explicitly destroyed (by a backend that is still attached to 
> > it), or
> > + * the area that backs it is returned to the operating system.
> > + */
> > +void
> > +dht_detach(dht_hash_table *hash_table)
> > +{
> > +   /* The hash table may have been destroyed.  Just free local memory. 
> > */
> > +   pfree(hash_table);
> > +}
> >
> > Somewhat inclined to add debugging refcount - seems like bugs around
> > that might be annoying to find. Maybe also add an assert ensuring that
> > no locks are held?
> 
> Added assert that not locks are held.
> 
> In an earlier version I had reference counts.  Then I realised that it
> wasn't really helping anything.  The state of being 'attached' to a
> dht_hash_table isn't really the same as holding a heavyweight resource
> like a DSM segment or a 

[HACKERS] pg_stat_statements query normalization, and the 'in' operator

2017-08-11 Thread unixway . drive

Hello there,

Given the following list of queries:

  create table foo (id serial, bar integer);
  select * from foo where id in (1);
  select * from foo where id in (2,3);
  select * from foo where id in (1,3,5);
  select * from foo where id in (select id from foo);

would it be possible to have first three select queries to be normalized 
into a single one so that 'select query from pg_stat_statements' returns 
something like:


  select * from foo where id in (...);
  select * from foo where id in (select id from foo);
  (2 rows)

instead of:

  select * from foo where id in (?,?);
  select * from foo where id in (?,?,?);
  select * from foo where id in (?);
  select * from foo where id in (select id from foo);
  (4 rows)

?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] additional contrib test suites

2017-08-11 Thread Peter Eisentraut
Here are some small test suites for some contrib modules as well as
pg_archivecleanup that didn't have one previously, as well as one patch
to improve code coverage in a module.

Will add to commit fest.  Testing on different platforms and with
different build configurations would be useful.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 65b09d5b15b43c7eab24aae2d2e7e7a7979d57f3 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 11 Aug 2017 21:04:04 -0400
Subject: [PATCH 1/7] adminpack: Add test suite

---
 contrib/adminpack/.gitignore |   4 +
 contrib/adminpack/Makefile   |   2 +
 contrib/adminpack/expected/adminpack.out | 144 +++
 contrib/adminpack/sql/adminpack.sql  |  56 
 4 files changed, 206 insertions(+)
 create mode 100644 contrib/adminpack/.gitignore
 create mode 100644 contrib/adminpack/expected/adminpack.out
 create mode 100644 contrib/adminpack/sql/adminpack.sql

diff --git a/contrib/adminpack/.gitignore b/contrib/adminpack/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/contrib/adminpack/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/adminpack/Makefile b/contrib/adminpack/Makefile
index f065f84bfb..89c249bc0d 100644
--- a/contrib/adminpack/Makefile
+++ b/contrib/adminpack/Makefile
@@ -8,6 +8,8 @@ EXTENSION = adminpack
 DATA = adminpack--1.0.sql
 PGFILEDESC = "adminpack - support functions for pgAdmin"
 
+REGRESS = adminpack
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/adminpack/expected/adminpack.out 
b/contrib/adminpack/expected/adminpack.out
new file mode 100644
index 00..83cbb741da
--- /dev/null
+++ b/contrib/adminpack/expected/adminpack.out
@@ -0,0 +1,144 @@
+CREATE EXTENSION adminpack;
+-- create new file
+SELECT pg_file_write('test_file1', 'test1', false);
+ pg_file_write 
+---
+ 5
+(1 row)
+
+SELECT pg_read_file('test_file1');
+ pg_read_file 
+--
+ test1
+(1 row)
+
+-- append
+SELECT pg_file_write('test_file1', 'test1', true);
+ pg_file_write 
+---
+ 5
+(1 row)
+
+SELECT pg_read_file('test_file1');
+ pg_read_file 
+--
+ test1test1
+(1 row)
+
+-- error, already exists
+SELECT pg_file_write('test_file1', 'test1', false);
+ERROR:  file "test_file1" exists
+SELECT pg_read_file('test_file1');
+ pg_read_file 
+--
+ test1test1
+(1 row)
+
+-- disallowed file paths
+SELECT pg_file_write('../test_file0', 'test0', false);
+ERROR:  path must be in or below the current directory
+SELECT pg_file_write('/tmp/test_file0', 'test0', false);
+ERROR:  absolute path not allowed
+SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 
'test4', false);
+ pg_file_write 
+---
+ 5
+(1 row)
+
+SELECT pg_file_write(current_setting('data_directory') || '/../test_file4', 
'test4', false);
+ERROR:  reference to parent directory ("..") not allowed
+-- rename file
+SELECT pg_file_rename('test_file1', 'test_file2');
+ pg_file_rename 
+
+ t
+(1 row)
+
+SELECT pg_read_file('test_file1');  -- not there
+ERROR:  could not stat file "test_file1": No such file or directory
+SELECT pg_read_file('test_file2');
+ pg_read_file 
+--
+ test1test1
+(1 row)
+
+-- error
+SELECT pg_file_rename('test_file1', 'test_file2');
+WARNING:  file "test_file1" is not accessible: No such file or directory
+ pg_file_rename 
+
+ f
+(1 row)
+
+-- rename file and archive
+SELECT pg_file_write('test_file3', 'test3', false);
+ pg_file_write 
+---
+ 5
+(1 row)
+
+SELECT pg_file_rename('test_file2', 'test_file3', 'test_file3_archive');
+ pg_file_rename 
+
+ t
+(1 row)
+
+SELECT pg_read_file('test_file2');  -- not there
+ERROR:  could not stat file "test_file2": No such file or directory
+SELECT pg_read_file('test_file3');
+ pg_read_file 
+--
+ test1test1
+(1 row)
+
+SELECT pg_read_file('test_file3_archive');
+ pg_read_file 
+--
+ test3
+(1 row)
+
+-- unlink
+SELECT pg_file_unlink('test_file1');  -- does not exist
+ pg_file_unlink 
+
+ f
+(1 row)
+
+SELECT pg_file_unlink('test_file2');  -- does not exist
+ pg_file_unlink 
+
+ f
+(1 row)
+
+SELECT pg_file_unlink('test_file3');
+ pg_file_unlink 
+
+ t
+(1 row)
+
+SELECT pg_file_unlink('test_file3_archive');
+ pg_file_unlink 
+
+ t
+(1 row)
+
+SELECT pg_file_unlink('test_file4');
+ pg_file_unlink 
+
+ t
+(1 row)
+
+-- superuser checks
+CREATE USER regress_user1;
+SET ROLE regress_user1;
+SELECT pg_file_write('test_file0', 'test0', false);
+ERROR:  only superuser may access generic file functions
+SELECT pg_file_rename('test_file0', 'test_file0');
+ERROR:  only 

Re: [HACKERS] POC: Sharing record typmods between backends

2017-08-11 Thread Andres Freund
On 2017-08-11 11:14:44 -0400, Robert Haas wrote:
> On Fri, Aug 11, 2017 at 4:39 AM, Thomas Munro
>  wrote:
> > OK.  Now it's ds_hash_table.{c,h}, where "ds" stands for "dynamic
> > shared".  Better?  If we were to do other data structures in DSA
> > memory they could follow that style: ds_red_black_tree.c, ds_vector.c,
> > ds_deque.c etc and their identifier prefix would be drbt_, dv_, dd_
> > etc.
> >
> > Do you want to see a separate patch to rename dsa.c?  Got a better
> > name?  You could have spoken up earlier :-)  It does sound like a bit
> > like the thing from crypto or perhaps a scary secret government
> > department.

I, and I bet a lot of other people, kind of missed dsa being merged for
a while...


> I doubt that we really want to have accessor functions with names like
> dynamic_shared_hash_table_insert or ds_hash_table_insert. Long names
> are fine, even desirable, for APIs that aren't too widely used,
> because they're relatively self-documenting, but a 30-character
> function name gets annoying in a hurry if you have to call it very
> often, and this is intended to be reusable for other things that want
> a dynamic shared memory hash table.  I think we should (a) pick some
> reasonably short prefix for all the function names, like dht or dsht
> or ds_hash, but not ds_hash_table or dynamic_shared_hash_table and (b)
> also use that prefix as the name for the .c and .h files.

Yea, I agree with this. Something dsmhash_{insert,...}... seems like
it'd kinda work without being too ambiguous like dht imo is, while still
being reasonably short.


> Right now, we've got a situation where the most widely-used hash table
> implementation uses dynahash.c for the code, hsearch.h for the
> interface, and "hash" as the prefix for the names, and that's really
> hard to remember.  I think having a consistent naming scheme
> throughout would be a lot better.

Yea, that situation still occasionally confuses me, a good 10 years
after starting to look at pg...  There's even a a dynahash.h, except
it's useless. And dynahash.c doesn't even include hsearch.h directly
(included via shmem.h)!  Personally I'd actually in favor of moving
hsearch.h stuff into dynahash.h and leave hsearch as a wrapper.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2017-08-11 Thread Andres Freund
On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core.  In a review of this
> thread I don't see anyone saying otherwise.  The debate has really
> been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that.  Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.


> - When a standby connects to a master, it can optionally supply a list
> of slot names that it cares about.
> - The master responds by periodically notifying the standby of changes
> to the slot contents using some new replication sub-protocol message.
> - The standby applies those updates to its local copies of the slots.

> So, you could create a slot on a standby with an "uplink this" flag of
> some kind, and it would then try to keep it up to date using the
> method described above.  It's not quite clear to me how to handle the
> case where the corresponding slot doesn't exist on the master, or
> initially does but then it's later dropped, or it initially doesn't
> but it's later created.


I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be
  available everywhere, or not?
- What kind of PITR integration do we want? Note that simple WAL based
  slots do *NOT* provide proper PITR support, there's not enough
  interlock easily available (you'd have to save slots at the end, then
  increment minRecoveryLSN to a point later than the slot saving)
- How much divergence are we going to accept between logical decoding on
  standbys, and failover slots. I'm probably a lot closer to closer than
  than Craig is.
- How much divergence are we going to accept between infrastructure for
  logical failover, and logical failover via failover slots (or however
  we're naming this)? Again, I'm probably a lot closer to zero than
  craig is.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

2017-08-11 Thread Andres Freund
Hi,

On 2017-07-21 19:32:02 +0300, Marina Polyakova wrote:
> Here is the third version of the patch for pgbench thanks to Fabien Coelho
> comments. As in the previous one, transactions with serialization and
> deadlock failures are rolled back and retried until they end successfully or
> their number of tries reaches maximum.

Just had a need for this feature, and took this to a short test
drive. So some comments:
- it'd be useful to display a retry percentage of all transactions,
  similar to what's displayed for failed transactions.
- it appears that we now unconditionally do not disregard a connection
  after a serialization / deadlock failure. Good. But that's useful far
  beyond just deadlocks / serialization errors, and should probably be exposed.
- it'd be useful to also conveniently display the number of retried
  transactions, rather than the total number of retries.

Nice feature!

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel COPY FROM execution

2017-08-11 Thread Robert Haas
On Fri, Aug 11, 2017 at 9:55 AM, Alex K  wrote:
> - I have used both Latch and ConditionalVariable for the same
> purpose–wait until some signal
>   occurs–and for me as an end user they perform quite similar. I
> looked into the condition_variable.c
>   code and it uses Latch and SpinLock under the hood. So what are
> differences and dis-/advantages
>   between Latch and ConditionalVariable?

A ConditionVariable lets you signal the processes that are waiting
without needing to know in advance exactly which processes those are.
If you use latches directly, you'll have to somehow keep track of
which processes need to be signaled.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2017-08-11 Thread Robert Haas
On Thu, Aug 10, 2017 at 11:12 AM, Alexander Korotkov
 wrote:
> These results look very cool!
> I think CSN is eventually inevitable, but it's a long distance feature.
> Thus, this optimization could make us a good serve before we would have CSN.
> Do you expect there are cases when this patch causes slowdown?  What about
> case when we scan each xip array only once (not sure how to simulate that
> using pgbench)?

Just a random thought here:

The statements pgbench executes are pretty simple: they touch one row
in one table.  You wouldn't expect them to scan the xip array very
many times.  If even those statements touch the array enough for this
to win, it seems like it might be hard to construct an even worse
case.  I might be missing something, though.

We're not going to accept code like this, though:

+d = xip[i] >> 6;
+j = k = xip[i] & mask;
+while (xiphash[j] != InvalidTransactionId)
+{
+j = (k + d) & mask;
+d = d*5 + 1;
+}

Single-character variable names, hard-coded constants, and no comments!

I kind of doubt as a general point that we really want another
open-coded hash table -- I wonder if this could be made to use
simplehash.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-11 Thread Robert Haas
On Fri, Aug 11, 2017 at 5:36 AM, Rushabh Lathia
 wrote:
> Please find attach patch with the changes.

I found the way that you had the logic structured in flagInhTables()
to be quite hard to follow, so I rewrote it in the attached version.
This version also has a bunch of minor cosmetic changes.  Barring
objections, I'll commit it once the tree opens for v11 development.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


load-via-partition-root-rmh.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Arrays of domains

2017-08-11 Thread Tom Lane
I wrote:
> Probably a better answer is to start supporting arrays over domain
> types.  That was left unimplemented in the original domains patch,
> but AFAICS not for any better reason than lack of round tuits.

Attached is a patch series that allows us to create arrays of domain
types.  I haven't tested this in great detail, so there might be some
additional corners of the system that need work, but it passes basic
sanity checks.  I believe it's independent of the other patch I have
in the commitfest for domains over composites, but I haven't tested
for interactions there either.

01-rationalize-coercion-APIs.patch cleans up the APIs of
coerce_to_domain() and some internal functions in parse_coerce.c so that
we consistently pass around a CoercionContext along with CoercionForm.
Previously, we sometimes passed an "isExplicit" boolean flag instead,
which is strictly less information; and coerce_to_domain() didn't even get
that, but instead had to reverse-engineer isExplicit from CoercionForm.
That's contrary to the documentation in primnodes.h that says that
CoercionForm only affects display and not semantics.  I don't think this
change fixes any live bugs, but it makes things more consistent.  The
main reason for doing it though is that now build_coercion_expression()
receives ccontext, which it needs in order to be able to recursively
invoke coerce_to_target_type(), as required by the next patch.

02-reimplement-ArrayCoerceExpr.patch is the core of the patch.  It changes
ArrayCoerceExpr so that the node does not directly know any details of
what has to be done to the individual array elements while performing the
array coercion.  Instead, the per-element processing is represented by
a sub-expression whose input is a source array element and whose output
is a target array element.  This simplifies life in parse_coerce.c,
because it can build that sub-expression by a recursive invocation of
coerce_to_target_type(), and it allows the executor to handle the
per-element processing as a compiled expression instead of hard-wired
code.  This is probably about a wash or a small loss performance-wise
for the simplest case where we just need to invoke one coercion function
for each element.  However, there are many cases where the existing code
ends up generating two nested ArrayCoerceExprs, one to do the type
conversion and one to apply a typmod (length) coercion function.  In the
new code there will be just one ArrayCoerceExpr, saving one deconstruction
and reconstruction of the array.  If I hadn't done it like this, adding
domains into the mix could have produced as many as three
ArrayCoerceExprs, where the top one would have only checked domain
constraints; that did not sound nice performance-wise, and it would have
required a lot of code duplication as well.

Finally, 03-support-arrays-of-domains.patch simply turns on the spigot
by creating an array type in DefineDomain(), and adds some test cases.
Given the new method of handling ArrayCoerceExpr, everything Just Works.

I'll add this to the next commitfest.

regards, tom lane

diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index afc733f..277ca8b 100644
*** a/src/backend/optimizer/prep/preptlist.c
--- b/src/backend/optimizer/prep/preptlist.c
*** expand_targetlist(List *tlist, int comma
*** 306,314 
  		new_expr = coerce_to_domain(new_expr,
  	InvalidOid, -1,
  	atttype,
  	COERCE_IMPLICIT_CAST,
  	-1,
- 	false,
  	false);
  	}
  	else
--- 306,314 
  		new_expr = coerce_to_domain(new_expr,
  	InvalidOid, -1,
  	atttype,
+ 	COERCION_IMPLICIT,
  	COERCE_IMPLICIT_CAST,
  	-1,
  	false);
  	}
  	else
diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index 0bc7dba..c406cea 100644
*** a/src/backend/parser/parse_coerce.c
--- b/src/backend/parser/parse_coerce.c
***
*** 34,48 
  
  static Node *coerce_type_typmod(Node *node,
     Oid targetTypeId, int32 targetTypMod,
!    CoercionForm cformat, int location,
!    bool isExplicit, bool hideInputCoercion);
  static void hide_coercion_node(Node *node);
  static Node *build_coercion_expression(Node *node,
  		  CoercionPathType pathtype,
  		  Oid funcId,
  		  Oid targetTypeId, int32 targetTypMod,
! 		  CoercionForm cformat, int location,
! 		  bool isExplicit);
  static Node *coerce_record_to_complex(ParseState *pstate, Node *node,
  		 Oid targetTypeId,
  		 CoercionContext ccontext,
--- 34,49 
  
  static Node *coerce_type_typmod(Node *node,
     Oid targetTypeId, int32 targetTypMod,
!    CoercionContext ccontext, CoercionForm cformat,
!    int location,
!    bool hideInputCoercion);
  static void hide_coercion_node(Node *node);
  static Node 

Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2017-08-11 Thread Sokolov Yura
В Thu, 10 Aug 2017 18:12:34 +0300
Alexander Korotkov  пишет:

> On Thu, Aug 10, 2017 at 3:30 PM, Sokolov Yura
>  wrote:
> 
> > On 2017-07-31 18:56, Sokolov Yura wrote:
> >  
> >> Good day, every one.
> >>
> >> In attempt to improve performance of YCSB on zipfian distribution,
> >> it were found that significant time is spent in XidInMVCCSnapshot
> >> in scanning snapshot->xip array. While overall CPU time is not too
> >> noticable, it has measurable impact on scaleability.
> >>
> >> First I tried to sort snapshot->xip in GetSnapshotData, and search
> >> in a sorted array. But since snapshot->xip is not touched if no
> >> transaction contention occurs, sorting xip always is not best
> >> option.
> >>
> >> Then I sorted xip array on demand in XidInMVCCSnapshot only if
> >> search in snapshot->xip occurs (ie lazy sorting). It performs much
> >> better, but since it is O(NlogN), sort's execution time is
> >> noticable for large number of clients.
> >>
> >> Third approach (present in attached patch) is making hash table
> >> lazily on first search in xip array.
> >>
> >> Note: hash table is not built if number of "in-progress" xids is
> >> less than 60. Tests shows, there is no big benefit from doing so
> >> (at least on Intel Xeon).
> >>
> >> With regards,
> >>  
> >
> > Here is new, more correct, patch version, and results for pgbench
> > with zipfian distribution (50% read 50% write) (same to Alik's
> > tests at
> > https://www.postgresql.org/message-id/BF3B6F54-68C3-417A-BFA
> > b-fb4d66f2b...@postgrespro.ru )
> >
> >
> > clients | master | hashsnap2 | hashsnap2_lwlock
> > ++---+--
> >  10 | 203384 |203813 |   204852
> >  20 | 334344 |334268 |   363510
> >  40 | 228496 |231777 |   383820
> >  70 | 146892 |148173 |   221326
> > 110 |  99741 |111580 |   157327
> > 160 |  65257 | 81230 |   112028
> > 230 |  38344 | 56790 |77514
> > 310 |  22355 | 39249 |55907
> > 400 |  13402 | 26899 |39742
> > 500 |   8382 | 17855 |28362
> > 650 |   5313 | 11450 |17497
> > 800 |   3352 |  7816 |11030
> >
> > (Note: I'm not quite sure, why my numbers for master are lower than
> > Alik's one. Probably, our test methodology differs a bit. Perhaps,
> > it is because I don't recreate tables between client number change,
> > but between branch switch).
> >  
> 
> These results look very cool!
> I think CSN is eventually inevitable, but it's a long distance
> feature. Thus, this optimization could make us a good serve before we
> would have CSN. Do you expect there are cases when this patch causes
> slowdown?  What about case when we scan each xip array only once (not
> sure how to simulate that using pgbench)?
> 

Patched version allocates a bit more memory per process (if number of
backends is greater than 60), and for copied snapshot (if number of
concurrent transactions is greater than 60).

I have no strong evidence patch could cause slowdown. Different
When xip array is scanned only once, then XidInMVCCSnapshot consumes
usually 0.2-0.5% CPU in total, and scanning xip array even less. So
even building hash table slows first scan in three-four times, it could
increase overall CPU usage just on 0.5-1%, and it is not necessary
directly mapped to tps.

-- 
With regards,
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Request more documentation for incompatibility of parallelism and plpgsql exec_run_select

2017-08-11 Thread Robert Haas
On Thu, Aug 10, 2017 at 6:21 PM, Mark Dilger  wrote:
> That misses the point I was making.  I was suggesting a syntax where
> the caller promises to use all rows without stopping short, and the
> database performance problems of having a bunch of parallel workers
> suspended in mid query is simply the caller's problem if the caller does
> not honor the contract.

I understand.  My point is: that isn't sufficient to solve the
problem.  It's not a question of whether the caller uses all the
tuples, but whether the caller gets to do anything else while the
tuples are being generated, so to make it work, you'd have to first
run the parallel query to completion and buffer all the tuples in
memory and *only then* begin iterating over them and running the
user-supplied code.  You can't run the parallel query until it
produces a tuple, then do the code inside the loop, then run it until
the next tuple shows up, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patches I'm thinking of pushing shortly

2017-08-11 Thread Tom Lane
Robert Haas  writes:
> On Fri, Aug 11, 2017 at 11:24 AM, Tom Lane  wrote:
>> 3. remove-pgbench-option-ordering-constraint.patch from
>> https://www.postgresql.org/message-id/20559.1501703...@sss.pgh.pa.us
>> 
>> That one was already informally reviewed by two people, so I don't
>> think it needs another look.

> I'd vote for putting this fix into v10, but maybe that's just me.

... OK by me, any objections?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patches I'm thinking of pushing shortly

2017-08-11 Thread Robert Haas
On Fri, Aug 11, 2017 at 11:24 AM, Tom Lane  wrote:
> 3. remove-pgbench-option-ordering-constraint.patch from
> https://www.postgresql.org/message-id/20559.1501703...@sss.pgh.pa.us
>
> That one was already informally reviewed by two people, so I don't
> think it needs another look.

I'd vote for putting this fix into v10, but maybe that's just me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] scan on inheritance parent with no children in current session

2017-08-11 Thread Robert Haas
On Sun, Aug 6, 2017 at 10:56 PM, Amit Langote
 wrote:
> Good catch.  I agree that getting an Append node after that commit is
> unintentional and we should fix so that we don't get an Append.  So, +1 to
> your patch.  I looked at the patch and the code fix seems to do what we want.

So, I also agree that this is a good fix, but I don't think it fixes
the whole problem.  Consider:

rhaas=# create table parent (a int) partition by list (a);
CREATE TABLE
rhaas=# create temp table child partition of parent for values in (1);
CREATE TABLE
rhaas=# explain verbose select * from parent;
   QUERY PLAN
-
 Append  (cost=0.00..35.50 rows=2550 width=4)
   ->  Seq Scan on pg_temp_3.child  (cost=0.00..35.50 rows=2550 width=4)
 Output: child.a
(3 rows)

But the comments say:

 * A childless table is never considered to be an inheritance set; therefore
 * a parent RTE must always have at least two associated AppendRelInfos.

Yet, not.  So at least the comments need to be updated; not sure if we
want to try to eliminate the Append node in this case also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patches I'm thinking of pushing shortly

2017-08-11 Thread Tom Lane
I have some patches sitting around in my workspace that I think are
non-controversial, and so I was considering just pushing them once
the tree opens for v11 development.  If anyone thinks they need
further review, I'll put them into the September commitfest, but
otherwise we might as well skip the overhead.  These are:

1. check-hash-bucket-size-against-work_mem-2.patch from
https://www.postgresql.org/message-id/13698.1487283...@sss.pgh.pa.us

That discussion sort of trailed off, but there wasn't really anyone
saying not to commit it, and no new ideas have surfaced.

2. simplify-simple-expresssion-checking.patch from
https://www.postgresql.org/message-id/2257.1498412...@sss.pgh.pa.us

This is the patch to get rid of plpgsql's exec_simple_check_node()
as Robert suggested.  It's never been anything but a maintenance
nuisance.

3. remove-pgbench-option-ordering-constraint.patch from
https://www.postgresql.org/message-id/20559.1501703...@sss.pgh.pa.us

That one was already informally reviewed by two people, so I don't
think it needs another look.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-08-11 Thread Robert Haas
On Fri, Aug 11, 2017 at 4:39 AM, Thomas Munro
 wrote:
> OK.  Now it's ds_hash_table.{c,h}, where "ds" stands for "dynamic
> shared".  Better?  If we were to do other data structures in DSA
> memory they could follow that style: ds_red_black_tree.c, ds_vector.c,
> ds_deque.c etc and their identifier prefix would be drbt_, dv_, dd_
> etc.
>
> Do you want to see a separate patch to rename dsa.c?  Got a better
> name?  You could have spoken up earlier :-)  It does sound like a bit
> like the thing from crypto or perhaps a scary secret government
> department.

I doubt that we really want to have accessor functions with names like
dynamic_shared_hash_table_insert or ds_hash_table_insert. Long names
are fine, even desirable, for APIs that aren't too widely used,
because they're relatively self-documenting, but a 30-character
function name gets annoying in a hurry if you have to call it very
often, and this is intended to be reusable for other things that want
a dynamic shared memory hash table.  I think we should (a) pick some
reasonably short prefix for all the function names, like dht or dsht
or ds_hash, but not ds_hash_table or dynamic_shared_hash_table and (b)
also use that prefix as the name for the .c and .h files.

Right now, we've got a situation where the most widely-used hash table
implementation uses dynahash.c for the code, hsearch.h for the
interface, and "hash" as the prefix for the names, and that's really
hard to remember.  I think having a consistent naming scheme
throughout would be a lot better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel COPY FROM execution

2017-08-11 Thread Alex K
Greetings pgsql-hackers,

I have completed the general infrastructure for parallel COPY FROM execution,
consisting of Main (master) process and multiple BGWorkers connected with master
using a personal message query (shm_mq).

Master process does:
- Dynamic shared memory allocation with parallel state across
BGWorkers and master
- Attaching every worker to the personal message query (shm_mq)
- Wait workers initialization using Latch
- Read raw text lines using CopyReadLine and puts them into shm_mq's
  via round-robin to balance queries load
- When EOF is reached sends zero-length message and workers are safely
  shut down when receive it
- Wait for worker until they complete their jobs using ConditionalVariable

Each BGWorker does:
- Signal master on initialization via Latch
- Receive raw text lines over the personal shm_mq and put them into
the log (for now)
- Reinitialize db connection using the same db_id and user_id as main process
- Signal master via ConditionalVariable on job done

All parallel state modifications are done under LWLocks.

You can find actual code here https://github.com/ololobus/postgres/pull/2/files
(it is still in progress, so has a lot of duplications and comments,
which are to-be-deleted)

To go forward I have to overcome some obstacles:

- Currently all copy.c methods are designed to work with one giant
structure – CopyState.
  It includes buffers, many initial parameters which stay unchanged
and a few variables
  which vary during COPY FROM execution. Since I need all these
parameters, I have to
  obtain them somehow inside each BGWorker process. I see two possible
solutions here:

  1) Perform BeginCopyFrom initialization inside master and put
required parameters into
  shared memory. However, many of them are arrays of a variable size
  (e.g. partition_tupconv_maps, force_notnull_flags), so I cannot
put them into shmem
  inside one single struct. The best idea I have is to put each
parameter under the personal
  shmem TOC key, which seems to be quite ugly.

  2) Perform BeginCopyFrom initialization inside each BGWorker.
However, it also opens
  input file/pipe for read, which is not necessary for workers and
may cause some
  interference with master, but I can modify BeginCopyFrom.

- I have used both Latch and ConditionalVariable for the same
purpose–wait until some signal
  occurs–and for me as an end user they perform quite similar. I
looked into the condition_variable.c
  code and it uses Latch and SpinLock under the hood. So what are
differences and dis-/advantages
  between Latch and ConditionalVariable?

I will be glad if someone will help me to find an answer to my
question; also any comments
and remarks to the overall COPY FROM processing architecture are very welcome.


Alexey


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Thoughts on unit testing?

2017-08-11 Thread Tom Lane
Peter Eisentraut  writes:
> On 8/10/17 17:53, Thomas Munro wrote:
>> The current regression tests, isolation tests and TAP tests are very
>> good (though I admit my experience with TAP is limited), but IMHO we
>> are lacking support for C-level unit testing.  Complicated, fiddly
>> things with many states, interactions, edge cases etc can be hard to
>> get full test coverage on from the outside.  Consider
>> src/backend/utils/mmgr/freepage.c as a case in point.

> I don't have a good idea how to address this, but I agree that something
> in this area would be widely useful.

I don't think anyone doubts it would be useful.  The question is can we
get to something useful with an acceptable level of effort and overhead.
So far I've not seen anything promising in that sense.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Funny WAL corruption issue

2017-08-11 Thread Chris Travers
On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark  wrote:

> On 10 August 2017 at 15:26, Chris Travers  wrote:
> >
> >
> > The bitwise comparison is interesting.  Remember the error was:
> >
> > pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> > pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
> > 1146880
> ...
> > Since this didn't throw a checksum error (we have data checksums
> disabled but wal records ISTR have a separate CRC check), would this
> perhaps indicate that the checksum operated over incorrect data?
>
> No checksum error and this "unexpected pageaddr" doesn't necessarily
> mean data corruption. It could mean that when the database stopped logging
> it was reusing a wal file and the old wal stream had a record boundary
> on the same byte position. So the previous record checksum passed and
> the following record checksum passes but the record header is for a
> different wal stream position.
>
> I think you could actually hack xlogdump to ignore this condition and
> keep outputting and you'll see whether the records that follow appear
> to be old wal log data.  I haven't actually tried this though.
>

For better or worse, I get a different error at the same spot if I try this:

Doing so involved disabling the check in the backend wal reader.

pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: invalid
contrecord length 4509 at 1E39C/E1117FF8

If I hack it to ignore all errors on that record, no further records come
out though it does run over the same records.

This leads me to conclude there are no further valid records.


>
> --
> greg
>



-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: [HACKERS] Thoughts on unit testing?

2017-08-11 Thread Peter Eisentraut
On 8/10/17 17:53, Thomas Munro wrote:
> The current regression tests, isolation tests and TAP tests are very
> good (though I admit my experience with TAP is limited), but IMHO we
> are lacking support for C-level unit testing.  Complicated, fiddly
> things with many states, interactions, edge cases etc can be hard to
> get full test coverage on from the outside.  Consider
> src/backend/utils/mmgr/freepage.c as a case in point.

I don't have a good idea how to address this, but I agree that something
in this area would be widely useful.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM protocol documentation

2017-08-11 Thread Peter Eisentraut
On 8/11/17 09:06, Álvaro Hernández Tortosa wrote:
>  Strictly speaking the RFC assumes that the username is at least 1 
> character. I understand this was precisely Peter's original comment.

Well, my main point was that the documentation, the code, and the code
comments all say slightly different things.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM protocol documentation

2017-08-11 Thread Peter Eisentraut
On 8/11/17 07:18, Michael Paquier wrote:
> The problem is where a username includes characters as a comma or '=',
> which can be avoided if the string is in UTF-8 as the username is
> prepared with SASLprep before being used in the SASL exchange, but we
> have no way now to be sure now that the string is actually in UTF-8.
> If at some point we decide that only things using UTF-8 are good to be
> used during authentication, using the username in the exchange
> messages instead of the one in the startup packet would be fine and
> actually better IMO in the long term. Please note that the
> specification says that both the username and the password must be
> encoded in UTF-8, so we are not completely compliant here. If there is
> something to address, that would be this part.

So we already handle passwords.  Can't we handle user names the same way?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Scan Mode in Hash Index

2017-08-11 Thread Ashutosh Sharma
>
> Comments on the latest patch:

Thank you once again for reviewing my patches.

>
> 1.
> /*
> + * We remember prev and next block number along with current block
> + * number so that if fetching the tuples using cursor we know the
> + * page from where to begin. This is for the case where we have
> + * reached the end of bucket chain without finding any matching
> + * tuples.
> + */
> + if (!BlockNumberIsValid(opaque->hasho_nextblkno))
> + prev_blkno = opaque->hasho_prevblkno;
>
> This doesn't seem to be the right place for this comment as you are
> not saving next or current block number here.  I am not sure whether
> you really need this comment at all.  Will it be better if you move
> this to else part of the loop and reword it as:
>
> "Remember next and previous block numbers for scrollable cursors to
> know the start position."

Shifted the comments to else part of the loop.

>
> 2. The code in _hash_readpage looks quite bloated.  I think we can
> make it better if we do something like below.
> a.
> +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
> {
> ..
> + if (so->currPos.buf == so->hashso_bucket_buf ||
> + so->currPos.buf == so->hashso_split_bucket_buf)
> + {
> + so->currPos.prevPage = InvalidBlockNumber;
> + so->currPos.nextPage = opaque->hasho_nextblkno;
> + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
> + }
> + else
> + {
> + so->currPos.prevPage = opaque->hasho_prevblkno;
> + so->currPos.nextPage = opaque->hasho_nextblkno;
> + _hash_relbuf(rel, so->currPos.buf);
> + so->currPos.buf = InvalidBuffer;
> + }
> ..
> }
>
> This code chunk is same for both forward and backward scans.  I think
> you can have single copy of this code by moving it out of if-else
> loop.

Done.

>
> b.
> +_hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
> {
> ..
> + /* new page, locate starting position by binary search */
> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
> +
> + itemIndex = _hash_load_qualified_items(scan, page, offnum, dir);
> +
> + while (itemIndex == 0)
> + {
> + /*
> + * Could not find any matching tuples in the current page, move to
> + * the next page. Before leaving the current page, also deal with
> + * any killed items.
> + */
> + if (so->numKilled > 0)
> + _hash_kill_items(scan);
> +
> + /*
> + * We remember prev and next block number along with current block
> + * number so that if fetching the tuples using cursor we know the
> + * page from where to begin. This is for the case where we have
> + * reached the end of bucket chain without finding any matching
> + * tuples.
> + */
> + if (!BlockNumberIsValid(opaque->hasho_nextblkno))
> + prev_blkno = opaque->hasho_prevblkno;
> +
> + _hash_readnext(scan, , , );
> + if (BufferIsValid(buf))
> + {
> + so->currPos.buf = buf;
> + so->currPos.currPage = BufferGetBlockNumber(buf);
> + so->currPos.lsn = PageGetLSN(page);
> + offnum = _hash_binsearch(page, so->hashso_sk_hash);
> + itemIndex = _hash_load_qualified_items(scan, page,
> +   offnum, dir);
> ..
> }
>
> Have just one copy of code search the offset and load qualified items.
> Change the above while loop to do..while loop and have a check in
> between break the loop when item index is not zero.

Done that way.

>
> 3.
> - * Find the first item in the index that
> - * satisfies the qualification associated with the scan descriptor. On
> - * success, the page containing the current index tuple is read locked
> - * and pinned, and the scan's opaque data entry is updated to
> - * include the buffer.
> + * We find the first item (or, if backward scan, the last item) in
> + * the index that satisfies the qualification associated with the
> + * scan descriptor. On success, the page containing the current
> + * index tuple is read locked and pinned, and data about the
> + * matching tuple(s) on the page has been loaded into so->currPos,
> + * scan->xs_ctup.t_self is set to the heap TID of the current tuple.
> + *
> + * If there are no matching items in the index, we return FALSE,
> + * with no pins or locks held.
>   */
>  bool
>  _hash_first(IndexScanDesc scan, ScanDirection dir)
> @@ -229,15 +297,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
>
> I don't think on success, the lock or pin is held anymore, after this
> patch the only pin will be retained and that too for bucket page.
> Also, there is no need to modify the part of the comment which is not
> related to change in this patch.

Corrected.

>
> I don't see scan->xs_ctup.t_self getting set anywhere in this
> function.  I think you are setting it in hashgettuple.  It is better
> to move that assignment from hashgettuple to _hash_first so as to be
> consistent with _hash_next.

Done that way.

>
> 4.
> + * On successful exit, scan->xs_ctup.t_self is set to the TID
> + * of the next heap tuple, and if requested, scan->xs_itup
> + * points to a copy of the index tuple.  so->currPos is updated
> + * as needed.
> + *
> + * On failure exit (no more tuples), we return FALSE with no
> + * 

Re: [HACKERS] SCRAM protocol documentation

2017-08-11 Thread Álvaro Hernández Tortosa



On 11/08/17 15:00, Michael Paquier wrote:

On Fri, Aug 11, 2017 at 9:31 PM, Álvaro Hernández Tortosa
 wrote:

On 11/08/17 13:18, Michael Paquier wrote:

On Fri, Aug 11, 2017 at 3:50 PM, Álvaro Hernández Tortosa
 wrote:

Relatedly, the SCRAM specification doesn't appear to allow omitting the
user name in this manner.  Why don't we just send the actual user name,
even though it's redundant with the startup message?

The problem is where a username includes characters as a comma or '=',
which can be avoided if the string is in UTF-8 as the username is
prepared with SASLprep before being used in the SASL exchange, but we
have no way now to be sure now that the string is actually in UTF-8.
If at some point we decide that only things using UTF-8 are good to be
used during authentication, using the username in the exchange
messages instead of the one in the startup packet would be fine and
actually better IMO in the long term. Please note that the
specification says that both the username and the password must be
encoded in UTF-8, so we are not completely compliant here. If there is
something to address, that would be this part.

 The reason why the username is ignored, unless I'm wrong, is not exactly
that it is already sent. It is that Postgres does not restrict usernames to
be UTF-8 only, while SCRAM does. As such, if a username would not be UTF-8,
it will not be sent reliably over SCRAM.

That's basically the point I was making. Note that I would not be
against Postgres forcing strings to be in UTF-8. Now things are fuzzy
because of the lack of restrictions.


I'm +1 for that. But I guess that involves a protocol change, 
and that's a completely different can of worms





  If there's a clear meaning about ignoring the user here, why not
settle
on something like the "*"? It's not going to change the world sending a
few
bytes less on initialization, but I guess it doesn't hurt either...

I am not sure either that '*' would be that much helpful. Requiring
that things are in UTF-8 would be more compliant with the original
RFC.

 But we really don't need to send the username, since Postgres already
knows it (and that accommodates for non UTF-8 usernames). So why bother?
Just sending something like "*" (which is UTF-8 and produces the same value
under Saslprep) should be enough. I think the idea of ignoring the username
is pretty neat, but maybe a "standard" like "send me an asterisk here" could
be even better than leaving it empty.

Personally I don't see much difference between both, so I'd rather
leave things as they are now.


Strictly speaking the RFC assumes that the username is at least 1 
character. I understand this was precisely Peter's original comment.



Álvaro

--

Álvaro Hernández Tortosa


---
<8K>data



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM protocol documentation

2017-08-11 Thread Michael Paquier
On Fri, Aug 11, 2017 at 9:31 PM, Álvaro Hernández Tortosa
 wrote:
> On 11/08/17 13:18, Michael Paquier wrote:
>> On Fri, Aug 11, 2017 at 3:50 PM, Álvaro Hernández Tortosa
>>  wrote:
 Relatedly, the SCRAM specification doesn't appear to allow omitting the
 user name in this manner.  Why don't we just send the actual user name,
 even though it's redundant with the startup message?
>>
>> The problem is where a username includes characters as a comma or '=',
>> which can be avoided if the string is in UTF-8 as the username is
>> prepared with SASLprep before being used in the SASL exchange, but we
>> have no way now to be sure now that the string is actually in UTF-8.
>> If at some point we decide that only things using UTF-8 are good to be
>> used during authentication, using the username in the exchange
>> messages instead of the one in the startup packet would be fine and
>> actually better IMO in the long term. Please note that the
>> specification says that both the username and the password must be
>> encoded in UTF-8, so we are not completely compliant here. If there is
>> something to address, that would be this part.
>
> The reason why the username is ignored, unless I'm wrong, is not exactly
> that it is already sent. It is that Postgres does not restrict usernames to
> be UTF-8 only, while SCRAM does. As such, if a username would not be UTF-8,
> it will not be sent reliably over SCRAM.

That's basically the point I was making. Note that I would not be
against Postgres forcing strings to be in UTF-8. Now things are fuzzy
because of the lack of restrictions.

>>>  If there's a clear meaning about ignoring the user here, why not
>>> settle
>>> on something like the "*"? It's not going to change the world sending a
>>> few
>>> bytes less on initialization, but I guess it doesn't hurt either...
>>
>> I am not sure either that '*' would be that much helpful. Requiring
>> that things are in UTF-8 would be more compliant with the original
>> RFC.
>
> But we really don't need to send the username, since Postgres already
> knows it (and that accommodates for non UTF-8 usernames). So why bother?
> Just sending something like "*" (which is UTF-8 and produces the same value
> under Saslprep) should be enough. I think the idea of ignoring the username
> is pretty neat, but maybe a "standard" like "send me an asterisk here" could
> be even better than leaving it empty.

Personally I don't see much difference between both, so I'd rather
leave things as they are now.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Funny WAL corruption issue

2017-08-11 Thread Chris Travers
On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark  wrote:

> On 10 August 2017 at 15:26, Chris Travers  wrote:
> >
> >
> > The bitwise comparison is interesting.  Remember the error was:
> >
> > pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> > pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
> > 1146880
> ...
> > Since this didn't throw a checksum error (we have data checksums
> disabled but wal records ISTR have a separate CRC check), would this
> perhaps indicate that the checksum operated over incorrect data?
>
> No checksum error and this "unexpected pageaddr" doesn't necessarily
> mean data corruption. It could mean that when the database stopped logging
> it was reusing a wal file and the old wal stream had a record boundary
> on the same byte position. So the previous record checksum passed and
> the following record checksum passes but the record header is for a
> different wal stream position.
>

I expect to test this theory shortly.

Assuming it is correct, what can we do to prevent restarts of slaves from
running into it?


> I think you could actually hack xlogdump to ignore this condition and
> keep outputting and you'll see whether the records that follow appear
> to be old wal log data.  I haven't actually tried this though.
>
> --
> greg
>



-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: [HACKERS] [BUGS] Replication to Postgres 10 on Windows is broken

2017-08-11 Thread Augustine, Jobin
Appears that patch is not helping.
Errors like below are still appearing in the log
===
2017-08-11 12:22:35 UTC [2840]: [1-1] user=,db=,app=,client= FATAL:  could
not connect to the primary server: could not send data to server: Socket is
not connected (0x2749/10057)
could not send startup packet: Socket is not connected
(0x2749/10057)
2017-08-11 12:22:40 UTC [1964]: [1-1] user=,db=,app=,client= FATAL:  could
not connect to the primary server: could not send data to server: Socket is
not connected (0x2749/10057)
could not send startup packet: Socket is not connected
(0x2749/10057)
2017-08-11 12:22:45 UTC [248]: [1-1] user=,db=,app=,client= FATAL:  could
not connect to the primary server: could not send data to server: Socket is
not connected (0x2749/10057)
could not send startup packet: Socket is not connected
(0x2749/10057)
===


On Fri, Aug 11, 2017 at 7:28 AM, Augustine, Jobin <
jobin.august...@openscg.com> wrote:

> I am in an effort to create independent build environment on Windows to
> test the patch from Andres.
> I shall come back with result possibly in another 24 hours.
>
> -Jobin
>
> On Fri, Aug 11, 2017 at 7:25 AM, Andres Freund  wrote:
>
>> On 2017-08-10 18:51:07 -0700, Noah Misch wrote:
>> > On Tue, Aug 08, 2017 at 07:25:37PM -0400, Peter Eisentraut wrote:
>> > > On 8/7/17 21:06, Noah Misch wrote:
>> > > >> That would fit.  Until v10 (commit 1e8a850), PQconnectStart() had
>> no in-tree
>> > > >> callers outside of libpq itself.
>> > > > [Action required within three days.  This is a generic
>> notification.]
>> > > >
>> > > > The above-described topic is currently a PostgreSQL 10 open item.
>> Peter,
>> > > > since you committed the patch believed to have created it, you own
>> this open
>> > > > item.
>> > >
>> > > I don't think I can usefully contribute to this.  Could someone else
>> > > take it?
>> >
>> > If nobody volunteers, you could always resolve this by reverting
>> 1e8a850 and
>> > successors.
>>
>> I've volunteered a fix nearby, just can't test it. I don't think we can
>> require committers to work on windows.
>>
>> - Andres
>>
>
>
>
> --
>
> *Jobin Augustine*
> Architect : Production Database Operations
>
>
> *OpenSCG*
>
>
> *phone : +91 9989932600*
>



-- 

*Jobin Augustine*
Architect : Production Database Operations


*OpenSCG*


*phone : +91 9989932600*


Re: [HACKERS] SCRAM protocol documentation

2017-08-11 Thread Álvaro Hernández Tortosa



On 11/08/17 13:18, Michael Paquier wrote:

On Fri, Aug 11, 2017 at 3:50 PM, Álvaro Hernández Tortosa
 wrote:

On 11/08/17 03:57, Peter Eisentraut wrote:

The SCRAM protocol documentation
(https://www.postgresql.org/docs/devel/static/sasl-authentication.html)
states

"To avoid confusion, the client should use pg_same_as_startup_message as
the username in the client-first-message."

However, the client implementation in libpq doesn't actually do that, it
sends an empty string for the user name.  I find no other reference to
"pg_same_as_startup_message" in the sources.  Should the documentation
be updated?

Yes, definitely. I think that we should mention that the server uses
the username of the startup packet and ignores the data sent by the
frontend potentially provided in client-first-message.


But it already says so the documentation:

"The username that was already sent in the startup message is used instead."




Relatedly, the SCRAM specification doesn't appear to allow omitting the
user name in this manner.  Why don't we just send the actual user name,
even though it's redundant with the startup message?

The problem is where a username includes characters as a comma or '=',
which can be avoided if the string is in UTF-8 as the username is
prepared with SASLprep before being used in the SASL exchange, but we
have no way now to be sure now that the string is actually in UTF-8.
If at some point we decide that only things using UTF-8 are good to be
used during authentication, using the username in the exchange
messages instead of the one in the startup packet would be fine and
actually better IMO in the long term. Please note that the
specification says that both the username and the password must be
encoded in UTF-8, so we are not completely compliant here. If there is
something to address, that would be this part.


The reason why the username is ignored, unless I'm wrong, is not 
exactly that it is already sent. It is that Postgres does not restrict 
usernames to be UTF-8 only, while SCRAM does. As such, if a username 
would not be UTF-8, it will not be sent reliably over SCRAM.





 If there's a clear meaning about ignoring the user here, why not settle
on something like the "*"? It's not going to change the world sending a few
bytes less on initialization, but I guess it doesn't hurt either...

I am not sure either that '*' would be that much helpful. Requiring
that things are in UTF-8 would be more compliant with the original
RFC.


But we really don't need to send the username, since Postgres 
already knows it (and that accommodates for non UTF-8 usernames). So why 
bother? Just sending something like "*" (which is UTF-8 and produces the 
same value under Saslprep) should be enough. I think the idea of 
ignoring the username is pretty neat, but maybe a "standard" like "send 
me an asterisk here" could be even better than leaving it empty.



Álvaro

--

Álvaro Hernández Tortosa


---
<8K>data



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Funny WAL corruption issue

2017-08-11 Thread Greg Stark
On 10 August 2017 at 15:26, Chris Travers  wrote:
>
>
> The bitwise comparison is interesting.  Remember the error was:
>
> pg_xlogdump: FATAL:  error in WAL record at 1E39C/E1117FB8: unexpected
> pageaddr 1E375/61118000 in log segment 0001E39C00E1, offset
> 1146880
...
> Since this didn't throw a checksum error (we have data checksums disabled but 
> wal records ISTR have a separate CRC check), would this perhaps indicate that 
> the checksum operated over incorrect data?

No checksum error and this "unexpected pageaddr" doesn't necessarily
mean data corruption. It could mean that when the database stopped logging
it was reusing a wal file and the old wal stream had a record boundary
on the same byte position. So the previous record checksum passed and
the following record checksum passes but the record header is for a
different wal stream position.

I think you could actually hack xlogdump to ignore this condition and
keep outputting and you'll see whether the records that follow appear
to be old wal log data.  I haven't actually tried this though.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM protocol documentation

2017-08-11 Thread Michael Paquier
On Fri, Aug 11, 2017 at 3:50 PM, Álvaro Hernández Tortosa
 wrote:
> On 11/08/17 03:57, Peter Eisentraut wrote:
>> The SCRAM protocol documentation
>> (https://www.postgresql.org/docs/devel/static/sasl-authentication.html)
>> states
>>
>> "To avoid confusion, the client should use pg_same_as_startup_message as
>> the username in the client-first-message."
>>
>> However, the client implementation in libpq doesn't actually do that, it
>> sends an empty string for the user name.  I find no other reference to
>> "pg_same_as_startup_message" in the sources.  Should the documentation
>> be updated?

Yes, definitely. I think that we should mention that the server uses
the username of the startup packet and ignores the data sent by the
frontend potentially provided in client-first-message.

>> Relatedly, the SCRAM specification doesn't appear to allow omitting the
>> user name in this manner.  Why don't we just send the actual user name,
>> even though it's redundant with the startup message?

The problem is where a username includes characters as a comma or '=',
which can be avoided if the string is in UTF-8 as the username is
prepared with SASLprep before being used in the SASL exchange, but we
have no way now to be sure now that the string is actually in UTF-8.
If at some point we decide that only things using UTF-8 are good to be
used during authentication, using the username in the exchange
messages instead of the one in the startup packet would be fine and
actually better IMO in the long term. Please note that the
specification says that both the username and the password must be
encoded in UTF-8, so we are not completely compliant here. If there is
something to address, that would be this part.

> If there's a clear meaning about ignoring the user here, why not settle
> on something like the "*"? It's not going to change the world sending a few
> bytes less on initialization, but I guess it doesn't hurt either...

I am not sure either that '*' would be that much helpful. Requiring
that things are in UTF-8 would be more compliant with the original
RFC.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reload-through-the-top-parent switch the partition table

2017-08-11 Thread Rushabh Lathia
On Thu, Aug 10, 2017 at 8:26 PM, Robert Haas  wrote:

> On Thu, Aug 10, 2017 at 3:47 AM, Rushabh Lathia
>  wrote:
> >> (1) seems like a pretty arbitrary restriction, so I don't like that
> >> option.  (2) would hurt performance in some use cases.  Do we have an
> >> option (3)?
> >
> > How about protecting option 2) with the load-via-partition-root
> protection.
> > Means
> > load the parents information even dump is not set only when
> > load-via-partition-root
> > & ispartition.
> >
> > This won't hurt performance for the normal cases.
>
> Yes, that seems like the right approach.
>
> +Dump data via the top-most partitioned table (rather than
> partition
> +table) when dumping data for the partition table.
>
> I think we should phrase this a bit more clearly, something like this:
> When dumping a COPY or INSERT statement for a partitioned table,
> target the root of the partitioning hierarchy which contains it rather
> than the partition itself.  This may be useful when reloading data on
> a server where rows do not always fall into the same partitions as
> they did on the original server.  This could happen, for example, if
> the partitioning column is of type text and the two system have
> different definitions of the collation used to partition the data.
>
>
Done.


> +printf(_("  --load-via-partition-rootload partition table via
> the root relation\n"));
>
> "relation" seems odd to me here.  root table?
>
>
Done.


>  /* Don't bother computing anything for non-target tables, either
> */
>  if (!tblinfo[i].dobj.dump)
> +{
> +/*
> + * Load the parents information for the partition table when
> + * the load-via-partition-root option is set. As we need the
> + * parents information to get the partition root.
> + */
> +if (dopt->load_via_partition_root &&
> +tblinfo[i].ispartition)
> +findParentsByOid([i], inhinfo, numInherits);
>  continue;
> +}
>
> Duplicating the call to findParentsByOid seems less then ideal.  How
> about doing something like this:
>
> if (tblinfo[i].dobj.dump)
> {
> find_parents = true;
> mark_parents = true;
> }
> else if (dopt->load_via_partition_root && tblinfo[i].ispartition)
> find_parents = true;
>
> if (find_parents)
> findParentsByOid([i], inhinfo, numInherits);
>
> etc.
>
>
Done changes to avoid duplicate call to findParentsByOid().


> The comments for this function also need some work - e.g. the function
> header comment deserves some kind of update for these changes.
>
>
Done.


> +static TableInfo *
> +getRootTableInfo(TableInfo *tbinfo)
> +{
> +Assert(tbinfo->ispartition);
> +Assert(tbinfo->numParents == 1);
> +if (tbinfo->parents[0]->ispartition)
> +return getRootTableInfo(tbinfo->parents[0]);
> +
> +return tbinfo->parents[0];
> +}
>
> This code should iterate, not recurse, to avoid any risk of blowing
> out the stack.
>
>
Done.

Please find attach patch with the changes.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index bafa031..de8297a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -889,6 +889,21 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+When dumping a COPY or INSERT statement for a partitioned table,
+target the root of the partitioning hierarchy which contains it rather
+than the partition itself.  This will be useful when reloading data on
+a server where rows do not always fall into the same partitions as
+they did on the original server.  This could happen, for example, if
+the partitioning column is of type text and the two system have
+different definitions of the collation used to partition the data.
+   
+  
+ 
+
+ 
--section=sectionname

  
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index aa944a2..dc1d3cc 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -431,6 +431,21 @@ PostgreSQL documentation
  
 
  
+  --load-via-partition-root
+  
+   
+When dumping a COPY or INSERT statement for a partitioned table,
+target the root of the partitioning hierarchy which contains it rather
+than the partition itself.  This will be useful when reloading data on
+a server where rows do not always fall into the same partitions as
+they did on the original server.  This could happen, for example, if
+the partitioning column is of type text and the two system have
+different definitions of the collation used to partition the data.
+   
+  
+ 
+
+ 
   --use-set-session-authorization
   
  

Re: [HACKERS] Foreign tables privileges not shown in information_schema.table_privileges

2017-08-11 Thread Ashutosh Bapat
 On Thu, Aug 10, 2017 at 6:30 PM, Nicolas Thauvin
 wrote:
> Hello,
>
> The information_schema.table_privileges view filters on regular tables
> and views. Foreign tables are not shown in this view but they are in
> other views of the information_schema like tables or column_privileges.
>
> Is it intentional? A patch is attached if not.

The line was first added by 596652d6 and updated by 262e821d to
include partitioned tables. Looks like we have forgot to add tables
added in between i.e. foreign tables and materialized views.
column_privileges doesn't have materialized views. Attached patch adds
materialized views to column_privileges view along with your changes.

Please add this to the next commitfest so that it doesn't get forgotten.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql
index 98bcfa0..fbb5460 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -573,7 +573,7 @@ CREATE VIEW column_privileges AS
   pr_c.relowner
FROM (SELECT oid, relname, relnamespace, relowner, (aclexplode(coalesce(relacl, acldefault('r', relowner.*
  FROM pg_class
- WHERE relkind IN ('r', 'v', 'f', 'p')
+ WHERE relkind IN ('r', 'v', 'f', 'p', 'm')
 ) pr_c (oid, relname, relnamespace, relowner, grantor, grantee, prtype, grantable),
 pg_attribute a
WHERE a.attrelid = pr_c.oid
@@ -595,7 +595,7 @@ CREATE VIEW column_privileges AS
 ) pr_a (attrelid, attname, grantor, grantee, prtype, grantable),
 pg_class c
WHERE pr_a.attrelid = c.oid
- AND relkind IN ('r', 'v', 'f', 'p')
+ AND relkind IN ('r', 'v', 'f', 'p', 'm')
  ) x,
  pg_namespace nc,
  pg_authid u_grantor,
@@ -1868,7 +1868,7 @@ CREATE VIEW table_privileges AS
  ) AS grantee (oid, rolname)
 
 WHERE c.relnamespace = nc.oid
-  AND c.relkind IN ('r', 'v', 'p')
+  AND c.relkind IN ('r', 'v', 'f', 'p', 'm')
   AND c.grantee = grantee.oid
   AND c.grantor = u_grantor.oid
   AND c.prtype IN ('INSERT', 'SELECT', 'UPDATE', 'DELETE', 'TRUNCATE', 'REFERENCES', 'TRIGGER')

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Sharing record typmods between backends

2017-08-11 Thread Thomas Munro
Hi,

Please find attached a new patch series.  I apologise in advance for
0001 and note that the patchset now weighs in at ~75kB compressed.
Here are my in-line replies to your two reviews:

On Tue, Jul 25, 2017 at 10:09 PM, Andres Freund  wrote:
> It does concern me that we're growing yet another somewhat different
> hashtable implementation. Yet I don't quite see how we could avoid
> it. dynahash relies on proper pointers, simplehash doesn't do locking
> (and shouldn't) and also relies on pointers, although to a much lesser
> degree.  All the open coded tables aren't a good match either.  So I
> don't quite see an alternative, but I'd love one.

Yeah, I agree.  To deal with data structures with different pointer
types, locking policy, inlined hash/eq functions etc, perhaps there is
a way we could eventually do 'policy based design' using the kind of
macro trickery you started where we generate N different hash table
variations but only have to maintain source code for one chaining hash
table implementation?  Or perl scripts that effectively behave as a
cfront^H^H^H nevermind.  I'm not planning to investigate that for this
cycle.

>
> diff --git a/src/backend/lib/dht.c b/src/backend/lib/dht.c
> new file mode 100644
> index 000..2fec70f7742
> --- /dev/null
> +++ b/src/backend/lib/dht.c
>
> FWIW, not a big fan of dht as a filename (nor of dsa.c). For one DHT
> usually refers to distributed hash tables, which this is not, and for
> another the abbreviation is so short it's not immediately
> understandable, and likely to conflict further.  I think it'd possibly
> ok to have dht as symbol prefixes, but rename the file to be longer.

OK.  Now it's ds_hash_table.{c,h}, where "ds" stands for "dynamic
shared".  Better?  If we were to do other data structures in DSA
memory they could follow that style: ds_red_black_tree.c, ds_vector.c,
ds_deque.c etc and their identifier prefix would be drbt_, dv_, dd_
etc.

Do you want to see a separate patch to rename dsa.c?  Got a better
name?  You could have spoken up earlier :-)  It does sound like a bit
like the thing from crypto or perhaps a scary secret government
department.

> + * To deal with currency, it has a fixed size set of partitions, each of 
> which
> + * is independently locked.
>
> s/currency/concurrency/ I presume.

Fixed.

> + * Each bucket maps to a partition; so insert, find
> + * and iterate operations normally only acquire one lock.  Therefore, good
> + * concurrency is achieved whenever they don't collide at the lock partition
>
> s/they/operations/?

Fixed.

> + * level.  However, when a resize operation begins, all partition locks must
> + * be acquired simultaneously for a brief period.  This is only expected to
> + * happen a small number of times until a stable size is found, since growth 
> is
> + * geometric.
>
> I'm a bit doubtful that we need partitioning at this point, and that it
> doesn't actually *degrade* performance for your typmod case.

Yeah, partitioning not needed for this case, but this is supposed to
be more generally useful.  I thought about making the number of
partitions a construction parameter, but it doesn't really hurt does
it?

> + * Resizing is done incrementally so that no individual insert operation pays
> + * for the potentially large cost of splitting all buckets.
>
> I'm not sure this is a reasonable tradeoff for the use-case suggested so
> far, it doesn't exactly make things simpler. We're not going to grow
> much.

Yeah, designed to be more generally useful.  Are you saying you would
prefer to see the DHT patch split into an initial submission that does
the simplest thing possible, so that the unlucky guy who causes the
hash table to grow has to do all the work of moving buckets to a
bigger hash table?  Then we could move the more complicated
incremental growth stuff to a later patch.

> +/* The opaque type used for tracking iterator state. */
> +struct dht_iterator;
> +typedef struct dht_iterator dht_iterator;
>
> Isn't it actually the iterator state? Rather than tracking it? Also, why
> is it opaque given you're actually defining it below? Guess you'd moved
> it at some point.

Improved comment.  The iterator state is defined below in the .h, but
with a warning that client code mustn't access it; it exists in the
header only because it's very useful to be able to but dht_iterator on
the stack which requires the client code to have its definition, but I
want to reserve the right to change it arbitrarily in future.

> +/*
> + * The set of parameters needed to create or attach to a hash table.  The
> + * members tranche_id and tranche_name do not need to be initialized when
> + * attaching to an existing hash table.
> + */
> +typedef struct
> +{
> +   Size key_size;  /* Size of the key (initial 
> bytes of entry) */
> +   Size entry_size;/* Total size of entry */
>
> Let's use size_t, like we kind of concluded in the thread you 

Re: [HACKERS] SCRAM protocol documentation

2017-08-11 Thread Álvaro Hernández Tortosa



On 11/08/17 03:57, Peter Eisentraut wrote:

The SCRAM protocol documentation
(https://www.postgresql.org/docs/devel/static/sasl-authentication.html)
states

"To avoid confusion, the client should use pg_same_as_startup_message as
the username in the client-first-message."

However, the client implementation in libpq doesn't actually do that, it
sends an empty string for the user name.  I find no other reference to
"pg_same_as_startup_message" in the sources.  Should the documentation
be updated?

Relatedly, the SCRAM specification doesn't appear to allow omitting the
user name in this manner.  Why don't we just send the actual user name,
even though it's redundant with the startup message?



Hi Peter.

You are absolutely right, I was also surprised by this when I was 
doing the JDBC implementation. Actually I chose to send an asterisk 
("*"), see 
https://github.com/pgjdbc/pgjdbc/pull/842/files#diff-c52128420a3882543ffa20a48964abe4R88, 
as it is shorter than the username (likely).


I don't like the empty string either, and actually the library 
built for the JDBC and used in pgjdbc does explicitly disallow the use 
of an empty username.


If there's a clear meaning about ignoring the user here, why not 
settle on something like the "*"? It's not going to change the world 
sending a few bytes less on initialization, but I guess it doesn't hurt 
either...



Álvaro

--

Álvaro Hernández Tortosa


---
<8K>data



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers