from:"Ants Aasma"

Re: [HACKERS] make async slave to wait for lsn to be replayed

2017-10-31 Thread Ants Aasma

On Mon, Oct 30, 2017 at 7:25 PM, Ivan Kartyshov
<i.kartys...@postgrespro.ru> wrote:
> It sounds reasonable. I can offer the following version.
>
> WAIT  LSN  lsn_number;
> WAIT  LSN  lsn_number  TIMEOUT delay;
> WAIT  LSN  lsn_number  INFINITE;
> WAIT  LSN  lsn_number  NOWAIT;
>
>
> WAIT  [token]  wait_value [option];
>
> token - [LSN, TIME | TIMESTAMP | CSN | XID]
> option - [TIMEOUT | INFINITE | NOWAIT]
>
> Ready to listen to your suggestions.

Making the interface more specific about the mechanism is not what I
had in mind, quite the opposite. I would like to see the interface
describe the desired effect of the wait.

Stepping back for a while, from what I understand the reason we want
to waiting is to prevent observation of database state going
backwards. To limit the amount of waiting needed we tell the database
what we have observed. For example "I just observed my transaction
commit", or "the last time I observed state was X", and then have the
database provide us with a snapshot that is causally dependent on
those states. This does not give us linearizability, for that we still
need at the very least serializable transactions on standby. But it
seems to provide a form of sequential consistency, which (if it can be
proved to hold) makes reasoning about concurrency a lot nicer.

For lack of a better proposal I would like something along the lines of:

WAIT FOR state_id[, state_id] [ OPTIONS (..)]

And to get the tokens maybe a function pg_last_commit_state().

Optionally, to provide read-to-read causality, pg_snapshot_state()
could return for example replay_lsn at the start of the current
transaction. This makes sure that things don't just appear and
disappear when load balancing across many standby servers.

WAIT FOR semantics is to ensure that next snapshot is causally
dependent (happens after) each of the specified observed states.

The state_id could simply be a LSN, or to allow for future expansion
something like 'lsn:/1234'. Example extension would be to allow
for waiting on xids. On master that would be just a ShareLock on the
transactionid. On standby it would wait for the commit or rollback
record for that transaction to be replayed.

Robert made a good point that people will still rely on the token
being an LSN, but perhaps they will be slightly less angry when we
explicitly tell them that this might change in the future.

Regards,
Ants Aasma

[1] 
https://www.postgresql.org/docs/devel/static/functions-admin.html#functions-snapshot-synchronization

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make async slave to wait for lsn to be replayed

2017-10-26 Thread Ants Aasma

On Mon, Oct 23, 2017 at 12:29 PM, Ivan Kartyshov
<i.kartys...@postgrespro.ru> wrote:
> Ants Aasma писал 2017-09-26 13:00:
>>
>> Exposing this interface as WAITLSN will encode that visibility order
>> matches LSN order. This removes any chance of fixing for example
>> visibility order of async/vs/sync transactions. Just renaming it so
>> the token is an opaque commit visibility token that just happens to be
>> a LSN would still allow for progress in transaction management. For
>> example, making PostgreSQL distributed will likely want timestamp
>> and/or vector clock based visibility rules.
>
>
> I'm sorry I did not understand exactly what you meant.
> Please explain this in more detail.

Currently transactions on the master become visible when xid is
removed from proc array. This depends on the order of acquiring
ProcArrayLock, which can happen in a different order from inserting
the commit record to wal. Whereas on the standby the transactions will
become visible in the same order that commit records appear in wal.
The difference can be quite large when transactions are using
different values for synchronous_commit. Fixing this is not easy, but
we may want to do it someday. IIRC CSN threads contained more
discussion on this topic. If we do fix it, it seems likely that what
we need to wait on is not LSN, but some other kind of value. If the UI
were something like "WAITVISIBILITY token", then we can later change
the token to something other than LSN.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Discussion on missing optimizations

2017-10-09 Thread Ants Aasma

On Sun, Oct 8, 2017 at 7:24 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Petr Jelinek <petr.jeli...@2ndquadrant.com> writes:
>> Okay, that makes sense, thanks for explanation. Your patch is the way to
>> go then.
>
> Hearing no further comment, pushed.  Thanks for reviewing it.

The tautological col = col comparison on is occasionally used as a
planner "hint" to correct for row count overestimates. Not a great
solution, but PostgreSQL doesn't really have a better way to guide the
planner. Those queries will now have to do something else, like col =
col + 0, which still works.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] JIT compiling - v4.0

2017-10-04 Thread Ants Aasma

On Wed, Oct 4, 2017 at 9:48 AM, Andres Freund <and...@anarazel.de> wrote:
> Here's an updated version of the patchset.  There's some substantial
> changes here, but it's still very obviously very far from committable as
> a whole. There's some helper commmits that are simple and independent
> enough to be committable earlier on.

Looks pretty impressive already.

I wanted to take it for a spin, but got errors about the following
symbols being missing:

LLVMOrcUnregisterPerf
LLVMOrcRegisterGDB
LLVMOrcRegisterPerf
LLVMOrcGetSymbolAddressIn
LLVMLinkModules2Needed

As far as I can tell these are not in mainline LLVM. Is there a branch
or patchset of LLVM available somewhere that I need to use this?

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make async slave to wait for lsn to be replayed

2017-09-26 Thread Ants Aasma

On Tue, Aug 15, 2017 at 5:00 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> On 22 March 2017 at 01:17, Robert Haas <robertmh...@gmail.com> wrote:
>>
>> On Sun, Mar 12, 2017 at 10:20 PM, Thomas Munro
>> <thomas.mu...@enterprisedb.com> wrote:
>> > Maybe someone can think of a clever way for an extension to insert a
>> > wait for a user-supplied LSN *before* acquiring a snapshot so it can
>> > work for the higher levels, or maybe the hooks should go into core
>> > PostgreSQL so that the extension can exist as an external project not
>> > requiring a patched PostgreSQL installation, or maybe this should be
>> > done with new core syntax that extends transaction commands.  Do other
>> > people have views on this?
>>
>> IMHO, trying to do this using a function-based interface is a really
>> bad idea for exactly the reasons you mention.  I don't see why we'd
>> resist the idea of core syntax here; transactions are a core part of
>> PostgreSQL.
>>
>> There is, of course, the question of whether making LSNs such a
>> user-visible thing is a good idea in the first place, but that's a
>> separate question from issue of what syntax for such a thing is best.
>
>
> (I know this is old, but):
>
> That ship sailed a long time ago unfortunately, they're all over
> pg_stat_replication and pg_replication_slots and so on. They're already
> routinely used for monitoring replication lag in bytes, waiting for a peer
> to catch up, etc.

(continuing the trend of resurrecting old topics)

Exposing this interface as WAITLSN will encode that visibility order
matches LSN order. This removes any chance of fixing for example
visibility order of async/vs/sync transactions. Just renaming it so
the token is an opaque commit visibility token that just happens to be
a LSN would still allow for progress in transaction management. For
example, making PostgreSQL distributed will likely want timestamp
and/or vector clock based visibility rules.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hooks to track changed pages for backup purposes

2017-09-13 Thread Ants Aasma

On Thu, Aug 31, 2017 at 9:02 AM, Andrey Borodin <x4...@yandex-team.ru> wrote:
> When we have accumulated diff blocknumbers for most of segments we can 
> significantly speed up method of WAL scanning. If we have blocknumbers for 
> all segments we can skip WAL scanning at all.

Have you measured that the WAL scanning is actually a significant
issue? As a quick experiment I hacked up pg_waldump to just dump block
references to stdout in binary format. It scanned 2.8GB of WAL in 3.17
seconds, outputting 9.3M block refs per second. WAL was generated with
pgbench, synchronous commit off, using 4 cores for 10 minutes - making
the ratio of work from generating WAL to parsing it be about 750:1.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-08-10 Thread Ants Aasma

On Wed, Jan 18, 2017 at 4:33 PM, Merlin Moncure <mmonc...@gmail.com> wrote:
> On Wed, Jan 18, 2017 at 4:11 AM, Ants Aasma <ants.aa...@eesti.ee> wrote:
>> On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmonc...@gmail.com> wrote:
>>> Still getting checksum failures.   Over the last 30 days, I see the
>>> following.  Since enabling checksums FWICT none of the damage is
>>> permanent and rolls back with the transaction.   So creepy!
>>
>> The checksums still only differ in least significant digits which
>> pretty much means that there is a block number mismatch. So if you
>> rule out filesystem not doing its job correctly and transposing
>> blocks, it could be something else that is resulting in blocks getting
>> read from a location that happens to differ by a small multiple of
>> page size. Maybe somebody is racily mucking with table fd's between
>> seeking and reading. That would explain the issue disappearing after a
>> retry.
>>
>> Maybe you can arrange for the RelFileNode and block number to be
>> logged for the checksum failures and check what the actual checksums
>> are in data files surrounding the failed page. If the requested block
>> number contains something completely else, but the page that follows
>> contains the expected checksum value, then it would support this
>> theory.
>
> will do.   Main challenge is getting hand compiled server to swap in
> so that libdir continues to work.  Getting access to the server is
> difficult as is getting a maintenance window.  I'll post back ASAP.

As a new datapoint, we just had a customer with an issue that I think
might be related. The issue was reasonably repeatable by running a
report on the standby system. Issue manifested itself by first "could
not open relation" and/or "column is not in index" errors, followed a
few minutes later by a PANIC from startup process due to "specified
item offset is too large", "invalid max offset number" or "page X of
relation base/16384/1259 is uninitialized". I took a look at the xlog
dump and it was completely fine. For instance in the "specified item
offset is too large" case there was a INSERT_LEAF redo record
inserting the preceding offset just a couple hundred kilobytes back.
Restarting the server sometimes successfully applied the offending
WAL, sometimes it failed with other corruption errors. The offending
relations were always pg_class or pg_class_oid_index. Replacing plsh
functions with dummy plpgsql functions made the problem go away,
reintroducing plsh functions made it reappear.

The only thing I came up with that is consistent with the symptoms is
that a page got thrown out of shared_buffers between the two xlog
records referencing it (shared_buffers was default 128MB), and then
read back by a backend process, where in the time between FileSeek and
FileRead calls in mdread a subprocess mucked with the fd's offset so
that a different page than intended got read in. Or basically the same
race condition, but on the write side. Maybe somebody else has a
better imagination than me...

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reducing pg_ctl's reaction time

2017-06-28 Thread Ants Aasma

On Wed, Jun 28, 2017 at 8:31 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Andres Freund <and...@anarazel.de> writes:
>> On 2017-06-27 14:59:18 -0400, Tom Lane wrote:
>>> However, it's certainly arguable that this is too much change for an
>>> optional post-beta patch.
>
>> Yea, I think there's a valid case to be made for that. I'm still
>> inclined to go along with this, it seems we're otherwise just adding
>> complexity we're going to remove in a bit again.
>
> I'm not hearing anyone speaking against doing this now, so I'm going
> to go ahead with it.

+1 for going for it. Besides pg_ctl it would also help cluster
management software. In Patroni we basically reimplement pg_ctl to get
at the started postmaster pid to detect a crash during postmaster
startup earlier and to be able to reliably cancel start. Getting the
current state from the pid file sounds better than having a loop poke
the server with pg_isready.

To introduce feature creep, I would like to see more detailed states
than proposed here. Specifically I'm interested in knowing when
PM_WAIT_BACKENDS ends.

When we lose quorum we restart PostgreSQL as a standby. We use a
regular fast shutdown, but that can take a long time due to the
shutdown checkpoint. The leader lease may run out during this time so
we would have to escalate to immediate shutdown or have a watchdog
fence the node. If we knew that no user backends are left we can let
the shutdown checkpoint complete at leisure without risk for split
brain.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Data at rest encryption

2017-06-14 Thread Ants Aasma

On Wed, Jun 14, 2017 at 6:26 PM, Bruce Momjian <br...@momjian.us> wrote:
> Are you checking the CPU type or if AES instructions are enabled on the
> CPU? I ask this because I just realized in researching my new TLS talk
> that my BIOS defaults to AES instructions disabled, and I had to
> manually enable it.

There is zero code for that now, but the plan was to check the CPUID
instruction. My understanding is that it should report what is
currently enabled on the CPU. Will double check when actually writing
the code for the check.

>> > I anticipate that one of the trickier problems here will be handling
>> > encryption of the write-ahead log.  Suppose you encrypt WAL a block at
>> > a time.  In the current system, once you've written and flushed a
>> > block, you can consider it durably committed, but if that block is
>> > encrypted, this is no longer true.  A crash might tear the block,
>> > making it impossible to decrypt.  Replay will therefore stop at the
>> > end of the previous block, not at the last record actually flushed as
>> > would happen today.
>>
>> My patch is currenly doing a block at a time for WAL. The XTS mode
>
> Uh, how are you writing partial writes to the WAL.  I assume you are
> doing a streaming cipher so you can write in increments, right?

We were doing 8kB page aligned writes to WAL anyway. I just encrypt
the block before it gets written out.

>> I think we need to require wal_log_hints=on when encryption is
>> enabled. Currently I have not considered tearing on CLOG bits. Other
>> SLRUs probably have similar issues. I need to think a bit about how to
>> solve that.
>
> I am not sure if clog even needs to be encrypted.

Me neither, but it currently is, and it looks like that's broken in a
"silently corrupts your data" way in face of torn writes. Using OFB
mode (xor plaintext with pseudorandom stream for cipher) looks like it
might help here, if other approaches fail.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Data at rest encryption

2017-06-14 Thread Ants Aasma

On Tue, Jun 13, 2017 at 6:35 PM, Robert Haas <robertmh...@gmail.com> wrote:
> Of course, what would be even more useful is fine-grained encryption -
> encrypt these tables (and the corresponding indexes, toast tables, and
> WAL records related to any of that) with this key, encrypt these other
> tables (and the same list of associated stuff) with this other key,
> and leave the rest unencrypted.  The problem with that is that you
> probably can't run recovery without all of the keys, and even on a
> clean startup there would be a good deal of engineering work involved
> in refusing access to tables whose key hadn't been provided yet.

That's pretty much the reason why I decided to skip anything more
complicated for now. Anything that would be able to run recovery
without knowing the encryption key looks like an order of magnitude
more complicated to implement.

> Performance is likely to be poor on large databases,
> because every time a page transits between shared_buffers and the
> buffer cache we've got to en/decrypt, but as long as it's only poor
> for the people who opt into the feature I don't see a big problem with
> that.

It would make sense to tune the database with large shared buffers if
encryption is enabled. That should make sure that most shared buffers
traffic is going to disk anyway. As for performance, I have a
prototype assembly implementation of AES that does 3GB/s/core on my
laptop. If we add that behind a CPUID check the overhead should be
quite reasonable.

> I anticipate that one of the trickier problems here will be handling
> encryption of the write-ahead log.  Suppose you encrypt WAL a block at
> a time.  In the current system, once you've written and flushed a
> block, you can consider it durably committed, but if that block is
> encrypted, this is no longer true.  A crash might tear the block,
> making it impossible to decrypt.  Replay will therefore stop at the
> end of the previous block, not at the last record actually flushed as
> would happen today.

My patch is currenly doing a block at a time for WAL. The XTS mode
used to encrypt has the useful property that blocks that share
identical prefix unencrypted also have identical prefix when
encrypted. It requires that the tearing is 16B aligned, but I think
that is true for pretty much all storage systems. That property of
course has security downsides, but for table/index storage we have a
nonce in the form of LSN in the page header eliminating the issue.

> So, your synchronous_commit suddenly isn't.  A
> similar problem will occur any other page where we choose not to
> protect against torn pages using full page writes.  For instance,
> unless checksums are enabled or wal_log_hints=on, we'll write a data
> page where a single bit has been flipped and assume that the bit will
> either make it to disk or not; the page can't really be torn in any
> way that hurts us.  But with encryption that's no longer true, because
> the hint bit will turn into much more than a single bit flip, and
> rereading that page with half old and half new contents will be the
> end of the world (TM).  I don't know off-hand whether we're
> protecting, say, CLOG page writes with FPWs.: because setting a couple
> of bits is idempotent and doesn't depend on the existing page
> contents, we might not need it currently, but with encryption, every
> bit in the page depends on every other bit in the page, so we
> certainly would.  I don't know how many places we've got assumptions
> like this baked into the system, but I'm guessing there are a bunch.

I think we need to require wal_log_hints=on when encryption is
enabled. Currently I have not considered tearing on CLOG bits. Other
SLRUs probably have similar issues. I need to think a bit about how to
solve that.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Data at rest encryption

2017-06-12 Thread Ants Aasma

On Mon, Jun 12, 2017 at 10:38 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Jun 13, 2016 at 11:07 AM, Peter Eisentraut
> <peter.eisentr...@2ndquadrant.com> wrote:
>> On 6/7/16 9:56 AM, Ants Aasma wrote:
>>>
>>> Similar things can be achieved with filesystem level encryption.
>>> However this is not always optimal for various reasons. One of the
>>> better reasons is the desire for HSM based encryption in a storage
>>> area network based setup.
>>
>> Could you explain this in more detail?
>
> I don't think Ants ever responded to this point.
>
> I'm curious whether this is something that is likely to be pursued for
> PostgreSQL 11.

Yes, the plan is to pick it up again, Real Soon Now(tm). There are a
couple of loose ends for stuff that should be encrypted, but in the
current state of the patch aren't yet (from the top of my head,
logical decoding and pg_stat_statements write some files). The code
handling keys could really take better precautions as Peter pointed
out in another e-mail. And I expect there to be a bunch of polishing
work to make the APIs as good as they can be.

To answer Peter's question about HSMs, many enterprise deployments are
on top of shared storage systems. For regulatory reasons or to limit
security clearance of storage administrators, the data on shared
storage should be encrypted. Now for there to be any point to this
endeavor, the key needs to be stored somewhere else. This is where
hardware security modules come in. They are basically hardware key
storage appliances that can either output the key when requested, or
for higher security hold onto the key and perform
encryption/decryption on behalf of the user. The patch enables the
user to use a custom shell command to go and fetch the key from the
HSM, for example using the KMIP protocol. Or a motivated person could
write an extension that implements the encryption hooks to delegate
encryption/decryption of blocks to an HSM.

Fundamentally there doesn't seem to be a big benefit of implementing
the encryption at PostgreSQL level instead of the filesystem. The
patch doesn't take any real advantage from the higher level knowledge
of the system, nor do I see much possibility for it to do that. The
main benefit for us is that it's much easier to get a PostgreSQL based
solution deployed.

I'm curious if the community thinks this is a feature worth having?
Even considering that security experts would classify this kind of
encryption as a checkbox feature.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Faster Expression Processing v4

2017-03-25 Thread Ants Aasma

On Sun, Mar 26, 2017 at 12:22 AM, Andres Freund <and...@anarazel.de> wrote:
>> At least with current gcc (6.3.1 on Fedora 25) at -O2,
>> what I see is multiple places jumping to the same indirect jump
>> instruction :-(.  It's not a total disaster: as best I can tell, all the
>> uses of EEO_JUMP remain distinct.  But gcc has chosen to implement about
>> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of
>> instructions that increment the "op" register and then do an indirect
>> jump :-(.
>
> Yea, I see some of that too - "usually" when there's more than just the
> jump in common.  I think there's some gcc variables that influence this
> (min-crossjump-insns (5), max-goto-duplication-insns (8)).  Might be
> worthwhile experimenting with setting them locally via a pragma or such.
> I think Aants wanted to experiment with that, too.

I haven't had the time to research this properly, but initial tests
show that with GCC 6.2 adding

#pragma GCC optimize ("no-crossjumping")

fixes merging of the op tail jumps.

Some quick and dirty benchmarking suggests that the benefit for the
interpreter is about 15% (5% speedup on a workload that spends 1/3 in
ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
vars before the indirect jump is somewhere between a tiny benefit and
no effect, certainly not worth introducing extra complexity. Clang 3.8
does the correct thing out of the box and is a couple of percent
faster than GCC with the pragma.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 10:30 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote:
>> On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian <br...@momjian.us> wrote:
>> > Oh, that's why we will hopefully eventually change the page checksum
>> > algorithm to use the special CRC32 instruction, and set a new checksum
>> > version --- got it.  I assume there is currently no compile-time way to
>> > do this.
>>
>> Using CRC32 as implemented now for the WAL would be significantly
>> slower than what we have now due to instruction latency. Even the best
>> theoretical implementation using the CRC32 instruction would still be
>> about the same speed than what we have now. I haven't seen anybody
>> working on swapping out the current algorithm. And I don't really see
>> a reason to, it would introduce a load of headaches for no real gain.
>
> Uh, I am confused.  I thought you said we were leaving some performance
> on the table.  What is that?   I though CRC32 was SSE4.1.  Why is CRC32
> good for the WAL but bad for the page checksums?  What about the WAL
> page images?

The page checksum algorithm was designed to take advantage of CPUs
that provide vectorized 32bit integer multiplication. On x86 this was
introduced with SSE4.1 extensions. This means that by default we can't
take advantage of the design. The code is written in a way that
compiler auto vectorization works on it, so only using appropriate
compilation flags are needed to compile a version that does use vector
instructions. However to enable it on generic builds, a runtime switch
between different levels of vectorization support is needed. This is
what is leaving the performance on the table.

The page checksum algorithm we have is extremely fast, memcpy fast.
Even without vectorization it is right up there with Murmurhash3a and
xxHash. With vectorization it's 4x faster. And it works this fast on
most modern CPUs, not only Intel. The downside is that it only works
well for large blocks, and only fixed power-of-2 size with the current
implementation. WAL page images have the page hole removed so can't
easily take advantage of this.

That said, I haven't really seen either the hardware accelerated CRC32
calculation nor the non-vectorized page checksum take a noticeable
amount of time on real world workloads. The benchmarks presented in
this thread seem to corroborate this observation.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 9:49 PM, Jim Nasby  wrote:
> On 2/24/17 12:30 PM, Tomas Vondra wrote:
>>
>> In any case, we can't just build x86-64 packages with compile-time
>> SSE4.1 checks.
>
>
> Dumb question... since we're already discussing llvm for the executor, would
> that potentially be an option here? AIUI that also opens the possibility of
> using the GPU as well.

Just transferring the block to the GPU would be slower than what we
have now. Theoretically LLVM could be used to JIT the checksum
calculation, but just precompiling a couple of versions and swithcing
between them at runtime would be simpler and would give the same
speedup.

Regards,
Ants saasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian <br...@momjian.us> wrote:
> Oh, that's why we will hopefully eventually change the page checksum
> algorithm to use the special CRC32 instruction, and set a new checksum
> version --- got it.  I assume there is currently no compile-time way to
> do this.

Using CRC32 as implemented now for the WAL would be significantly
slower than what we have now due to instruction latency. Even the best
theoretical implementation using the CRC32 instruction would still be
about the same speed than what we have now. I haven't seen anybody
working on swapping out the current algorithm. And I don't really see
a reason to, it would introduce a load of headaches for no real gain.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 10:02 PM, Bruce Momjian <br...@momjian.us> wrote:
> Uh, as far as I know, the best you are going to get from llvm is
> standard assembly, while the SSE4.1 instructions use special assembly
> instructions, so they would be faster, and in a way they are a GPU built
> into CPUs.

Both LLVM and GCC are capable of compiling the code that we have to a
vectorized loop using SSE4.1 or AVX2 instructions given the proper
compilation flags. This is exactly what was giving the speedup in the
test I showed in my e-mail.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-02-24 Thread Ants Aasma

On Fri, Feb 24, 2017 at 7:47 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
>> > It might be worth looking into using the CRC CPU instruction to reduce this
>> > overhead, like we do for the WAL checksums. Since that is a different
>> > algorithm it would be a compatibility break and we would need to support 
>> > the
>> > old algorithm for upgraded clusters..
>>
>> We looked at that when picking the algorithm. At that point it seemed
>> that CRC CPU instructions were not universal enough to rely on them.
>> The algorithm we ended up on was designed to be fast on SIMD hardware.
>> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
>> with default compiles there is a lot of performance left on table. A
>> low hanging fruit would be to do CPU detection like the CRC case and
>> enable a SSE4.1 optimized variant when those instructions are
>> available. IIRC it was actually a lot faster than the naive hardware
>> CRC that is used for WAL and about on par with interleaved CRC.
>
> Uh, I thought already did compile-time testing for SSE4.1 and used them
> if present.  Why do you say "with default compiles there is a lot of
> performance left on table?"

Compile time checks don't help because the compiled binary could be
run on a different host that does not have SSE4.1 (as extremely
unlikely as it is at this point of time). A runtime check is done for
WAL checksums that use a special CRC32 instruction. Block checksums
predate that and use a different algorithm that was picked because it
could be accelerated with vectorized execution on non-Intel
architectures. We just never got around to adding runtime checks for
the architecture to enable this speedup.

The attached test runs 1M iterations of the checksum about 3x faster
when compiled with SSE4.1 and vectorization, 4x if AVX2 is added into
the mix.

Test:
gcc $CFLAGS -Isrc/include -DN=100 testchecksum.c -o testchecksum
&& time ./testchecksum

Results:
CFLAGS="-O2": 2.364s
CFLAGS="-O2 -msse4.1 -ftree-vectorize": 0.752s
CFLAGS="-O2 -mavx2 -ftree-vectorize": 0.552s

That 0.552s is 15GB/s per core on a 3 year old laptop.

Regards,
Ants Aasma
#include "postgres.h"
#include "storage/checksum_impl.h"

void main() {
	char page[8192] = {0};
	uint32 i, sum = 0;

	for (i = 0; i < N; i++)
		sum ^= pg_checksum_page(page, i);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-01-25 Thread Ants Aasma

On Wed, Jan 25, 2017 at 8:18 PM, Robert Haas <robertmh...@gmail.com> wrote:
> Also, it's not as if there are no other ways of checking whether your
> disks are failing.  SMART, for example, is supposed to tell you about
> incipient hardware failures before PostgreSQL ever sees a bit flip.
> Surely an average user would love to get a heads-up that their
> hardware is failing even when that hardware is not being used to power
> PostgreSQL, yet many people don't bother to configure SMART (or
> similar proprietary systems provided by individual vendors).

You really can't rely on SMART to tell you about hardware failures. 1
in 4 drives fail completely with 0 SMART indication [1]. And for the 1
in 1000 annual checksum failure rate other indicators except system
restarts only had a weak correlation[2]. And this is without
filesystem and other OS bugs that SMART knows nothing about.

My view may be biased by mostly seeing the cases where things have
already gone wrong, but I recommend support clients to turn checksums
on unless it's known that write IO is going to be an issue. Especially
because I know that if it turns out to be a problem I can go in and
quickly hack together a tool to help them turn it off. I do agree that
to change the PostgreSQL default at least some tool turn it off online
should be included.

[1] 
https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/
[2] 
https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-01-24 Thread Ants Aasma

On Tue, Jan 24, 2017 at 4:07 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Peter Geoghegan <p...@heroku.com> writes:
>> I thought that checksums went in in part because we thought that there
>> was some chance that they'd find bugs in Postgres.
>
> Not really.  AFAICS the only point is to catch storage-system malfeasance.

This matches my understanding. Actual physical media errors are caught
by lower level checksums/error correction codes, and memory errors are
caught by ECC RAM. Checksums do very little for PostgreSQL bugs, which
leaves only filesystem and storage firmware bugs. However the latter
are still reasonably common faults. I have seen multiple cases where,
after reviewing the corruption with a hex editor, the only reasonable
conclusion was a bug in the storage system. Data shifted around by
non-page size amounts, non-page aligned extents that are zeroed out,
etc. Unfortunately none of those customers had checksums turned on at
the time. I feel that reacting to such errors with a non-cryptic and
easily debuggable checksum error is much better than erroring out with
huge memory allocations, crashing or returning bogus data. Timely
reaction to data corruption is really important for minimizing data
loss.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-01-21 Thread Ants Aasma

On Sat, Jan 21, 2017 at 10:16 PM, Michael Banck
<michael.ba...@credativ.de> wrote:
> On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
>> On Sat, Jan 21, 2017 at 6:41 PM, Andreas Karlsson <andr...@proxel.se> wrote:
>> > It might be worth looking into using the CRC CPU instruction to reduce this
>> > overhead, like we do for the WAL checksums. Since that is a different
>> > algorithm it would be a compatibility break and we would need to support 
>> > the
>> > old algorithm for upgraded clusters..
>>
>> We looked at that when picking the algorithm. At that point it seemed
>> that CRC CPU instructions were not universal enough to rely on them.
>> The algorithm we ended up on was designed to be fast on SIMD hardware.
>> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
>> with default compiles there is a lot of performance left on table. A
>> low hanging fruit would be to do CPU detection like the CRC case and
>> enable a SSE4.1 optimized variant when those instructions are
>> available. IIRC it was actually a lot faster than the naive hardware
>> CRC that is used for WAL and about on par with interleaved CRC.
>
> I am afraid that won't fly with most end-user packages, cause
> distributions can't just built packages on their newest machine and then
> users get SIGILL or whatever cause their 2014 server doesn't have that
> instruction, or would they still work?
>
> So you would have to do runtime detection of the CPU features, and use
> the appropriate code if they are available.  Which also makes regression
> testing harder, as not all codepaths would get exercised all the time.

Runtime detection is exactly what I had in mind. The code path would
also be the same as at least the two most important compilers only
need a compilation flag change. And the required instruction was
introduced in 2007 so I think anybody who is concerned about
performance is covered.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-01-21 Thread Ants Aasma

On Sat, Jan 21, 2017 at 7:39 PM, Petr Jelinek
<petr.jeli...@2ndquadrant.com> wrote:
> So in summary "postgresql.conf options are easy to change" while "initdb
> options are hard to change", I can see this argument used both for
> enabling or disabling checksums by default. As I said I would be less
> worried if it was easy to turn off, but we are not there afaik. And even
> then I'd still want benchmark first.

Adding support for disabling checksums is almost trivial as it only
requires flipping a value in the control file. And I have somewhere
sitting around a similarly simple tool for turning on checksums while
the database is offline. FWIW, based on customers and fellow
conference goers I have talked to most would gladly take the
performance hit, but not the downtime to turn it on on an existing
database.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checksums by default?

2017-01-21 Thread Ants Aasma

On Sat, Jan 21, 2017 at 6:41 PM, Andreas Karlsson <andr...@proxel.se> wrote:
> On 01/21/2017 04:48 PM, Stephen Frost wrote:
>>
>> * Fujii Masao (masao.fu...@gmail.com) wrote:
>>>
>>> If the performance overhead by the checksums is really negligible,
>>> we may be able to get rid of wal_log_hints parameter, as well.
>>
>>
>> Prior benchmarks showed it to be on the order of a few percent, as I
>> recall, so I'm not sure that we can say it's negligible (and that's not
>> why Magnus was proposing changing the default).
>
>
> It might be worth looking into using the CRC CPU instruction to reduce this
> overhead, like we do for the WAL checksums. Since that is a different
> algorithm it would be a compatibility break and we would need to support the
> old algorithm for upgraded clusters..

We looked at that when picking the algorithm. At that point it seemed
that CRC CPU instructions were not universal enough to rely on them.
The algorithm we ended up on was designed to be fast on SIMD hardware.
Unfortunately on x86-64 that required SSE4.1 integer instructions, so
with default compiles there is a lot of performance left on table. A
low hanging fruit would be to do CPU detection like the CRC case and
enable a SSE4.1 optimized variant when those instructions are
available. IIRC it was actually a lot faster than the naive hardware
CRC that is used for WAL and about on par with interleaved CRC.

That said the actual checksum calculation was not a big issue for
performance. The only way to make it really matter was with a larger
than shared buffers smaller than RAM workload with tiny per page
execution overhead. My test case was SELECT COUNT(*) on wide rows with
a small fill factor. Having to WAL log hint bits is the major issue.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Causal reads take II

2017-01-19 Thread Ants Aasma

On Thu, Jan 19, 2017 at 2:22 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> On Thu, Jan 19, 2017 at 8:11 PM, Ants Aasma <ants.aa...@eesti.ee> wrote:
>> On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
>> <thomas.mu...@enterprisedb.com> wrote:
>>> Long term, I think it would be pretty cool if we could develop a set
>>> of features that give you distributed sequential consistency on top of
>>> streaming replication.  Something like (this | causality-tokens) +
>>> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
>>> distributed-dirty-read-prevention[4].
>>
>> Is it necessary that causal writes wait for replication before making
>> the transaction visible on the master? I'm asking because the per tx
>> variable wait time between logging commit record and making
>> transaction visible makes it really hard to provide matching
>> visibility order on master and standby.
>
> Yeah, that does seem problematic.  Even with async replication or no
> replication, isn't there already a race in CommitTransaction() where
> two backends could reach RecordTransactionCommit() in one order but
> ProcArrayEndTransaction() in the other order?  AFAICS using
> synchronous replication in one of the transactions just makes it more
> likely you'll experience such a visibility difference between the DO
> and REDO histories (!), by making RecordTransactionCommit() wait.
> Nothing prevents you getting a snapshot that can see t2 but not t1 in
> the DO history, while someone doing PITR or querying an asynchronous
> standby gets a snapshot that can see t1 but not t2 because those
> replay the REDO history.

Yes there is a race even with all transactions having the same
synchronization level. But nobody will mind if we some day fix that
race. :) With different synchronization levels it is much trickier to
fix as either async commits must wait behind sync commits before
becoming visible, return without becoming visible or visibility order
must differ from commit record LSN order. The first makes the async
commit feature useless, second seems a no-go for semantic reasons,
third requires extra information sent to standby's so they know the
actual commit order.

>> In CSN based snapshot
>> discussions we came to the conclusion that to make standby visibility
>> order match master while still allowing for async transactions to
>> become visible before they are durable we need to make the commit
>> sequence a vector clock and transmit extra visibility ordering
>> information to standby's. Having one more level of delay between wal
>> logging of commit and making it visible would make the problem even
>> worse.
>
> I'd like to read that... could you please point me at the right bit of
> that discussion?

Some of that discussion was face to face at pgconf.eu, some of it is
here: 
https://www.postgresql.org/message-id/CA+CSw_vbt=cwluogr7gxdpnho_y4cz7x97+o_bh-rfo7ken...@mail.gmail.com

Let me know if you have any questions.

>> It seems that fixing that would
>> require either keeping some per client state or a global agreement on
>> what snapshots are safe to provide, both of which you tried to avoid
>> for this feature.
>
> Agreed.  You briefly mentioned this problem in the context of pairs of
> read-only transactions a while ago[1].  As you said then, it does seem
> plausible to do that with a token system that gives clients the last
> commit LSN from the snapshot used by a read only query, so that you
> can ask another standby to make sure that LSN has been replayed before
> running another read-only transaction.  This could be handled
> explicitly by a motivated client that is talking to multiple nodes.  A
> more general problem is client A telling client B to go and run
> queries and expecting B to see all transactions that A has seen; it
> now has to pass the LSN along with that communication, or rely on some
> kind of magic proxy that sees all transactions, or a radically
> different system with a GTM.

If/when we do CSN based snapshots, adding a GTM could be relatively
straightforward. It's basically not all that far from what Spanner is
doing by using a timestamp as the snapshot. But this is all relatively
independent of this patch.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Causal reads take II

2017-01-18 Thread Ants Aasma

On Tue, Jan 3, 2017 at 3:43 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> Here is a new version of my "causal reads" patch (see the earlier
> thread from the 9.6 development cycle[1]), which provides a way to
> avoid stale reads when load balancing with streaming replication.

Thanks for working on this. It will let us do something a lot of
people have been asking for.

> Long term, I think it would be pretty cool if we could develop a set
> of features that give you distributed sequential consistency on top of
> streaming replication.  Something like (this | causality-tokens) +
> SERIALIZABLE-DEFERRABLE-on-standbys[3] +
> distributed-dirty-read-prevention[4].

Is it necessary that causal writes wait for replication before making
the transaction visible on the master? I'm asking because the per tx
variable wait time between logging commit record and making
transaction visible makes it really hard to provide matching
visibility order on master and standby. In CSN based snapshot
discussions we came to the conclusion that to make standby visibility
order match master while still allowing for async transactions to
become visible before they are durable we need to make the commit
sequence a vector clock and transmit extra visibility ordering
information to standby's. Having one more level of delay between wal
logging of commit and making it visible would make the problem even
worse.

One other thing that might be an issue for some users is that this
patch only ensures that clients observe forwards progress of database
state after a writing transaction. With two consecutive read only
transactions that go to different servers a client could still observe
database state going backwards. It seems that fixing that would
require either keeping some per client state or a global agreement on
what snapshots are safe to provide, both of which you tried to avoid
for this feature.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2017-01-18 Thread Ants Aasma

On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure <mmonc...@gmail.com> wrote:
> Still getting checksum failures.   Over the last 30 days, I see the
> following.  Since enabling checksums FWICT none of the damage is
> permanent and rolls back with the transaction.   So creepy!

The checksums still only differ in least significant digits which
pretty much means that there is a block number mismatch. So if you
rule out filesystem not doing its job correctly and transposing
blocks, it could be something else that is resulting in blocks getting
read from a location that happens to differ by a small multiple of
page size. Maybe somebody is racily mucking with table fd's between
seeking and reading. That would explain the issue disappearing after a
retry.

Maybe you can arrange for the RelFileNode and block number to be
logged for the checksum failures and check what the actual checksums
are in data files surrounding the failed page. If the requested block
number contains something completely else, but the page that follows
contains the expected checksum value, then it would support this
theory.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2017-01-13 Thread Ants Aasma

On 5 Jan 2017 2:54 a.m., "Craig Ringer" <cr...@2ndquadrant.com> wrote:
> Ants, do you think you'll have a chance to convert your shell script
> test into a TAP test in src/test/recovery?
>
> Simon has said he would like to commit this fix. I'd personally be
> happier if it had a test to go with it.
>
> You could probably just add to src/test/recover/t/001 which now
> contains my additions for hot standby.

Do you feel the test in the attached patch is enough or would you like
to see anything else covered?

Regards,
Ants Aasma
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c6b54ec..0ab936f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1157,7 +1157,10 @@ XLogWalRcvSendReply(bool force, bool requestReply)
  * in case they don't have a watch.
  *
  * If the user disables feedback, send one final message to tell sender
- * to forget about the xmin on this standby.
+ * to forget about the xmin on this standby. We also send this message
+ * on first connect because a previous connection might have set xmin
+ * on a replication slot. (If we're not using a slot it's harmless to
+ * send a feedback message explicitly setting InvalidTransactionId).
  */
 static void
 XLogWalRcvSendHSFeedback(bool immed)
@@ -1167,7 +1170,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 	uint32		nextEpoch;
 	TransactionId xmin;
 	static TimestampTz sendTime = 0;
-	static bool master_has_standby_xmin = false;
+	/* initially true so we always send at least one feedback message */
+	static bool master_has_standby_xmin = true;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1192,14 +1196,17 @@ XLogWalRcvSendHSFeedback(bool immed)
 	}
 
 	/*
-	 * If Hot Standby is not yet active there is nothing to send. Check this
-	 * after the interval has expired to reduce number of calls.
+	 * If Hot Standby is not yet accepting connections there is nothing to
+	 * send. Check this after the interval has expired to reduce number of
+	 * calls.
+	 *
+	 * Bailing out here also ensures that we don't send feedback until we've
+	 * read our own replication slot state, so we don't tell the master to
+	 * discard needed xmin or catalog_xmin from any slots that may exist
+	 * on this replica.
 	 */
 	if (!HotStandbyActive())
-	{
-		Assert(!master_has_standby_xmin);
 		return;
-	}
 
 	/*
 	 * Make the expensive call to get the oldest xmin once we are certain
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index eef512d..7fb2e9e 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 22;
+use Test::More tests => 24;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -161,3 +161,22 @@ is($catalog_xmin, '', 'non-cascaded slot xmin still null with hs_feedback reset'
 ($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
 is($xmin, '', 'cascaded slot xmin null with hs feedback reset');
 is($catalog_xmin, '', 'cascaded slot xmin still null with hs_feedback reset');
+
+diag "re-enabling hot_standby_feedback and disabling while stopped";
+$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = on;');
+$node_standby_2->reload;
+
+$node_master->safe_psql('postgres', qq[INSERT INTO tab_int VALUES (11000);]);
+replay_check();
+
+$node_standby_2->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;');
+$node_standby_2->stop;
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+isnt($xmin, '', 'cascaded slot xmin non-null with postgres shut down');
+
+# Xmin from a previous run should be cleared on startup.
+$node_standby_2->start;
+
+($xmin, $catalog_xmin) = get_slot_xmins($node_standby_1, $slotname_2);
+is($xmin, '', 'cascaded slot xmin reset after startup with hs feedback reset');

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2017-01-05 Thread Ants Aasma

On 5 Jan 2017 2:54 a.m., "Craig Ringer" <cr...@2ndquadrant.com> wrote:

On 2 January 2017 at 22:24, Craig Ringer <cr...@2ndquadrant.com> wrote:
>
>
> On 2 Jan. 2017 20:20, "Simon Riggs" <si...@2ndquadrant.com> wrote:
>
> On 21 December 2016 at 13:23, Simon Riggs <si...@2ndquadrant.com> wrote:
>
>> Fix it up and I'll commit. Thanks for the report.
>
> I was hoping for some more effort from Ants to correct this.
>
> I'll commit Craig's new tests for hs feedback before this, so it can
> go in with a Tap test, so I imagine we're about a week away from
> commit on this.
>
>
> I posted a revised version of Ants's patch that passes the shell script
> test.
>
> A TAP  test would likely make sene though, I agree.

Ants, do you think you'll have a chance to convert your shell script
test into a TAP test in src/test/recovery?

Simon has said he would like to commit this fix. I'd personally be
happier if it had a test to go with it.

You could probably just add to src/test/recover/t/001 which now
contains my additions for hot standby.

I'm travelling right now, but I should be able to give it a shot next week.

Regards,
Ants Aasma

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-21 Thread Ants Aasma

On Wed, Dec 21, 2016 at 2:09 PM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> On 21 December 2016 at 15:40, Ants Aasma <ants.aa...@eesti.ee> wrote:
>
>>> So -1 on this part of the patch, unless there's something I've 
>>> misunderstood.
>>
>> Currently there was no feedback sent if hot standby was not active. I
>> was not sure if it was safe to call GetOldestXmin() in that case.
>> However I did not consider cascading replica slots wanting to hold
>> back xmin, where resetting the parents xmin is indeed wrong. Do you
>> know if GetOldestXmin() is safe at this point and we can just remove
>> the HotStandbyActive() check? Otherwise I think the correct approach
>> is to move the check and return inside the hot_standby_feedback case
>> like this:
>>
>> if (hot_standby_feedback)
>> {
>> if (!HotStandbyActive())
>>return;
>
> I feel like I'm missing something obvious here. If we force sending
> hot standby feedback at least once, by assuming
> master_has_standby_xmin = true at startup, why isn't that sufficient?
> We'll send InvalidTransactionId if hot_standby_feedback is off. Isn't
> that the point?
>
> It's safe to call GetOldestXmin pretty much as soon as we load the
> recovery start checkpoint. It won't consider the state of replication
> slots until later in startup, but that's a pre-existing flaw that
> should be addressed separately.
>
> Why do we need to do more than master_has_standby_xmin = true ?

There was a !HotStandbyActive() check there previously, I was not sure
if it was only to avoid sending useless messages or was necessary
because something was not initialized otherwise. Looks like it is not
needed and can be removed. Revised patch that does so attached.

Regards,
Ants Aasma
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..84ffa91 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1157,7 +1157,9 @@ XLogWalRcvSendReply(bool force, bool requestReply)
  * in case they don't have a watch.
  *
  * If the user disables feedback, send one final message to tell sender
- * to forget about the xmin on this standby.
+ * to forget about the xmin on this standby. We also send this message
+ * on first connect because a previous connection might have set xmin
+ * on a replication slot.
  */
 static void
 XLogWalRcvSendHSFeedback(bool immed)
@@ -1167,7 +1169,8 @@ XLogWalRcvSendHSFeedback(bool immed)
 	uint32		nextEpoch;
 	TransactionId xmin;
 	static TimestampTz sendTime = 0;
-	static bool master_has_standby_xmin = false;
+	/* initially true so we always send at least one feedback message */
+	static bool master_has_standby_xmin = true;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1192,16 +1195,6 @@ XLogWalRcvSendHSFeedback(bool immed)
 	}
 
 	/*
-	 * If Hot Standby is not yet active there is nothing to send. Check this
-	 * after the interval has expired to reduce number of calls.
-	 */
-	if (!HotStandbyActive())
-	{
-		Assert(!master_has_standby_xmin);
-		return;
-	}
-
-	/*
 	 * Make the expensive call to get the oldest xmin once we are certain
 	 * everything else has been checked.
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-20 Thread Ants Aasma

On Wed, Dec 21, 2016 at 4:14 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> Re the patch, I don't like
>
> -   static bool master_has_standby_xmin = false;
> +   static bool master_has_standby_xmin = true;
>
> without any comment. It's addressed in the comment changes on the
> header func, but the link isn't obvious. Maybe a oneliner to say
> "ensure we always send at least one feedback message" ?

Will fix.

> I think this part of the patch is correct and useful.
>
>
>
> I don't see the point of
>
> +*
> +* If Hot Standby is not yet active we reset the xmin value. Check 
> this
> +* after the interval has expired to reduce number of calls.
>  */
> -   if (hot_standby_feedback)
> +   if (hot_standby_feedback && HotStandbyActive())
>
> though. Forcing feedback to send InvalidTransactionId until hot
> standby feedback is actively running seems counterproductive; we want
> to lock in feedback as soon as possible, not wait until we're
> accepting transactions. Simon and I are in fact working on changes to
> do the opposite of what you've got here and ensure that we don't allow
> hot standby connections until we know hot_standby_feedback is in
> effect and a usable xmin is locked in. That way the master won't
> remove tuples we need as soon as we start an xact and cause a conflict
> with recovery.
>
> If there are no running xacts, GetOldestXmin() will return the slot
> xmin if any. We should definitely not be clearing that just because
> we're not accepting hot standby connections yet; we want to make very
> sure it remains in effect unless the user explicitly de-configures hot
> standby.
>
>  (There's another pre-exisitng problem there, where we can start the
> walsender before slots are initialized, that I'll be addressing
> separately).
>
> If there's no slot xmin either, we'll send nextXid. That's a sensible
> starting point for hot standby feedback until we start some
> transactions.
>
> So -1 on this part of the patch, unless there's something I've misunderstood.

Currently there was no feedback sent if hot standby was not active. I
was not sure if it was safe to call GetOldestXmin() in that case.
However I did not consider cascading replica slots wanting to hold
back xmin, where resetting the parents xmin is indeed wrong. Do you
know if GetOldestXmin() is safe at this point and we can just remove
the HotStandbyActive() check? Otherwise I think the correct approach
is to move the check and return inside the hot_standby_feedback case
like this:

if (hot_standby_feedback)
{
if (!HotStandbyActive())
   return;

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Replication slot xmin is not reset if HS feedback is turned off while standby is shut down

2016-12-20 Thread Ants Aasma

I just had a client issue with table bloat that I traced back to a
stale xmin value in a replication slot. xmin value from hot standby
feedback is stored in replication slot and used for vacuum xmin
calculation. If hot standby feedback is turned off while walreceiver
is active then the xmin gets reset by HS feedback message containing
InvalidTransactionId. However, if feedback gets turned off while
standby is shut down this message never gets sent and a stale value
gets left behind in the replication slot holding back vacuum.

The simple fix seems to be to always send out at least one feedback
message on each connect regardless of hot_standby_feedback setting.
Patch attached. Looks like this goes back to version 9.4. It could
conceivably cause issues for replication middleware that does not know
how to handle hot standby feedback messages. Not sure if any exist and
if that is a concern.

A shell script to reproduce the problem is also attached, adjust the
PGPATH variable to your postgres install and run in an empty
directory.

Regards,
Ants Aasma
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index cc3cf7d..31333ec 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1157,7 +1157,9 @@ XLogWalRcvSendReply(bool force, bool requestReply)
  * in case they don't have a watch.
  *
  * If the user disables feedback, send one final message to tell sender
- * to forget about the xmin on this standby.
+ * to forget about the xmin on this standby. We also send this message
+ * on first connect because a previous connection might have set xmin
+ * on a replication slot.
  */
 static void
 XLogWalRcvSendHSFeedback(bool immed)
@@ -1167,7 +1169,7 @@ XLogWalRcvSendHSFeedback(bool immed)
 	uint32		nextEpoch;
 	TransactionId xmin;
 	static TimestampTz sendTime = 0;
-	static bool master_has_standby_xmin = false;
+	static bool master_has_standby_xmin = true;
 
 	/*
 	 * If the user doesn't want status to be reported to the master, be sure
@@ -1192,20 +1194,13 @@ XLogWalRcvSendHSFeedback(bool immed)
 	}
 
 	/*
-	 * If Hot Standby is not yet active there is nothing to send. Check this
-	 * after the interval has expired to reduce number of calls.
-	 */
-	if (!HotStandbyActive())
-	{
-		Assert(!master_has_standby_xmin);
-		return;
-	}
-
-	/*
 	 * Make the expensive call to get the oldest xmin once we are certain
 	 * everything else has been checked.
+	 *
+	 * If Hot Standby is not yet active we reset the xmin value. Check this
+	 * after the interval has expired to reduce number of calls.
 	 */
-	if (hot_standby_feedback)
+	if (hot_standby_feedback && HotStandbyActive())
 		xmin = GetOldestXmin(NULL, false);
 	else
 		xmin = InvalidTransactionId;


slot-xmin-not-reset-reproduce.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] emergency outage requiring database restart

2016-10-27 Thread Ants Aasma

On Wed, Oct 26, 2016 at 8:43 PM, Merlin Moncure <mmonc...@gmail.com> wrote:
> /var/lib/pgsql/9.5/data/pg_log/postgresql-26.log | grep "page
> verification"
> 2016-10-26 11:26:42 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37251 but expected 37244
> 2016-10-26 11:27:55 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37249 but expected 37244
> 2016-10-26 12:16:44 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 44363 but expected 44364
> 2016-10-26 12:18:58 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 49525 but expected 49539
> 2016-10-26 12:19:12 CDT [postgres@castaging]: WARNING:  page
> verification failed, calculated checksum 37345 but expected 37340

The checksum values are improbably close. The checksum algorithm has
decently good mixing of all bits in the page. Having the first byte
match in 5 checksums makes this 1:2^40 improbable. What is not mixed
in properly is the block number, it only gets xor'ed before packing
the value into 16bits using modulo 0x. So I'm pretty sure
different block numbers were used for writing out and reading in the
page. Either the blocknum gets corrupted between calculating the
checksum and writing the page out (unlikely given the proximity), or
the pages are somehow getting transposed in the storage.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-18 Thread Ants Aasma

On Tue, Aug 16, 2016 at 9:46 PM, Andres Freund <and...@anarazel.de> wrote:
> On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
>> I am somewhat inclined to
>> believe that we need to restructure the executor in a bigger way so
>> that it passes around datums instead of tuples; I'm inclined to
>> believe that the current tuple-centric model is probably not optimal
>> even for the existing storage format.
>
> I actually prototyped that, and it's not an easy win so far. Column
> extraction cost, even after significant optimization, is still often a
> significant portion of the runtime. And e.g. projection only extracting
> all columns, after evaluating a restrictive qual referring to an "early"
> column, can be a significant win.  We'd definitely have to give up on
> extracting columns 0..n when accessing later columns... Hm.

What about going even further than [1] in converting the executor to
being opcode based and merging projection and qual evaluation to a
single pass? Optimizer would then have some leeway about how to order
column extraction and qual evaluation. Might even be worth it to
special case some functions as separate opcodes (e.g. int4eq,
timestamp_lt).

Regards,
Ants Aasma

[1] 
https://www.postgresql.org/message-id/20160714011850.bd5zhu35szle3...@alap3.anarazel.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Ants Aasma

On Wed, Aug 10, 2016 at 7:54 PM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
> Oh, I found that I underestimated complexity of async commit...  :)
>
> Do I understand right that now async commit right as follows?
> 1) Async transaction confirms commit before flushing WAL.
> 2) Other transactions sees effect of async transaction only when its WAL
> flushed.
> 3) In the session which just committed async transaction, effect of this
> transaction is visible immediately (before WAL flushed). Is it true?

Current code  simplified:

XactLogCommitRecord()
if (synchronous_commit)
 XLogFlush()
ProcArrayEndTransaction() // Become visible

The issue we are discussing is that with CSNs, the "become visible"
portion must occur in CSN order. If CSN == LSN, then async
transactions that have their commit record after a sync record must
wait for the sync record to flush and become visible. Simplest
solution is to not require CSN == LSN and just assign a CSN value
immediately before becoming visible.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Ants Aasma

On Wed, Aug 10, 2016 at 6:09 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> Hmm. There's one more possible way this could all work. Let's have CSN ==
> LSN, also for asynchronous commits. A snapshot is the current insert
> position, but also make note of the current flush position, when you take a
> snapshot. Now, when you use the snapshot, if you ever see an XID that
> committed between the snapshot's insert position and the flush position,
> wait for the WAL to be flushed up to the snapshot's insert position at that
> point. With that scheme, an asynchronous commit could return to the
> application without waiting for a flush, but if someone actually looks at
> the changes the transaction made, then that transaction would have to wait.
> Furthermore, we could probably skip that waiting too, if the reading
> transaction is also using synchronous_commit=off.
>
> That's slightly different from the current behaviour. A transaction that
> runs with synchronous_commit=on, and reads data that was modified by an
> asynchronous transaction, would take a hit. But I think that would be
> acceptable.

My proposal of vector clocks would allow for pretty much exactly
current behavior.

To simplify, there would be lastSyncCommitSeqNo and
lastAsyncCommitSeqNo variables in ShmemVariableCache. Transaction
commit would choose which one to update based on synchronous_commit
setting and embed the value of the setting into CSN log. Snapshots
would contain both values, when checking for CSN visibility use the
value of the looked up synchronous_commit setting to decide which
value to compare against. Standby's replaying commit records would
just update both values, resulting in transactions becoming visible in
xlog order, as they do today. The scheme would allow for inventing a
new xlog record/replication message communicating visibility ordering.

However I don't see why inventing a separate CSN concept is a large
problem. Quite the opposite, unless there is a good reason that I'm
missing, it seems better to not unnecessarily conflate commit record
durability and transaction visibility ordering. Not having them tied
together allows for an external source to provide CSN values, allowing
for interesting distributed transaction implementations. E.g. using a
timestamp as the CSN a'la Google Spanner and the TrueTime API.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] multivariate statistics (v19)

2016-08-10 Thread Ants Aasma

On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> 2) combining multiple statistics
>
> I think the ability to combine multivariate statistics (covering different
> subsets of conditions) is important and useful, but I'm starting to think
> that the current implementation may not be the correct one (which is why I
> haven't written the SGML docs about this part of the patch series yet).

While researching this topic a few years ago I came across a paper on
this exact topic called "Consistently Estimating the Selectivity of
Conjuncts of Predicates" [1]. While effective it seems to be quite
heavy-weight, so would probably need support for tiered optimization.

[1] 
https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2016-08-10 Thread Ants Aasma

mistic oldest
snapshot, publish and then recheck, if necessary, republish.

* While commits are not as performance critical it may still be
beneficial to defer clog updates and perform them in larger batches to
reduce clog lock traffic.

I think I have achieved some clarity on my idea of snapshots that
migrate between being CSN and XID based. The essential problem with
old CSN based snapshots is that the CSN log for their xmin-xmax
interval needs to be kept around in a quick to access datastructure.
And the problem with long running transactions is twofold, first is
that they need a well defined location where to publish the visibility
info for their xids and secondly, they enlarge the xmin-xmax interval
of all concurrent snapshots, needing a potentially huge amount of CSN
log to be kept around. One way to deal with it is to just accept it
and use a SLRU as in this patch.

My new insight is that a snapshot doesn't need to be either-or CSN or
XIP (xid in progress) array based, but it can also be a hybrid. There
would be a sorted array of in progress xids and a non-overlapping CSN
xmin-xmax interval where CSN log needs to be consulted. As snapshots
age they scan the CSN log and incrementally build their XIP array,
basically lazily constructing same data structure used in snapshots
today. Old in progress xids need a slot for themselves to be kept
around, but all new snapshots being taken would immediately classify
them as old, store them in XIP and not include them in their CSN xmin.
Under this scheme the amount of CSN log strictly needed is a
reasonably sized ring buffer for recent xids (probably sized based on
max conn), a sparse map for long transactions (bounded by max conn)
and some slack for snapshots still being converted. A SLRU or similar
is still needed because there is no way to ensure timely conversion of
all snapshots, but by properly timing the notification the probability
of actually using it should be extremely low. Datastructure
maintenance operations occur naturally on xid assignment and are
easily batched. Long running transactions still affect all snapshots,
but the effect is small and not dependent on transaction age, old
snapshots only affect themselves. Subtransactions are still a pain,
and would need something similar to the current suboverflowed scheme.
On the plus side, it seems like the number of subxids to overflow
could be global, not per backend.

In short:

* A directly mapped ring buffer for recent xids could be accessed lock
free, would take care most of the traffic in most of the cases. Looks
like it would be a good trade-off for complexity/performance.

* To keep workloads with wildly varying transaction lengths in bounded
amount of memory, a significantly more complex hybrid snapshot scheme
is needed. It remains to be seen if these workloads are actually a
significant issue. But if they are, the approach described might
provide a way out.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Data at rest encryption

2016-06-13 Thread Ants Aasma

On Mon, Jun 13, 2016 at 5:17 AM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> On Sun, Jun 12, 2016 at 4:13 PM, Ants Aasma <ants.aa...@gmail.com> wrote:
>>> I feel separate file is better to include the key data instead of pg_control
>>> file.
>>
>> I guess that would be more flexible. However I think at least the fact
>> that the database is encrypted should remain in the control file to
>> provide useful error messages for faulty backup procedures.
>
> Another possibility could be always to do some encryption at data-type
> level for text data. For example I recalled the following thing while
> going through this thread:
> https://github.com/nec-postgres/tdeforpg
> Though I don't quite understand the use for encrypt.enable in this
> code... This has the advantage to not patch upstream.

While certainly possible, this does not cover the requirements I want
to satisfy - user data never gets stored on disk unencrypted without
making changes to the application or schema. This seems to be mostly
about separating administrator roles, specifically that centralised
storage and backup administrators should not have access to database
contents. I see this as orthogonal to per column encryption, which in
my opinion is better done in the application.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Data at rest encryption

2016-06-12 Thread Ants Aasma

On Fri, Jun 10, 2016 at 5:23 AM, Haribabu Kommi
<kommi.harib...@gmail.com> wrote:
> 1. Instead of doing the entire database files encryption, how about
> providing user an option to protect only some particular tables that
> wants the encryption at table/tablespace level. This not only provides
> an option to the user, it reduces the performance impact on tables
> that doesn't need any encryption. The problem with this approach
> is that every xlog record needs to validate to handle the encryption
> /decryption, instead of at page level.

Is there a real need for this? The customers I have talked to want to
encrypt the whole database and my goal is to make the feature fast
enough to make that feasible for pretty much everyone. I guess
switching encryption off per table would be feasible, but the key
setup would still need to be done at server startup. Per record
encryption would result in some additional information leakage though.
Overall I thought it would not be worth it, but I'm willing to have my
mind changed on this.

> 2. Instead of depending on a contrib module for the encryption, how
> about integrating pgcrypto contrib in to the core and add that as a
> default encryption method. And also provide an option to the user
> to use a different encryption methods if needs.

Technically that would be simple enough, this is more of a policy
decision. I think having builtin encryption provided by pgcrypto is
completely fine. If a consensus emerges that it needs to be
integrated, it would need to be a separate patch anyway.

> 3. Currently entire xlog pages are encrypted and stored in the file.
> can pg_xlogdump works with those files?

Technically yes, with the patch as it stands, no. Added this to my todo list.

> 4. For logical decoding, how about the adding a decoding behavior
> based on the module to decide whether data to be encrypted/decrypted.

The data to be encrypted does not depend on the module used, so I
don't think it should be module controlled. The reorder buffer
contains pretty much the same stuff as the xlog, so not encrypting it
does not look like a valid choice. For logical heap rewrites it could
be argued that nothing useful is leaked in most cases, but encrypting
it is not hard. Just a small matter of programming.

> 5. Instead of providing passphrase through environmental variable,
> better to provide some options to pg_ctl etc.

That looks like it would be worse from a security perspective.
Integrating a passphrase prompt would be an option, but a way for
scripts to provide passphrases would still be needed.

> 6. I don't have any idea whether is it possible to integrate the checksum
> and encryption in a single shot to avoid performance penalty.

Currently no, the checksum gets stored in the page header and for any
decent cipher mode the encryption of the rest of the page will depend
on it. However, the performance difference should be negligible
because both algorithms are compute bound for cached data. The data is
very likely to be completely in L1 cache as the operations are done in
quick succession.

The non-cryptographic checksum algorithm could actually be an attack
vector for an adversary that can trigger repeated encryption by
tweaking a couple of bytes at the end of the page to see when the
checksum matches and try to infer the data from that. Similarly to the
CRIME attack. However the LSN stored at the beginning of the page
header basically provides a nonce that makes this impossible.

This also means that encryption needs to imply wal_log_hints. Will
include this in the next version of the patch.

>> I would also like to incorporate some database identifier as a salt in
>> key setup. However, system identifier stored in control file doesn't
>> fit this role well. It gets initialized somewhat too late in the
>> bootstrap process, and more importantly, gets changed on pg_upgrade.
>> This will make link mode upgrades impossible, which seems like a no
>> go. I'm torn whether to add a new value for this purpose (perhaps
>> stored outside the control file) or allow setting of system identifier
>> via initdb. The first seems like a better idea, the file could double
>> as a place to store additional encryption parameters, like key length
>> or different cipher primitive.
>
> I feel separate file is better to include the key data instead of pg_control
> file.

I guess that would be more flexible. However I think at least the fact
that the database is encrypted should remain in the control file to
provide useful error messages for faulty backup procedures.

Thanks for your input.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WIP: Data at rest encryption

2016-06-07 Thread Ants Aasma

Hi all,

I have been working on data-at-rest encryption support for PostgreSQL.
In my experience this is a common request that customers make. The
short of the feature is that all PostgreSQL data files are encrypted
with a single master key and are decrypted when read from the OS. It
does not provide column level encryption which is an almost orthogonal
feature, arguably better done client side.

Similar things can be achieved with filesystem level encryption.
However this is not always optimal for various reasons. One of the
better reasons is the desire for HSM based encryption in a storage
area network based setup.

Attached to this mail is a work in progress patch that adds an
extensible encryption mechanism. There are some loose ends left to tie
up, but the general concept and architecture is at a point where it's
ready for some feedback, fresh ideas and bikeshedding.

Usage
=

Set up database like so:

(read -sp "Postgres passphrase: " PGENCRYPTIONKEY; echo;
 export PGENCRYPTIONKEY
 initdb -k -K pgcrypto $PGDATA )

Start PostgreSQL:

(read -sp "Postgres passphrase: " PGENCRYPTIONKEY; echo;
 export PGENCRYPTIONKEY
 postgres $PGDATA )

Design
==

The patch adds a new GUC called encryption_library, when specified the
named library is loaded before shared_preload_libraries and is
expected to register its encryption routines. For now the API is
pretty narrow, one parameterless function that lets the extension do
key setup on its own terms, and two functions for
encrypting/decrypting an arbitrary sized block of data with tweak. The
tweak should alter the encryption function so that identical block
contents are encrypted differently based on their location. The GUC
needs to be set at bootstrap time, so it gets set by a new option for
initdb. During bootstrap an encryption sample gets stored in the
control file, enabling useful error messages.

The library name is not stored in controldata. I'm not quite sure
about this decision. On one hand it would be very useful to tell the
user what he needs to get at his data if the configuration somehow
goes missing and it would get rid of the extra GUC. On the other hand
I don't really want to bloat control data, and the same encryption
algorithm could be provided by different implementations.

For now the encryption is done for everything that goes through md,
xlog and slru. Based on a review of read/write/fread/fwrite calls this
list is missing:

* BufFile - needs refactoring
* Logical reorder buffer serialization - probably needs a stream mode
cipher API addition.
* logical_heap_rewrite - can be encrypted as one big block
* 2PC state data - ditto
* pg_stat_statements - query texts get appended so a stream mode
cipher might be needed here too.

copydir needed some changes too because tablespace and database oid
are included in the tweak and so copying also needs to decrypt and
encrypt with the new tweak value.

For demonstration purposes I imported Brian Gladman's AES-128-XTS mode
implementation into pgcrypto and used an environment variable for key
setup. This part is not really in any reviewable state, the XTS code
needs heavy cleanup to bring it up to PostgreSQL coding standards,
keysetup needs something secure, like PBKDF2 or scrypt.

Performance with current AES implementation is not great, but not
horrible either, I'm seeing around 2x slowdown for larger than
shared_buffers, smaller than free memory workloads. However the plan
is to fix this - I have a prototype AES-NI implementation that does
3GB/s per core on my Haswell based laptop (1.25 B/cycle).

Open questions
==

The main questions is what to do about BufFile? It currently provides
both unaligned random access and a block based interface. I wonder if
it would be a good idea to refactor it to be fully block based under
the covers.

I would also like to incorporate some database identifier as a salt in
key setup. However, system identifier stored in control file doesn't
fit this role well. It gets initialized somewhat too late in the
bootstrap process, and more importantly, gets changed on pg_upgrade.
This will make link mode upgrades impossible, which seems like a no
go. I'm torn whether to add a new value for this purpose (perhaps
stored outside the control file) or allow setting of system identifier
via initdb. The first seems like a better idea, the file could double
as a place to store additional encryption parameters, like key length
or different cipher primitive.

Regards,
Ants Aasma
diff --git a/contrib/pgcrypto/Makefile b/contrib/pgcrypto/Makefile
index 18bad1a..04ce887 100644
--- a/contrib/pgcrypto/Makefile
+++ b/contrib/pgcrypto/Makefile
@@ -20,7 +20,7 @@ SRCS		= pgcrypto.c px.c px-hmac.c px-crypt.c \
 		mbuf.c pgp.c pgp-armor.c pgp-cfb.c pgp-compress.c \
 		pgp-decrypt.c pgp-encrypt.c pgp-info.c pgp-mpi.c \
 		pgp-pubdec.c pgp-pubenc.c pgp-pubkey.c pgp-s2k.c \
-		pgp-pgsql.c
+		pgp-pgsql.c xts.c
 
 MODULE_big	= pgcrypto

Re: [HACKERS] Does people favor to have matrix data type?

2016-05-25 Thread Ants Aasma

On Wed, May 25, 2016 at 10:38 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 25 May 2016 at 03:52, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
>>
>> In a few days, I'm working for a data type that represents matrix in
>> mathematical area. Does people favor to have this data type in the core,
>> not only my extension?
>
>
> If we understood the use case, it might help understand whether to include
> it or not.
>
> Multi-dimensionality of arrays isn't always useful, so this could be good.

Many natural language and image processing methods extract feature
vectors that then use some simple distance metric, like dot product to
calculate vector similarity. For example we presented a latent
semantic analysis prototype at pgconf.eu 2015 that used real[] to
store the features and a dotproduct(real[], real[]) real function to
do similarity matching. However using real[] instead of a hypothetical
realvector or realmatrix did not prove to be a huge overhead, so
overall I'm on the fence for the usefulness of a special type. Maybe a
helper function or two to validate the additional restrictions in a
check constraint would be enough.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is the unfair lwlock behavior intended?

2016-05-24 Thread Ants Aasma

On Tue, May 24, 2016 at 9:03 AM, Tsunakawa, Takayuki
<tsunakawa.ta...@jp.fujitsu.com> wrote:
> I encountered a strange behavior of lightweight lock in PostgreSQL 9.2.  That 
> appears to apply to 9.6, too, as far as I examine the code.  Could you tell 
> me if the behavior is intended or needs fix?
>
> Simply put, the unfair behavior is that waiters for exclusive mode are 
> overtaken by share-mode lockers who arrive later.

9.5 had significant LWLock scalability improvements. This might
improve performance enough so that exclusive lockers don't get
completely starved. It would be helpful if you could test if it's
still possible to trigger starvation with the new code.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is the unfair lwlock behavior intended?

2016-05-24 Thread Ants Aasma

On Tue, May 24, 2016 at 1:29 PM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
> On Tue, May 24, 2016 at 9:03 AM, Tsunakawa, Takayuki
> <tsunakawa.ta...@jp.fujitsu.com> wrote:
>> Is this intentional, or should we make the later share-lockers if someone
>> is in the wait queue?
>
> I've already observed such behavior, see [1].  I think that now there is no
> consensus on how to fix that.  For instance, Andres express opinion that
> this shouldn't be fixed from LWLock side [2].
> FYI, I'm planning to pickup work on CSN patch [3] for 10.0.  CSN should fix
> various scalability issues including high ProcArrayLock contention.

Some amount of non-fairness is ok, but degrading to the point of
complete denial of service is not very graceful. I don't think it's
realistic to hope that all lwlock contention issues will be fixed any
time soon. Some fallback mechanism would be extremely nice until then.

It seems to me that we could fix complete starvation with no
noticeable effect on common cases. The general idea would be to have a
separate flag bit for starving exclusive lockers. When lock is taken
in shared mode an exclusive locker puts itself on the wait queue like
it does now, and then proceeds to wait on its semaphore. But
differently from now, the semaphore wait has a (configurable?) timeout
after which the exclusive locker wakes up and sets the new
LW_FLAG_STARVING flag. When this flag is set new shared lockers will
now consider the lock taken and will queue up to wait. The exclusive
locker that marked itself as starving will get a chance to run once
the existing shared lockers complete and wake it up. If there are
other nearly starved exclusive lockers they are likely to be at the
head of the wait queue and will also get a chance to run. If the wait
queue is empty or has a shared locker at the head the starvation flag
gets reset and all of the queued up shared lockers get their turn.

With the proposed scheme non-starved cases will execute pretty much
exactly what they do now, except for a slightly different check for
shared lock availability. In starvation scenarios the exclusive
lockers will get a chance to run once per starvation grace period.
That might still not be ideal or "fair", but it is a lot better than
the status quo of indefinitely blocking.

PS: if/when you are picking up the CSN work, ping me to write up some
of the insights and ideas I have had while thinking about this topic.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Ants Aasma

On Wed, May 11, 2016 at 3:52 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2016-05-11 03:20:12 +0300, Ants Aasma wrote:
>> On Tue, May 10, 2016 at 7:56 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> > On Mon, May 9, 2016 at 8:34 PM, David Rowley
>> > <david.row...@2ndquadrant.com> wrote:
>> > I don't have any at the moment, but I'm not keen on hundreds of new
>> > vector functions that can all have bugs or behavior differences versus
>> > the unvectorized versions of the same code.  That's a substantial tax
>> > on future development.  I think it's important to understand what
>> > sorts of queries we are targeting here.  KaiGai's GPU-acceleration
>> > stuff does great on queries with complex WHERE clauses, but most
>> > people don't care not only because it's out-of-core but because who
>> > actually looks for the records where (a + b) % c > (d + e) * f / g?
>> > This seems like it has the same issue.  If we can speed up common
>> > queries people are actually likely to run, OK, that's interesting.
>>
>> I have seen pretty complex expressions in the projection and
>> aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
>> END) type of expressions. In critical places had to replace them with
>> a C coded function that processed a row at a time to avoid the
>> executor dispatch overhead.
>
> I've seen that as well, but Was it the actual fmgr indirection causing
> the overhead, or was it ExecQual/ExecMakeFunctionResultNoSets et al?

I don't remember what the exact profile looked like, but IIRC it was
mostly Exec* stuff with advance_aggregates also up there.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Ants Aasma

On Tue, May 10, 2016 at 7:56 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, May 9, 2016 at 8:34 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
> I don't have any at the moment, but I'm not keen on hundreds of new
> vector functions that can all have bugs or behavior differences versus
> the unvectorized versions of the same code.  That's a substantial tax
> on future development.  I think it's important to understand what
> sorts of queries we are targeting here.  KaiGai's GPU-acceleration
> stuff does great on queries with complex WHERE clauses, but most
> people don't care not only because it's out-of-core but because who
> actually looks for the records where (a + b) % c > (d + e) * f / g?
> This seems like it has the same issue.  If we can speed up common
> queries people are actually likely to run, OK, that's interesting.

I have seen pretty complex expressions in the projection and
aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
END) type of expressions. In critical places had to replace them with
a C coded function that processed a row at a time to avoid the
executor dispatch overhead.

> By the way, I think KaiGai's GPU-acceleration stuff points to another
> pitfall here.  There's other stuff somebody might legitimately want to
> do that requires another copy of each function. For example, run-time
> code generation likely needs that (a function to tell the code
> generator what to generate for each of our functions), and
> GPU-acceleration probably does, too.  If fixing a bug in numeric_lt
> requires changing not only the regular version and the vectorized
> version but also the GPU-accelerated version and the codegen version,
> Tom and Dean are going to kill us.  And justifiably so!  Granted,
> nobody is proposing those other features in core right now, but
> they're totally reasonable things to want to do.

My thoughts in this area have been circling around getting LLVM to do
the heavy lifting. LLVM/clang could compile existing C functions to IR
and bundle those with the DB. At query planning time or maybe even
during execution the functions can be inlined into the compiled query
plan, LLVM can then be coaxed to copy propagate, constant fold and
dead code eliminate the bejeezus out of the expression tree. This way
duplication of the specialized code can be kept to a minimum while at
least the common cases can completely avoid the fmgr overhead.

This approach would also mesh together fine with batching. Given
suitably regular data structures and simple functions, LLVM will be
able to vectorize the code. If not it will still end up with a nice
tight loop that is an order of magnitude or two faster than the
current executor.

The first cut could take care of ExecQual, ExecTargetList and friends.
Later improvements could let execution nodes provide basic blocks that
would then be threaded together into the main execution loop. If some
node does not implement the basic block interface a default
implementation is used that calls the current interface. It gets a bit
handwavy at this point, but the main idea would be to enable data
marshaling so that values can be routed directly to the code that
needs them without being written to intermediate storage.

> I suspect the number of queries that are being hurt by fmgr overhead
> is really large, and I think it would be nice to attack that problem
> more directly.  It's a bit hard to discuss what's worthwhile in the
> abstract, without performance numbers, but when you vectorize, how
> much is the benefit from using SIMD instructions and how much is the
> benefit from just not going through the fmgr every time?

My feeling is the same, fmgr overhead and data marshaling, dynamic
dispatch through the executor is the big issue. This is corroborated
by what I have seen found by other VM implementations. Once you get
the data into an uniform format where vectorized execution could be
used, the CPU execution resources are no longer the bottleneck. Memory
bandwidth gets in the way, unless each input value is used in multiple
calculations. And even then, we are looking at a 4x speedup at best.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-05-09 Thread Ants Aasma

On Mon, May 9, 2016 at 10:53 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> Attached draft patch adds SCANALL option to VACUUM in order to scan
>> all pages forcibly while ignoring visibility map information.
>> The option name is SCANALL for now but we could change it after got 
>> consensus.
>
> If we're going to go that way, I'd say it should be scan_all rather
> than scanall.  Makes it clearer, at least IMHO.

Just to add some diversity to opinions, maybe there should be a
separate command for performing integrity checks. Currently the best
ways to actually verify database correctness do so as a side effect.
The question that I get pretty much every time after I explain why we
have data checksums, is "how do I check that they are correct" and we
don't have a nice answer for that now. We could also use some ways to
sniff out corrupted rows that don't involve crashing the server in a
loop. Vacuuming pages that supposedly don't need vacuuming just to
verify integrity seems very much in the same vein.

I know right now isn't exactly the best time to hastily slap on such a
feature, but I just wanted the thought to be out there for
consideration.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] what to revert

2016-05-05 Thread Ants Aasma

5. mai 2016 6:14 AM kirjutas kuupäeval "Andres Freund" <and...@anarazel.de>:
>
> On 2016-05-05 06:08:39 +0300, Ants Aasma wrote:
> > On 5 May 2016 1:28 a.m., "Andres Freund" <and...@anarazel.de> wrote:
> > > On 2016-05-04 18:22:27 -0400, Robert Haas wrote:
> > > > How would the semantics change?
> > >
> > > Right now the time for computing the snapshot is relevant, if
> > > maintenance of xids is moved, it'll likely be tied to the time xids
are
> > > assigned. That seems perfectly alright, but it'll change behaviour.
> >
> > FWIW moving the maintenance to a clock tick process will not change user
> > visible semantics in any significant way. The change could easily be
made
> > in the next release.
>
> I'm not convinced of that - right now the timeout is computed as a
> offset to the time a snapshot with a certain xmin horizon is
> taken. Moving the computation to GetNewTransactionId() or a clock tick
> process will make it relative to the time an xid has been generated
> (minus a fuzz factor).  That'll behave differently in a number of cases,
no?

Timeout is calculated in TransactionIdLimitedForOldSnapshots() as
GetSnapshotCurrentTimestamp() minus old_snapshot_timeout rounded down to
previous minute. If the highest seen xmin in that minute is useful in
raising cleanup xmin the threshold is bumped. Snapshots switch behavior
when their whenTaken is past this threshold. Xid generation time has no
effect on this timing, it's completely based on passage of time.

The needed invariant is, that no snapshot having whenTaken after timeout
timestamp can have xmin less than calculated bound. Moving the xmin
calculation and storage to clock tick actually makes the bound tighter. The
ordering between xmin calculation and clok update/read needs to be correct
to ensure the invariant.

Regards,
Ants Aasma

Re: [HACKERS] what to revert

2016-05-04 Thread Ants Aasma

On 5 May 2016 1:28 a.m., "Andres Freund" <and...@anarazel.de> wrote:
> On 2016-05-04 18:22:27 -0400, Robert Haas wrote:
> > How would the semantics change?
>
> Right now the time for computing the snapshot is relevant, if
> maintenance of xids is moved, it'll likely be tied to the time xids are
> assigned. That seems perfectly alright, but it'll change behaviour.

FWIW moving the maintenance to a clock tick process will not change user
visible semantics in any significant way. The change could easily be made
in the next release.

Regards,
Ants Aasma

Re: [HACKERS] what to revert

2016-05-03 Thread Ants Aasma

On Tue, May 3, 2016 at 9:57 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> If you tell me how to best test it, I do have a 4-socket server sitting idly
> in the corner (well, a corner reachable by SSH). I can get us some numbers,
> but I haven't been following the snapshot_too_old so I'll need some guidance
> on what to test.

I worry about two contention points with the current implementation.

The main one is the locking within MaintainOldSnapshotTimeMapping()
that gets called every time a snapshot is taken. AFAICS this should
show up by setting old_snapshot_threshold to any positive value and
then running a simple within shared buffers scale factor read only
pgbench at high concurrency (number of CPUs or a small multiple). On a
single socket system this does not show up.

The second one is probably a bit harder to hit,
GetOldSnapshotThresholdTimestamp() has a spinlock that gets hit
everytime a scan sees a page that has been modified after the snapshot
was taken. A workload that would tickle this is something that uses a
repeatable read snapshot, builds a non-temporary table and runs
reporting on it. Something like this would work:

BEGIN ISOLATION LEVEL REPEATABLE READ;
DROP TABLE IF EXISTS test_:client_id;
CREATE TABLE test_:client_id (x int, filler text);
INSERT INTO test_:client_id  SELECT x, repeat(' ', 1000) AS filler
FROM generate_series(1,1000) x;
SELECT (SELECT COUNT(*) FROM test_:client_id WHERE x != y) FROM
generate_series(1,1000) y;
COMMIT;

With this script running with -c4 on a 4 core workstation I'm seeing
the following kind of contention and a >2x loss in throughput:

+   14.77%  postgres  postgres   [.] GetOldSnapshotThresholdTimestamp
-8.01%  postgres  postgres   [.] s_lock
   - s_lock
  + 88.15% GetOldSnapshotThresholdTimestamp
  + 10.47% TransactionIdLimitedForOldSnapshots
  + 0.71% TestForOldSnapshot_impl
  + 0.57% GetSnapshotCurrentTimestamp

Now this is kind of an extreme example, but I'm willing to bet that on
multi socket hosts similar issues can crop up with common real world
use cases.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold

2016-05-02 Thread Ants Aasma

On Mon, May 2, 2016 at 5:21 PM, Andres Freund <and...@anarazel.de> wrote:
> On 2016-05-02 09:03:19 -0400, Robert Haas wrote:
>> On Fri, Apr 29, 2016 at 6:08 PM, Kevin Grittner <kgri...@gmail.com> wrote:
>> > Now to continue with the performance benchmarks.  I'm pretty sure
>> > we've fixed the problems when the feature is disabled
>> > (old_snapshot_threshold = -1), and there are several suggestions
>> > for improving performance while it is on that need to be compared
>> > and benchmarked.
>>
>> If anyone thinks that the issue with the feature disabled is NOT
>> fixed, please speak up!  I'm moving the corresponding open item to
>> CLOSE_WAIT status, meaning that it will be closed if nobody shows up
>> to say that there is still an issue.
>
> Well, I don't agree that the feature is in a releaseable state. The
> datastructure is pretty much non-scalable, and maintained on the wrong
> side (every read, instead of once in writing writing xacts). There's no
> proposal actually addressing the scalability issues.

Unless I'm missing something fundamental the feature only requires
tracking an upper bound on xmin observed by snapshots between clock
ticks. The simplest way to do this would be a periodic process that
increments a clock counter (32bit counter would be plenty) and then
calculates xmin for the preceding range. With this scheme
GetSnapshotData would need two atomic fetches to get current LSN and
the timestamp. Test for old snapshot can also run completely lock free
with a single atomic fetch of threshold timestamp. The negative side
is that we need to have a process running that runs the clock ticks
and the ticks may sometimes be late. Giving something like autovacuum
launcher this task doesn't seem too bad and the consequence of falling
behind is just delayed timing out of old snapshots.

As far as I can see this approach would get rid of any scalability
issues, but it is a pretty significant change and requires 64bit
atomic reads to get rid of contention on xlog insert lock.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold

2016-04-21 Thread Ants Aasma

On Thu, Apr 21, 2016 at 5:16 PM, Kevin Grittner <kgri...@gmail.com> wrote:
> On Wed, Apr 20, 2016 at 8:08 PM, Ants Aasma <ants.aa...@eesti.ee> wrote:
>
>> However, while checking out if my proof of concept patch actually
>> works I hit another issue. I couldn't get my test for the feature to
>> actually work. The test script I used is attached.
>
> Could you provide enough to make that a self-contained reproducible
> test case (i.e., that I don't need to infer or re-write any steps
> or guess how to call it)?  In previous cases people have given me
> where they felt that the feature wasn't working there have have
> been valid reasons for it to behave as it was (e.g., a transaction
> with a transaction ID and an xmin which prevented cleanup from
> advancing).  I'll be happy to look at your case and see whether
> it's another such case or some bug, but it seems a waste to reverse
> engineer or rewrite parts of the test case to do so.

Just to be sure I didn't have anything screwy in my build environment
I redid the test on a freshly installed Fedora 23 VM. Steps to
reproduce:

1. Build postgresql from git. I used ./configure --enable-debug
--enable-cassert --prefix=/home/ants/pg-master
2. Set up database:

cat << EOF > test-settings.conf
old_snapshot_threshold = 1min

logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql.log'
log_line_prefix = '[%m] '
log_autovacuum_min_duration = 0
EOF

pg-master/bin/initdb data/
cat test-settings.conf >> data/postgresql.conf
pg-master/bin/pg_ctl -D data/ start
pg-master/bin/createdb

3. Install python-psycopg2 and get the test script from my earlier e-mail [1]
4. Run the test:

python test_oldsnapshot.py "host=/tmp"

5. Observe that the table keeps growing even after the old snapshot
threshold is exceeded and autovacuum has run. Autovacuum log shows 0
tuples removed.

Only the write workload has a xid assigned, the other two backends
only have snapshot held:

[ants@localhost ~]$ pg-master/bin/psql -c "SELECT application_name,
backend_xid, backend_xmin, NOW()-xact_start AS tx_age, state FROM
pg_stat_activity"
   application_name   | backend_xid | backend_xmin | tx_age  |
   state
--+-+--+-+-
 write-workload   |   95637 |  | 00:00:00.009314 | active
 long-unrelated-query | | 1806 | 00:04:33.914048 | active
 interfering-query| | 2444 | 00:04:32.910742 |
idle in transaction
 psql | |95637 | 00:00:00| active

Output from the test tool attached. After killing the test tool and
the long running query autovacuum cleans stuff as expected.

I'm too tired right now to chase this down myself. The mental toll
that two small kids can take is pretty staggering. But I might find
the time to fire up a debugger sometime tomorrow.

Regards,
Ants Aasma

[1] http://www.postgresql.org/message-id/attachment/43859/test_oldsnapshot.py
[ants@localhost ~]$ python test_oldsnapshot.py "host=/tmp"
[21:37:56]  Starting 1800s long query
[21:37:56]  old_snapshot_threshold = 1min
[21:37:56]  High throughput table size @ 0s. Size176kB Last vacuum None 
ago
[21:37:57]  Counted 1000 rows with max 1637 in high_throughput table
[21:38:06]  High throughput table size @10s. Size952kB Last vacuum None 
ago
[21:38:07]  Counted 1000 rows with max 1637 in high_throughput table
[21:38:16]  High throughput table size @20s. Size   1768kB Last vacuum None 
ago
[21:38:17]  Counted 1000 rows with max 1637 in high_throughput table
[21:38:26]  High throughput table size @30s. Size   2640kB Last vacuum None 
ago
[21:38:27]  Counted 1000 rows with max 1637 in high_throughput table
[21:38:36]  High throughput table size @40s. Size   3488kB Last vacuum None 
ago
[21:38:37]  Counted 1000 rows with max 1637 in high_throughput table
[21:38:46]  High throughput table size @50s. Size   4328kB Last vacuum 
0:00:02.339652 ago
[21:38:47]  Counted 1000 rows with max 1637 in high_throughput table
[21:38:56]  High throughput table size @60s. Size   4832kB Last vacuum 
0:00:12.342278 ago
[21:38:57]  Counted 1000 rows with max 1637 in high_throughput table
[21:39:06]  High throughput table size @70s. Size   4920kB Last vacuum 
0:00:22.425250 ago
[21:39:07]  Counted 1000 rows with max 1637 in high_throughput table
[21:39:16]  High throughput table size @80s. Size   5016kB Last vacuum 
0:00:32.431971 ago
[21:39:17]  Counted 1000 rows with max 1637 in high_throughput table
[21:39:26]  High throughput table size @90s. Size   5112kB Last vacuum 
0:00:42.431730 ago
[21:39:27]  Counted 1000 rows with max 1637 in high_throughput table
[21:39:36]  High throughput table size @   100s. Size   5200kB Last vacuum 
0:00:52.448369 ago
[21:39:37]  Counted 1000

Re: [HACKERS] Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold

2016-04-20 Thread Ants Aasma

On Tue, Apr 19, 2016 at 6:11 PM, Kevin Grittner <kgri...@gmail.com> wrote:
> On Tue, Apr 19, 2016 at 9:57 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> On Sun, Apr 17, 2016 at 2:26 AM, Andres Freund <and...@anarazel.de> wrote:
>>>
>>> On 2016-04-16 16:44:52 -0400, Noah Misch wrote:
>>> > That is more controversial than the potential ~2% regression for
>>> > old_snapshot_threshold=-1.  Alvaro[2] and Robert[3] are okay releasing
>>> > that way, and Andres[4] is not.
>>>
>>> FWIW, I could be kinda convinced that it's temporarily ok, if there'd be
>>> a clear proposal on the table how to solve the scalability issue around
>>> MaintainOldSnapshotTimeMapping().
>>
>> It seems that for read-only workloads, MaintainOldSnapshotTimeMapping()
>> takes EXCLUSIVE LWLock which seems to be a probable reason for a performance
>> regression.  Now, here the question is do we need to acquire that lock if
>> xmin is not changed since the last time value of
>> oldSnapshotControl->latest_xmin is updated or xmin is lesser than equal to
>> oldSnapshotControl->latest_xmin?
>> If we don't need it for above cases, I think it can address the performance
>> regression to a good degree for read-only workloads when the feature is
>> enabled.
>
> Thanks, Amit -- I think something along those lines is the right
> solution to the scaling issues when the feature is enabled.  For
> now I'm focusing on the back-patching issues and the performance
> regression when the feature is disabled, but I'll shift focus to
> this once the "killer" issues are in hand.

I had an idea I wanted to test out. The gist of it is to effectively
have the last slot of timestamp to xid map stored in the latest_xmin
field and only update the mapping when slot boundaries are crossed.
See attached WIP patch for details. This way the exclusive lock only
needs to be acquired once per minute. The common case is a spinlock
that could be replaced with atomics later. And it seems to me that the
mutex_threshold taken in TestForOldSnapshot() can also get pretty hot
under some workloads, so that may also need some tweaking.

I think a better approach would be to base the whole mechanism on a
periodically updated counter, instead of timestamps. Autovacuum
launcher looks like a good candidate to play the clock keeper, without
it the feature has little point anyway. AFAICS only the clock keeper
needs to have the timestamp xid mapping, others can make do with a
couple of periodically updated values. I haven't worked it out in
detail, but it feels like the code would be simpler. But this was a
larger change than I felt comfortable trying out, so I went with the
simple change first.

However, while checking out if my proof of concept patch actually
works I hit another issue. I couldn't get my test for the feature to
actually work. The test script I used is attached. Basically I have a
table with 1000 rows, one high throughput worker deleting old rows and
inserting new ones, one long query that acquires a snapshot and sleeps
for 30min, and one worker that has a repeatable read snapshot and
periodically does count(*) on the table. Based on documentation I
would expect the following:

* The interfering query gets cancelled
* The long running query gets to run
* Old rows will start to be cleaned up after the threshold expires.

However, testing on commit 9c75e1a36b6b2f3ad9f76ae661f42586c92c6f7c,
I'm seeing that the old rows do not get cleaned up, and that I'm only
seeing the interfering query get cancelled when old_snapshot_threshold
= 0. Larger values do not result in cancellation. Am I doing something
wrong or is the feature just not working at all?

Regards,
Ants Aasma
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8aa1f49..dc00d91 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -80,8 +80,11 @@ typedef struct OldSnapshotControlData
 	 */
 	slock_t		mutex_current;			/* protect current timestamp */
 	int64		current_timestamp;		/* latest snapshot timestamp */
-	slock_t		mutex_latest_xmin;		/* protect latest snapshot xmin */
+	slock_t		mutex_latest_xmin;		/* protect latest snapshot xmin
+		 * and next_map_update
+		 */
 	TransactionId latest_xmin;			/* latest snapshot xmin */
+	int64		next_map_update;		/* latest snapshot valid up to */
 	slock_t		mutex_threshold;		/* protect threshold fields */
 	int64		threshold_timestamp;	/* earlier snapshot is old */
 	TransactionId threshold_xid;		/* earlier xid may be gone */
@@ -95,7 +98,9 @@ typedef struct OldSnapshotControlData
 	 * count_used value of old_snapshot_threshold means that the buffer is
 	 * full and the head must be advanced to add new entries.  Use timestamps
 	 * aligned to minute boundaries, since that seems less

Re: [HACKERS] raw output from copy

2016-04-12 Thread Ants Aasma

On 8 Apr 2016 9:14 pm, "Pavel Stehule" <pavel.steh...@gmail.com> wrote:
> 2016-04-08 20:54 GMT+02:00 Andrew Dunstan <and...@dunslane.net>:
>> I should add that I've been thinking about this some more, and that I now 
>> agree that something should be done to support this at the SQL level, mainly 
>> so that clients can manage very large pieces of data in a stream-oriented 
>> fashion rather than having to marshall the data in memory to load/unload via 
>> INSERT/SELECT. Anything that is client-side only is likely to have this 
>> memory issue.
>>
>> At the same time I'm still not entirely convinced that COPY is a good 
>> vehicle for this. It's designed for bulk records, and already quite complex. 
>> Maybe we need something new that uses the COPY protocol but is more 
>> specifically tailored for loading or sending large singleton pieces of data.
>
>
> Now it is little bit more time to think more about. But It is hard to design 
> some more simpler than is COPY syntax. What will support both directions.

Sorry for arriving late and adding to the bikeshedding. Maybe the
answer is to make COPY pluggable. It seems to me that it would be
relatively straightforward to add an extension mechanism for copy
output and input plugins that could support any format expressible as
a binary stream. Raw output would then be an almost trivial plugin.
Others could implement JSON, protocol buffers, Redis bulk load, BSON,
ASN.1 or whatever else serialisation format du jour. It will still
have the same backwards compatibility issues as adding the raw output,
but the payoff is greater.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-12 Thread Ants Aasma

On Sat, Dec 12, 2015 at 12:41 AM, Greg Stark <st...@mit.edu> wrote:
> On Wed, Dec 9, 2015 at 2:44 AM, Peter Geoghegan <p...@heroku.com> wrote:
>> On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark <st...@mit.edu> wrote:
>>
>> I guess you mean insertion sort. What's the theoretical justification
>> for the change?
>
> Well my thinking was that hard coding a series of comparisons would be
> faster than a loop doing a O(n^2) algorithm even for small constants.
> And sort networks are perfect for hard coded sorts because they do the
> same comparisons regardless of the results of previous comparisons so
> there are no branches. And even better the comparisons are as much as
> possible independent of each other -- sort networks are typically
> measured by the depth which assumes any comparisons between disjoint
> pairs can be done in parallel. Even if it's implemented in serial the
> processor is probably parallelizing some of the work.
>
> So I implemented a quick benchmark outside Postgres based on sorting
> actual SortTuples with datum1 defined to be random 64-bit integers (no
> nulls). Indeed the sort networks perform faster on average despite
> doing more comparisons. That makes me think the cpu is indeed doing
> some of the work in parallel.

The open coded version you shared bloats the code by 37kB, I'm not
sure it is pulling it's weight, especially given relatively heavy
comparators. A quick index creation test on int4's profiled with perf
shows about 3% of CPU being spent in the code being replaced. Any
improvement on that is going to be too small to easily quantify.

As the open coding doesn't help with eliminating control flow
dependencies, so my idea is to encode the sort network comparison
order in an array and use that to drive a simple loop. The code size
would be pretty similar to insertion sort and the loop overhead should
mostly be hidden by the CPU OoO machinery. Probably won't help much,
but would be interesting and simple enough to try out. Can you share
you code for the benchmark so I can try it out?

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] W-TinyLfu for cache eviction

2015-12-09 Thread Ants Aasma

On Wed, Dec 9, 2015 at 11:31 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:
> I expect synchronization issues with implementation of this algorithm. It
> seems to be hard to avoid some global critical section which can cause
> significant performance degradation at MPP systems (see topic "Move
> PinBuffer and UnpinBuffer to atomics").

The sketch updates don't really need to be globally consistent,
anything that is faster than cycling of buffers through the LRU tier
would be enough. In principle atomic increments could be used, but
funneling all buffer hits onto a small amount of cachelines and then
doing 4x the amount of writes would result in extremely nasty cache
line ping-ponging. Caffeine implementation appears to solve this by
batching the frequency sketch updates using a striped set of circular
buffers. Re-implementing this is not exactly trivial and some prior
experience with lock free programming seems well advised.

Based on the paper it looks like this algorithm can work with a low
quality primary eviction algorithm. Swapping our GCLOCK out for
something simpler like plain CLOCK could simplify an atomics based
PinBuffer approach. Another interesting avenue for research would be
to use ideas in TinyLFU to implement a tier of "nailed" buffers that
have backend local or striped pin-unpin accounting.

But for checking if the replacement policy implemented by W-TinyLFU is
good it isn't necessary to have a performant locking solution. I think
a global lock would be good enough for a proof of concept that only
evaluates cache hit ratios.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Revisiting pg_stat_statements and IN() (Was: Re: pg_stat_statements fingerprinting logic and ArrayExpr)

2015-11-24 Thread Ants Aasma

On Tue, Nov 24, 2015 at 2:25 PM, Greg Stark <st...@mit.edu> wrote:
> On Tue, Nov 24, 2015 at 7:53 AM, Peter Geoghegan <p...@heroku.com> wrote:
>> * How do other people feel about this? Personally, I've seen enough
>> problems of this kind in the field that "slippery slope" arguments
>> against this don't seem very compelling.

+1 for changing jumbling logic to consider any number of constants as identical.

> I have also seen code where I would have needed *not* to have this
> jumbling though. I think this is a general problem with jumbling that
> there needs to be some kind of intelligence that deduces when it's
> important to break out the statistics by constant. In my case it was
> an IN query where specific values were very common but others very
> rare. Partial indexes ended up being the solution and we had to
> identify which partial indexes were needed.

This is an issue with jumbling in general and very vaguely related to
merging the number of constants in IN. I think a better solution for
this type of issue would be a way to sample the parameter values and
possibly EXPLAIN ANALYZE results to logs. Having runtime intelligence
in PostgreSQL that would be capable of distinguishing interesting
variations from irrelevant doesn't seem like a feasible plan. In my
view the best we could do is to aim to have entries roughly correspond
to application query invocation points and leave the more complex
statistical analysis use cases to more specialized tools.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelism and sorting

2015-11-24 Thread Ants Aasma

On Tue, Nov 24, 2015 at 5:29 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Nov 23, 2015 at 8:45 PM, Peter Geoghegan <p...@heroku.com> wrote:
>> Beyond that, CREATE INDEX and CLUSTER utility
>> cases will also need to be parallelized without all this executor
>> infrastructure.
>
> Or, alternatively, CREATE INDEX and CLUSTER could be refactored to use
> the executor.  This is might sound crazy, but maybe it's not.  Perhaps
> we could have the executor tree output correctly-formed index tuples
> that get funneled into a new kind of DestReceiver that puts them into
> the index.  I don't know if that's a GOOD idea, but it's an idea.

Having CREATE INDEX use the executor seems like a useful idea for
reasons unrelated to parallelism.

The use case I have in mind is a table containing multiple years worth
of (approximately) time series data, where overwhelming majority of
queries are explicitly interested in recent data. Having a partial
index with WHERE tstamp > $some_recent_tstamp cutting out 90+% of
tuples was extremely helpful for performance for both index size
reasons and having to process less tuples. This index needs to be
periodically rebuilt with a newer timestamp constant, and the rebuild
would be a lot faster if it could use the existing index to perform an
index only scan of 10% of data instead of scanning and sorting the
full table.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-11 Thread Ants Aasma

On Wed, Nov 11, 2015 at 11:22 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> On Wed, Nov 11, 2015 at 9:42 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
>> On 11/11/2015 10:23 AM, Simon Riggs wrote:
>>> Thanks for working on this issue.
>>
>> +1.

+1. I have seen a lot of interest for something along these lines.

>> I'm thinking the client should get some kind of a token back from the
>> commit, and it could use the token on the standby, to wait for that commit
>> to be applied. The token could be just the XID, or the LSN of the commit
>> record. Or the application could generate the token and pass it to the
>> server in the commit, similar to how 2PC works. So the interaction would be
>> something like:
>>
>> In master:
>> BEGIN;
>> INSERT INTO FOO ...;
>> COMMIT;
>> Server returns: COMMITted with token 1234
>>
>> Later, in standby:
>> BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;
>> SELECT * FROM foo;
>> ...

To avoid read anomalies (backwards timetravel) it should also be
possible to receive a token from read-only transactions based on the
latest snapshot used.

> My thinking was that the reason for wanting to load balance over a set of
> hot standbys is because you have a very read-heavy workload, so it makes
> sense to tax the writers and leave the many dominant readers unburdened, so
> (3) should be better than (2) for the majority of users who want such a
> configuration.  (Note also that it's not a requirement to tax every write;
> with this proposal you can set causal_reads to off for those transactions
> where you know there is no possibility of a causally dependent read).
>
> As for (1), my thinking was that most application developers would probably
> prefer not to have to deal with that type of interface.  For users who do
> want to do that, it would be comparatively simple to make that possible, and
> would not conflict with this proposal.  This proposal could be used by
> people retrofitting load balancing to an existing applications with relative
> ease, or simply not wanting to have to deal with LSNs and complexity.  (I
> have considered proposing pg_wait_for_xlog_replay_location(lsn, timeout)
> separately, which could be called on a standby with the lsn obtained from
> pg_current_xlog_location() on the primary any time after a COMMIT completes,
> but I was thinking of that as a different feature addressing a different
> user base: people prepared to do more work to squeeze out some extra
> performance.)

Although I still think that 1) is the correct long term solution I
must say that I agree with the reasoning presented. I think we should
review the API in the light that in the future we might have a mix of
clients, some clients that are able to keep track of causality tokens
and either want to wait when a read request arrives, or pick a host to
use based on the token, and then there are "dumb" clients that want to
use write side waits.

Also, it should be possible to configure which standbys are considered
for waiting on. Otherwise a reporting slave will occasionally catch up
enough to be considered "available" and then cause a latency peak when
a long query blocks apply again.

Regards,
Ants Aasma
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow snapshot too old error, to prevent bloat

2015-02-19 Thread Ants Aasma

On Thu, Feb 19, 2015 at 6:01 PM, Kevin Grittner kgri...@ymail.com wrote:
 I can see how they would be, provided we can be confident that we're
 going to actually throw an error when the snapshot is out of date and
 not end up returning incorrect results.  We need to be darn sure of
 that, both now and in a few years from now when many of us may have
 forgotten about this knob.. ;)

 I get that this is not zero-cost to maintain, and that it might not
 be of interest to the community largely for that reason.  If you
 have any ideas about how to improve that, I'm all ears.  But do
 consider the actual scope of what was needed for the btree coverage
 (above).  (Note that the index-only scan issue doesn't look like
 anything AM-specific; if something is needed for that it would be
 in the pruning/vacuum area.)

If I understood the issue correctly, you have long running snapshots
accessing one part of the data, while you have high churn on a
disjoint part of data. We would need to enable vacuum on the high
churn data while still being able to run those long queries. The
snapshot too old solution limits the maximum age of global xmin at
the cost of having to abort transactions that are older than this and
happen to touch a page that was modified after they were started.

What about having the long running snapshots declare their working
set, and then only take them into account for global xmin for
relations that are in the working set? Like a SET TRANSACTION WORKING
SET command. This way the error is deterministic, vacuum on the high
churn tables doesn't have to wait for the old transaction delay to
expire and we avoid a hard to tune GUC (what do I need to set
old_snapshot_threshold to, to reduce bloat while not having normal
transactions abort).

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow snapshot too old error, to prevent bloat

2015-02-19 Thread Ants Aasma

On Feb 19, 2015 10:31 PM, Kevin Grittner kgri...@ymail.com wrote:
  What about having the long running snapshots declare their working
  set, and then only take them into account for global xmin for
  relations that are in the working set? Like a SET TRANSACTION WORKING
  SET command. This way the error is deterministic, vacuum on the high
  churn tables doesn't have to wait for the old transaction delay to
  expire and we avoid a hard to tune GUC (what do I need to set
  old_snapshot_threshold to, to reduce bloat while not having normal
  transactions abort).

 Let me make sure I understand your suggestion.  You are suggesting
 that within a transaction you can declare a list of tables which
 should get the snapshot too old error (or something like it) if a
 page is read which was modified after the snapshot was taken?
 Snapshots within that transaction would not constrain the effective
 global xmin for purposes of vacuuming those particular tables?

Sorry, I should have been clearer. I'm proposing that a transaction can
declare what tables it will access. After that the transaction will
constrain xmin for only those tables. Accessing any other table would give
an error immediately.

 If I'm understanding your proposal, that would help some of the
 cases the snapshot too old case addresses, but it would not
 handle the accidental idle in transaction case (e.g., start a
 repeatable read transaction, run one query, then sit idle
 indefinitely).  I don't think it would work quite as well for some
 of the other cases I've seen, but perhaps it could be made to fit
 if we could figure out the accidental idle in transaction case.

Accidental idle in transaction seems better handled by just terminating
transactions older than some timeout. This is already doable externally,
but an integrated solution would be nice if we could figure out how the
permissions for setting such a timeout work.

 I also think it might be hard to develop a correct global xmin for
 vacuuming a particular table with that solution.  How do you see
 that working?

I'm imagining the long running transaction would register it's xmin in a
shared map keyed by relation for each referenced relation and remove the
xmin from procarray. Vacuum would access this map by relation, determine
the minimum and use that if it's earlier than the global xmin. I'm being
deliberately vague here about the datastructure in shared memory as I don't
have a great idea what to use there. It's somewhat similar to the lock
table in that in theory the size is unbounded, but in practice it's
expected to be relatively tiny.

Regards,
Ants Aasma

[HACKERS] Using RTLD_DEEPBIND to handle symbol conflicts in loaded libraries

2014-11-26 Thread Ants Aasma

I had to make oracle_fdw work with PostgreSQL compiled using
--with-ldap. The issue there is that Oracle's client library has the
delightful property of linking against a ldap library they bundle that
has symbol conflicts with OpenLDAP. At PostgreSQL startup libldap is
loaded, so when libclntsh.so (the Oracle client) is loaded it gets
bound to OpenLDAP symbols, and unsurprisingly crashes with a segfault
when those functions get used.

glibc-2.3.4+ has a flag called RTLD_DEEPBIND for dlopen that prefers
symbols loaded by the library to those provided by the caller. Using
this flag fixes my issue, PostgreSQL gets the ldap functions from
libldap, Oracle client gets them from whatever it links to. Both work
fine.

Attached is a patch that enables this flag on Linux when available.
This specific case could also be fixed by rewriting oracle_fdw to use
dlopen for libclntsh.so and pass this flag, but I think it would be
better to enable it for all PostgreSQL loaded extension modules. I
can't think of a sane use case where it would be correct to prefer
PostgreSQL loaded symbols to those the library was actually linked
against.

Does anybody know of a case where this flag wouldn't be a good idea?
Are there any similar options for other platforms? Alternatively, does
anyone know of linker flags that would give a similar effect?

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
diff --git a/src/backend/port/dynloader/linux.h b/src/backend/port/dynloader/linux.h
index db2ac66..34fd35c 100644
--- a/src/backend/port/dynloader/linux.h
+++ b/src/backend/port/dynloader/linux.h
@@ -25,8 +25,12 @@
 /*
  * In some older systems, the RTLD_NOW flag isn't defined and the mode
  * argument to dlopen must always be 1.  The RTLD_GLOBAL flag is wanted
- * if available, but it doesn't exist everywhere.
- * If it doesn't exist, set it to 0 so it has no effect.
+ * if available, but it doesn't exist everywhere. The RTLD_DEEPBIND flag
+ * may also be missing, but if it is available we want to enable it so
+ * extensions we load prefer symbols from libraries they were linked
+ * against to conflicting symbols from unrelated libraries loaded by
+ * PostgreSQL.
+ * If the optional flags don't exist, set them to 0 so they have no effect.
  */
 #ifndef RTLD_NOW
 #define RTLD_NOW 1
@@ -34,8 +38,11 @@
 #ifndef RTLD_GLOBAL
 #define RTLD_GLOBAL 0
 #endif
+#ifndef RTLD_DEEPBIND
+#define RTLD_DEEPBIND 0
+#endif
 
-#define pg_dlopen(f)	dlopen((f), RTLD_NOW | RTLD_GLOBAL)
+#define pg_dlopen(f)	dlopen((f), RTLD_NOW | RTLD_GLOBAL | RTLD_DEEPBIND)
 #define pg_dlsym		dlsym
 #define pg_dlclose		dlclose
 #define pg_dlerror		dlerror

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What exactly is our CRC algorithm?

2014-11-21 Thread Ants Aasma

On Fri, Nov 21, 2014 at 12:11 PM, Abhijit Menon-Sen a...@2ndquadrant.com 
wrote:
 If anyone has other suggestions, I'm all ears.

Do you have a WIP patch I could take a look at and tweak? Maybe
there's something about the compilers code generation that could be
improved.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failback to old master

2014-11-13 Thread Ants Aasma

On Thu, Nov 13, 2014 at 10:05 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 In this case the
 old master will request recovery from a point after the timeline
 switch and the new master will reply with an error.  So it is safe to
 try re-adding a crashed master as a slave, but this might fail.


 Are you sure it's guaranteed to fail, when the master had some WAL that was
 not streamed before the crash? I'm not 100% sure about that. I thought the
 old master might continue streaming and replaying, if there just happens to
 be a record start at the same point in WAL on both timelines. I think you'll
 get an error at the next checkpoint record, because its timeline ID isn't
 what the old master expects, but it would've started up already.

It seems to me like it's guaranteed. Given slave promotion at x1 and
end of xlog on old master at x2, x1  x2, master will request
streaming at tli1.x2, wal sender does tliSwitchPoint(tli1) to lookup
x1, finds that x1  x2 and gives the error requested starting point
%X/%X on timeline %u is not in this server's history. The alignment
of x2 on tli2 does not play a role here.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failback to old master

2014-11-12 Thread Ants Aasma

On Tue, Nov 11, 2014 at 11:52 PM, Maeldron T. maeld...@gmail.com wrote:
 As far as I remember (I can’t test it right now but I am 99% sure) promoting 
 the slave makes it impossible to connect the old master to the new one 
 without making a base_backup. The reason is the timeline change. It complains.

A safely shut down master (-m fast is safe) can be safely restarted as
a slave to the newly promoted master. Fast shutdown shuts down all
normal connections, does a shutdown checkpoint and then waits for this
checkpoint to be replicated to all active streaming clients. Promoting
slave to master creates a timeline switch, that prior to version 9.3
was only possible to replicate using the archive mechanism. As of
version 9.3 you don't need to configure archiving to follow timeline
switches, just add a recovery.conf to the old master to start it up as
a slave and it will fetch everything it needs from the new master.

In case of a unsafe shut down (crash) it is possible that you have WAL
lying around that was not streamed out to the slave. In this case the
old master will request recovery from a point after the timeline
switch and the new master will reply with an error. So it is safe to
try re-adding a crashed master as a slave, but this might fail.
Success is more likely when the whole operating system went down, as
then it's somewhat likely that any WAL got streamed out before it made
it to disk.

In general my suggestion is to avoid slave promotion by removal of
recovery.conf, it's too easy to get confused and end up with hard to
diagnose data corruption.

In your example, if for example B happens to disconnect at WAL
position x1 and remains disconnected while shutdown on A occurred at
WAL position x2 it will be missing the WAL interval A(x1..x2). Now B
is restarted as master from position x1, generates some new WAL past
x2, then A is restarted as slave and starts streaming at x2 as to the
best of it's knowledge that was where things left off. At this point
the slave A is corrupted, you have x1..x2 changes from A that are not
on the master and are also missing some changes that are on the
master. Wrong data and/or crashes ensue.

Always use the promotion mechanism because then you are likely to get
errors when something is screwy. Unfortunately it's still possible to
end up in a corrupted state with no errors, as timeline identifiers
are sequential integers, not GUID's, but at least it's significantly
harder.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-16 Thread Ants Aasma

On Oct 15, 2014 7:32 PM, Ants Aasma a...@cybertec.at wrote:
 I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
 associativity). This allows us to fit the bucket onto 2 regular sized
 cache lines and have 8 bytes left over. Buckets would be protected by
 seqlocks stored in the extra space. On the read side we would only
 need 2 read barriers (basically free on x86), and we are guaranteed to
 have an answer by fetching two pairs of cache lines. We can trade
 memory bandwidth for latency by issuing prefetches (once we add
 primitives for this). Alternatively we can trade insert speed for
 lookup speed by using asymmetrically sized tables.

I was thinking about this. It came to me that we might not even need
BufferTag's to be present in the hash table. In the hash table itself
we store just a combination of 4 byte buffer tag hash-buffer id. This
way we can store 8 pairs in one cacheline. Lookup must account for the
fact that due to hash collisions we may find multiple matches one of
which may be the buffer we are looking for.  We can make condition
exceedingly unlikely by taking advantage of the fact that we have two
hashes anyway, in each table we use the lookup hash of the other table
for tagging. (32bit hash collision within 8 items). By having a
reasonably high load factor (75% should work) and 8 bytes per key we
even have a pretty good chance of fitting the hotter parts of the
buffer lookup table in CPU caches.

We use a separate array for the locks (spinlocks should be ok here)
for fine grained locking, this may be 1:1 with the buckets or we can
map many buckets to a single lock. Otherwise the scheme should work
mostly the same. Using this scheme we can get by on the lookup side
using 64 bit atomic reads with no barriers, we only need atomic ops
for pinning the found buffer.

I haven't had the chance to check out how second-chance hashing works
and if this scheme would be applicable to it.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-15 Thread Ants Aasma

On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas robertmh...@gmail.com wrote:
 With regard for using a hash table for the buffer mapping lock I'm
 doubtful that any form of separate chaining is the right one. We
 currently have a quite noticeable problem with the number of cache
 misses in the buffer mapping hash (and also the heavyweight lock mgr) -
 if we stay with hashes that's only going to be fundamentally lower than
 dynahash if we change the type of hashing. I've had good, *preliminary*,
 results using an open addressing + linear probing approach.

 I'm very skeptical of open addressing + linear probing.  Such hash
 tables tend to degrade over time, because you have to leave tombstones
 behind to ensure that future scans advance far enough.  There's really
 no way to recover without rebuilding the whole hash table, and
 eventually it degrades to linear search.  If we're spending too much
 time walking hash chains, I think the solution is to increase the
 number of buckets so that the chains get shorter.

What about cuckoo hashing? There was a recent paper on how to do fine
grained locking with cuckoo hashes. [1]

I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.

[1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Value locking Wiki page

2014-10-01 Thread Ants Aasma

On Wed, Oct 1, 2014 at 2:42 PM, Simon Riggs si...@2ndquadrant.com wrote:
 GiST supports exclusion constraints. That is one of the main reasons I want
 to do promise tuples, instead of locking within the indexam: to support this
 feature with exclusion constraints.

 That does sound interesting, but I am concerned the semantics may cause 
 issues.

 If I go to insert a row for 'UK' and find an existing row for
 'Europe', do we really want to update the population of Europe to be
 the population of the UK, simply because the UK and Europe have an
 exclusion conflict?

 Please give some concrete examples of a business request that might be
 satisified by such a feature.

The ON CONFLICT UPDATE semantics don't seem particularly useful for
exclusion constraints. However, a reasonable business request for
exclusion constraints would be to have a boss mode for the canonical
room reservation example - an INSERT that is guaranteed not to fail by
either deleting conflicting rows or updating them so the exclusion
constraints don't overlap (e.g. truncate the time intervals) or the
rows fail the index predicate (e.g. soft delete). AFAICS this is
currently not possible to implement correctly without a retry loop.

The hypothetical ON CONFLICT REPLACE and ON CONFLICT
UPDATE-AND-THEN-INSERT modes would also make sense in the unique index
case.

Not saying that I view this as necessary for the first cut of the
feature, just providing an example where it could be useful.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-26 Thread Ants Aasma

On Fri, Sep 26, 2014 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote:
 Neither, really. The hash calculation is visible in the profile, but not
 that pronounced yet. The primary thing noticeable in profiles (besides
 cache efficiency) is the comparison of the full tag after locating a
 possible match in a bucket. 20 byte memcmp's aren't free.

I'm not arguing against a more efficient hash table, but one simple
win would be to have a buffer tag specific comparison function. I mean
something like the attached patch. This way we avoid the loop counter
overhead, can check the blocknum first and possibly have better branch
prediction.

Do you have a workload where I could test if this helps alleviate the
comparison overhead?

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 7a38f2f..ba37b5f 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -34,6 +34,7 @@ typedef struct
 
 static HTAB *SharedBufHash;
 
+static int BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n);
 
 /*
  * Estimate space needed for mapping hashtable
@@ -46,6 +47,19 @@ BufTableShmemSize(int size)
 }
 
 /*
+ * Compare contents of two buffer tags. We use this instead of the default
+ * memcmp to minimize comparison overhead.
+ */
+static int
+BufTableCmpBufferTag(BufferTag *a, BufferTag *b, int n)
+{
+	Assert(offsetof(BufferTag, blockNum) == 2*sizeof(uint64));
+	return (a-blockNum == b-blockNum
+			 ((uint64*)a)[0] == ((uint64*)b)[0]
+			 ((uint64*)a)[1] == ((uint64*)b)[1]) ? 0 : 1;
+}
+
+/*
  * Initialize shmem hash table for mapping buffers
  *		size is the desired hash table size (possibly more than NBuffers)
  */
@@ -61,6 +75,7 @@ InitBufTable(int size)
 	info.entrysize = sizeof(BufferLookupEnt);
 	info.hash = tag_hash;
 	info.num_partitions = NUM_BUFFER_PARTITIONS;
+	info.match = BufTableCmpBufferTag;
 
 	SharedBufHash = ShmemInitHash(Shared Buffer Lookup Table,
   size, size,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-24 Thread Ants Aasma

On Tue, Sep 23, 2014 at 8:15 PM, Florian Weimer f...@deneb.enyo.de wrote:
 * Ants Aasma:

 CRC has exactly one hardware implementation in general purpose CPU's

 I'm pretty sure that's not true.  Many general purpose CPUs have CRC
 circuity, and there must be some which also expose them as
 instructions.

I must eat my words here, indeed AMD processors starting from
Bulldozer do implement the CRC32 instruction. However, according to
Agner Fog, AMD's implementation has a 6 cycle latency and more
importantly a throughput of 1/6 per cycle. While Intel's
implementation on all CPUs except the new Atom has 3 cycle latency and
1 instruction/cycle throughput. This means that there still is a
significant handicap for AMD platforms, not to mention Power or Sparc
with no hardware support. Some ARM's implement CRC32, but I haven't
researched what their performance is.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-12 Thread Ants Aasma

On Thu, Sep 11, 2014 at 4:22 PM, Andres Freund and...@2ndquadrant.com wrote:
  Hm. Perhaps we should do a bufHdr-refcount != zero check without
  locking here? The atomic op will transfer the cacheline exclusively to
  the reclaimer's CPU. Even though it very shortly afterwards will be
  touched afterwards by the pinning backend.

 Meh.  I'm not in favor of adding more funny games with locking unless
 we can prove they're necessary for performance.

 Well, this in theory increases the number of processes touching buffer
 headers regularly. Currently, if you have one read IO intensive backend,
 there's pretty much only process touching the cachelines. This will make
 it two. I don't think it's unreasonable to try to reduce the cacheline
 pingpong caused by that...

I don't think it will help much. A pinned buffer is pretty likely to
be in modified state in the cache of the cpu of the pinning backend.
Even taking a look at the refcount will trigger a writeback and
demotion to shared state. When time comes to unpin the buffer the
cacheline must again be promoted to exclusive state introducing
coherency traffic. Not locking the buffer only saves transfering the
cacheline back to the pinning backend, not a huge amount of savings.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread Ants Aasma

On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 I don't mean that we should abandon this patch - compression makes the WAL
 smaller which has all kinds of other benefits, even if it makes the raw TPS
 throughput of the system worse. But I'm just saying that these TPS
 comparisons should be taken with a grain of salt. We probably should
 consider switching to a faster CRC algorithm again, regardless of what we do
 with compression.

CRC is a pretty awfully slow algorithm for checksums. We should
consider switching it out for something more modern. CityHash,
MurmurHash3 and xxhash look like pretty good candidates, being around
an order of magnitude faster than CRC. I'm hoping to investigate
substituting the WAL checksum algorithm 9.5.

Given the room for improvement in this area I think it would make
sense to just short-circuit the CRC calculations for testing this
patch to see if the performance improvement is due to less data being
checksummed.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-09-12 Thread Ants Aasma

On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva arthur...@gmail.com wrote:
 That's not entirely true. CRC-32C beats pretty much everything with the same
 length quality-wise and has both hardware implementations and highly
 optimized software versions.

For better or for worse CRC is biased by detecting all single bit
errors, the detection capability of larger errors is slightly
diminished. The quality of the other algorithms I mentioned is also
very good, while producing uniformly varying output. CRC has exactly
one hardware implementation in general purpose CPU's and Intel has a
patent on the techniques they used to implement it. The fact that AMD
hasn't yet implemented this instruction shows that this patent is
non-trivial to work around. The hardware CRC is about as fast as
xxhash. The highly optimized software CRCs are an order of magnitude
slower and require large cache trashing lookup tables.

If we choose to stay with CRC we must accept that we can only solve
the performance issues for Intel CPUs and provide slight alleviation
for others.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-04 Thread Ants Aasma

On Sat, Aug 30, 2014 at 8:50 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Andres Freund and...@2ndquadrant.com writes:
On 2014-08-27 19:23:04 +0300, Heikki Linnakangas wrote:
A long time ago, Itagaki Takahiro wrote a patch sort the buffers and write
them out in order
(http://www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp).
The performance impact of that was inconclusive, but one thing that it
allows nicely is to interleave the fsyncs, so that you write all the buffers
for one file, then fsync it, then next file and so on.

...
So, *very* clearly sorting is a benefit.

pg_bench alone doesn't convince me on this. The original thread found
cases where it was a loss, IIRC; you will need to test many more than
one scenario to prove the point.

The same objection came up last time I tried to push for sorted
checkpoints. I did not find any reference to where it caused a loss,
nor was I able to come up with a case where writing out in arbitrary
order would be better than writing out in file sequential order. In
fact if we ask for low latency this means that the OS must keep the
backlog small eliminating any chance of write combining writes that
arrive out of order.

I have a use case where the system continuously loads data into time
partitioned indexed tables, at every checkpoint all of the indexes of
the latest partition need to be written out. The only way I could get
the write out to happen with sequential I/O was to set
checkpoint_completion_target to zero and ensure OS cache allows for
enough dirty pages to absorb the whole checkpoint. The fsync that
followed did obviously nasty things to latency. Whereas sorted
checkpoints were able to do sequential I/O with checkpoint spreading
and low latency tuned OS virtual memory settings. I can create a
benchmark that shows this behavior if you need additional data points
to pgbench's OLTP workload to convince you that sorting checkpoint
writes is a good idea.

I did just come up with a case where plain sorting might cause an
issue. If the writes go to different I/O devices then naive sorting
will first use one device then the other, whereas arbitrary writing
will load balance between the devices. Assuming that separate
tablespaces are used for separate I/O devices, it should be enough to
just interleave writes of each tablespace, weighed by the amount of
writes per tablespace.

Regards,
Ants Aasma
--
Cybertec Schönig Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] postgresql latency bgwriter not doing its job

2014-09-04 Thread Ants Aasma

On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund and...@2ndquadrant.com wrote:
 It's imo quite clearly better to keep it allocated. For one after
 postmaster started the checkpointer successfully you don't need to be
 worried about later failures to allocate memory if you allocate it once
 (unless the checkpointer FATALs out which should be exceedingly rare -
 we're catching ERRORs). It's much much more likely to succeed
 initially. Secondly it's not like there's really that much time where no
 checkpointer isn't running.

In principle you could do the sort with the full sized array and then
compress it to a list of buffer IDs that need to be written out. This
way most of the time you only need a small array and the large array
is only needed for a fraction of a second.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-08-08 Thread Ants Aasma

On Fri, Aug 8, 2014 at 7:35 PM, Andrew Dunstan and...@dunslane.net wrote:
 I took a quick look and saw that this wouldn't be that easy to get around.
 As I'd suspected upthread, there are places that do random access into a
 JEntry array, such as the binary search in findJsonbValueFromContainer().
 If we have to add up all the preceding lengths to locate the corresponding
 value part, we lose the performance advantages of binary search.  AFAICS
 that's applied directly to the on-disk representation.  I'd thought
 perhaps there was always a transformation step to build a pointer list,
 but nope.

 It would be interesting to know what the performance hit would be if we
 calculated the offsets/pointers on the fly, especially if we could cache it
 somehow. The main benefit of binary search is in saving on comparisons,
 especially of strings, ISTM, and that could still be available - this would
 just be a bit of extra arithmetic.

I don't think binary search is the main problem here. Objects are
usually reasonably sized, while arrays are more likely to be huge. To
make matters worse, jsonb - int goes from O(1) to O(n).

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reporting the commit LSN at commit time

2014-08-07 Thread Ants Aasma

On Thu, Aug 7, 2014 at 4:15 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 I'm thinking about adding a new message type in the protocol that gets
 sent immediately before CommandComplete, containing the LSN of the
 commit. Clients would need to enable the sending of this message with a
 GUC that they set when they connect, so it doesn't confuse clients that
 aren't expecting it or aware of it.

 Is this something you can see being useful for other non-BDR purposes?

I have been thinking about something similar.

For regular streaming replication you could keep track of the commit
LSN on a per client basis and automatically redirect read queries to a
standby if standby apply location is larger than the commit LSN of
this particular client. This can be done mostly transparently for the
application while not running into the issue that recent modifications
disappear for a while after commit.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics - v0.5

2014-07-02 Thread Ants Aasma

On Fri, Jun 27, 2014 at 9:00 PM, Andres Freund and...@2ndquadrant.com wrote:
 Imagine the situation for the buffer header spinlock which is one of the
 bigger performance issues atm. We could aim to replace all usages of
 that with clever and complicated logic, but it's hard.

 The IMO reasonable (and prototyped) way to do it is to make the common
 paths lockless, but fall back to the spinlock for the more complicated
 situations. For the buffer header that means that pin/unpin and buffer
 lookup are lockless, but IO and changing the identity of a buffer still
 require the spinlock. My attempts to avoid the latter basically required
 a buffer header specific reimplementation of spinlocks.

There is a 2010 paper [1] that demonstrates a fully non-blocking
approach to buffer management using the same generalized clock
algorithm that PostgreSQL has. The site also has an implementation for
Apache Derby. You may find some interesting ideas in there.

[1] 
http://code.google.com/p/derby-nb/source/browse/trunk/derby-nb/ICDE10_conf_full_409.pdf

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gettimeofday is at the end of its usefulness?

2014-05-14 Thread Ants Aasma

On Wed, May 14, 2014 at 6:34 AM, Greg Stark st...@mit.edu wrote:
 I always assumed the kernel used rdtsc to implement some of the high
 performance timers. It can save the current time in a mapped page when
 it schedules a process and then in the vdso syscall (ie in user-space)
 it can use rdtsc to calculate the offset needed to adjust that
 timestamp to the current time. This seems consistent with your
 calculations that showed the 40ns overhead with +/- 10ns precision.

Both gettimeofday and clock_gettime do exactly that. [1]
clock_gettime(CLOCK_MONOTONIC) is the mode of operation we would want
to use here.

 I actually think it would be more interesting if we could measure the
 overhead and adjust for it. I don't think people are really concerned
 with how long EXPLAIN ANALYZE takes to run if they could get accurate
 numbers out of it.

Measuring would also be a good idea so we can automatically turn on
performance counters like IO timing when we know it's not obscenely
expensive.

However, subtracting the overhead will still skew the numbers somewhat
by giving more breathing time for memory and IO prefetching
mechanisms. Another option to consider would be to add a sampling
based mechanism for low overhead time attribution. It would be even
better if we could distinguish between time spent waiting on locks vs.
waiting on IO vs. waiting to be scheduled vs. actually executing.

[1] 
https://github.com/torvalds/linux/blob/master/arch/x86/vdso/vclock_gettime.c#L223

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2014-05-12 Thread Ants Aasma

On Mon, May 12, 2014 at 4:56 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 01/24/2014 02:10 PM, Rajeev rastogi wrote:

 We are also planning to implement CSN based snapshot.
 So I am curious to know whether any further development is happening on
 this.


 I started looking into this, and plan to work on this for 9.5. It's a big
 project, so any help is welcome. The design I have in mind is to use the LSN
 of the commit record as the CSN (as Greg Stark suggested).

I did do some coding work on this, but the free time I used to work on
this basically disappeared with a child in the family. I guess what I
have has bitrotted beyond recognition. However I may still have some
insight that may be of use.

From your comments I presume that you are going with the original,
simpler approach proposed by Robert to simply keep the XID-CSN map
around for ever and probe it for all visibility lookups that lie
outside of the xmin-xmax range? As opposed to the more complex hybrid
approach I proposed that keeps a short term XID-CSN map and lazily
builds conventional list-of-concurrent-XIDs snapshots for long lived
snapshots. I think that would be prudent, as the simpler approach
needs mostly the same ground work and if turns out to work well
enough, simpler is always better.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2014-05-12 Thread Ants Aasma

On Mon, May 12, 2014 at 6:09 PM, Robert Haas robertmh...@gmail.com wrote:
 However, I wonder what
 happens if you write the commit record and then the attempt to update
 pg_clog fails.  I think you'll have to PANIC, which kind of sucks.

CLOG IO error while commiting is already a PANIC, SimpleLruReadPage()
does SlruReportIOError(), which in turn does ereport(ERROR), while
inside a critical section initiated in RecordTransactionCommit().

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2014-05-12 Thread Ants Aasma

On Mon, May 12, 2014 at 7:10 PM, Greg Stark st...@mit.edu wrote:
 Would it be useful to store the current WAL insertion point along with
 the about to commit flag so it's effectively a promise that this
 transaction will commit no earlier than XXX. That should allow most
 transactions to decide if those records are visible or not unless
 they're very recent transactions which started in that short window
 while the committing transaction was in the process of committing.

I don't believe this is worth the complexity. The contention window is
extremely short here.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-16 Thread Ants Aasma

On Wed, Apr 16, 2014 at 7:44 AM, Robert Haas robertmh...@gmail.com wrote:
 I think that the basic problem here is that usage counts increase when
 buffers are referenced, but they decrease when buffers are evicted,
 and those two things are not in any necessary way connected to each
 other.  In particular, if no eviction is happening, reference counts
 will converge to the maximum value.  I've read a few papers about
 algorithms that attempt to segregate the list of buffers into hot
 and cold lists, and an important property of such algorithms is that
 they mustn't be allowed to make everything hot.  It's easy to be too
 simplistic, here: an algorithm that requires that no more than half
 the list be hot will fall over badly on a workload where the working
 set exceeds the available cache and the really hot portion of the
 working set is 60% of the available cache.  So you need a more
 sophisticated algorithm than that.  But that core property that not
 all buffers can be hot must somehow be preserved, and our algorithm
 doesn't.

FWIW in CLOCK-Pro segregating buffers between hot and cold is tied to
eviction and the clock sweep, the ratio between hot and cold is
dynamically adapted based on prior experience. The main downside is
that it seems to require an indirection somewhere either in the clock
sweep or buffer lookup. Maybe it's possible to avoid that with some
clever engineering if we think hard enough.

CLOCK-Pro may also have too little memory of hotness, making it too
easy to blow the whole cache away with a burst of activity. It may be
useful to have a (possibly tunable) notion of fairness where one
query/backend can't take over the cache even though it may be an
overall win in terms of total number of I/Os performed. Maybe we need
to invent Generalized CLOCK-Pro with a larger number of levels,
ranging from cold, hot and scalding to infernal. :)

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2014-04-15 Thread Ants Aasma

On Mon, Apr 14, 2014 at 8:11 PM, Peter Geoghegan p...@heroku.com wrote:
PostgreSQL implements a clock sweep algorithm, which gets us something
approaching an LRU for the buffer manager in trade-off for less
contention on core structures. Buffers have a usage_count/popularity
that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
algorithm only has one bit for what approximates our usage_count (so
it's either 0 or 1). I think that at its core CLOCK is an algorithm
that has some very desirable properties that I am sure must be
preserved. Actually, I think it's more accurate to say we use a
variant of clock pro, a refinement of the original CLOCK.

PostgreSQL replacement algorithm is more similar to Generalized CLOCK
or GCLOCK, as described in [1]. CLOCK-Pro [2] is a different algorithm
that approximates LIRS[3]. LIRS is what MySQL implements[4] and
CLOCK-Pro is implemented by NetBSD [5] and there has been some work on
trying it on Linux [6]. Both LIRS and CLOCK-Pro work by keeping double
the cache size metadata entries and detect pages that have been
recently referenced. Basically they provide an adaptive tradeoff
between LRU and LFU.

In the past, various hackers have noted problems they've observed with
this scheme. A common pathology is to see frantic searching for a
victim buffer only to find all buffer usage_count values at 5. It may
take multiple revolutions of the clock hand before a victim buffer is
found, as usage_count is decremented for each and every buffer. Also,
BufFreelistLock contention is considered a serious bottleneck [1],
which is related.

There's a paper on a non blocking GCLOCK algorithm, that does lock
free clock sweep and buffer pinning[7]. If we decide to stay with
GCLOCK it may be interesting, although I still believe that some
variant of buffer nailing[8] is a better idea, my experience shows
that most of the locking overhead is cache line bouncing ignoring the
extreme cases where our naive spinlock implementation blows up.

Let's leave aside inner/root pages though, because they're so
dramatically useful when in a primary index on a tpb-b table that
they'll always be cached by any non-terrible algorithm. It beggars
belief that the still relatively dense (and consequently *popular*)
B+Tree leaf pages get so little credit for being of such long-term
utility (in the view of our clock sweep algorithm as implemented). The
algorithm has what could be loosely described as an excessively
short-term perspective. There is clearly a better balance to be had
here. I don't think the answer is that we have the B-Tree code give
its pages some special treatment that makes them harder to evict,
although I will admit that I briefly entertained the idea.

There has been some research that indicates that for TPC-A workloads
giving index pages higher weights increases hitrates[1].

I think the hardest hurdle for any changes in this area will be
showing that we don't have any nasty regressions. I think the best way
to do that would be to study separately the performance overhead of
the replacement algorithm and optimality of the replacement choices.
If we capture a bunch of buffer reference traces by instrumenting
PinBuffer, we can pretty accurately simulate the behavior of different
algorithm and tuning choices with different shared buffer sizes.
Obviously full scale tests are still needed due to interactions with
OS, controller and disk caches and other miscellaneous influences. But
even so, simulation would get us much better coverage of various
workloads and at least some confidence that it's a good change
overall. It will be very hard and time consuming to gather equivalent
evidence with full scale tests.

[1] http://www.csd.uoc.gr/~hy460/pdf/p35-nicola.pdf
[2] http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf
[3] http://www.ece.eng.wayne.edu/~sjiang/pubs/papers/jiang02_LIRS.pdf
[4] http://lists.mysql.com/commits/28601
[5] http://fxr.watson.org/fxr/source/uvm/uvm_pdpolicy_clockpro.c?v=NETBSD
[6] http://lwn.net/Articles/147879/
[7]
http://derby-nb.googlecode.com/svn-history/r41/trunk/derby-nb/ICDE10_conf_full_409.pdf
[8]
http://www.postgresql.org/message-id/ca+tgmozypeyhwauejvyy9a5andoulcf33wtnprfr9sycw30...@mail.gmail.com

Regards,
Ants Aasma
--
Cybertec Schönig Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Missing pfree in logical_heap_rewrite_flush_mappings()

2014-03-26 Thread Ants Aasma

It seems to me that when flushing logical mappings to disk, each
mapping file leaks the buffer used to pass the mappings to XLogInsert.
Also, it seems consistent to allocate that buffer in the RewriteState
memory context. Patch attached.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 4cf07ea..ae439e8 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -897,7 +897,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 
 		/* write all mappings consecutively */
 		len = src-num_mappings * sizeof(LogicalRewriteMappingData);
-		waldata = palloc(len);
+		waldata = MemoryContextAlloc(state-rs_cxt, len);
 		waldata_start = waldata;
 
 		/*
@@ -943,6 +943,7 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		/* write xlog record */
 		XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_REWRITE, rdata);
 
+		pfree(waldata);
 	}
 	Assert(state-rs_num_rewrite_mappings == 0);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

2014-02-12 Thread Ants Aasma

On Wed, Feb 12, 2014 at 4:04 PM, knizhnik knizh...@garret.ru wrote:
 Even if reordering was not done by compiler, it still can happen while
 execution.
 There is no warranty that two subsequent assignments will be observed by all
 CPU cores in the same order.

The x86 memory model (total store order) provides that guarantee in
this specific case.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2013-12-15 Thread Ants Aasma

On Dec 15, 2013 6:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 David Rowley dgrowle...@gmail.com writes:
  I've attached an updated patch which includes some documentation.
  I've also added support for negfunc in CREATE AGGREGATE. Hopefully
that's
  an ok name for the option, but if anyone has any better ideas please let
  them be known.

 I'd be a bit inclined to build the terminology around reverse instead of
 negative --- the latter seems a bit too arithmetic-centric.  But that's
 just MHO.

To contribute to the bike shedding, inverse is often used in similar
contexts.

--
Ants Aasma

Re: [HACKERS] better atomics

2013-11-06 Thread Ants Aasma

On Tue, Nov 5, 2013 at 5:51 PM, Andres Freund and...@2ndquadrant.com wrote:
 So what I've previously proposed did use plain types for the
 to-be-manipulated atomic integers. I am not sure about that anymore
 though, C11 includes support for atomic operations, but it assumes that
 the to-be-manipulated variable is of a special atomic type. While I am
 not propose that we try to emulate C11's API completely (basically
 impossible as it uses type agnostic functions, it also exposes too
 much), it seems to make sense to allow PG's atomics to be implemented by
 C11's.

 The API is described at: http://en.cppreference.com/w/c/atomic

 The fundamental difference to what I've suggested so far is that the
 atomic variable isn't a plain integer/type but a distinct datatype that
 can *only* be manipulated via the atomic operators. That has the nice
 property that it cannot be accidentally manipulated via plain operators.

 E.g. it wouldn't be
 uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_);
 but
 uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_);

 where, for now, atomic_uint32 is something like:
 typedef struct pg_atomic_uint32
 {
 volatile uint32 value;
 } pg_atomic_uint32;
 a future C11 implementation would just typedef C11's respective atomic
 type to pg_atomic_uint32.

 Comments?

C11 atomics need to be initialized through atomic_init(), so a simple
typedef will not be correct here. Also, C11 atomics are designed to
have the compiler take care of memory barriers - so they might not
always be a perfect match to our API's, or any API implementable
without compiler support.

However I'm mildly supportive on having a separate type for variables
accessed by atomics. It can result in some unnecessary code churn, but
in general if an atomic access to a variable is added, then all other
accesses to it need to be audited for memory barriers, even if they
were previously correctly synchronized by a lock.

I guess the best approach for deciding would be to try to convert a
couple of the existing unlocked accesses to the API and see what the
patch looks like.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] better atomics

2013-11-06 Thread Ants Aasma

On Wed, Nov 6, 2013 at 4:54 PM, Andres Freund and...@2ndquadrant.com wrote:
 FWIW, I find the requirement for atomic_init() really, really
 annoying. Not that that will change anything ;)

Me too, but I guess this allows them to emulate atomics on platforms
that don't have native support. Whether that is a good idea is a
separate question.

 However I'm mildly supportive on having a separate type for variables
 accessed by atomics. It can result in some unnecessary code churn, but
 in general if an atomic access to a variable is added, then all other
 accesses to it need to be audited for memory barriers, even if they
 were previously correctly synchronized by a lock.

 Ok, that's what I am writing right now.

 I guess the best approach for deciding would be to try to convert a
 couple of the existing unlocked accesses to the API and see what the
 patch looks like.

 I don't think there really exist any interesting ones? I am using my
 lwlock reimplementation as a testbed so far.

BufferDesc management in src/backend/storage/buffer/bufmgr.c.
The seqlock like thing used to provide consistent reads from
PgBackendStatus src/backend/postmaster/pgstat.c (this code looks like
it's broken on weakly ordered machines)
The unlocked reads that are done from various procarray variables.
The XLogCtl accesses in xlog.c.

IMO the volatile keyword should be confined to the code actually
implementing atomics, locking. The use volatile pointer to prevent
code rearrangement comment is evil. If there are data races in the
code, they need comments pointing this out and probably memory
barriers too. If there are no data races, then the volatile
declaration is useless and a pessimization. Currently it's more of a
catch all here be dragons declaration for data structures.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-21 Thread Ants Aasma

On Tue, Oct 22, 2013 at 1:09 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Gavin Flower wrote:

 One way it could be done, but even this would consume far too much
 storage and processing power (hence totally impractical), would be
 to 'simply' store a counter for each value found and increment it
 for each occurence...

 An histogram?  Sounds like a huge lot of code complexity to me.  Not
 sure the gain is enough.

I have a proof of concept patch somewhere that does exactly this. I
used logarithmic bin widths. With 8 log10 bins you can tell the
fraction of queries running at each order of magnitude from less than
1ms to more than 1000s. Or with 31 bins you can cover factor of 2
increments from 100us to over 27h. And the code is almost trivial,
just take a log of the duration and calculate the bin number from that
and increment the value in the corresponding bin.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-21 Thread Ants Aasma

On Tue, Oct 22, 2013 at 4:00 AM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:
 I have a proof of concept patch somewhere that does exactly this. I
 used logarithmic bin widths. With 8 log10 bins you can tell the
 fraction of queries running at each order of magnitude from less than
 1ms to more than 1000s. Or with 31 bins you can cover factor of 2
 increments from 100us to over 27h. And the code is almost trivial,
 just take a log of the duration and calculate the bin number from that
 and increment the value in the corresponding bin.

 I suppose this has to be decided at compile time to keep the code both
 simple and efficient - if so, I like the binary approach.

For efficiency's sake it can easily be done at run time, one extra
logarithm calculation per query will not be noticeable. Having a
proper user interface to make it configurable and changeable is where
the complexity is. We might just decide to go with something good
enough as even the 31 bin solution would bloat the pg_stat_statements
data structure only by about 10%.

 Curious, why start at 100us? I suppose this might be of interest if
 everything of note is in RAM and/or stuff is on SSD's.

Selecting a single row takes about 20us on my computer, I picked 100us
as a reasonable limit below where the exact speed doesn't matter
anymore.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] removing old ports and architectures

2013-10-18 Thread Ants Aasma

On Thu, Oct 17, 2013 at 3:10 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Oct 17, 2013 at 12:22 AM, Noah Misch n...@leadboat.com wrote:
 Removing support for alpha is a different animal compared to removing support
 for non-gcc MIPS and most of the others in your list.  A hacker wishing to
 restore support for another MIPS compiler would fill in the assembly code
 blanks, probably using code right out of an architecture manual.  A hacker
 wishing to restore support for alpha would find himself auditing every
 lock-impoverished algorithm in the backend.

 I had much the same thought last night.  So I reverse my vote on
 Alpha: let's drop it.  I had thought that perhaps there'd be some
 value in keeping it to force ourselves to consider what will happen
 under the weakest generally-understood memory model, but in fact
 that's probably a doomed effort without having the hardware available
 to test the code.  As you say, any future atomics support for such a
 platform will be a major undertaking.

FWIW, I think that if we approach coding lock free algorithms
correctly - i.e. which memory barriers can we avoid while being
safe, instead of which memory barriers we need to add to become
safe - then supporting Alpha isn't a huge amount of extra work. We
only need a couple of extra barriers after atomic reads where I think
we should have a comment anyway explaining that we don't need a read
barrier because a data dependency provides ordering.

In general I agree that we are unlikely to provide a bug free result
without a build farm animal, so I'm ±0 on removing support. We can try
to support, but we are unlikely to succeed. I also find it unlikely
that anyone will create a new architecture with a similarly loose
memory model. The experience with Alpha and other microprocessors
shows that the extra hardware needed for fast and strong memory
ordering guarantees more than pays for itself in performance.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] removing old ports and architectures

2013-10-18 Thread Ants Aasma

On Fri, Oct 18, 2013 at 8:04 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Oct 18, 2013 at 9:55 AM, Ants Aasma a...@cybertec.at wrote:
 FWIW, I think that if we approach coding lock free algorithms
 correctly - i.e. which memory barriers can we avoid while being
 safe, instead of which memory barriers we need to add to become
 safe - then supporting Alpha isn't a huge amount of extra work.

 Alpha is completely irrelevant, so I would not like to expend the
 tiniest effort on supporting it. If there is someone using a very much
 legacy architecture like this, I doubt that even they will appreciate
 the ability to upgrade to the latest major version.

It's mostly irrelevant and I wouldn't shed a tear for Alpha support,
but I'd like to point out that it's a whole lot less irrelevant than
some of the architectures being discussed here. The latest Alpha
machines were sold only 6 years ago and supported up to 512GB of
memory with 64 1.3 GHz cores, something that can run a very reasonable
database load even today.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-10-02 Thread Ants Aasma

On Wed, Oct 2, 2013 at 4:39 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Mon, Sep 30, 2013 at 7:51 PM, Ants Aasma a...@cybertec.at wrote:
 So we need a read barrier somewhere *after* reading the flag in
 RecoveryInProgress() and reading the shared memory structures, and in
 theory a full barrier if we are going to be writing data.

 wow -- thanks for your review and provided detail.  Considering there
 are no examples of barrier instructions to date, I think some of your
 commentary should be included in the in-source documentation.

 In this particular case, a read barrier should be sufficient?  By
 'writing data', do you mean to the xlog control structure?  This
 routine only sets a backend local flag so that should be safe?

I haven't reviewed the code in as much detail to say if there is an
actual race here, I tend to think there's probably not, but the
specific pattern that I had in mind is that with the following actual
code:

Process A:
globalVar = 1;
write_barrier();
recoveryInProgress = false;

Process B:
if (!recoveryInProgress) {
globalVar = 2;
doWork();
}

If process B speculatively executes line 2 before reading the flag for
line 1, then it's possible that the store in process B is executed
before the store in process A, making globalVar move backwards. The
barriers as they are defined don't make this scenario impossible. That
said, I don't know of any hardware that would make speculatively
executed stores visible to non-speculative state, as I said, that
would be completely insane. However currently compilers consider it
completely legal to rewrite the code into the following form:

tmp = globalVar;
globalVar = 2;
if (!recoveryInProgress) {
doWork();
} else {
globalVar = tmp;
}

That still exhibits the same problem. An abstract read barrier would
not be enough here, as this requires a LoadStore barrier. However, the
control dependency is enough for the hardware and PostgreSQL
pg_read_barrier() is a full compiler barrier, so in practice a simple
pg_read_barrier() is enough.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-10-02 Thread Ants Aasma

On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma a...@cybertec.at wrote:
 I haven't reviewed the code in as much detail to say if there is an
 actual race here, I tend to think there's probably not, but the
 specific pattern that I had in mind is that with the following actual
 code:

 hm.  I think there *is* a race.  2+ threads could race to the line:

 LocalRecoveryInProgress = xlogctl-SharedRecoveryInProgress;

 and simultaneously set the value of LocalRecoveryInProgress to false,
 and both engage InitXLOGAccess, which is destructive.   The move
 operation is atomic, but I don't think there's any guarantee the reads
 to xlogctl-SharedRecoveryInProgress are ordered between threads
 without a lock.

 I don't think the memory barrier will fix this.  Do you agree?  If so,
 my earlier patch (recovery4) is another take on this problem, and
 arguable safer; the unlocked read is in a separate path that does not
 engage InitXLOGAccess()

InitXLOGAccess only writes backend local variables, so it can't be
destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock
acquisition cycle, which should be a full memory barrier. A read
barrier after accessing SharedRecoveryInProgress is good enough, and
it seems to be necessary to avoid a race condition for
XLogCtl-ThisTimeLineID.

Admittedly the race is so unprobable that in practice it probably
doesn't matter. I just wanted to be spell out the correct way to do
unlocked accesses as it can get quite confusing. I found Herb Sutter's
atomic weapons talk very useful, thinking about the problem in
terms of acquire and release makes it much clearer to me.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freezing without write I/O

2013-10-01 Thread Ants Aasma

On Tue, Oct 1, 2013 at 2:13 PM, Andres Freund and...@2ndquadrant.com wrote:
 Agreed. The wait free LW_SHARED thing[1] I posted recently had a simple

 #define pg_atomic_read(atomic) (*(volatile uint32 *)(atomic))

 That should be sufficient and easily greppable, right?

Looks good enough for me. I would consider using a naming scheme that
accounts for possible future uint64 atomics.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-10-01 Thread Ants Aasma

On Tue, Oct 1, 2013 at 2:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 As for the specific patch being discussed here. The read barrier is in
 the wrong place and with the wrong comment, and the write side is
 assuming that SpinLockAcquire() is a write barrier, which it isn't
 documented to do at the moment.

 Lots of things we do pretty much already assume it is one - and I have a
 hard time imagining any spinlock implementation that isn't a write
 barrier. In fact it's the only reason we use spinlocks in quite some
 places.
 What we are missing is that it's not guaranteed to be a compiler barrier
 on all platforms :(. But that's imo a bug.

Agree on both counts.

 The correct way to think of this is
 that StartupXLOG() does a bunch of state modifications and then
 advertises the fact that it's done by setting
 xlogctl-SharedRecoveryInProgress = false; The state modifications
 should better be visible to anyone seeing that last write, so you need
 one write barrier between the state modifications and setting the
 flag.

 SpinLockAcquire() should provide that.

Yes. It's notable that in this case it's a matter of correctness that
the global state modifications do *not* share the critical section
with the flag update. Otherwise the flag update may become visible
before the state updates.

 So we need a read barrier somewhere *after* reading the flag in
 RecoveryInProgress() and reading the shared memory structures, and in
 theory a full barrier if we are going to be writing data. In practice
 x86 is covered thanks to it's memory model, Power is covered thanks to
 the control dependency and ARM would need a read barrier, but I don't
 think any real ARM CPU does speculative stores as that would be
 insane.

 Does there necessarily have to be a visible control dependency?

Unfortunately no, the compiler is quite free to hoist loads and even
stores out from the conditional block losing the control dependency.
:( It's quite unlikely to do so as it would be a very large code
motion and it probably has no reason to do it, but I don't see
anything that would disallow it. I wonder if we should just emit a
full fence there for platforms with weak memory models. Trying to
enforce the control dependency seems quite fragile.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-30 Thread Ants Aasma

On Sat, Sep 28, 2013 at 12:53 AM, Andres Freund and...@2ndquadrant.com wrote:
 What confuses me is that pg_read_barrier() is just a compiler barrier on
 x86[-64] in barrier.h. According to my knowledge it needs to be an
 lfence or the full barrier?
 The linked papers from Paul McKenney - which are a great read - seem to
 agree?

x86 provides a pretty strong memory model, the only (relevant) allowed
reordering is that stores may be delayed beyond subsequent loads. A
simple and functionally accurate model of this is to ignore all caches
and think of the system as a bunch of CPU's accessing the main memory
through a shared bus, but each CPU has a small buffer that stores
writes done by that CPU. Intel and AMD have performed feats of black
magic in the memory pipelines that this isn't a good performance
model, but for correctness it is valid. The exceptions are few
unordered memory instructions that you have to specifically ask for
and non-writeback memory that concerns the kernel. (See section 8.2 in
[1] for details) The note by McKenney stating that lfence is required
for a read barrier is for those unordered instructions. I don't think
it's necessary in PostgreSQL.

As for the specific patch being discussed here. The read barrier is in
the wrong place and with the wrong comment, and the write side is
assuming that SpinLockAcquire() is a write barrier, which it isn't
documented to do at the moment. The correct way to think of this is
that StartupXLOG() does a bunch of state modifications and then
advertises the fact that it's done by setting
xlogctl-SharedRecoveryInProgress = false; The state modifications
should better be visible to anyone seeing that last write, so you need
one write barrier between the state modifications and setting the
flag. On the other side, after reading the flag as false in
RecoveryInProgress() the state modified by StartupXLOG() may be
investigated, those loads better not happen before we read the flag.
So we need a read barrier somewhere *after* reading the flag in
RecoveryInProgress() and reading the shared memory structures, and in
theory a full barrier if we are going to be writing data. In practice
x86 is covered thanks to it's memory model, Power is covered thanks to
the control dependency and ARM would need a read barrier, but I don't
think any real ARM CPU does speculative stores as that would be
insane.

[1] 
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freezing without write I/O

2013-09-30 Thread Ants Aasma

Just found this in my drafts folder. Sorry for the late response.

On Fri, Sep 20, 2013 at 3:32 PM, Robert Haas robertmh...@gmail.com wrote:
 I am entirely unconvinced that we need this.  Some people use acquire
 + release fences, some people use read + write fences, and there are
 all combinations (read-acquire, read-release, read-acquire-release,
 write-acquire, write-release, write-acquire-release, both-acquire,
 both-release, both-acquire-release).  I think we're going to have
 enough trouble deciding between the primitives we already have, let
 alone with a whole mess of new ones.  Mistakes will only manifest
 themselves on certain platforms and in many cases the races are so
 tight that actual failures are very unlikely to be reserved in
 regression testing.

I have to retract my proposal to try to emulate C11 atomics in C89. I
guess this goes to show why one shouldn't try to conjure up atomic
API's at 2AM. I forgot the fact that while acquire-release semantics
are enough to ensure sequentially consistent behavior, the actual
barriers need to be paired with specific atomic accesses to achieve
that. It's not possible to use freestanding acquire/release barriers
to do implement this, nor would it be possible to include barriers in
the atomic accesses themselves without causing significant
pessimization.

Sequentially consistency (along with causal transitivity and total
store ordering that come with it) should be regarded as a goal. I'm
not able to reason about concurrent programs without those guarantees,
and I suspect no one else is either. Sequential consistency is
guaranteed if atomic variables (including locks) are accessed with
appropriate acquire and release semantics. We just need to use a
hodge-podge of read/write/full barriers and volatile memory accesses
to actually implement the semantics until some distant future date
where we can start relying on compilers getting it right.

I still think we should have a macro for the volatile memory accesses.
As a rule, each one of those needs a memory barrier, and if we
consolidate them, or optimize them out, the considerations why this is
safe should be explained in a comment. Race prone memory accesses
should stick out like a sore thumb.

 Personally, I think the biggest change that would help here is to
 mandate that spinlock operations serve as compiler fences.  That would
 eliminate the need for scads of volatile references all over the
 place.

+1. The commits that you showed fixing issues in this area show quite
well why this is a good idea.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freezing without write I/O

2013-09-19 Thread Ants Aasma

On Thu, Sep 19, 2013 at 2:42 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 * switchFinishXmin and nextSwitchXid should probably be either volatile
or have a compiler barrier between accessing shared memory and
checking them. The compiler very well could optimize them away and
access shmem all the time which could lead to weird results.

 Hmm, that would be a weird optimization. Is that really legal for the
 compiler to do? Better safe than sorry, I guess.

C doesn't really have a multi-threaded memory model before C11, so the
compiler makers have just made one up that suits them best. For unlocked
memory reads the compiler is free to assume that no one else is accessing the
variable, so yes that optimization would be legal according to their rules.

I'm tackling similar issues in my patch. What I'm thinking is that we should
change our existing supposedly atomic accesses to be more explicit and make
the API compatible with C11 atomics[1]. For now I think the changes should be
something like this:

* Have separate typedefs for atomic variants of datatypes (I don't think we
  have a whole lot of them). With C11 atomics support, this would change to

typedef _Atomic TransactionId AtomicTransactionId;

* Require that pg_atomic_init(type, var, val) be used to init atomic
  variables. Initially it would just pass through to assignment, as all
  supported platforms can do 32bit atomic ops. C11 compatible compilers will
  delegate to atomic_init().

* Create atomic read and write macros that look something like:

#define pg_atomic_load(type, var) (*((volatile type*) (var)))

  and

#define pg_atomic_store(type, var, val) do { \
*((volatile type*) (var)) = (val); \
} while(0)

  C11 would pass through to the compilers implementation with relaxed memory
  ordering.

* Rename pg_read_barrier()/pg_write_barrier()/pg_memory_barrier() to
  pg_acquire_fence()/pg_release_fence()/pg_acq_rel_fence(). Herb Sutter makes
  a convincing argument why loosening up the barrier semantics is the sane
  choice here. [2] C11 support can then pass though to atomic_thread_fence().

This way we have a relatively future proof baseline for lockfree programming,
with options to expand with other primitives. We could also only do the
volatile access macros part, at least it would make the intention more clear
than the current approach of sprinkling around volatile pointers.

Regards,
Ants Aasma

[1] http://en.cppreference.com/w/c/atomic
[2] (long video about atomics)
http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Questions about checksum feature in 9.3

2013-09-16 Thread Ants Aasma

On Sun, Sep 15, 2013 at 8:13 AM, Kevin k...@gatorgraphics.com wrote:
 My attempts to compile it vectorized on OS X seemed to have failed since I 
 don't find a vector instruction in the .o file even though the options 
 -msse4.1 -funroll-loops -ftree-vectorize should be supported according to the 
 man page for Apple's llvm-gcc.

I'm not sure what version of LLVM Apple is using for llvm-gcc. I know
that clang+llvm 3.3 can successfully vectorize the checksum algorithm
when -O3 is used.

 So, has anyone compiled checksum vectorized on OS X? Are there any 
 performance data that would indicate whether or not I should worry with this 
 in the first place?

Even without vectorization the worst case performance hit is about
20%. This is for a workload that is fully bottlenecked on swapping
pages in between shared buffers and OS cache. In real world cases it's
hard to imagine it having any measurable effect. A single core can
checksum several gigabytes per second of I/O without vectorization,
and about 30GB/s with vectorization.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Master-slave visibility order

2013-08-29 Thread Ants Aasma

Hi, thanks for your reply.

On Thu, Aug 29, 2013 at 6:40 PM, Robert Haas robertmh...@gmail.com wrote:
 I think approach #2 is dead on arrival, at least as a default policy.
 It essentially amounts to requiring two commit records per transaction
 rather than one, and I think that has no chance of being acceptable.
 It's not just or even primarily the *volume* of WAL that I'm concerned
 about so much as the feeling that hitting WAL twice rather than once
 at the end of a transaction that may have only written one or two WAL
 records to begin with is going to slow things down pretty
 substantially, especially in high-concurrency scenarios.

Heikki's excellent work on WAL insert scaling improves this so the hit
might not be all that big, considering that the visibility record only
needs to be inserted - relatively cheap compared to a WAL sync. But
it's still not likely to be free. I guess the only way to know for
sure would be to build it and bench it.

 I wouldn't entirely dismiss the idea of changing the user-visible
 semantics.  In addition to a WAL insertion pointer and a WAL flush
 pointer, you'd have a WAL snapshot pointer, which could run ahead of
 the flush pointer if the transactions were all asynchronous, but which
 for synchronous transactions could not advance faster than the flush
 pointer.  Only users running a mix of synchronous_commit=on and
 synchronous_commit=off would be harmed, and maybe we could convince
 ourselves that's OK.

Do you mean that mixed durability workloads with replication would
make async transactions wait or delay the visibility? We have the
additional complication of different synchronous_commit levels, so
this decision also affects different levels of synchronous commits.

 Still, there's no doubt that there is a downside there.  Therefore,
 I'm inclined to suggest that you implement #1.  If, at a later time,
 we want to make progress on the issue of cluster-wide snapshot
 consistency, you could implement #2 or #3 as an optional feature that
 can be turned on via some flag.  However, I would recommend against
 trying to do that in the initial patch; I think that doing either #2
 or #3 is really a separate feature, and I think if you try to
 incorporate all of that code into the main CSN patch it's just going
 to be a distraction from what figures to be a very complicated patch
 even in minimal form.

I'll go with #1. I agree that snapshot consistency a separate feature
that is mostly orthogonal to CSN snapshots. I wanted to get this
decision out of the way, so when it's time to discuss the actual patch
we don't have the distraction of discussing why LSNs are not workable
for determining visibility order.

 If you did choose to implement #2 as an option at some point, it would
 probably be worth optimizing for the case where commit ordering and
 visibility ordering match, and try to find a design where you only
 need the extra WAL record when the orderings don't match.  I'm not
 sure exactly how to do that, but it might be worth investigating.  I
 don't think that's enough to save #2 as a default behavior, but it
 might make it more palatable as an option.

Without a side channel the extra WAL record is necessary. Suppose that
we want to determine the ordering with a single commit record. The
slave must be able to deduce from the single record if it can make the
commit immediately visible or should it wait for additional
information. If it waits for additional information, that may never
come as the master could have committed and then went idle. If it
doesn't wait, then an async transaction could arrive on master, commit
and would want to become visible, but the master can't make it visible
without either violating the visibility order or letting the async
transaction wait behind the sync. In other words, without an oracle
(in the computer science sense :) ) master can't determine at the time
of commit record generation if the orderings can differ, and as WAL is
the only communication channel, neither can the slave. Timeouts won't
help either as that would need clock synchronization between servers,
similarly to Google's F1 system.

Speaking of F1, they solve the same problem by having clients be aware
of how fresh they want their snapshot to be. If we add this capability
then clients aware of this functionality could shift the visibility
wait from commit to the start of next transaction that needs to see
the changes.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 >

1 - 100 of 247 matches

Mail list logo