Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-10 Thread Ants Aasma
On Mon, 8 Apr 2024 at 16:26, Robert Haas  wrote:

> And maybe we need to think of a way to further mitigate this crush of
> last minute commits. e.g. In the last week, you can't have more
> feature commits, or more lines of insertions in your commits, than you
> did in the prior 3 weeks combined. I don't know. I think this mad rush
> of last-minute commits is bad for the project.
>

I think some part of this rush of commits could also be explained as a form
of entrainment[1]. Only patches reasonably close to commit will get picked
up with extra attention to get them ready before the deadline. After the
release hammer drops, the pool of remaining patches will have few patches
close to commit remaining. And to make matters worse the attention of
working on them will be spread thinner. When repeated, this pattern can be
self reinforcing.

If this hypothesis is true, maybe some forces could be introduced to
counteract this natural tendency. I don't have any bright ideas on how
exactly yet.

Ants

[1] Emergent synchronization of interacting oscillators, see:
https://en.wikipedia.org/wiki/Injection_locking#Entrainment
https://en.wikipedia.org/wiki/Entrainment_(biomusicology)


Re: Popcount optimization using AVX512

2024-04-05 Thread Ants Aasma
On Fri, 5 Apr 2024 at 07:15, Nathan Bossart  wrote:
> Here is an updated patch set.  IMHO this is in decent shape and is
> approaching committable.

I checked the code generation on various gcc and clang versions. It
looks mostly fine starting from versions where avx512 is supported,
gcc-7.1 and clang-5.

The main issue I saw was that clang was able to peel off the first
iteration of the loop and then eliminate the mask assignment and
replace masked load with a memory operand for vpopcnt. I was not able
to convince gcc to do that regardless of optimization options.
Generated code for the inner loop:

clang:
:
  50:  add rdx, 64
  54:  cmp rdx, rdi
  57:  jae 
  59:  vpopcntq zmm1, zmmword ptr [rdx]
  5f:  vpaddq zmm0, zmm1, zmm0
  65:  jmp 

gcc:
:
  38:  kmovq k1, rdx
  3d:  vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax]
  43:  add rax, 64
  47:  mov rdx, -1
  4e:  vpopcntq zmm0, zmm0
  54:  vpaddq zmm0, zmm0, zmm1
  5a:  vmovdqa64 zmm1, zmm0
  60:  cmp rax, rsi
  63:  jb 

I'm not sure how much that matters in practice. Attached is a patch to
do this manually giving essentially the same result in gcc. As most
distro packages are built using gcc I think it would make sense to
have the extra code if it gives a noticeable benefit for large cases.

The visibility map patch has the same issue, otherwise looks good.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index dacc7553d29..f6e718b86e9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -52,13 +52,21 @@ pg_popcount_avx512(const char *buf, int bytes)
 	 * Iterate through all but the final iteration.  Starting from second
 	 * iteration, the start index mask is ignored.
 	 */
-	for (; buf < final; buf += sizeof(__m512i))
+	if (buf < final)
 	{
 		val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
 		cnt = _mm512_popcnt_epi64(val);
 		accum = _mm512_add_epi64(accum, cnt);
 
+		buf += sizeof(__m512i);
 		mask = ~UINT64CONST(0);
+
+		for (; buf < final; buf += sizeof(__m512i))
+		{
+			val = _mm512_load_si512((const __m512i *) buf);
+			cnt = _mm512_popcnt_epi64(val);
+			accum = _mm512_add_epi64(accum, cnt);
+		}
 	}
 
 	/* Final iteration needs to ignore bytes that are not within the length */


Re: Popcount optimization using AVX512

2024-04-04 Thread Ants Aasma
On Thu, 4 Apr 2024 at 01:50, Nathan Bossart  wrote:
> If we can verify this approach won't cause segfaults and can stomach the
> regression between 8 and 16 bytes, I'd happily pivot to this approach so
> that we can avoid the function call dance that I have in v25.

The approach I posted does not rely on masking performing page fault
suppression. All loads are 64 byte aligned and always contain at least
one byte of the buffer and therefore are guaranteed to be within a
valid page.

I personally don't mind it being slower for the very small cases,
because when performance on those sizes really matters it makes much
more sense to shoot for an inlined version instead.

Speaking of which, what does bumping up the inlined version threshold
to 16 do with and without AVX-512 available? Linearly extrapolating
the 2 and 4 byte numbers it might just come ahead in both cases,
making the choice easy.

Regards,
Ants Aasma




Re: Popcount optimization using AVX512

2024-04-02 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.

After sleeping on the problem, I think we can avoid this question
altogether while making the code faster by using aligned accesses.
Loads that straddle cache line boundaries run internally as 2 load
operations. Gut feel says that there are enough out-of-order resources
available to make it not matter in most cases. But even so, not doing
the extra work is surely better. Attached is another approach that
does aligned accesses, and thereby avoids going outside bounds.

Would be interesting to see how well that fares in the small use case.
Anything that fits into one aligned cache line should be constant
speed, and there is only one branch, but the mask setup and folding
the separate popcounts together should add up to about 20-ish cycles
of overhead.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..e1fbd98fa14 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,44 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
 	__m512i		accum = _mm512_setzero_si512();
+	const char *final;
+	int 		tail_idx;
+	__mmask64	mask = -1;
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
-	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+	/*
+	 * Align buffer down to avoid double load overhead from unaligned access.
+	 * Calculate a mask to ignore preceding bytes. Find start offset of final
+	 * iteration and number of valid bytes making sure that final iteration
+	 * is not empty.
+	 */
+	mask <<= ((uintptr_t) buf) % sizeof(__m512i);
+	tail_idx = (((uintptr_t) buf + bytes - 1) % sizeof(__m512i)) + 1;
+	final = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf + bytes - 1);
+	buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
 
+	/*
+	 * Iterate through all but the final iteration. Starting from second
+	 * iteration, the start index mask is ignored.
+	 */
+	for (; buf < final; buf += sizeof(__m512i))
+	{
+		val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 		accum = _mm512_add_epi64(accum, cnt);
-		buf += sizeof(__m512i);
+
+		mask = -1;
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	/* Final iteration needs to ignore bytes that are not within the length */
+	mask &= ((~0ULL) >> (64 - tail_idx));
+
+	val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif			/* TRY_POPCNT_FAST */


Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart  wrote:
>
> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote:
> > What about using the masking capabilities of AVX-512 to handle the
> > tail in the same code path? Masked out portions of a load instruction
> > will not generate an exception. To allow byte level granularity
> > masking, -mavx512bw is needed. Based on wikipedia this will only
> > disable this fast path on Knights Mill (Xeon Phi), in all other cases
> > VPOPCNTQ implies availability of BW.
>
> Sounds promising.  IMHO we should really be sure that these kinds of loads
> won't generate segfaults and the like due to the masked-out portions.  I
> searched around a little bit but haven't found anything that seemed
> definitive.

Interestingly the Intel software developer manual is not exactly
crystal clear on how memory faults with masks work, but volume 2A
chapter 2.8 [1] does specify that MOVDQU8 is of exception class E4.nb
that supports memory fault suppression on page fault.

Regards,
Ants Aasma

[1] https://cdrdv2-public.intel.com/819712/253666-sdm-vol-2a.pdf




Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart  wrote:
>
> On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> > On 2024-Mar-31, Nathan Bossart wrote:
> >> +popcnt = _mm512_reduce_add_epi64(accum);
> >> +return popcnt + pg_popcount_fast(buf, bytes);
> >
> > Hmm, doesn't this arrangement cause an extra function call to
> > pg_popcount_fast to be used here?  Given the level of micro-optimization
> > being used by this code, I would have thought that you'd have tried to
> > avoid that.  (At least, maybe avoid the call if bytes is 0, no?)
>
> Yes, it does.  I did another benchmark on very small arrays and can see the
> overhead.  This is the time in milliseconds to run pg_popcount() on an
> array 1 billion times:
>
> size (bytes)  HEAD  AVX512-POPCNT
> 1 1707.685  3480.424
> 2 1926.694  4606.182
> 4 3210.412  5284.506
> 8 1920.703  3640.968
> 162936.91   4045.586
> 323627.956  5538.418
> 645347.213  3748.212
>
> I suspect that anything below 64 bytes will see this regression, as that is
> the earliest point where there are enough bytes for ZMM registers.

What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.

Attached is an example of what I mean. I did not have a machine to
test it with, but the code generated looks sane. I added the clang
pragma because it insisted on unrolling otherwise and based on how the
instruction dependencies look that is probably not too helpful even
for large cases (needs to be tested). The configure check and compile
flags of course need to be amended for BW.

Regards,
Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index f86558d1ee5..7fb2ada16c9 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -30,20 +30,27 @@
 uint64
 pg_popcount_avx512(const char *buf, int bytes)
 {
-	uint64		popcnt;
+	__m512i		val, cnt;
+	__mmask64	remaining_mask;
 	__m512i		accum = _mm512_setzero_si512();
 
-	for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i))
+	#pragma clang loop unroll(disable)
+	for (; bytes > sizeof(__m512i); bytes -= sizeof(__m512i))
 	{
-		const		__m512i val = _mm512_loadu_si512((const __m512i *) buf);
-		const		__m512i cnt = _mm512_popcnt_epi64(val);
+		val = _mm512_loadu_si512((const __m512i *) buf);
+		cnt = _mm512_popcnt_epi64(val);
 
 		accum = _mm512_add_epi64(accum, cnt);
 		buf += sizeof(__m512i);
 	}
 
-	popcnt = _mm512_reduce_add_epi64(accum);
-	return popcnt + pg_popcount_fast(buf, bytes);
+	remaining_mask = ~0ULL >> (sizeof(__m512i) - bytes);
+	val = _mm512_maskz_loadu_epi8(remaining_mask, (const __m512i *) buf);
+	cnt = _mm512_popcnt_epi64(val);
+
+	accum = _mm512_add_epi64(accum, cnt);
+
+	return _mm512_reduce_add_epi64(accum);
 }
 
 #endif			/* TRY_POPCNT_FAST */


Re: Infinite loop in XLogPageRead() on standby

2024-03-15 Thread Ants Aasma
On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi  wrote:
>
> At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > Oh, I once saw the fix work, but seems not to be working after some
> > point. The new issue was a corruption of received WAL records on the
> > first standby, and it may be related to the setting.
>
> I identified the cause of the second issue. When I tried to replay the
> issue, the second standby accidentally received the old timeline's
> last page-spanning record till the end while the first standby was
> promoting (but it had not been read by recovery). In addition to that,
> on the second standby, there's a time window where the timeline
> increased but the first segment of the new timeline is not available
> yet. In this case, the second standby successfully reads the
> page-spanning record in the old timeline even after the second standby
> noticed that the timeline ID has been increased, thanks to the
> robustness of XLogFileReadAnyTLI().
>
> I think the primary change to XLogPageRead that I suggested is correct
> (assuming the use of wal_segment_size instead of the
> constant). However, still XLogFileReadAnyTLI() has a chance to read
> the segment from the old timeline after the second standby notices a
> timeline switch, leading to the second issue. The second issue was
> fixed by preventing XLogFileReadAnyTLI from reading segments from
> older timelines than those suggested by the latest timeline
> history. (In other words, disabling the "AnyTLI" part).
>
> I recall that there was a discussion for commit 4bd0ad9e44, about the
> objective of allowing reading segments from older timelines than the
> timeline history suggests. In my faint memory, we concluded to
> postpone making the decision to remove the feature due to uncertainity
> about the objective. If there's no clear reason to continue using
> XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt
> XLogFileReadOnTLHistory(), which reads segments that align precisely
> with the timeline history.


This sounds very similar to the problem described in [1]. And I think
both will be resolved by that change.

[1] 
https://postgr.es/m/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw%40mail.gmail.com




Re: Change GUC hashtable to use simplehash?

2024-01-30 Thread Ants Aasma
On Tue, 30 Jan 2024 at 12:04, John Naylor  wrote:
>
> On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma  wrote:
> > But given that we know the data length and we have it in a register
> > already, it's easy enough to just mask out data past the end with a
> > shift. See patch 1. Performance benefit is about 1.5x Measured on a
> > small test harness that just hashes and finalizes an array of strings,
> > with a data dependency between consecutive hashes (next address
> > depends on the previous hash output).
>
> Interesting work! I've taken this idea and (I'm guessing, haven't
> tested) improved it by re-using an intermediate step for the
> conditional, simplifying the creation of the mask, and moving the
> bitscan out of the longest dependency chain. Since you didn't attach
> the test harness, would you like to run this and see how it fares?
> (v16-0001 is same as your 0001, and v16-0002 builds upon it.) I plan
> to test myself as well, but since your test tries to model true
> latency, I'm more interested in that one.

It didn't calculate the same result because the if (mask) condition
was incorrect. Changed it to if (chunk & 0xFF) and removed the right
shift from the mask. It seems to be half a nanosecond faster, but as I
don't have a machine set up for microbenchmarking it's quite close to
measurement noise.

I didn't post the harness as it's currently so messy to be near
useless to others. But if you'd like to play around,  I can tidy it up
a bit and post it.

> > Not sure if the second one is worth the extra code.
>
> I'd say it's not worth optimizing the case we think won't be taken
> anyway. I also like having a simple path to assert against.

Agreed.

As an addendum, I couldn't resist trying out using 256bit vectors with
two parallel AES hashes running, unaligned loads with special casing
page boundary straddling loads. Requires -march=x86-64-v3 -maes. About
20% faster than fasthash on short strings, 2.2x faster on 4k strings.
Right now requires 4 bytes alignment (uses vpmaskmovd), but could be
made to work with any alignment.

Regards,
Ants Aasma
#include 
#include 

#define PAGE_SIZE 0x1000

uint64_t
fast_vec_hash_cstring_avx2(char *buf)
{
__m128i hash0 = {0, 0};
__m128i hash1 = {0, 0};

__m128i k0 = {0x0807060504030201, 0x100F0E0D0C0B0A09};
__m128i k1 = {0x1117161514131211, 0x201F1E1D1C1B1A19};

char *cur = buf;

int mask;
__m256i chunk;
int offset = (uintptr_t) buf & (sizeof(chunk) - 1);
int endpos;


do {

char *end_of_page = (char*) uintptr_t) cur) | (PAGE_SIZE-1)) + 1);
for (; cur + sizeof(chunk) <= end_of_page; cur += sizeof(chunk))
{
chunk = _mm256_loadu_si256((__m256i*) cur);
__m256i ends = _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0));
mask = _mm256_movemask_epi8(ends);
if (mask)
goto last_iteration;
hash0 = _mm_aesenc_si128(hash0, k0);
hash1 = _mm_aesenc_si128(hash1, k1);
hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0));
hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1));
}
if (offset)
{
__m256i load_mask = _mm256_cmpgt_epi32(_mm256_set1_epi32(offset / 4), _mm256_setr_epi32(0,1,2,3,4,5,6,7));
chunk = _mm256_maskload_epi32((const int*) cur, load_mask);
__m256i ends = load_mask & _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0));
mask = _mm256_movemask_epi8(ends);
if (mask)
goto last_iteration;
chunk |= _mm256_maskload_epi32((const int*) cur, load_mask);
ends = load_mask & _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0));
mask = _mm256_movemask_epi8(ends);
if (mask)
goto last_iteration;
hash0 = _mm_aesenc_si128(hash0, k0);
hash1 = _mm_aesenc_si128(hash1, k1);
hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0));
hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1));
cur += sizeof(chunk);
}
} while(1);


last_iteration:
// chunk contains data, mask contains location of end of line
endpos = _tzcnt_u32(mask);
_mm256_cmpgt_epi8(_mm256_set1_epi8(endpos), _mm256_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31));
hash0 = _mm_aesenc_si128(hash0, k0);
hash1 = _mm_aesenc_si128(hash1, k1);
hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0));
hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1));

hash0 = _mm_aesenc_si128(hash0, k0);
hash1 = _mm_aesenc_si128(hash1, k1);
hash0 = _mm_aesenc_si128(hash0, k1);
hash1 = _mm_aesenc_si128(hash1, k0);
hash0 = _mm_aesenc_si128(hash0, k0);
hash1 = _mm_aesenc_si128(hash1, k1);

__m128i intermediate = hash1 ^ hash0;
return intermediate[1] ^ intermediate[0];
}



Re: Change GUC hashtable to use simplehash?

2024-01-29 Thread Ants Aasma
On Sun, 21 Jan 2024 at 03:06, Jeff Davis  wrote:
> Yes, thank you. I don't think we need to change the algorithm.

Jumping in here at a random point just to share my findings from
poking around this on and off. I am concentrating here on cstring
hashing as that is the most complicated one.

One thing that caught my eye in testing was that the unaligned cstring
code was unexpectedly faster for short strings (3-18B uniform
distribution). Looking into it the cause was  fasthash_accum() called
in the final iteration. In the unaligned case compiler (clang-15)
unrolled the inner loop which allowed it to jump directly into the
correct place in the switch. In the unaligned case clang decided to
use a data dependent jump which then mispredicts all of the time.

But given that we know the data length and we have it in a register
already, it's easy enough to just mask out data past the end with a
shift. See patch 1. Performance benefit is about 1.5x Measured on a
small test harness that just hashes and finalizes an array of strings,
with a data dependency between consecutive hashes (next address
depends on the previous hash output).

Unaligned case can actually take advantage of the same trick as the
aligned case, it just has to shuffle the data from two consecutive
words before applying the combine function. Patch 2 implements this.
It makes the unaligned case almost as fast as the aligned one, both on
short and long strings. 10% benefit on short strings, 50% on long
ones.

Not sure if the second one is worth the extra code. A different
approach would be to use the simple word at a time hashing for the
unaligned case too and handle word accesses that straddle a page
boundary as a special case. Obviously this only makes sense for
platforms that support unaligned access. On x86 unaligned access
within a cache line is basically free, and across cache lines is only
slightly more expensive. On benchmarks calling the aligned code on
unaligned strings only has a 5% penalty on long strings, short ones
are indistinguishable.

I also took a look at using SIMD for implementing the hash using the
same aligned access + shuffle trick. The good news is that the
shuffling works well enough that neither it nor checking for string
end are the longest chain. The bad news is that the data load,
alignment, zero finding and masking form a big dependency chain on the
first iteration. Mixing and finalization is even worse, fasthash uses
64bit imul instruction that has a 3 cycle latency, the iteration to
iteration chain is imul + xor, for 4 cycles or 2 B/cycle (in practice
a bit less due to ALU port contention). In SIMD registers there is no
64bit multiply, and 32 bit multiply has a terrible 10 cycle latency on
Intel. AES instructions are an interesting option, but it seems that 2
are needed for good enough mixing, at 4 cycles each, we again end up
at 2B/cycle. Finalization needs another 3 AES instructions, a shuffle
and a xor fold to pass SMHasher, for 17 cycles. The mix latency issue
could be worked around by doing more mixing in parallel, potentially
up to 8x faster, but this does not help short strings at all and would
make the code way bigger. SIMD code does use fewer instructions so it
interleaves better with nearby code that is not dependent on it, not
sure if that matters anywhere.

The short version is that for very long (4k+) strings the attached
SIMD code is 35% faster, for short strings it is 35% slower, and this
is very much x86-64-v3 only and would need a fallback when AVX and
AES-NI are not available. Basically a dead end for the use cases this
hash function is used for.

Regards,
Ants Aasma
From 912f46be12536985dda7bcfb669d4ec13e79d073 Mon Sep 17 00:00:00 2001
From: Ants Aasma 
Date: Mon, 29 Jan 2024 21:07:44 +0200
Subject: [PATCH 2/2] Unaligned fasthash word at a time hashing

About 10% performance benefit on short strings, 50% on long ones,
making the performance almost identical to the aligned case.
---
 src/include/common/hashfn_unstable.h | 156 +++
 1 file changed, 138 insertions(+), 18 deletions(-)

diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 8ee1b99a204..1e44814d84a 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -189,6 +189,38 @@ first_byte_nonzero(uint64 v)
 #endif
 }
 
+/*
+ * Selects first n bits in memory order and masks the rest with NUL.
+ * Using value 0 for n results in undefined behavior.
+ */
+static inline uint64
+first_n64(uint64 v, uint64 n)
+{
+	Assert(0 < n && n <= 64);
+#ifdef WORDS_BIGENDIAN
+		return v & ((~0ULL) << (64 - n));
+#else
+		return v & ((~0ULL) >> (64 - n));
+#endif
+}
+
+/*
+ * Does the equivalent of an unaligned word access into two consecutive
+ * words, taking the last 8 - offset bytes from first and adding first
+ * offset bytes from second word. offset must be in range [1..7]
+ */
+static inline uint64
+align_n64(uint64

Re: add AVX2 support to simd.h

2024-01-09 Thread Ants Aasma
On Tue, 9 Jan 2024 at 18:20, Nathan Bossart  wrote:
>
> On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote:
> > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart  
> > wrote:
> >>
> >> > I suspect that there could be a regression lurking for some inputs
> >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be
> >> > able to read 4 vector registers worth of elements before taking the
> >> > fast path. There is then a tail of up to 15 elements that are now
> >> > checked one-by-one, but AVX2 would increase that to 31. That's getting
> >> > big enough to be noticeable, I suspect. It would be good to understand
> >> > that case (n*32 + 31), because it may also be relevant now. It's also
> >> > easy to improve for SSE2/NEON for v17.
> >>
> >> Good idea.  If it is indeed noticeable, we might be able to "fix" it by
> >> processing some of the tail with shorter vectors.  But that probably means
> >> finding a way to support multiple vector sizes on the same build, which
> >> would require some work.
> >
> > What I had in mind was an overlapping pattern I've seen in various
> > places: do one iteration at the beginning, then subtract the
> > aligned-down length from the end and do all those iterations. And
> > one-by-one is only used if the total length is small.
>
> Sorry, I'm not sure I understood this.  Do you mean processing the first
> several elements individually or with SSE2 until the number of remaining
> elements can be processed with just the AVX2 instructions (a bit like how
> pg_comp_crc32c_armv8() is structured for memory alignment)?

For some operations (min, max, = any) processing the same elements
multiple times doesn't change the result. So the vectors for first
and/or last iterations can overlap with the main loop. In other cases
it's possible to mask out the invalid elements and replace them with
zeroes. Something along the lines of:

static inline Vector8
vector8_mask_right(int num_valid)
{
__m256i seq = _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24,
  23, 22, 21, 20, 19, 18, 17, 16,
  15, 14, 13, 12, 11, 10, 9, 8,
  7, 6, 5, 4, 3, 2, 1, 0);
return _mm256_cmpgt_epi8(_mm256_set1_epi8(num_valid), seq);
}

/* final incomplete iteration */
Vector8 mask = vector8_mask_right(end - cur);
final_vec = vector8_and((Vector8*) (end - sizeof(Vector8), mask);
accum = vector8_add(accum, final_vec);

It helps that on any halfway recent x86 unaligned loads only have a
minor performance penalty and only when straddling cache line
boundaries. Not sure what the  state on ARM is. If we don't care about
unaligned loads then we only need to care about the load not crossing
page boundaries which could cause segfaults. Though I'm sure memory
sanitizer tools will have plenty to complain about around such hacks.




Re: add AVX2 support to simd.h

2024-01-09 Thread Ants Aasma
On Tue, 9 Jan 2024 at 16:03, Peter Eisentraut  wrote:
> On 29.11.23 18:15, Nathan Bossart wrote:
> > Using the same benchmark as we did for the SSE2 linear searches in
> > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:
> >
> >writerssse2avx2 %
> >25611951188-1
> >512 9281054   +14
> >   1024 633 716   +13
> >   2048 332 420   +27
> >   4096 162 203   +25
> >   8192 162 182   +12
>
> AFAICT, your patch merely provides an alternative AVX2 implementation
> for where currently SSE2 is supported, but it doesn't provide any new
> API calls or new functionality.  One might naively expect that these are
> just two different ways to call the underlying primitives in the CPU, so
> these performance improvements are surprising to me.  Or do the CPUs
> actually have completely separate machinery for SSE2 and AVX2, and just
> using the latter to do the same thing is faster?

The AVX2 implementation uses a wider vector register. On most current
processors the throughput of the instructions in question is the same
on 256bit vectors as on 128bit vectors. Basically, the chip has AVX2
worth of machinery and using SSE2 leaves half of it unused. Notable
exceptions are efficiency cores on recent Intel desktop CPUs and AMD
CPUs pre Zen 2 where AVX2 instructions are internally split up into
two 128bit wide instructions.

For AVX512 the picture is much more complicated. Some instructions run
at half rate, some at full rate, but not on all ALU ports, some
instructions cause aggressive clock rate reduction on some
microarchitectures. AVX-512 adds mask registers and masked vector
instructions that enable quite a bit simpler code in many cases.
Interestingly I have seen Clang make quite effective use of these
masked instructions even when using AVX2 intrinsics, but targeting an
AVX-512 capable platform.

The vector width independent approach used in the patch is nice for
simple cases by not needing a separate implementation for each vector
width. However for more complicated cases where "horizontal"
operations are needed it's going to be much less useful. But these
cases can easily just drop down to using intrinsics directly.




Re: autovectorize page checksum code included elsewhere

2023-11-22 Thread Ants Aasma
On Wed, 22 Nov 2023 at 11:44, John Naylor  wrote:
>
> On Tue, Nov 7, 2023 at 9:47 AM Nathan Bossart  
> wrote:
> >
> > Presently, we ask compilers to autovectorize checksum.c and numeric.c.  The
> > page checksum code actually lives in checksum_impl.h, and checksum.c just
> > includes it.  But checksum_impl.h is also used in pg_upgrade/file.c and
> > pg_checksums.c, and since we don't ask compilers to autovectorize those
> > files, the page checksum code may remain un-vectorized.
>
> Poking in those files a bit, I also see references to building with
> SSE 4.1. Maybe that's an avenue that we should pursue? (an indirect
> function call is surely worth it for page-sized data)

For reference, executing the page checksum 10M times on a AMD 3900X CPU:

clang-14 -O2 4.292s (17.8 GiB/s)
clang-14 -O2 -msse4.12.859s (26.7 GiB/s)
clang-14 -O2 -msse4.1 -mavx2 1.378s (55.4 GiB/s)

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-08 Thread Ants Aasma
On Sat, 4 Nov 2023 at 22:08, Andrey M. Borodin  wrote:

> On 30 Oct 2023, at 09:20, Dilip Kumar  wrote:
>
> changed the logic of SlruAdjustNSlots() in 0002, such that now it
> starts with the next power of 2 value of the configured slots and
> keeps doubling the number of banks until we reach the number of banks
> to the max SLRU_MAX_BANKS(128) and bank size is bigger than
> SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
> many banks
>
> There was nothing wrong with having too many banks. Until bank-wise locks
> and counters were added in later patchsets.
> Having hashtable to find SLRU page in the buffer IMV is too slow. Some
> comments on this approach can be found here [0].
> I'm OK with having HTAB for that if we are sure performance does not
> degrade significantly, but I really doubt this is the case.
> I even think SLRU buffers used HTAB in some ancient times, but I could not
> find commit when it was changed to linear search.
>
> Maybe we could decouple locks and counters from SLRU banks? Banks were
> meant to be small to exploit performance of local linear search. Lock
> partitions have to be bigger for sure.
>

Is there a particular reason why lock partitions need to be bigger? We have
one lock per buffer anyway, bankwise locks will increase the number of
locks < 10%.

I am working on trying out a SIMD based LRU mechanism that uses a 16 entry
bank. The data layout is:

struct CacheBank {
int page_numbers[16];
char access_age[16];
}

The first part uses up one cache line, and the second line has 48 bytes of
space left over that could fit a lwlock and page_status, page_dirty arrays.

Lookup + LRU maintenance has 20 instructions/14 cycle latency and the only
branch is for found/not found. Hoping to have a working prototype of SLRU
on top in the next couple of days.

Regards,
Ants Aasma


Re: Lowering the default wal_blocksize to 4K

2023-10-12 Thread Ants Aasma
On Thu, 12 Oct 2023 at 16:36, Robert Haas  wrote:

> On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro 
> wrote:
> > That leaves only the segments where a record starts exactly on the
> > first usable byte of a segment, which is why I was trying to think of
> > a way to cover that case too.  I suggested we could notice and insert
> > a new record at that place.  But Andres suggests it would be too
> > expensive and not worth worrying about.
>
> Hmm. Even in that case, xl_prev has to match. It's not like it's the
> wild west. Sure, it's not nearly as good of a cross-check, but it's
> something. It seems to me that it's not worth worrying very much about
> xlp_seg_size or xlp_blcksz changing undetected in that scenario - if
> you're doing that kind of advanced magic, you need to be careful
> enough to not mess it up, and if we still cross-check once per
> checkpoint cycle that's pretty good. I do worry a bit about the sysid
> changing under us, though. It's not that hard to get your WAL archives
> mixed up, and it'd be nice to catch that right away.
>

This reminds me that xlp_tli is not being used to its full potential right
now either. We only check that it's not going backwards, but there is at
least one not very hard to hit way to get postgres to silently replay on
the wrong timeline. [1]

[1]
https://www.postgresql.org/message-id/canwkhkmn3qwacvudzhb6wsvlrtkwebiyso-klfykkqvwuql...@mail.gmail.com
-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: Disabling Heap-Only Tuples

2023-07-07 Thread Ants Aasma
On Fri, 7 Jul 2023 at 13:18, Tomas Vondra  wrote:
> On 7/7/23 11:55, Matthias van de Meent wrote:
> > On Fri, 7 Jul 2023 at 06:53, Dilip Kumar  wrote:
> >>
> >> On Fri, Jul 7, 2023 at 1:48 AM Matthias van de Meent
> >>  wrote:
> >>>
> >>> On Wed, 5 Jul 2023 at 19:55, Thom Brown  wrote:
> >>>>
> >>>> On Wed, 5 Jul 2023 at 18:05, Matthias van de Meent
> >>>>  wrote:
> >>>>> So what were you thinking of? A session GUC? A table option?
> >>>>
> >>>> Both.
> >>>
> >>> Here's a small patch implementing a new table option max_local_update
> >>> (name very much bikesheddable). Value is -1 (default, disabled) or the
> >>> size of the table in MiB that you still want to allow to update on the
> >>> same page. I didn't yet go for a GUC as I think that has too little
> >>> control on the impact on the system.
> >>
> >> So IIUC, this parameter we can control that instead of putting the new
> >> version of the tuple on the same page, it should choose using
> >> RelationGetBufferForTuple(), and that can reduce the fragmentation
> >> because now if there is space then most of the updated tuple will be
> >> inserted in same pages.  But this still can not truncate the pages
> >> from the heap right? because we can not guarantee that the new page
> >> selected by RelationGetBufferForTuple() is not from the end of the
> >> heap, and until we free the pages from the end of the heap, the vacuum
> >> can not truncate any page.  Is my understanding correct?
> >
> > Yes. If you don't have pages with (enough) free space for the updated
> > tuples in your table, or if the FSM doesn't accurately reflect the
> > actual state of free space in your table, this won't help (which is
> > also the reason why I run vacuum in the tests). It also won't help if
> > you don't update the tuples physically located at the end of your
> > table, but in the targeted workload this would introduce a bias where
> > new tuple versions are moved to the front of the table.
> >
> > Something to note is that this may result in very bad bloat when this
> > is combined with a low fillfactor: All blocks past max_local_update
> > will be unable to use space reserved by fillfactor because FSM lookups
> > always take fillfactor into account, and all updates (which ignore
> > fillfactor when local) would go through the FSM instead, thus reducing
> > the space available on each block to exactly the fillfactor. So, this
> > might need some extra code to make sure we don't accidentally blow up
> > the table's size with UPDATEs when max_local_update is combined with
> > low fillfactors. I'm not sure where that would fit best.
> >
>
> I know the thread started as "let's disable HOT" and this essentially
> just proposes to do that using a table option. But I wonder if that's
> far too simple to be reliable, because hoping RelationGetBufferForTuple
> happens to do the right thing does not seem great.
>
> I wonder if we should invent some definition of "strategy" that would
> tell RelationGetBufferForTuple what it should aim for ...
>
> I'm imagining either a table option with a couple possible values
> (default, non-hot, first-page, ...) or maybe something even more
> elaborate (perhaps even a callback?).
>
> Now, it's not my intention to hijack this thread, but this discussion
> reminds me one of the ideas from my "BRIN improvements" talk, about
> maybe using BRIN indexes for routing. UPDATEs may be a major issue for
> BRIN, making them gradually worse over time. If we could "tell"
> RelationGetBufferForTuple() which buffers are more suitable (by looking
> at an index, histogram or some approximate mapping), that might help.

Just as another point in support of strategy based/extensible tuple
placement, I would at some point try out placing INSERT ON CONFLICT
tuples on the same page as the preceding key in the index. Use case is
in tables with (series, timestamp) primary key to get locality of
access range scanning for a single series. Placement will always be a
tradeoff that is dependent on hardware and workload, and the effect
can be pretty large. For the mentioned use case, if placement can
maintain some semblance of clustering, there will be a 10-100x
reduction in buffers accessed for a relatively minor increase in
bloat.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Re: ReadRecentBuffer() doesn't scale well

2023-06-27 Thread Ants Aasma
On Tue, 27 Jun 2023 at 18:40, Andres Freund  wrote:
> On 2023-06-27 14:49:48 +0300, Ants Aasma wrote:
> > If you want to experiment, here is a rebased version of something I
> > hacked up a couple of years back on the way to Fosdem Pgday. I didn't
> > pursue it further because I didn't have a use case where it showed a
> > significant difference.
>
> Thanks for posting!
>
> Based on past experiments, anything that requires an atomic op during spinlock
> release on x86 will be painful :/. I'm not sure there's a realistic way to
> avoid that with futexes though :(.

Do you happen to know if a plain xchg instruction counts as an atomic
for this? I haven't done atomics stuff in a while, so I might be
missing something, but at first glance I think using a plain xchg
would be enough for the releasing side.

-- 
Ants




Re: ReadRecentBuffer() doesn't scale well

2023-06-27 Thread Ants Aasma
On Tue, 27 Jun 2023 at 07:09, Andres Freund  wrote:
> On 2023-06-27 15:33:57 +1200, Thomas Munro wrote:
> > On Tue, Jun 27, 2023 at 2:05 PM Andres Freund  wrote:
> > > Unfortunately it scaled way worse at first. This is not an inherent 
> > > issue, but
> > > due to an implementation choice in ReadRecentBuffer().  Whereas the normal
> > > BufferAlloc() path uses PinBuffer(), ReadRecentBuffer() first does
> > > LockBufHdr(), checks if the buffer ID is the same and then uses
> > > PinBuffer_Locked().
> > >
> > > The problem with that is that PinBuffer() takes care to not hold the 
> > > buffer
> > > header spinlock, it uses compare_exchange to atomically acquire the pin, 
> > > while
> > > guaranteing nobody holds the lock.  When holding the buffer header 
> > > spinlock,
> > > there obviously is the risk of being scheduled out (or even just not have
> > > exclusive access to the cacheline).
> >
> > Yeah.  Aside from inherent nastiness of user-space spinlocks
>
> I've been wondering about making our backoff path use futexes, after some
> adaptive spinning.

If you want to experiment, here is a rebased version of something I
hacked up a couple of years back on the way to Fosdem Pgday. I didn't
pursue it further because I didn't have a use case where it showed a
significant difference.

--
Ants
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 327ac64f7c2..67a5e8a0246 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -92,6 +92,7 @@ s_lock_stuck(const char *file, int line, const char *func)
 int
 s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
 {
+#ifndef HAS_FUTEX
 	SpinDelayStatus delayStatus;
 
 	init_spin_delay(, file, line, func);
@@ -104,6 +105,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
 	finish_spin_delay();
 
 	return delayStatus.delays;
+#endif
+	elog(FATAL, "Should not be called");
 }
 
 #ifdef USE_DEFAULT_S_UNLOCK
@@ -230,6 +233,71 @@ update_spins_per_delay(int shared_spins_per_delay)
 	return (shared_spins_per_delay * 15 + spins_per_delay) / 16;
 }
 
+#ifdef HAS_FUTEX
+#include 
+#include 
+#include 
+
+static int
+futex(volatile uint32 *uaddr, int futex_op, int val,
+	  const struct timespec *timeout, int *uaddr2, int val3)
+{
+	return syscall(SYS_futex, uaddr, futex_op, val,
+   timeout, uaddr, val3);
+}
+
+int
+futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func)
+{
+	int i, s;
+	/*
+	 * First lets wait for a bit without involving the kernel, it is quite likely
+	 * the lock holder is still running.
+	 **/
+	if (likely(current < 2))
+	{
+		uint32 expected;
+		for (i = 0; i < DEFAULT_SPINS_PER_DELAY; i++)
+		{
+			SPIN_DELAY();
+			expected = lock->value;
+			if (expected == 0 && pg_atomic_compare_exchange_u32(lock, , 1))
+return i;
+		}
+
+		while (expected != 2 && !pg_atomic_compare_exchange_u32(lock, , 2)) {
+			if (expected == 0 && pg_atomic_compare_exchange_u32(lock, , 2))
+return i;
+		}
+	}
+
+	/* At this point lock value is 2 and we will get waken up */
+	while (true)
+	{
+		uint32 expected = 0;
+		s = futex(&(lock->value), FUTEX_WAIT, 2, NULL, NULL, 0);
+		if (s == -1 && errno != EAGAIN)
+			elog(FATAL, "Futex wait failed with error: %m");
+
+		/* Maybe someone else was waiting too, we will try to wake them up. */
+		if (pg_atomic_compare_exchange_u32(lock, , 2))
+			break;
+
+	}
+
+	return i;
+}
+
+int futex_unlock(volatile slock_t *lock, uint32 current)
+{
+	lock->value = 0;
+	if (futex(&(lock->value), FUTEX_WAKE, 1, NULL, NULL, 0) == -1)
+		elog(FATAL, "Futex wake failed with error: %m");
+
+	return 0;
+}
+
+#endif /* HAS_FUTEX */
 
 /*/
 #if defined(S_LOCK_TEST)
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index c9fa84cc43c..6351ec0804e 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -205,6 +205,52 @@ spin_delay(void)
 #ifdef __x86_64__		/* AMD Opteron, Intel EM64T */
 #define HAS_TEST_AND_SET
 
+#if defined(__linux__)
+#define HAS_FUTEX 1 	/* TODO: move to configure to check for old kernels */
+#endif
+
+#ifdef HAS_FUTEX
+
+#include "port/atomics.h"
+
+typedef pg_atomic_uint32 slock_t;
+
+#define S_LOCK(lock) \
+	do { \
+		uint32 expected = 0; \
+		if (unlikely(!pg_atomic_compare_exchange_u32((lock), , 1))) \
+			futex_lock((lock), expected, __FILE__, __LINE__, __func__); \
+	} while (0)
+
+
+#define S_UNLOCK(lock) \
+	do { \
+		uint32 actual = pg_atomic_exchange_u32((lock), 0); \
+		if (unlikely(actual == 2)) \
+			futex_unlock((lock), actual); \
+	} while (0)
+extern int futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func);
+extern int futex_unlock(volatile slock_t *lock, uint32 current);
+
+/* TAS only needed for regress */
+#define TAS(lock) tas(lock)
+
+static __inline__ int
+tas(volatile slock_t 

Re: Do we want a hashset type?

2023-06-02 Thread Ants Aasma
On Wed, 31 May 2023 at 18:40, Joel Jacobson  wrote:
>
> On Wed, May 31, 2023, at 16:53, Tomas Vondra wrote:
> > I think this needs a better explanation - what exactly is a hashset in
> > this context? Something like an array with a hash for faster lookup of
> > unique elements, or what?
>
> In this context, by "hashset" I am indeed referring to a data structure 
> similar
> to an array, where each element would be unique, and lookups would be faster
> than arrays for larger number of elements due to hash-based lookups.
>
> This data structure would store identifiers (IDs) of the nodes, not the 
> complete
> nodes themselves.

Have you looked at roaring bitmaps? There is a pg_roaringbitmap
extension [1] already available that offers very fast unions,
intersections and membership tests over integer sets. I used it to get
some pretty impressive performance results for faceting search on
large document sets. [2]

Depending on the graph fan-outs and operations it might make sense in
the graph use case. For small sets it's probably not too different
from the intarray extension in contrib. But for finding intersections
over large sets (i.e. a join) it's very-very fast. If the workload is
traversal heavy it might make sense to even cache materialized
transitive closures up to some depth (a friend-of-a-friend list).

Roaring bitmaps only support int4 right now, but that is easily
fixable. And they need a relatively dense ID space to get the
performance boost, which seems essential to the approach. The latter
issue means that it can't be easily dropped into GIN or B-tree indexes
for ctid storage.

[1] https://github.com/ChenHuajun/pg_roaringbitmap
[2] https://github.com/cybertec-postgresql/pgfaceting
-- 
Ants Aasma
www.cybertec-postgresql.com




Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-21 Thread Ants Aasma
On Mon, 20 Mar 2023 at 00:59, Melanie Plageman
 wrote:
>
> On Wed, Mar 15, 2023 at 6:46 AM Ants Aasma  wrote:
> >
> > On Wed, 15 Mar 2023 at 02:29, Melanie Plageman
> >  wrote:
> > > As for routine vacuuming and the other buffer access strategies, I think
> > > there is an argument for configurability based on operator knowledge --
> > > perhaps your workload will use the data you are COPYing as soon as the
> > > COPY finishes, so you might as well disable a buffer access strategy or
> > > use a larger fraction of shared buffers. Also, the ring sizes were
> > > selected sixteen years ago and average server memory and data set sizes
> > > have changed.
> >
> > To be clear I'm not at all arguing against configurability. I was
> > thinking that dynamic use could make the configuration simpler by self
> > tuning to use no more buffers than is useful.
>
> Yes, but I am struggling with how we would define "useful".

For copy and vacuum, the only reason I can see for keeping visited
buffers around is to avoid flushing WAL or at least doing it in larger
batches. Once the ring is big enough that WAL doesn't need to be
flushed on eviction, making it bigger only wastes space that could be
used by something that is not going to be evicted soon.

> > > StrategyRejectBuffer() will allow bulkreads to, as you say, use more
> > > buffers than the original ring size, since it allows them to kick
> > > dirty buffers out of the ring and claim new shared buffers.
> > >
> > > Bulkwrites and vacuums, however, will inevitably dirty buffers and
> > > require flushing the buffer (and thus flushing the associated WAL) when
> > > reusing them. Bulkwrites and vacuum do not kick dirtied buffers out of
> > > the ring, since dirtying buffers is their common case. A dynamic
> > > resizing like the one you suggest would likely devolve to vacuum and
> > > bulkwrite strategies always using the max size.
> >
> > I think it should self stabilize around the point where the WAL is
> > either flushed by other commit activity, WAL writer or WAL buffers
> > filling up. Writing out their own dirtied buffers will still happen,
> > just the associated WAL flushes will be in larger chunks and possibly
> > done by other processes.
>
> They will have to write out any WAL associated with modifications to the
> dirty buffer before flushing it, so I'm not sure I understand how this
> would work.

By the time the dirty buffer needs eviction the WAL associated with it
can already be written out by concurrent commits, WAL writer or by WAL
buffers filling up. The bigger the ring is, the higher the chance that
one of these will happen before we loop around.

> > > As for decreasing the ring size, buffers are only "added" to the ring
> > > lazily and, technically, as it is now, buffers which have been added
> > > added to the ring can always be reclaimed by the clocksweep (as long as
> > > they are not pinned). The buffer access strategy is more of a
> > > self-imposed restriction than it is a reservation. Since the ring is
> > > small and the buffers are being frequently reused, odds are the usage
> > > count will be 1 and we will be the one who set it to 1, but there is no
> > > guarantee. If, when attempting to reuse the buffer, its usage count is
> > > > 1 (or it is pinned), we also will kick it out of the ring and go look
> > > for a replacement buffer.
> >
> > Right, but while the buffer is actively used by the ring it is
> > unlikely that clocksweep will find it at usage 0 as the ring buffer
> > should cycle more often than the clocksweep. Whereas if the ring stops
> > using a buffer, clocksweep will eventually come and reclaim it. And if
> > the ring shrinking decision turns out to be wrong before the
> > clocksweep gets around to reusing it, we can bring the same buffer
> > back into the ring.
>
> I can see what you mean about excluding a buffer from the ring being a
> more effective way of allowing it to be reclaimed. However, I'm not sure
> I understand the use case. If the operation, say vacuum, is actively
> using the buffer and keeping its usage count at one, then what would be
> the criteria for it to decide to stop using it?

The criteria for reducing ring size could be that we have cycled the
ring buffer n times without having to do any WAL flushes.

> Also, if vacuum used the buffer once and then didn't reuse it but, for
> some reason, the vacuum isn't over, it isn't any different at that point
> than some other buffer with a usage count of one. It isn't any harder
> for it to be reclaimed by the clocksweep

Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-15 Thread Ants Aasma
On Wed, 15 Mar 2023 at 02:57, Melanie Plageman
 wrote:
> > > Subject: [PATCH v3 3/3] add vacuum option to specify ring_size and guc
> >
> > >  #define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)&(var
> > > +#define bufsize_limit_to_nbuffers(bufsize) (bufsize * 1024 / BLCKSZ)
> >
> > Macros are normally be capitalized
>
> Yes, there doesn't seem to be a great amount of consistency around
> this... See pgstat.c read_chunk_s and bufmgr.c BufHdrGetBlock and
> friends. Though there are probably more capitalized than not. Since it
> does a bit of math and returns a value, I wanted to convey that it was
> more like a function. Also, since the name was long, I thought all-caps
> would be hard to read. However, if you or others feel strongly, I am
> attached neither to the capitalization nor to the name at all (what do
> you think of the name?).

A static inline function seems like a less surprising and more type
safe solution for this.

-- 
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-15 Thread Ants Aasma
On Wed, 15 Mar 2023 at 02:29, Melanie Plageman
 wrote:
> As for routine vacuuming and the other buffer access strategies, I think
> there is an argument for configurability based on operator knowledge --
> perhaps your workload will use the data you are COPYing as soon as the
> COPY finishes, so you might as well disable a buffer access strategy or
> use a larger fraction of shared buffers. Also, the ring sizes were
> selected sixteen years ago and average server memory and data set sizes
> have changed.

To be clear I'm not at all arguing against configurability. I was
thinking that dynamic use could make the configuration simpler by self
tuning to use no more buffers than is useful.

> StrategyRejectBuffer() will allow bulkreads to, as you say, use more
> buffers than the original ring size, since it allows them to kick
> dirty buffers out of the ring and claim new shared buffers.
>
> Bulkwrites and vacuums, however, will inevitably dirty buffers and
> require flushing the buffer (and thus flushing the associated WAL) when
> reusing them. Bulkwrites and vacuum do not kick dirtied buffers out of
> the ring, since dirtying buffers is their common case. A dynamic
> resizing like the one you suggest would likely devolve to vacuum and
> bulkwrite strategies always using the max size.

I think it should self stabilize around the point where the WAL is
either flushed by other commit activity, WAL writer or WAL buffers
filling up. Writing out their own dirtied buffers will still happen,
just the associated WAL flushes will be in larger chunks and possibly
done by other processes.

> As for decreasing the ring size, buffers are only "added" to the ring
> lazily and, technically, as it is now, buffers which have been added
> added to the ring can always be reclaimed by the clocksweep (as long as
> they are not pinned). The buffer access strategy is more of a
> self-imposed restriction than it is a reservation. Since the ring is
> small and the buffers are being frequently reused, odds are the usage
> count will be 1 and we will be the one who set it to 1, but there is no
> guarantee. If, when attempting to reuse the buffer, its usage count is
> > 1 (or it is pinned), we also will kick it out of the ring and go look
> for a replacement buffer.

Right, but while the buffer is actively used by the ring it is
unlikely that clocksweep will find it at usage 0 as the ring buffer
should cycle more often than the clocksweep. Whereas if the ring stops
using a buffer, clocksweep will eventually come and reclaim it. And if
the ring shrinking decision turns out to be wrong before the
clocksweep gets around to reusing it, we can bring the same buffer
back into the ring.

> I do think that it is a bit unreasonable to expect users to know how
> large they would like to make their buffer access strategy ring. What we
> want is some way of balancing different kinds of workloads and
> maintenance tasks reasonably. If your database has no activity because
> it is the middle of the night or it was shutdown because of transaction
> id wraparound, there is no reason why vacuum should limit the number of
> buffers it uses. I'm sure there are many other such examples.

Ideally yes, though I am not hopeful of finding a solution that does
this any time soon. Just to take your example, if a nightly
maintenance job wipes out the shared buffer contents slightly
optimizing its non time-critical work and then causes morning user
visible load to have big latency spikes due to cache misses, that's
not a good tradeoff either.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Re: Option to not use ringbuffer in VACUUM, using it in failsafe mode

2023-03-13 Thread Ants Aasma
On Sat, 11 Mar 2023 at 16:55, Melanie Plageman
 wrote:
>
> > On Tue, Feb 28, 2023 at 3:16 AM Bharath Rupireddy
> >  wrote:
> >
> > > On Thu, Jan 12, 2023 at 6:06 AM Andres Freund  wrote:
> > > >
> > > > On 2023-01-11 17:26:19 -0700, David G. Johnston wrote:
> > > > > Should we just add "ring_buffers" to the existing "shared_buffers" and
> > > > > "temp_buffers" settings?
> > > >
> > > > The different types of ring buffers have different sizes, for good 
> > > > reasons. So
> > > > I don't see that working well. I also think it'd be more often useful to
> > > > control this on a statement basis - if you have a parallel import tool 
> > > > that
> > > > starts NCPU COPYs you'd want a smaller buffer than a single threaded 
> > > > COPY. Of
> > > > course each session can change the ring buffer settings, but still.
> > >
> > > How about having GUCs for each ring buffer (bulk_read_ring_buffers,
> > > bulk_write_ring_buffers, vacuum_ring_buffers - ah, 3 more new GUCs)?
> > > These options can help especially when statement level controls aren't
> > > easy to add (COPY, CREATE TABLE AS/CTAS, REFRESH MAT VIEW/RMV)? If
> > > needed users can also set them at the system level. For instance, one
> > > can set bulk_write_ring_buffers to other than 16MB or -1 to disable
> > > the ring buffer to use shared_buffers and run a bunch of bulk write
> > > queries.
>
> In attached v3, I've changed the name of the guc from buffer_usage_limit
> to vacuum_buffer_usage_limit, since it is only used for vacuum and
> autovacuum.

Sorry for arriving late to this thread, but what about sizing the ring
dynamically? From what I gather the primary motivation for larger ring
size is avoiding WAL flushes due to dirty buffer writes. We already
catch that event with StrategyRejectBuffer(). So maybe a dynamic
sizing algorithm could be applied to the ringbuffer. Make the buffers
array in strategy capable of holding up to the limit of buffers, but
set ring size conservatively. If we have to flush WAL, double the ring
size (up to the limit). If we loop around the ring without flushing,
decrease the ring size by a small amount to let clock sweep reclaim
them for use by other backends.

-- 
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Re: Standby recovers records from wrong timeline

2022-10-21 Thread Ants Aasma
On Fri, 21 Oct 2022 at 11:44, Kyotaro Horiguchi  wrote:
>
> At Fri, 21 Oct 2022 17:12:45 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > latest works. It dones't consider the case of explict target timlines
> > so it's just a PoC.  (So this doesn't work if recovery_target_timeline
> > is set to 2 for the "standby" in the repro.)
>
> So, finally I noticed that the function XLogFileReadAnyTLI is not
> needed at all if we are going this direction.
>
> Regardless of recvoery_target_timeline is latest or any explicit
> imeline id or checkpoint timeline, what we can do to reach the target
> timline is just to follow the history file's direction.
>
> If segments are partly gone while reading on a timeline, a segment on
> the older timelines is just a crap since it should be incompatible.

I came to the same conclusion. I adjusted XLogFileReadAnyTLI to not use any
timeline that ends within the segment (attached patch). At this point the
name of the function becomes really wrong, XLogFileReadCorrectTLI or
something to that effect would be much more descriptive and the code could
be simplified.

However I'm not particularly happy with this approach as it will not use
valid WAL if that is not available. Consider scenario of a cascading
failure. Node A has a hard failure, then node B promotes, archives history
file, but doesn't see enough traffic to archive a full segment before
failing itself. While this is happening we restore node A from backup and
start it up as a standby.

If node b fails before node A has a chance to connect then either we are
continuing recovery on the wrong timeline (current behavior) or we will
not try to recover the first portion of the archived WAL file (with patch).

So I think the correct approach would still be to have ReadRecord() or
ApplyWalRecord() determine that switching timelines is needed.

-- 
Ants Aasma
www.cybertec-postgresql.com
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea6..73bde98b920 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -4171,6 +4171,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 	{
 		TimeLineHistoryEntry *hent = (TimeLineHistoryEntry *) lfirst(cell);
 		TimeLineID	tli = hent->tli;
+		XLogSegNo	beginseg = 0;
 
 		if (tli < curFileTLI)
 			break;/* don't bother looking at too-old TLIs */
@@ -4181,7 +4182,6 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 		 */
 		if (hent->begin != InvalidXLogRecPtr)
 		{
-			XLogSegNo	beginseg = 0;
 
 			XLByteToSeg(hent->begin, beginseg, wal_segment_size);
 
@@ -4223,6 +4223,14 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 return fd;
 			}
 		}
+
+		/*
+		 * For segments containing known timeline switches only consider the
+		 * last timeline as redo otherwise doesn't know when to switch
+		 * timelines.
+		 */
+		if (segno == beginseg && beginseg > 0)
+			break;
 	}
 
 	/* Couldn't find it.  For simplicity, complain about front timeline */


Re: Standby recovers records from wrong timeline

2022-10-20 Thread Ants Aasma
On Thu, 20 Oct 2022 at 11:30, Kyotaro Horiguchi  wrote:
>
> primary_restored did a time-travel to past a bit because of the
> recovery_target=immediate. In other words, the primary_restored and
> the replica diverge. I don't think it is legit to connect a diverged
> standby to a primary.

primary_restored did timetravel to the past, as we're doing PITR on the
primary that's the expected behavior. However replica is not diverged,
it's a copy of the exact same basebackup. The usecase is restoring a
cluster from backup using PITR and using the same backup to create a
standby. Currently this breaks when primary has not yet archived any
segments.

> So, about the behavior in doubt, it is the correct behavior to
> seemingly ignore the history file in the archive. Recovery assumes
> that the first half of the first segment of the new timeline is the
> same with the same segment of the old timeline (.partial) so it is
> legit to read the  file til the end and that causes the
> replica goes beyond the divergence point.

What is happening is that primary_restored has a timeline switch at
tli 2, lsn 0/2000100, and the next insert record starts in the same
segment. Replica is starting on the same backup on timeline 1, tries to
find tli 2 seg 2, which is not archived yet, so falls back to tli 1 seg 2
and replays tli 1 seg 2 continuing to tli seg 3, then connects to primary
and starts applying wal starting from tli 2 seg 4. To me that seems
completely broken.

> As you know, when new primary starts a diverged history, the
> recommended way is to blow (or stash) away the archive, then take a
> new backup from the running primary.

My understanding is that backup archives are supposed to remain valid
even after PITR or equivalently a lagging standby promoting.

--
Ants Aasma
Senior Database Engineer
www.cybertec-postgresql.com




Standby recovers records from wrong timeline

2022-10-19 Thread Ants Aasma
When standby is recovering to a timeline that doesn't have any segments
archived yet it will just blindly blow past the timeline switch point and
keeps on recovering on the old timeline. Typically that will eventually
result in an error about incorrect prev-link, but under unhappy
circumstances can result in standby silently having different contents.

Attached is a shell script that reproduces the issue. Goes back to at least
v12, probably longer.

I think we should be keeping track of where the current replay timeline is
going to end and not read any records past it on the old timeline. Maybe
while at it, we should also track that the next record should be a
checkpoint record for the timeline switch and error out if not. Thoughts?

-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


recoverytest.sh
Description: application/shellscript


Re: storing an explicit nonce

2021-10-13 Thread Ants Aasma
On Wed, 13 Oct 2021 at 02:20, Bruce Momjian  wrote:

> On Wed, Oct 13, 2021 at 12:48:51AM +0300, Ants Aasma wrote:
> > On Wed, 13 Oct 2021 at 00:25, Bruce Momjian  wrote:
> >
> > On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> > > Page encrypting to all zeros is for all practical purposes
> impossible to
> > hit.
> > > Basically an attacker would have to be able to arbitrarily set the
> whole
> > > contents of the page and they would then achieve that this page
> gets
> > ignored.
> >
> > Uh, how do we know that valid data can't produce an encrypted
> all-zero
> > page?
> >
> >
> > Because the chances of that happening by accident are equivalent to
> making a
> > series of commits to postgres and ending up with the same git commit
> hash 400
> > times in a row.
>
> Yes, 256^8192 is 1e+19728, but why not just assume a page LSN=0 is an
> empty page, and if not, an error?  Seems easier than checking if each
> page contains all zeros every time.
>

We already check it anyway, see PageIsVerifiedExtended().

-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: storing an explicit nonce

2021-10-12 Thread Ants Aasma
On Wed, 13 Oct 2021 at 00:25, Bruce Momjian  wrote:

> On Tue, Oct 12, 2021 at 11:21:28PM +0300, Ants Aasma wrote:
> > On Tue, 12 Oct 2021 at 16:14, Bruce Momjian  wrote:
> >
> > Well, how do you detect an all-zero page vs a page that encrypted to
> all
> > zeros?
> >
> > Page encrypting to all zeros is for all practical purposes impossible to
> hit.
> > Basically an attacker would have to be able to arbitrarily set the whole
> > contents of the page and they would then achieve that this page gets
> ignored.
>
> Uh, how do we know that valid data can't produce an encrypted all-zero
> page?
>

Because the chances of that happening by accident are equivalent to making
a series of commits to postgres and ending up with the same git commit hash
400 times in a row.

--

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: storing an explicit nonce

2021-10-12 Thread Ants Aasma
On Tue, 12 Oct 2021 at 16:14, Bruce Momjian  wrote:

> Well, how do you detect an all-zero page vs a page that encrypted to all
> zeros?
>
Page encrypting to all zeros is for all practical purposes impossible to
hit. Basically an attacker would have to be able to arbitrarily set the
whole contents of the page and they would then achieve that this page gets
ignored.

-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: storing an explicit nonce

2021-10-11 Thread Ants Aasma
On Mon, 11 Oct 2021 at 22:15, Bruce Momjian  wrote:

> > Yes, that's the direction that I was thinking also and specifically with
> > XTS as the encryption algorithm to allow us to exclude the LSN but keep
> > everything else, and to address the concern around the nonce/tweak/etc
> > being the same sometimes across multiple writes.  Another thing to
> > consider is if we want to encrypt zero'd page.  There was a point
> > brought up that if we do then we are encrypting a fair bit of very
> > predictable bytes and that's not great (though there's a fair bit about
> > our pages that someone could quite possibly predict anyway based on
> > table structures and such...).  I would think that if it's easy enough
> > to not encrypt zero'd pages that we should avoid doing so.  Don't recall
> > offhand which way zero'd pages were being handled already but thought it
> > made sense to mention that as part of this discussion.
>
> Yeah, I wanted to mention that.  I don't see any security difference
> between fully-zero pages, pages with headers and no tuples, and pages
> with headers and only a few tuples.  If any of those are insecure, they
> all are.  Therefore, I don't see any reason to treat them differently.
>

We had to special case zero pages and not encrypt them because as far as I
can tell, there is no atomic way to extend a file and initialize it to
Enc(zero) in the same step.

-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: storing an explicit nonce

2021-10-07 Thread Ants Aasma
On Thu, 7 Oct 2021 at 21:52, Stephen Frost  wrote:

> With XTS this isn't actually the case though, is it..?  Part of the
> point of XTS is that the last block doesn't have to be a full 16 bytes.
> What you're saying is true for XEX, but that's also why XEX isn't used
> for FDE in a lot of cases, because disk sectors aren't typically
> divisible by 16.
>
> https://en.wikipedia.org/wiki/Disk_encryption_theory
>
> Assuming that's correct, and I don't see any reason to doubt it, then
> perhaps it would make sense to have the LSN be unencrypted and include
> it in the tweak as that would limit the risk from re-use of the same
> tweak over time.
>

Right, my thought was to leave the first 8 bytes of pages, the LSN,
unencrypted and include the value in the tweak. Just tested that OpenSSL
aes-256-xts handles non multiple-of-16 messages just fine.

-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: storing an explicit nonce

2021-10-07 Thread Ants Aasma
On Wed, 6 Oct 2021 at 23:08, Bruce Momjian  wrote:

> Yes, I would prefer we don't use the LSN.  I only mentioned it since
> Ants Aasma mentioned LSN use above.
>

Is there a particular reason why you would prefer not to use LSN? I
suggested it because in my view having a variable tweak is still better
than not having it even if we deem the risks of XTS tweak reuse not
important for our use case. The comment was made under the assumption that
requiring wal_log_hints for encryption is acceptable.

-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: storing an explicit nonce

2021-09-28 Thread Ants Aasma
On Mon, 27 Sept 2021 at 23:34, Bruce Momjian  wrote:

> On Sun, Sep  5, 2021 at 10:51:42PM +0800, Sasasu wrote:
> > Hi, community,
> >
> > It looks like we are still considering AES-CBC, AES-XTS, and
> AES-GCM(-SIV).
> > I want to say something that we don't think about.
> >
> > For AES-CBC, the IV should be not predictable. I think LSN or HASH(LSN,
> > block number or something) is predictable. There are many CVE related to
> > AES-CBC with a predictable IV.
>
> The LSN would change every time the page is modified, so while the LSN
> could be predicted, it would not be reused.  However, there is currently
> no work being done on page-level encryption of Postgres.
>

We are still working on our TDE patch. Right now the focus is on
refactoring temporary file access to make the TDE patch itself smaller.
Reconsidering encryption mode choices given concerns expressed is next.
Currently a viable option seems to be AES-XTS with LSN added into the IV.
XTS doesn't have an issue with predictable IV and isn't totally broken in
case of IV reuse.

-- 

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: track_planning causing performance regression

2020-06-30 Thread Ants Aasma
On Tue, 30 Jun 2020 at 08:43, Fujii Masao 
wrote:

> > The problem looks to be that spinlocks are terrible with overloaded
> CPU and a contended spinlock. A process holding the spinlock might easily
> get scheduled out leading to excessive spinning by everybody. I think a
> simple thing to try would be to replace the spinlock with LWLock.
>
> Yes. Attached is the POC patch that replaces per-counter spinlock with
> LWLock.
>

Great. I think this is the one that should get considered for testing.


> > I did a prototype patch that replaces spinlocks with futexes, but was
> not able to find a workload where it mattered.
>
> I'm not familiar with futex, but could you tell me why you used futex
> instead
> of LWLock that we already have? Is futex portable?
>

Futex is a Linux kernel call that allows to build a lock that has
uncontended cases work fully in user space almost exactly like a spinlock,
while falling back to syscalls that wait for wakeup in case of contention.
It's not portable, but probably something similar could be implemented for
other operating systems. I did not pursue this further because it became
apparent that every performance critical spinlock had already been removed.

To be clear, I am not advocating for this patch to get included. I just had
the patch immediately available and it could have confirmed that using a
better lock fixes things.

-- 
Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com


Re: track_planning causing performance regression

2020-06-29 Thread Ants Aasma
On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud  wrote:

> On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
>  wrote:
> >
> > On 2020/06/29 16:05, Julien Rouhaud wrote:
> > > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins 
> wrote:
> > >>
> > >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1
> shows
> >
> > Thanks for the benchmark!
> >
> >
> > >> ~45% performance drop [2] at high DB connection counts (when compared
> with v12.3)
> >
> > That's bad :(
> >
> >
> > >>
> > >> Disabling pg_stat_statements.track_planning (which is 'On' by default)
> > >> brings the TPS numbers up to v12.3 levels.
> > >>
> > >> The inflection point (in this test-case) is 128 Connections, beyond
> which the
> > >> TPS numbers are consistently low. Looking at the mailing list [1],
> this issue
> > >> didn't surface earlier possibly since the regression is trivial at
> low connection counts.
> > >>
> > >> It would be great if this could be optimized further, or
> track_planning
> > >> disabled (by default) so as to not trip users upgrading from v12 with
> pg_stat_statement
> > >> enabled (but otherwise not particularly interested in track_planning).
> >
> > Your benchmark result seems to suggest that the cause of the problem is
> > the contention of per-query spinlock in pgss_store(). Right?
> > This lock contention is likely to happen when multiple sessions run
> > the same queries.
> >
> > One idea to reduce that lock contention is to separate per-query spinlock
> > into two; one is for planning, and the other is for execution.
> pgss_store()
> > determines which lock to use based on the given "kind" argument.
> > To make this idea work, also every pgss counters like shared_blks_hit
> > need to be separated into two, i.e., for planning and execution.
>
> This can probably remove some overhead, but won't it eventually hit
> the same issue when multiple connections try to plan the same query,
> given the number of different queries and very low execution runtime?
> It'll also quite increase the shared memory consumption.
>
> I'm wondering if we could instead use atomics to store the counters.
> The only downside is that we won't guarantee per-row consistency
> anymore, which may be problematic.
>


The problem looks to be that spinlocks are terrible with overloaded CPU and
a contended spinlock. A process holding the spinlock might easily get
scheduled out leading to excessive spinning by everybody. I think a simple
thing to try would be to replace the spinlock with LWLock.

I did a prototype patch that replaces spinlocks with futexes, but was not
able to find a workload where it mattered. We have done a great job at
eliminating spinlocks from contended code paths. Robins, perhaps you could
try it to see if it reduces the regression you are observing. The patch is
against v13 stable branch.

-- 
Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 7fac0703419..56d45b7cfce 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -90,6 +90,7 @@ s_lock_stuck(const char *file, int line, const char *func)
 int
 s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
 {
+#ifndef HAS_FUTEX
 	SpinDelayStatus delayStatus;
 
 	init_spin_delay(, file, line, func);
@@ -102,6 +103,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
 	finish_spin_delay();
 
 	return delayStatus.delays;
+#endif
+	elog(FATAL, "Should not be called");
 }
 
 #ifdef USE_DEFAULT_S_UNLOCK
@@ -218,6 +221,71 @@ update_spins_per_delay(int shared_spins_per_delay)
 	return (shared_spins_per_delay * 15 + spins_per_delay) / 16;
 }
 
+#ifdef HAS_FUTEX
+#include 
+#include 
+#include 
+
+static int
+futex(volatile uint32 *uaddr, int futex_op, int val,
+	  const struct timespec *timeout, int *uaddr2, int val3)
+{
+	return syscall(SYS_futex, uaddr, futex_op, val,
+   timeout, uaddr, val3);
+}
+
+int
+futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func)
+{
+	int i, s;
+	/*
+	 * First lets wait for a bit without involving the kernel, it is quite likely
+	 * the lock holder is still running.
+	 **/
+	if (likely(current < 2))
+	{
+		uint32 expected;
+		for (i = 0; i < DEFAULT_SPINS_PER_DELAY; i++)
+		{
+			SPIN_DELAY();
+			expected = lock->value;
+			if (expected == 0 && pg_atomic_compare_exchange_u32(lock, , 1))
+return i;
+		}
+
+		while (expected != 2 && !pg_atomic_compare_exchange_u32(lock, , 2)) {
+			if (

Re: what can go in root.crt ?

2020-06-03 Thread Ants Aasma
On Tue, 2 Jun 2020 at 20:14, Bruce Momjian  wrote:

> The server certificate should be issued by a certificate authority root
> outside of your organization only if you want people outside of your
> organization to trust your server certificate, but you are then asking
> for the client to only trust an intermediate inside your organization.
> The big question is why bother having the server certificate chain to a
> root certificat you don't trust when you have no intention of having
> clients outside of your organization trust the server certificate.
> Postgres could be made to handle such cases, but is is really a valid
> configuration we should support?
>

I think the "why" the org cert is not root was already made clear, that is
the copmany policy. I don't think postgres should take a stance whether the
certificate designated as the root of trust is self-signed or claims to get
its power from somewhere else.

It's pretty easy to conceive of certificate management procedures that make
use of this chain to implement certificate replacement securely. For
example one might trust the global issuer to verify that a CSR is coming
from the O= value that it's claiming to come from to automate replacement
of intermediate certificates, but not trust that every other sub-CA signed
by root and their sub-sub-CA-s are completely honest and secure.

Regards,
Ants Aasma


Re: spin_delay() for ARM

2020-04-17 Thread Ants Aasma
On Thu, 16 Apr 2020 at 10:33, Pavel Stehule  wrote:
> what I know, pgbench cannot be used for testing spinlocks problems.
>
> Maybe you can see this issue when a) use higher number clients - hundreds, 
> thousands. Decrease share memory, so there will be press on related spin lock.

There really aren't many spinlocks left that could be tickled by a
normal workload. I looked for a way to trigger spinlock contention
when I prototyped a patch to replace spinlocks with futexes. The only
one that I could figure out a way to make contended was the lock
protecting parallel btree scan. A highly parallel index only scan on a
fully cached index should create at least some spinlock contention.

Regards,
Ants Aasma




Re: Parallel copy

2020-04-15 Thread Ants Aasma
On Mon, 13 Apr 2020 at 23:16, Andres Freund  wrote:
> > Still, if the reader does the splitting, then you don't need as much
> > IPC, right? The shared memory data structure is just a ring of bytes,
> > and whoever reads from it is responsible for the rest.
>
> I don't think so. If only one process does the splitting, the
> exclusively locked section is just popping off a bunch of offsets of the
> ring. And that could fairly easily be done with atomic ops (since what
> we need is basically a single producer multiple consumer queue, which
> can be done lock free fairly easily ). Whereas in the case of each
> process doing the splitting, the exclusively locked part is splitting
> along lines - which takes considerably longer than just popping off a
> few offsets.

I see the benefit of having one process responsible for splitting as
being able to run ahead of the workers to queue up work when many of
them need new data at the same time. I don't think the locking
benefits of a ring are important in this case. At current rather
conservative chunk sizes we are looking at ~100k chunks per second at
best, normal locking should be perfectly adequate. And chunk size can
easily be increased. I see the main value in it being simple.

But there is a point that having a layer of indirection instead of a
linear buffer allows for some workers to fall behind. Either because
the kernel scheduled them out for a time slice, or they need to do I/O
or because inserting some tuple hit an unique conflict and needs to
wait for a tx to complete or abort to resolve. With a ring buffer
reading has to wait on the slowest worker reading its chunk. Having
workers copy the data to a local buffer as the first step would reduce
the probability of hitting any issues. But still, at GB/s rates,
hiding a 10ms timeslice of delay would need 10's of megabytes of
buffer.

FWIW. I think just increasing the buffer is good enough - the CPUs
processing this workload are likely to have tens to hundreds of
megabytes of cache on board.




Re: Parallel copy

2020-04-15 Thread Ants Aasma
On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh  wrote:
> 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> collects the following three stats from the chunk:
> a) number of quotes
> b) position of the first new line after even number of quotes
> c) position of the first new line after odd number of quotes
> 2. Once stats from all the chunks are collected, the leader identifies
> the adjusted chunk boundaries by iterating over the stats linearly:
> - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> - If the number is even, then the k-th chunk does not start in the
> middle of a quoted field, and the first newline after an even number
> of quotes (the second collected information) is the first record
> delimiter in this chunk.
> - Otherwise, if the number is odd, the first newline after an odd
> number of quotes (the third collected information) is the first record
> delimiter.
> - The end position of the adjusted chunk is obtained based on the
> starting position of the next adjusted chunk.

The trouble is that, at least with current coding, the number of
quotes in a chunk can depend on whether the chunk started in a quote
or not. That's because escape characters only count inside quotes. See
for example the following csv:

foo,\"bar
baz",\"xyz"

This currently parses as one line and the number of parsed quotes
doesn't change if you add a quote in front.

But the general approach of doing the tokenization in parallel and
then a serial pass over the tokenization would still work. The quote
counting and new line finding just has to be done for both starting in
quote and not starting in quote case.

Using phases doesn't look like the correct approach - the tokenization
can be prepared just in time for the serial pass and processing the
chunk can proceed immediately after. This could all be done by having
the data in a single ringbuffer with a processing pipeline where one
process does the reading, then workers grab tokenization chunks as
they become available, then one process handles determining the chunk
boundaries, after which the chunks are processed.

But I still don't think this is something to worry about for the first
version. Just a better line splitting algorithm should go a looong way
in feeding a large number of workers, even when inserting to an
unindexed unlogged table. If we get the SIMD line splitting in, it
will be enough to overwhelm most I/O subsystems available today.

Regards,
Ants Aasma




Re: Parallel copy

2020-04-08 Thread Ants Aasma
On Wed, 8 Apr 2020 at 22:30, Robert Haas  wrote:
> - If we're unable to supply data to the COPY process as fast as the
> workers could load it, then speed will be limited at that point. We
> know reading the file from disk is pretty fast compared to what a
> single process can do. I'm not sure we've tested what happens with a
> network socket. It will depend on the network speed some, but it might
> be useful to know how many MB/s we can pump through over a UNIX
> socket.

This raises a good point. If at some point we want to minimize the
amount of memory copies then we might want to allow for RDMA to
directly write incoming network traffic into a distributing ring
buffer, which would include the protocol level headers. But at this
point we are so far off from network reception becoming a bottleneck I
don't think it's worth holding anything up for not allowing for zero
copy transfers.

> - The portion of the time that is used to split the lines is not
> easily parallelizable. That seems to be a fairly small percentage for
> a reasonably wide table, but it looks significant (13-18%) for a
> narrow table. Such cases will gain less performance and be limited to
> a smaller number of workers. I think we also need to be careful about
> files whose lines are longer than the size of the buffer. If we're not
> careful, we could get a significant performance drop-off in such
> cases. We should make sure to pick an algorithm that seems like it
> will handle such cases without serious regressions and check that a
> file composed entirely of such long lines is handled reasonably
> efficiently.

I don't have a proof, but my gut feel tells me that it's fundamentally
impossible to ingest csv without a serial line-ending/comment
tokenization pass. The current line splitting algorithm is terrible.
I'm currently working with some scientific data where on ingestion
CopyReadLineText() is about 25% on profiles. I prototyped a
replacement that can do ~8GB/s on narrow rows, more on wider ones.

For rows that are consistently wider than the input buffer I think
parallelism will still give a win - the serial phase is just memcpy
through a ringbuffer, after which a worker goes away to perform the
actual insert, letting the next worker read the data. The memcpy is
already happening today, CopyReadLineText() copies the input buffer
into a StringInfo, so the only extra work is synchronization between
leader and worker.

> - There could be index contention. Let's suppose that we can read data
> super fast and break it up into lines super fast. Maybe the file we're
> reading is fully RAM-cached and the lines are long. Now all of the
> backends are inserting into the indexes at the same time, and they
> might be trying to insert into the same pages. If so, lock contention
> could become a factor that hinders performance.

Different data distribution strategies can have an effect on that.
Dealing out input data in larger or smaller chunks will have a
considerable effect on contention, btree page splits and all kinds of
things. I think the common theme would be a push to increase chunk
size to reduce contention..

> - There could also be similar contention on the heap. Say the tuples
> are narrow, and many backends are trying to insert tuples into the
> same heap page at the same time. This would lead to many lock/unlock
> cycles. This could be avoided if the backends avoid targeting the same
> heap pages, but I'm not sure there's any reason to expect that they
> would do so unless we make some special provision for it.

I thought there already was a provision for that. Am I mis-remembering?

> - What else? I bet the above list is not comprehensive.

I think parallel copy patch needs to concentrate on splitting input
data to workers. After that any performance issues would be basically
the same as a normal parallel insert workload. There may well be
bottlenecks there, but those could be tackled independently.

Regards,
Ants Aasma
Cybertec




Re: Parallel copy

2020-04-07 Thread Ants Aasma
On Tue, 7 Apr 2020 at 08:24, vignesh C  wrote:
> Leader will create a circular queue
> and share it across the workers. The circular queue will be present in
> DSM. Leader will be using a fixed size queue to share the contents
> between the leader and the workers. Currently we will have 100
> elements present in the queue. This will be created before the workers
> are started and shared with the workers. The data structures that are
> required by the parallel workers will be initialized by the leader,
> the size required in dsm will be calculated and the necessary keys
> will be loaded in the DSM. The specified number of workers will then
> be launched. Leader will read the table data from the file and copy
> the contents to the queue element by element. Each element in the
> queue will have 64K size DSA. This DSA will be used to store tuple
> contents from the file. The leader will try to copy as much content as
> possible within one 64K DSA queue element. We intend to store at least
> one tuple in each queue element. There are some cases where the 64K
> space may not be enough to store a single tuple. Mostly in cases where
> the table has toast data present and the single tuple can be more than
> 64K size. In these scenarios we will extend the DSA space accordingly.
> We cannot change the size of the dsm once the workers are launched.
> Whereas in case of DSA we can free the dsa pointer and reallocate the
> dsa pointer based on the memory size required. This is the very reason
> for choosing DSA over DSM for storing the data that must be inserted
> into the relation.

I think the element based approach and requirement that all tuples fit
into the queue makes things unnecessarily complex. The approach I
detailed earlier allows for tuples to be bigger than the buffer. In
that case a worker will claim the long tuple from the ring queue of
tuple start positions, and starts copying it into its local line_buf.
This can wrap around the buffer multiple times until the next start
position shows up. At that point this worker can proceed with
inserting the tuple and the next worker will claim the next tuple.

This way nothing needs to be resized, there is no risk of a file with
huge tuples running the system out of memory because each element will
be reallocated to be huge and the number of elements is not something
that has to be tuned.

> We had a couple of options for the way in which queue elements can be stored.
> Option 1:  Each element (DSA chunk) will contain tuples such that each
> tuple will be preceded by the length of the tuple.  So the tuples will
> be arranged like (Length of tuple-1, tuple-1), (Length of tuple-2,
> tuple-2),  Or Option 2: Each element (DSA chunk) will contain only
> tuples (tuple-1), (tuple-2), .  And we will have a second
> ring-buffer which contains a start-offset or length of each tuple. The
> old design used to generate one tuple of data and process tuple by
> tuple. In the new design, the server will generate multiple tuples of
> data per queue element. The worker will then process data tuple by
> tuple. As we are processing the data tuple by tuple, I felt both of
> the options are almost the same. However Design1 was chosen over
> Design 2 as we can save up on some space that was required by another
> variable in each element of the queue.

With option 1 it's not possible to read input data into shared memory
and there needs to be an extra memcpy in the time critical sequential
flow of the leader. With option 2 data could be read directly into the
shared memory buffer. With future async io support, reading and
looking for tuple boundaries could be performed concurrently.


Regards,
Ants Aasma
Cybertec




Re: Parallel copy

2020-02-26 Thread Ants Aasma
On Tue, 25 Feb 2020 at 18:00, Tomas Vondra  wrote:
> Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> so I still think we need to do some measurements first. I'm willing to
> do that, but (a) I doubt I'll have time for that until after 2020-03,
> and (b) it'd be good to agree on some set of typical CSV files.

I agree that getting a nice varied dataset would be nice. Including
things like narrow integer only tables, strings with newlines and
escapes in them, extremely wide rows.

I tried to capture a quick profile just to see what it looks like.
Grabbed a random open data set from the web, about 800MB of narrow
rows CSV [1].

Script:
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Profile:
# Samples: 59K of event 'cycles:u'
# Event count (approx.): 57644269486
#
# Overhead  Command   Shared Object   Symbol
#     ..
...
#
18.24%  postgres  postgres[.] CopyReadLine
 9.23%  postgres  postgres[.] NextCopyFrom
 8.87%  postgres  postgres[.] NextCopyFromRawFields
 5.82%  postgres  postgres[.] pg_verify_mbstr_len
 5.45%  postgres  postgres[.] pg_strtoint32
 4.16%  postgres  postgres[.] heap_fill_tuple
 4.03%  postgres  postgres[.] heap_compute_data_size
 3.83%  postgres  postgres[.] CopyFrom
 3.78%  postgres  postgres[.] AllocSetAlloc
 3.53%  postgres  postgres[.] heap_form_tuple
 2.96%  postgres  postgres[.] InputFunctionCall
 2.89%  postgres  libc-2.30.so[.] __memmove_avx_unaligned_erms
 1.82%  postgres  libc-2.30.so[.] __strlen_avx2
 1.72%  postgres  postgres[.] AllocSetReset
 1.72%  postgres  postgres[.] RelationPutHeapTuple
 1.47%  postgres  postgres[.] heap_prepare_insert
 1.31%  postgres  postgres[.] heap_multi_insert
 1.25%  postgres  postgres[.] textin
 1.24%  postgres  postgres[.] int4in
 1.05%  postgres  postgres[.] tts_buffer_heap_clear
 0.85%  postgres  postgres[.] pg_any_to_server
 0.80%  postgres  postgres[.] pg_comp_crc32c_sse42
 0.77%  postgres  postgres[.] cstring_to_text_with_len
 0.69%  postgres  postgres[.] AllocSetFree
 0.60%  postgres  postgres[.] appendBinaryStringInfo
 0.55%  postgres  postgres[.] tts_buffer_heap_materialize.part.0
 0.54%  postgres  postgres[.] palloc
 0.54%  postgres  libc-2.30.so[.] __memmove_avx_unaligned
 0.51%  postgres  postgres[.] palloc0
 0.51%  postgres  postgres[.] pg_encoding_max_length
 0.48%  postgres  postgres[.] enlargeStringInfo
 0.47%  postgres  postgres[.] ExecStoreVirtualTuple
 0.45%  postgres  postgres[.] PageAddItemExtended

So that confirms that the parsing is a huge chunk of overhead with
current splitting into lines being the largest portion. Amdahl's law
says that splitting into tuples needs to be made fast before
parallelizing makes any sense.

Regards,
Ants Aasma

[1] 
https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip




Re: Parallel copy

2020-02-21 Thread Ants Aasma
On Thu, 20 Feb 2020 at 18:43, David Fetter  wrote:>
> On Thu, Feb 20, 2020 at 02:36:02PM +0100, Tomas Vondra wrote:
> > I think the wc2 is showing that maybe instead of parallelizing the
> > parsing, we might instead try using a different tokenizer/parser and
> > make the implementation more efficient instead of just throwing more
> > CPUs on it.
>
> That was what I had in mind.
>
> > I don't know if our code is similar to what wc does, maytbe parsing
> > csv is more complicated than what wc does.
>
> CSV parsing differs from wc in that there are more states in the state
> machine, but I don't see anything fundamentally different.

The trouble with a state machine based approach is that the state
transitions form a dependency chain, which means that at best the
processing rate will be 4-5 cycles per byte (L1 latency to fetch the
next state).

I whipped together a quick prototype that uses SIMD and bitmap
manipulations to do the equivalent of CopyReadLineText() in csv mode
including quotes and escape handling, this runs at 0.25-0.5 cycles per
byte.

Regards,
Ants Aasma
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define likely(x)   __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)

/*
 * Create a bitmap of matching characters in the next 64 bytes
 **/
static inline uint64_t
find_chars(__m256i *data, char c)
{
	const __m256i mask = _mm256_set1_epi8(c);
	uint64_t result = (uint32_t) _mm256_movemask_epi8(_mm256_cmpeq_epi8(data[0], mask));
	result |= ((uint64_t) _mm256_movemask_epi8(_mm256_cmpeq_epi8(data[1], mask))) << 32;
	return result;
}

/*
 * Creates a bitmap of unpaired escape characters
 **/
static inline uint64_t
find_unpaired_escapes(uint64_t escapes)
{
	// TODO: handle unpaired escape from end of last iteration
	uint64_t p, e, r;
	p = escapes;
	e = escapes;
	r = escapes;
	while (e) {
		p = e;
		e = (e << 1) & escapes;
		r ^= e;
	}
	return r & p;
}

/*
 * Creates a bitmap mask of quoted sections given locations of 
 * quote chatacters.
 **/
static inline uint64_t
find_quote_mask(uint64_t quote_bits, uint64_t *prev_inside_quote)
{
	uint64_t mask = _mm_cvtsi128_si64(_mm_clmulepi64_si128(
			_mm_set_epi64x(0ULL, quote_bits), _mm_set1_epi8(0xFF), 0));
	mask ^= *prev_inside_quote;
	*prev_inside_quote = ((int64_t) mask) >> 63;
	return mask;
}

/*
 * Parses len bytes from buf according to csv rules and writes start positions of
 * records to output. Returns number of rows found.
 **/
int64_t
parseIntoLines(char *buf, size_t len, size_t *output)
{
	__m256i* input = (__m256i*) buf;
	uint64_t prev_inside_quote = 0;
	size_t pos = 0;
	uint64_t numfound = 0;

	*output++ = 0;
	numfound++;

	while (pos < len - 64) {
		uint64_t quotes = find_chars(input, '"');
		uint64_t escapes = find_chars(input, '\\');
		uint64_t unpaired_escapes = find_unpaired_escapes(escapes);
		uint64_t unescaped_quotes = quotes & ~(unpaired_escapes << 1);
		uint64_t newlines = find_chars(input, '\n');
		uint64_t quote_mask = find_quote_mask(unescaped_quotes, _inside_quote);
		uint64_t tokenpositions = newlines & ~quote_mask;
		uint64_t carriages = find_chars(input, '\r') & ~quote_mask;
		if (unlikely(carriages != 0))
			exit(1);

		uint64_t offset = 0;
		while (tokenpositions > 0) {
			int numchars = __builtin_ctzll(tokenpositions);
			tokenpositions >>= numchars;
			tokenpositions >>= 1;
			offset += numchars + 1;
			*output++ = pos + offset;
			numfound++;
		}

		pos += 64;
		input += 2;
	}
	// TODO: handle tail
	return numfound;
}

int main(int argc, char *argv[])
{
	char *buf;
	uint64_t *lines;
	uint64_t iters = 1;

	if (argc < 2)
	{
		printf("Usage: simdcopy csvfile [iterations]\n");
		return 1;
	}
	if (argc > 2)
	{
		iters = atol(argv[2]);
	}

	buf = aligned_alloc(64, 1024*1024*1024);
	lines = aligned_alloc(8, 128*1024*1024*sizeof(uint64_t));

	if (!buf || !lines)
		return 1;

	FILE *f = fopen(argv[1], "r");
	if (!f)
		return 1;

#define READBLOCK (1024*1024)
	size_t len = 0;
	while (len < sizeof(buf) - READBLOCK)
	{
		size_t result = fread(buf + len, 1, READBLOCK, f);
		if (!result)
			break;
		len += result;
	}
	fclose(f);

	struct timespec start;
	struct timespec end;

	printf("Parsing %lu bytes, %lu times\n", len, iters);
	uint64_t numfound;
	clock_gettime(CLOCK_MONOTONIC, );
	for (uint64_t i = 0; i < iters; i++) {
		numfound = parseIntoLines(buf, len, lines);
	}
	clock_gettime(CLOCK_MONOTONIC, );

	double delta = (end.tv_sec - start.tv_sec) + (1.e-9)*(end.tv_nsec - start.tv_nsec);

	printf("Found %lu rows in %lu bytes in %f milliseconds\n", numfound, len*iters, delta*1000);
	printf("  Speed: %0.3f GB/s\n", len/delta/1e9*iters);

	return 0;
}


Re: Parallel copy

2020-02-19 Thread Ants Aasma
On Wed, 19 Feb 2020 at 06:22, Amit Kapila  wrote:
>
> On Tue, Feb 18, 2020 at 8:08 PM Ants Aasma  wrote:
> >
> > On Tue, 18 Feb 2020 at 15:21, Amit Kapila  wrote:
> > >
> > > On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma  wrote:
> > > >
> > > > On Tue, 18 Feb 2020 at 12:20, Amit Kapila  
> > > > wrote:
> > > > > This is something similar to what I had also in mind for this idea.  I
> > > > > had thought of handing over complete chunk (64K or whatever we
> > > > > decide).  The one thing that slightly bothers me is that we will add
> > > > > some additional overhead of copying to and from shared memory which
> > > > > was earlier from local process memory.  And, the tokenization (finding
> > > > > line boundaries) would be serial.  I think that tokenization should be
> > > > > a small part of the overall work we do during the copy operation, but
> > > > > will do some measurements to ascertain the same.
> > > >
> > > > I don't think any extra copying is needed.
> > > >
> > >
> > > I am talking about access to shared memory instead of the process
> > > local memory.  I understand that an extra copy won't be required.
> > >
> > > > The reader can directly
> > > > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > > > CopyReadLineText() inner loop directly off of the buffer in shared 
> > > > memory.
> > > >
> > >
> > > I am slightly confused here.  AFAIU, the for(;;) loop in
> > > CopyReadLineText is about finding the line endings which we thought
> > > that the reader process will do.
> >
> > Indeed, I somehow misread the code while scanning over it. So 
> > CopyReadLineText
> > currently copies data from cstate->raw_buf to the StringInfo in
> > cstate->line_buf. In parallel mode it would copy it from the shared data 
> > buffer
> > to local line_buf until it hits the line end found by the data reader. The
> > amount of copying done is still exactly the same as it is now.
> >
>
> Yeah, on a broader level it will be something like that, but actual
> details might vary during implementation.  BTW, have you given any
> thoughts on one other approach I have shared above [1]?  We might not
> go with that idea, but it is better to discuss different ideas and
> evaluate their pros and cons.
>
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1LyAyPCtBk4rkwomeT6%3DyTse5qWws-7i9EFwnUFZhvu5w%40mail.gmail.com

It seems to be that at least for the general CSV case the tokenization to
tuples is an inherently serial task. Adding thread synchronization to that path
for coordinating between multiple workers is only going to make it slower. It
may be possible to enforce limitations on the input (e.g. no quotes allowed) or
do some speculative tokenization (e.g. if we encounter quote before newline
assume the chunk started in a quoted section) to make it possible to do the
tokenization in parallel. But given that the simpler and more featured approach
of handling it in a single reader process looks to be fast enough, I don't see
the point. I rather think that the next big step would be to overlap reading
input and tokenization, hopefully by utilizing Andres's work on asyncio.

Regards,
Ants Aasma




Re: Parallel copy

2020-02-18 Thread Ants Aasma
On Tue, 18 Feb 2020 at 15:21, Amit Kapila  wrote:
>
> On Tue, Feb 18, 2020 at 5:59 PM Ants Aasma  wrote:
> >
> > On Tue, 18 Feb 2020 at 12:20, Amit Kapila  wrote:
> > > This is something similar to what I had also in mind for this idea.  I
> > > had thought of handing over complete chunk (64K or whatever we
> > > decide).  The one thing that slightly bothers me is that we will add
> > > some additional overhead of copying to and from shared memory which
> > > was earlier from local process memory.  And, the tokenization (finding
> > > line boundaries) would be serial.  I think that tokenization should be
> > > a small part of the overall work we do during the copy operation, but
> > > will do some measurements to ascertain the same.
> >
> > I don't think any extra copying is needed.
> >
>
> I am talking about access to shared memory instead of the process
> local memory.  I understand that an extra copy won't be required.
>
> > The reader can directly
> > fread()/pq_copymsgbytes() into shared memory, and the workers can run
> > CopyReadLineText() inner loop directly off of the buffer in shared memory.
> >
>
> I am slightly confused here.  AFAIU, the for(;;) loop in
> CopyReadLineText is about finding the line endings which we thought
> that the reader process will do.

Indeed, I somehow misread the code while scanning over it. So CopyReadLineText
currently copies data from cstate->raw_buf to the StringInfo in
cstate->line_buf. In parallel mode it would copy it from the shared data buffer
to local line_buf until it hits the line end found by the data reader. The
amount of copying done is still exactly the same as it is now.

Regards,
Ants Aasma




Re: Parallel copy

2020-02-18 Thread Ants Aasma
On Tue, 18 Feb 2020 at 12:20, Amit Kapila  wrote:
> This is something similar to what I had also in mind for this idea.  I
> had thought of handing over complete chunk (64K or whatever we
> decide).  The one thing that slightly bothers me is that we will add
> some additional overhead of copying to and from shared memory which
> was earlier from local process memory.  And, the tokenization (finding
> line boundaries) would be serial.  I think that tokenization should be
> a small part of the overall work we do during the copy operation, but
> will do some measurements to ascertain the same.

I don't think any extra copying is needed. The reader can directly
fread()/pq_copymsgbytes() into shared memory, and the workers can run
CopyReadLineText() inner loop directly off of the buffer in shared memory.

For serial performance of tokenization into lines, I really think a SIMD
based approach will be fast enough for quite some time. I hacked up the code in
the simdcsv  project to only tokenize on line endings and it was able to
tokenize a CSV file with short lines at 8+ GB/s. There are going to be many
other bottlenecks before this one starts limiting. Patch attached if you'd
like to try that out.

Regards,
Ants Aasma
diff --git a/src/main.cpp b/src/main.cpp
index 9d33a85..2cf775c 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -185,7 +185,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
 #endif
 simd_input in = fill_input(buf+internal_idx);
 uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote);
-uint64_t sep = cmp_mask_against_input(in, ',');
 #ifdef CRLF
 uint64_t cr = cmp_mask_against_input(in, 0x0d);
 uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end;
@@ -195,7 +194,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
 #else
 uint64_t end = cmp_mask_against_input(in, 0x0a);
 #endif
-fields[b] = (end | sep) & ~quote_mask;
+fields[b] = (end) & ~quote_mask;
   }
   for(size_t b = 0; b < SIMDCSV_BUFFERSIZE; b++){
 size_t internal_idx = 64 * b + idx;
@@ -211,7 +210,6 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
 #endif
   simd_input in = fill_input(buf+idx);
   uint64_t quote_mask = find_quote_mask(in, prev_iter_inside_quote);
-  uint64_t sep = cmp_mask_against_input(in, ',');
 #ifdef CRLF
   uint64_t cr = cmp_mask_against_input(in, 0x0d);
   uint64_t cr_adjusted = (cr << 1) | prev_iter_cr_end;
@@ -226,7 +224,7 @@ bool find_indexes(const uint8_t * buf, size_t len, ParsedCSV & pcsv) {
 // then outside the quotes with LF so it's OK to "and off"
 // the quoted bits here. Some other quote convention would
 // need to be thought about carefully
-  uint64_t field_sep = (end | sep) & ~quote_mask;
+  uint64_t field_sep = (end) & ~quote_mask;
   flatten_bits(base_ptr, base, idx, field_sep);
   }
 #undef SIMDCSV_BUFFERSIZE


Re: Parallel copy

2020-02-17 Thread Ants Aasma
On Tue, 18 Feb 2020 at 04:40, Thomas Munro  wrote:
> +1.  That sort of two-queue scheme is exactly how I sketched out a
> multi-consumer queue for a hypothetical Parallel Scatter node.  It
> probably gets a bit trickier when the payload has to be broken up into
> fragments to wrap around the "data" buffer N times.

At least for copy it should be easy enough - it already has to handle reading
data block by block. If worker updates its position while doing so the reader
can wrap around the data buffer.

There will be no parallelism while one worker is buffering up a line larger
than the data buffer, but that doesn't seem like a major issue. Once the line is
buffered and begins inserting next worker can start buffering the next tuple.

Regards,
Ants Aasma




Re: Parallel copy

2020-02-17 Thread Ants Aasma
On Sat, 15 Feb 2020 at 14:32, Amit Kapila  wrote:
> Good point and I agree with you that having a single process would
> avoid any such stuff.   However, I will think some more on it and if
> you/anyone else gets some idea on how to deal with this in a
> multi-worker system (where we can allow each worker to read and
> process the chunk) then feel free to share your thoughts.

I think having a single process handle splitting the input into tuples makes
most sense. It's possible to parse csv at multiple GB/s rates [1], finding
tuple boundaries is a subset of that task.

My first thought for a design would be to have two shared memory ring buffers,
one for data and one for tuple start positions. Reader process reads the CSV
data into the main buffer, finds tuple start locations in there and writes
those to the secondary buffer.

Worker processes claim a chunk of tuple positions from the secondary buffer and
update their "keep this data around" position with the first position. Then
proceed to parse and insert the tuples, updating their position until they find
the end of the last tuple in the chunk.

Buffer size, maximum and minimum chunk size could be tunable. Ideally the
buffers would be at least big enough to absorb one of the workers getting
scheduled out for a timeslice, which could be up to tens of megabytes.

Regards,
Ants Aasma

[1] https://github.com/geofflangdale/simdcsv/




Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-22 Thread Ants Aasma
On Wed, 22 Jan 2020 at 09:02, Hamid Akhtar  wrote:
>
> At this stage, I'm not sure of the scale of changes this will require, 
> however, I wanted to get an understanding and consensus on whether (a) this 
> is something we should work on, and (b) whether an approach to implementing a 
> timeout makes sense.
>
> Please feel free to share your thoughts here.

The intended use case of two phase transactions is ensuring atomic
durability of transactions across multiple database systems. This
necessarily means that there needs to be a failure tolerant agent that
ensures there is consensus about the status of the transaction and
then executes that consensus across all systems. In other words, there
needs to be a transaction manager for prepared statements to actually
fulfil their purpose. Therefore I think that unilaterally timing out
prepared statements is just shifting the consequences of a broken
client from availability to durability. But if durability was never a
concern, why is the client even using prepared statements?

Citing the documentation:

> PREPARE TRANSACTION is not intended for use in applications or interactive 
> sessions. Its purpose is to allow an external transaction manager to perform 
> atomic global transactions across multiple databases or other transactional 
> resources. Unless you're writing a transaction manager, you probably 
> shouldn't be using PREPARE TRANSACTION.

Regards,
Ants Aasma




Re: Remove size limitations of vacuums dead_tuples array

2019-10-11 Thread Ants Aasma
On Thu, 10 Oct 2019 at 17:05, Tomas Vondra 
wrote:

> There already was a attempt to make this improvement, see [1]. There was
> a fairly long discussion about how to best do that (using other data
> structure, not just a simple array). It kinda died about a year ago, but
> I suppose there's a lot of relevant info in that thread.
>
> [1]
> https://www.postgresql.org/message-id/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com


Thanks for the pointer, wow that's a long thread. For some reason it did
not consider lifting the INT_MAX tuples/12GB limitation. I'll see if I can
pick up where that thread left off and push it along.

Regards,
Ants Aasma
Web: https://www.cybertec-postgresql.com


Remove size limitations of vacuums dead_tuples array

2019-10-09 Thread Ants Aasma
When dealing with a case where a 2TB table had 3 billion dead tuples I
discovered that vacuum currently can't make use of more than 1GB of
maintenance_work_mem - 179M tuples. This caused excessive amounts of index
scanning even though there was plenty of memory available.

I didn't see any good reason for having this limit, so here is a patch that
makes use of MemoryContextAllocHuge, and converts the array indexing to use
size_t to lift a second limit at 12GB.

One potential problem with allowing larger arrays is that bsearch might no
longer be the best way of determining if a ctid was marked dead. It might
pay off to convert the dead tuples array to a hash table to avoid O(n log
n) runtime when scanning indexes. I haven't done any profiling yet to see
how big of a problem this is.

Second issue I noticed is that the dead_tuples array is always allocated
max allowed size, unless the table can't possibly have that many tuples. It
may make sense to allocate it based on estimated number of dead tuples and
resize if needed.

Regards,
Ants Aasma
Web: https://www.cybertec-postgresql.com
From 6101b360ea85a66aba093f98a83ae335983aa4a5 Mon Sep 17 00:00:00 2001
From: Ants Aasma 
Date: Wed, 2 Oct 2019 20:11:20 +0300
Subject: [PATCH] Allow vacuum to use more than 1GB of memory

Use huge allocation for vacuum dead tuples list and lift the 1GB
limitation that caps maximum number of dead tuples to approximately
179M rows. Now that huge allocations are supported INT_MAX limitation
of array indexing can plausibly be hit (at maintenance_work_mem 12GB).
Use size_t to index the dead tuples array.
---
 src/backend/access/heap/vacuumlazy.c | 34 +---
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a3c4a1df3b4..612b2f51cd7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -130,8 +130,8 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/* List of TIDs of tuples we intend to delete */
 	/* NB: this list is ordered by TID address */
-	int			num_dead_tuples;	/* current # of entries */
-	int			max_dead_tuples;	/* # slots allocated in array */
+	size_t		num_dead_tuples;	/* current # of entries */
+	size_t		max_dead_tuples;	/* # slots allocated in array */
 	ItemPointer dead_tuples;	/* array of ItemPointerData */
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
@@ -161,8 +161,8 @@ static void lazy_vacuum_index(Relation indrel,
 static void lazy_cleanup_index(Relation indrel,
 			   IndexBulkDeleteResult *stats,
 			   LVRelStats *vacrelstats);
-static int	lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
-			 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+static size_t lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
+			   size_t tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
 static bool should_attempt_truncation(VacuumParams *params,
 	  LVRelStats *vacrelstats);
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
@@ -1525,7 +1525,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats,
 static void
 lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 {
-	int			tupindex;
+	size_t		tupindex;
 	int			npages;
 	PGRUsage	ru0;
 	Buffer		vmbuffer = InvalidBuffer;
@@ -1571,7 +1571,7 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
 	}
 
 	ereport(elevel,
-			(errmsg("\"%s\": removed %d row versions in %d pages",
+			(errmsg("\"%s\": removed %zu row versions in %d pages",
 	RelationGetRelationName(onerel),
 	tupindex, npages),
 			 errdetail_internal("%s", pg_rusage_show(;
@@ -1587,9 +1587,9 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats)
  * tuple for this page.  We assume the rest follow sequentially.
  * The return value is the first tupindex after the tuples of this page.
  */
-static int
+static size_t
 lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
+ size_t tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer)
 {
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxOffsetNumber];
@@ -1762,7 +1762,7 @@ lazy_vacuum_index(Relation indrel,
 			   lazy_tid_reaped, (void *) vacrelstats);
 
 	ereport(elevel,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove %zu row versions",
 	RelationGetRelationName(indrel),
 	vacrelstats->num_dead_tuples),
 			 errdetail_internal("%s", pg_rusage_show(;
@@ -2141,7 +2141,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 static void
 lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 {
-	long		maxtuples

Re: Transparent Data Encryption (TDE) and encrypted files

2019-10-08 Thread Ants Aasma
On Mon, 7 Oct 2019 at 18:02, Bruce Momjian  wrote:

> Well, do to encryption properly, there is the requirement of the nonce.
> If you ever rewrite a bit, you technically have to have a new nonce.
> For WAL, since it is append-only, you can use the WAL file name.  For
> heap/index files, we change the LSN on every rewrite (with
> wal_log_hints=on), and we never use the same LSN for writing multiple
> relations, so LSN+page-offset is a sufficient nonce.
>
> For clog, it is not append-only, and bytes are rewritten (from zero to
> non-zero), so there would have to be a new nonce for every clog file
> write to the file system.  We can store the nonce in a separate file,
> but the clog contents and nonce would have to be always synchronized or
> the file could not be properly read.  Basically every file we want to
> encrypt, needs this kind of study.
>

Yes. That is the reason why our current version doesn't encrypt SLRU's.
There is some security in encrypting without a nonce when considering an
attack vector that only sees one version of the encrypted page. But I think
to make headway on this we need to figure out if TDE feature is useful
withour SLRU encryption (I think yes), and how hard would it be to properly
encrypt SLRU's? Would the solution be acceptable for inclusion?

I can think of 3 options:

a) A separate nonce storage. Seems pretty bad complexity wise. New
data-structures would need to be created. SLRU writes would need to be WAL
logged with a full page image.
b) Inline nonces, number of items per SLRU page is variable depending on if
encryption is enabled or not.
c) Inline nonces we reserve a header structure on all SLRU pages.
pg_upgrade needs to rewrite persistent SLRUs.

None of the options seem great, but c) has the benefit of also carving out
the space for SLRU checksums.

> As I also said to Stephen, the people who are discussing this here
> > should *really really really* be looking at the Cybertec patch instead
> > of trying to invent everything from scratch - unless that patch has,
>
> Someone from Cybertec is on the voice calls we have, and is actively
> involved.
>

As far as I can tell no-one from us is on the call. I personally missed the
invitation when it was sent out. I would gladly share our learnings, a lot
of what I see here is retreading what we already went through with our
patch. However, I think that at the very least the conclusions, problems to
work on and WIP patch should be shared on list. It's hard for anybody
outside to have any input if there are no concrete design proposals or code
to review. Moreover, I think e-mail is a much better media for having a
reasoned discussion about technical design decisions.


> > In other words: maybe I'm wrong here, but it looks to me like we're
>
> laboriously reinventing the wheel when we could be working on
> > improving the working prototype.
>
> The work being done is building on that prototype.
>

We would like to help on that front.

Regards,
Ants Aasma
Web: https://www.cybertec-postgresql.com


Re: Enable data checksums by default

2019-03-29 Thread Ants Aasma
On Thu, Mar 28, 2019 at 10:38 AM Christoph Berg  wrote:

> Re: Ants Aasma 2019-03-27 <
> ca+csw_twxdrzdn2xsszbxej63dez+f6_hs3qf7hmxfenxsq...@mail.gmail.com>
> > Can you try with postgres compiled with CFLAGS="-O2 -march=native"?
> There's
> > a bit of low hanging fruit there to use a runtime CPU check to pick a
> > better optimized checksum function.
>
> Frankly, no. This is with the apt.pg.o packages which are supposed to
> be usable by everyone. If there is a better per-CPU checksum function,
> PG should pick it at runtime. Special compiler flags are a no-go here.
>

I went ahead and tested it on the count(*) test, same settings as upthread.
Median of 5 runs of 20txs on Intel i5-2500k @ 4GHz.

No checksum: 344ms
Checksums: 384ms (+12%)
No checksum march=native: 344ms
Checksums march=native: 369ms (+7%)

The checksum code was written to be easily auto-vectorized by the compiler.
So if we just compile the same function with different compiler flags and
pick between them at runtime the overhead can be approximately halved. Not
saying that this needs to be done before enabling checksums by default,
just that when considering overhead, we can foresee it being much lower in
future versions.

Regards,
Ants Aasma


Re: Enable data checksums by default

2019-03-27 Thread Ants Aasma
On Wed, Mar 27, 2019, 15:57 Christoph Berg  wrote:

> Re: To Tom Lane 2019-03-26 <20190326151446.gg3...@msg.df7cb.de>
> > I run a benchmark with checksums disabled/enabled. shared_buffers is
> > 512kB to make sure almost any read will fetch the page from the OS
> > cache; scale factor is 50 (~750MB) to make sure the whole cluster fits
> > into RAM.
> [...]
> > So the cost is 5% in this very contrived case. In almost any other
> > setting, the cost would be lower, I'd think.
>
> (That was on 12devel, btw.)
>
> That was about the most extreme OLTP read-only workload. After
> thinking about it some more, I realized that exercising large seqscans
> might be an even better way to test it because of less per-query
> overhead.
>
> Same setup again, shared_buffers = 16 (128kB), jit = off,
> max_parallel_workers_per_gather = 0:
>
> select count(bid) from pgbench_accounts;
>
> no checksums: ~456ms
> with checksums: ~489ms
>
> 456.0/489 = 0.9325
>
> The cost of checksums is about 6.75% here.
>

Can you try with postgres compiled with CFLAGS="-O2 -march=native"? There's
a bit of low hanging fruit there to use a runtime CPU check to pick a
better optimized checksum function.

Regards,
Ants Aasma

>


Re: CPU costs of random_zipfian in pgbench

2019-02-22 Thread Ants Aasma
On Sun, Feb 17, 2019 at 10:52 AM Fabien COELHO  wrote:

> > I'm trying to use random_zipfian() for benchmarking of skewed data sets,
> > and I ran head-first into an issue with rather excessive CPU costs.
> > [...] This happens because generalizedHarmonicNumber() does this:
> >
> >   for (i = n; i > 1; i--)
> >   ans += pow(i, -s);
> >
> > where n happens to be 10 (range passed to random_zipfian), so
> > the loop takes quite a bit of time.
>
> If you find a better formula for the harmonic number, you are welcome
> and probably get your name on it:-)
>

There are pretty good approximations for s > 1.0 using Riemann zeta
function and Euler derived a formula for the s = 1 case.

I also noticed that i is int in this function, but n is int64. That seems
like an oversight.

Regards,
Ants Aasma


Re: WAL insert delay settings

2019-02-21 Thread Ants Aasma
On Thu, Feb 21, 2019 at 12:50 PM Stephen Frost  wrote:

> > Rate limit in front of WAL insertion would allow for allocating the
> > throughput between foreground and background tasks, and even allow for
> > priority inheritance to alleviate priority inversion due to locks.
>
> I'm not sure how much we have to worry about priority inversion here as
> you need to have conflicts for that and if there's actually a conflict,
> then it seems like we should just press on.
>
> That is, a non-concurrent REINDEX is going to prevent an UPDATE from
> modifying anything in the table, which if the UPDATE is a higher
> priority than the REINDEX would be priority inversion, but that doesn't
> mean we should slow down the REINDEX to allow the UPDATE to happen
> because the UPDATE simply can't happen until the REINDEX is complete.
> Now, we might slow down the REINDEX because there's UPDATEs against
> *other* tables that aren't conflicting and we want those UPDATEs to be
> prioritized over the REINDEX but then that isn't priority inversion.
>

I was thinking along the lines that each backend gets a budget of WAL
insertion credits per time interval, and when the credits run out the
process sleeps. With this type of scheme it would be reasonably
straightforward to let UPDATEs being blocked by REINDEX to transfer their
WAL insertion budgets to the REINDEX, making it get a larger piece of the
total throughput pie.

Regards,
Ants Aasma


Re: WAL insert delay settings

2019-02-21 Thread Ants Aasma
On Thu, Feb 21, 2019 at 2:20 AM Stephen Frost  wrote:

> * Andres Freund (and...@anarazel.de) wrote:
> > On 2019-02-20 18:46:09 -0500, Stephen Frost wrote:
> > > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > > > On 2/20/19 10:43 PM, Stephen Frost wrote:
> > > > > Just to share a few additional thoughts after pondering this for a
> > > > > while, but the comment Andres made up-thread really struck a
> chord- we
> > > > > don't necessairly want to throttle anything, what we'd really
> rather do
> > > > > is *prioritize* things, whereby foreground work (regular queries
> and
> > > > > such) have a higher priority than background/bulk work (VACUUM,
> REINDEX,
> > > > > etc) but otherwise we use the system to its full capacity.  We
> don't
> > > > > actually want to throttle a VACUUM run any more than a CREATE
> INDEX, we
> > > > > just don't want those to hurt the performance of regular queries
> that
> > > > > are happening.
> > > >
> > > > I think you're forgetting the motivation of this very patch was to
> > > > prevent replication lag caused by a command generating large amounts
> of
> > > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing
> to
> > > > do with prioritization or foreground/background split.
> > > >
> > > > I'm not arguing against ability to prioritize stuff, but I disagree
> it
> > > > somehow replaces throttling.
> > >
> > > Why is replication lag an issue though?  I would contend it's an issue
> > > because with sync replication, it makes foreground processes wait, and
> > > with async replication, it makes the actions of foreground processes
> > > show up late on the replicas.
> >
> > I think reaching the bandwidth limit of either the replication stream,
> > or of the startup process is actually more common than these. And for
> > that prioritization doesn't help, unless it somehow reduces the total
> > amount of WAL.
>
> The issue with hitting those bandwidth limits is that you end up with
> queues outside of your control and therefore are unable to prioritize
> the data going through them.  I agree, that's an issue and it might be
> necessary to ask the admin to provide what the bandwidth limit is, so
> that we could then avoid running into issues with downstream queues that
> are outside of our control causing unexpected/unacceptable lag.
>

If there is a global rate limit on WAL throughput it could be adjusted by a
control loop, measuring replication queue length and/or apply delay. I
don't see any sane way how one would tune a per command rate limit, or even
worse, a cost-delay parameter. It would have the same problems as work_mem
settings.

Rate limit in front of WAL insertion would allow for allocating the
throughput between foreground and background tasks, and even allow for
priority inheritance to alleviate priority inversion due to locks.

There is also an implicit assumption here that a maintenance command is a
background task and a normal DML query is a foreground task. This is not
true for all cases, users may want to throttle transactions doing lots of
DML to keep synchronous commit latencies for smaller transactions within
reasonable limits.

As a wild idea for how to handle the throttling, what if when all our wal
insertion credits are used up XLogInsert() sets InterruptPending and the
actual sleep is done inside ProcessInterrupts()?

Regards,
Ants Aasma


Re: Checkpoint start logging is done inside critical section

2018-10-18 Thread Ants Aasma
On Thu, Oct 18, 2018 at 9:02 AM Amit Kapila  wrote:
>
> On Thu, Oct 18, 2018 at 10:27 AM Andres Freund  wrote:
> > (that's why we mark the ctx as being ok with that).
> >
>
> Yeah, as the palloc for log message would be called in an ErrorContext
> where it is safe to do the allocation, so ideally this shouldn't be a
> problem.  So, it seems to me that this is not a problem, Ants, do you
> see any problem in any particular scenario or was this based on
> theoretical analysis?

This was purely theoretical, as also evidenced by lack of complaints
even though the code has been like that for a very long time. I was
actually mostly worried about extension code run by logging hook
causing the panic.

Regards,
Ants Aasma



Checkpoint start logging is done inside critical section

2018-10-17 Thread Ants Aasma
The LogCheckpointStart() call inside CreateCheckPoint() is done while
inside a critical section. The elog call could trigger errors due to
memory allocations or from a logging hook, resulting in a panic. It
seems better to postpone the logging until after the critical section
is done. It's only a few lwlock acquisitions away and shouldn't make
any material difference. Patch to do so is attached.

Regards,
Ants Aasma
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffc..faa9690e48 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8907,15 +8907,6 @@ CreateCheckPoint(int flags)
 	XLogCtl->RedoRecPtr = checkPoint.redo;
 	SpinLockRelease(>info_lck);
 
-	/*
-	 * If enabled, log checkpoint start.  We postpone this until now so as not
-	 * to log anything if we decided to skip the checkpoint.
-	 */
-	if (log_checkpoints)
-		LogCheckpointStart(flags, false);
-
-	TRACE_POSTGRESQL_CHECKPOINT_START(flags);
-
 	/*
 	 * Get the other info we need for the checkpoint record.
 	 *
@@ -8962,6 +8953,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * If enabled, log checkpoint start.  We postpone this until now so as not
+	 * to log anything if we decided to skip the checkpoint.
+	 */
+	if (log_checkpoints)
+		LogCheckpointStart(flags, false);
+
+	TRACE_POSTGRESQL_CHECKPOINT_START(flags);
+
 	/*
 	 * In some cases there are groups of actions that must all occur on one
 	 * side or the other of a checkpoint record. Before flushing the


Re: Skylake-S warning

2018-10-04 Thread Ants Aasma
On Thu, Oct 4, 2018 at 9:50 AM Adrien Nayrat  wrote:
>
> On 10/3/18 11:29 PM, Daniel Wood wrote:
> > If running benchmarks or you are a customer which is currently impacted by
> > GetSnapshotData() on high end multisocket systems be wary of Skylake-S.
> >
> >
> > Performance differences of nearly 2X can be seen on select only pgbench due 
> > to
> > nothing else but unlucky choices for max_connections.  Scale 1000, 192 local
> > clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) 
> > system.
> > pgbench -S
>
> Could it be related to :
> https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net
> ?


Unlikely. I understood from Daniel's email that profiling shows a
different hot-spot. In the cited .NET issue the problem was mostly due
to issuing PAUSE in a loop without attempting to grab the lock. In
PostgreSQL it's called only once per retry attempt.

Regards,
Ants Aasma
--
PostgreSQL Senior Consultant
www.cybertec-postgresql.com

Austria (HQ), Wiener Neustadt  |  Switzerland, Zürich  |  Estonia,
Tallinn  |  Uruguay, Montevideo
Facebook: www.fb.com/cybertec.postgresql
Twitter: www.twitter.com/PostgresSupport



Re: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-10 Thread Ants Aasma
On Tue, Jul 10, 2018 at 10:05 AM Jamison, Kirk 
wrote:

> Since in the current implementation, the replay of each TRUNCATE/DROP
> TABLE scans the whole shared buffer.
>
> One approach (though idea is not really developed yet) is to improve the
> recovery by delaying the shared buffer scan and invalidation
> (DropRelFileNodeBuffers) and to put it after the next checkpoint (after
> failover completion). The replay of TRUNCATE/DROP TABLE just make the
> checkpointer process remember what relations should be invalidated in the
> shared buffer during subsequent checkpoint. The checkpointer then scans the
> shared buffer only once to invalidate the buffers of relations that was
> dropped and truncated.
>

How about using the background writer for this? It seems to me that the
main reason to invalidate buffers would be to free them up for buffer
allocation, which is precisely the task of background writer. When adding a
filenode to be invalidated, take note of bgwriter position and add it to a
queue. When bgwriter is advancing, check each buffer tag against a hash
table of filenodes being invalidated. When background writer has completed
a loop it can remove the invalidated filenode. When bgwriter falls behind
the clock sweep and there are filenodes to invalidate it should run the
invalidation scan instead of skipping ahead. If there are already too many
filenodes being invalidated, then whoever is trying to add a new one gets
to run the invalidation scan until something can be evicted.

--
Ants Aasma
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com/


Re: WAL prefetch

2018-06-19 Thread Ants Aasma
On Tue, Jun 19, 2018 at 4:04 PM Tomas Vondra 
wrote:

> Right. My point is that while spawning bgworkers probably helps, I don't
> expect it to be enough to fill the I/O queues on modern storage systems.
> Even if you start say 16 prefetch bgworkers, that's not going to be
> enough for large arrays or SSDs. Those typically need way more than 16
> requests in the queue.
>
> Consider for example [1] from 2014 where Merlin reported how S3500
> (Intel SATA SSD) behaves with different effective_io_concurrency values:
>
> [1]
>
> https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azs...@mail.gmail.com
>
> Clearly, you need to prefetch 32/64 blocks or so. Consider you may have
> multiple such devices in a single RAID array, and that this device is
> from 2014 (and newer flash devices likely need even deeper queues).'
>

For reference, a typical datacenter SSD needs a queue depth of 128 to
saturate a single device. [1] Multiply that appropriately for RAID arrays.

Regards,
Ants Aasma

[1]
https://www.anandtech.com/show/12435/the-intel-ssd-dc-p4510-ssd-review-part-1-virtual-raid-on-cpu-vroc-scalability/3


Re: All Taxi Services need Index Clustered Heap Append

2018-03-05 Thread Ants Aasma
On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski
<m...@komzpa.net> wrote:
>> This approach mixes well with hash
>> partitioning. It would be neat indeed if PostgreSQL do something
>> equivalent on its own, and pluggable storage work being done could
>> enable index organized tables that would help. But you probably need
>> something right now.
>
>
> Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only
> tables, vacuum processing all of the eternity of btree) by 11 will get most
> of spike-nails out of the microservice code, and we can probably live with
> them until 11 gets to RDS.
>
> I also don't see why a pluggable storage is a must for the clustered write.
> Postgres does have a mechanism for selecting the next page to write tuple
> to, right now it's just looking at FSM - but what if it just peeked at
> existing index that already has enough the data to route tuple to correct
> page on write?

The mechanism you outlined would likely work for your use case, but it
has many issues that prevent it from being universally useful. From
the top of my head:

* One extra index descent per insertion (I/O for this is necessary
anyway, but CPU work is duplicated).
* We don't currently track the amount of bloat. A mechanism that does
this needs to be added.
* If table hits the bloat limit there will be a sudden change in
behavior. This is pretty nasty from an operations point of view.
* With your (id,ts) clustering and data coming in mostly ordered by
timestamp, after initial warmup, each page will contain rows from a
single id, but different ids are arbitrarily interleaved. This is
better than current state, but people might want to have an
interleaving step bigger than 8kB to better utilize storage hardware.
* It seems that with a common (ts) clustering and age of timestamp
coming from an exponential distribution, this will quickly bloat to
threshold and then insert data in a rather arbitrary order. This is
much worse than the default behavior.

At least in my opinion these problems make it a special case
optimization that is hard to justify in core. A decent alternative
would be a plugin mechanism for locating free space for a tuple where
you can write your extension to find a suitable location for the row.

>> I guess I don't have to tell you that it looks like your needs have
>> outgrown what RDS works well with and you are in for a painful move
>> sooner or later.
>
>
> Painful move where to? If we just run a Postgres instance without RDS we'll
> get the pain of setting up Postgres and replication and backups and
> autofailover, with no visible gain except if we get some private /
> unaccepted patches applied to it. If we can get these things right upstream
> why would we want to switch?

EC2 for example. Mainly because I3 instances and ephemeral provide an
order of magnitude or two of performance improvement while costing
less. Being able to run custom extensions and patches if necessary is
a nice bonus. Yes, setting up replication, autofailover and backups is
extra work that you have to weigh against the benefits. But don't
overestimate the effort - there are some pretty nice tools available
that make a proper cluster relatively simple to set up.

> Per my colleagues, MySQL offers clustered index, also MySQL is available on
> RDS without the need of "painful move", which is doable by writing to two
> locations for a day and then pointing readers to new DB. But if we can
> instead do no move and be sure the issues are gone upstream before we hit
> the limit of spike-nails we're running on currently, wouldn't that be
> better? :)

The move off of RDS is painful because getting data out of RDS
involves either downtime or building an ad-hoc logical replication
solution. You need to solve that regardless of where you move to.

Providing an out-of-the-box solution in core PostgreSQL would of
course be best, but realistically you will be waiting at least 2 years
to get it on RDS. In the meanwhile either the buffer partition
approach I described, or a buffering microservice in front of
PostgreSQL like Aleksander recommended should fix data locality for
you. If you weren't running on RDS I would even propose using Redis as
the buffer with one key per driver and redis_fdw to make the data
accessible from within PostgreSQL.

Regards,
Ants  Aasma
--
+43-670-6056265
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: All Taxi Services need Index Clustered Heap Append

2018-03-04 Thread Ants Aasma
On Sat, Mar 3, 2018 at 4:53 PM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> On 3 March 2018 at 05:30, Darafei "Komяpa" Praliaskouski <m...@komzpa.net> 
> wrote:
>> Our options were:
>>
>>  - partitioning. Not entirely trivial when your id is uuid. To get visible
>> gains, we need to make sure each driver gets their own partition. That would
>> leave us with 50 000(+) tables, and rumors say that in that's what is done
>> in some bigger taxi service, and relcache then eats up all the RAM and
>> system OOMs.
>
> It's a good job someone invented HASH partitioning then.
>
> It would be interesting to hear how your benchmarks go using current
> master + the faster partition pruning patchset [1].  Currently, HASH
> partitioning does exist in master, just there's no partition pruning
> for the non-matching partitions, which is why you need [1].
>
> I think trying with something like 500-1000 partitions might be a good
> place to start.

I don't think that will actually help much. 1000 partitions means each
partition gets data from ~50 vehicles. A 60 tuples per page each page
in the partitioned able will contain on average 1.2 interesting
tuples. So you still have almost one page read per row.


Regards,
Ants  Aasma
-- 
+43-670-6056265
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: All Taxi Services need Index Clustered Heap Append

2018-03-04 Thread Ants Aasma
On Fri, Mar 2, 2018 at 6:30 PM, Darafei "Komяpa" Praliaskouski
<m...@komzpa.net> wrote:
> I gave this all some thought and it looks like it all could have not
> happened if Postgres was able to cluster heap insertions by (id, ts) index.
> We're ok with synchronuous_commit=off, so amplified write won't immediately
> hit disk and can get cooled down in progress. Clustering doesn't require
> perfect sorting: we need to minimize number of pages fetched, it's ok if the
> pages are not consecutive on disk.

Data locality is indeed the key here. Specifically for non-cached
data. It is possible to manually implement some approximation of
clustering on SQL level with current PostgreSQL features. Insert
incoming data into new data partitions and have a background job swap
input to a new partition and then insert data from the previous new
data partition to main storage sorting it by vehicle in the process.
If you do this every few minutes or so you should be able to tune the
system in a way that the new partition data isn't even written to
disk, you only have to pay the cost of double WAL for insertion and
the CPU work to perform the move. This approach mixes well with hash
partitioning. It would be neat indeed if PostgreSQL do something
equivalent on its own, and pluggable storage work being done could
enable index organized tables that would help. But you probably need
something right now.

I guess I don't have to tell you that it looks like your needs have
outgrown what RDS works well with and you are in for a painful move
sooner or later.

Regards,
Ants  Aasma
-- 
+43-670-6056265
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: RTLD_GLOBAL (& JIT inlining)

2018-02-26 Thread Ants Aasma
On Mon, Feb 26, 2018 at 11:28 PM, Andres Freund <and...@anarazel.de> wrote:
> So RTLD_LOCAL is out of the question, but I think we can get a good bit
> of the benefit by either specifying -Wl,-Bsymbolic at shlib build time,
> or RTLD_DEEPBIND at dlopen() time.  Either leads to the opened shared
> library effectively being put at the beginning of the search path,
> therefore avoiding the issue that an earlier loaded shared library or
> symbols from the main binary can accidentally overwrite things in the
> shared library itself. Which incidentally also makes loading a bit
> faster.

I think this would also fix oracle_fdw crashing when postgres is
compiled with --with-ldap. At least RTLD_DEEPBIND helped. [1]

[1] 
https://www.postgresql.org/message-id/CA%2BCSw_tPDYgnzCYW0S4oU0mTUoUhZ9pc7MRBPXVD-3Zbiwni9w%40mail.gmail.com

Ants Aasma