Re: Experimenting with hash join prefetch
On Tue, Feb 4, 2020 at 2:31 PM Andres Freund wrote: > How much of the benefit here comes from the prefetching, and how much > just from writing the code in a manner that allows for more out-of-order > execution? Because there's no dependencies preventing execution of the > next queued tuple anymore, I'd assume that this is a good part what > helps? A small part of the speed-up does indeed seem to come from that sort of thing. > Code like this really should look something roughly like: > > while (true) > have_skew = False > > # get tuples > for i in 0..batchsize: > tuples[i] = ExecProcNode(outerNode); > if tuples[i] == NULL: ># have slowpath handle this >break; > > # compute their hash values > for i in 0..batchsize: > hashvalues[i] = ExecHashGetHashValue(tuples[i]) > > # check whether go into skew buckets > if have_skew_table: > for i in 0..batchsize: > skewbuckets[i] = ExecHashGetSkewBucket(tuples[i], hashvalues[i]) > if (skewbuckets[i] != INVALID_SKEW_BUCKET_NO) >have_skew = True > if have_skew: > # handle everything here > continue > > # assume there's no skew tuples going forward, all handled above > > # compute bucket/batch for all tuples > have_into_batch = False > for i in 0..batchsize: > hashbuckets[i] = ExecHashGetBucketAndBatch() > if hashbuckets[i] != hashtable->curbatch: >have_into_batchfile = True > > if have_into_batchfile: > # slowpath > continue > > # Allocate all tuples > for i in 0..batchsize: > hjtuples[i] = alloc_mintuple(hashtuples[i]) > > # And finally isert them > for i in 0..batchsize: > hjtuple.next = buckets[hashbuckets[i]] > buckets[hashbuckets[i]] = hjtuple Hmm. I see your point: don't use the batch number for a branch immediately, and so on. I thought a bit about a multi-pass system a bit like that too just for prefetching purposes, though I haven't tested due to lack of required infrastructure. I guess you need a way to get the next N tuples and make them all simultaneously available without copying them yet. For this experiment, I speculated that it might be better to be continually inserting a short distance behind so there are no batch-boundary stalls anyway. Admittedly, it's pretty hard to choose the right queue depth if your loop includes ExecProcNode() because you have no idea what that actually does, but on the other hand, you do need to put enough cycles between prefetch and fetch to see benefits, so maybe that's not so crazy. Perhaps to get more speedup I'd need to consider dependencies along the lines your'e describing, but also find a way to keep prefetch and insertion far enough apart to win. Hmm. > I would bet this helps significantly even if there's no prefetch > instruction - but using explicit prefetching might help further. Also > allows us to increase branch costs, because we can amortize them over a > few tuples. Yes, I already observe that performance improves a little bit even with my simple insert-queue patch if you comment out the pg_prefetch_mem() call, and figured it was something about execution order at work there (though I didn't study the effect up close with perf etc due to lack of PMC access on this host), but the prefetch apparently supplies most of the speed-up I saw. It stands to reason that hash joins should benefit from explicit prefetching (even though lots of pixels have been expended explaining that explicit prefetching is often a mistake, cf Linux experience), since hash joins are basically cache miss machines par excellence, at least in the build phase with unique keys. > > create unlogged table t as select generate_series(1, 1)::int i; > > select pg_prewarm('t'); > > set work_mem='8GB'; > > > > select count(*) from t t1 join t t2 using (i); > > > >masterpatched/N=1 patched/N=4 > > workers=0 89.808s 80.556s (+11%) 76.864 (+16%) > > workers=2 27.010s 22.679s (+19%) 23.503 (+15%) > > workers=4 16.728s 14.896s (+12%) 14.167 (+18%) > > > > Just an early experiment, but I though it looked promising enough to share. > > Nice! It's starting to look like prefetching of build + prefetching of probe + reordering-friendly code + icache-friendly tight loops could add up to some serious gains, but some architectural stuff is needed for much of that, hence my lower aim :-) Other things I noticed on that hacking escapade: the patch generates PREFETCHT0 instructions on my compiler, but there are also "write" and "locality" flags you can pass to __builtin_prefetch() to get PREFETCHW, and variants for predictions about how valuable the data is after the next access; write=1 slowed down my initial tests for reasons I don't fully understand, but I didn't look further once I realised you need -march= anyway. I didn't look into the
Re: Experimenting with hash join prefetch
HI, On 2020-02-04 01:48:49 +1300, Thomas Munro wrote: > On Fri, Apr 12, 2019 at 4:23 PM Thomas Munro wrote: > > ... if we also prefetch during > > the build phase ... > > Here's an experimental patch to investigate just that part. I tried > initiating a prefetch of the bucket just before we copy the tuple and > then finally insert it, but it doesn't seem to be far enough apart (at > least for small tuples), which is a shame because that'd be a one line > patch. So I made a little insert queue that prefetches and defers the > insertion until N tuples later, and then I managed to get between 10% > and 20% speed-up for contrived tests like this: How much of the benefit here comes from the prefetching, and how much just from writing the code in a manner that allows for more out-of-order execution? Because there's no dependencies preventing execution of the next queued tuple anymore, I'd assume that this is a good part what helps? Code like this really should look something roughly like: while (true) have_skew = False # get tuples for i in 0..batchsize: tuples[i] = ExecProcNode(outerNode); if tuples[i] == NULL: # have slowpath handle this break; # compute their hash values for i in 0..batchsize: hashvalues[i] = ExecHashGetHashValue(tuples[i]) # check whether go into skew buckets if have_skew_table: for i in 0..batchsize: skewbuckets[i] = ExecHashGetSkewBucket(tuples[i], hashvalues[i]) if (skewbuckets[i] != INVALID_SKEW_BUCKET_NO) have_skew = True if have_skew: # handle everything here continue # assume there's no skew tuples going forward, all handled above # compute bucket/batch for all tuples have_into_batch = False for i in 0..batchsize: hashbuckets[i] = ExecHashGetBucketAndBatch() if hashbuckets[i] != hashtable->curbatch: have_into_batchfile = True if have_into_batchfile: # slowpath continue # Allocate all tuples for i in 0..batchsize: hjtuples[i] = alloc_mintuple(hashtuples[i]) # And finally isert them for i in 0..batchsize: hjtuple.next = buckets[hashbuckets[i]] buckets[hashbuckets[i]] = hjtuple Obviously it's a bit more complicated in reality than this, but I do think that's where we've to go to make crucial parts like this faster (same with hashaggs, and a few other places). I would bet this helps significantly even if there's no prefetch instruction - but using explicit prefetching might help further. Also allows us to increase branch costs, because we can amortize them over a few tuples. > create unlogged table t as select generate_series(1, 1)::int i; > select pg_prewarm('t'); > set work_mem='8GB'; > > select count(*) from t t1 join t t2 using (i); > >masterpatched/N=1 patched/N=4 > workers=0 89.808s 80.556s (+11%) 76.864 (+16%) > workers=2 27.010s 22.679s (+19%) 23.503 (+15%) > workers=4 16.728s 14.896s (+12%) 14.167 (+18%) > > Just an early experiment, but I though it looked promising enough to share. Nice! Greetings, Andres Freund
Re: Experimenting with hash join prefetch
On Fri, Apr 12, 2019 at 4:23 PM Thomas Munro wrote: > ... if we also prefetch during > the build phase ... Here's an experimental patch to investigate just that part. I tried initiating a prefetch of the bucket just before we copy the tuple and then finally insert it, but it doesn't seem to be far enough apart (at least for small tuples), which is a shame because that'd be a one line patch. So I made a little insert queue that prefetches and defers the insertion until N tuples later, and then I managed to get between 10% and 20% speed-up for contrived tests like this: create unlogged table t as select generate_series(1, 1)::int i; select pg_prewarm('t'); set work_mem='8GB'; select count(*) from t t1 join t t2 using (i); masterpatched/N=1 patched/N=4 workers=0 89.808s 80.556s (+11%) 76.864 (+16%) workers=2 27.010s 22.679s (+19%) 23.503 (+15%) workers=4 16.728s 14.896s (+12%) 14.167 (+18%) Just an early experiment, but I though it looked promising enough to share. 0001-Prefetch-cache-lines-while-building-hash-join-table.patch Description: Binary data
Re: Experimenting with hash join prefetch
On Fri, Apr 12, 2019 at 1:35 AM Robert Haas wrote: > It would be interesting to see how this does with moderately-long text > keys, say 32 or 64 byte strings, and with actually-long text keys, say > several kB, and then with gigantic text keys, say several MB. At some > point the key probably gets large enough that computing the hash value > for the next key evicts the current key from the relevant CPU cache, > and if I had to guess, at that point prefetching will become a loser. > But I don't know where that point is. If it turns out for example > that this technique is a winner for pass-by-value datatypes and a > loser for pass-by-reference datatypes, or that it's a winner always, > or some sort of simple rule like that, awesome! But if it turns out > that there's no simple rule that we can use to know whether it wins or > loses, then that might make things a little tricky. Ok, I ran the attached script on an E5-2695 v3 @ 2.30GHz with 32K of L1, 256K of L2, 30M of L3. I used shared_buffers=16GB and prewarmed all tables. The short version: very mixed results. For small hash tables it clearly hurts, for large ones it looks promising. Much more digging required to draw any conclusions. It'd be nice to understand exactly how the hidden parameters work. Hash bucket array size vs hardware cache sizes must be a factor. Another is surely timing; there is an optimal time to begin prefetching (too soon and your line might be replaced by the time you need it, too late and you won't avoid stalling). Here, I am doing a ham-fisted prefetch of t+1's bucket header and not yet trying to prefetch the tuple itself, but the Intel/CMU paper[1] of course shows a deeper pipeline and talks about calibrating the prefetch distance for the hardware. As far as I can see, that's quite tricky in general, and complicated by our executor design: the number of cycles before the node runs again depends on the higher-level plan! Previously I have heard icache-based arguments for why we should be able to emit more than one tuple at a time, and I suppose this is a new argument for that: it makes it actually plausible to calibrate the prefetch distance for "software-pipelined prefetch" techniques. Anyway, here are the results for what they are worth: Table r has 50,000,000 rows. Table s was tested with 3 different sizes. Both tables have the same layout: key (various types and sizes) + 0, 2 or 4 extra columns. I ran each test 3 times, and compared the worst (w) and median (m) times and computed the speed-up provided by the patch. The absolute times (not shown) were all in the range 9-40 seconds depending depending on the parameters. s=1,000 s=100,000s=10,000,000 int w=-10.86%, m=-11.03% w=-8.56%, m=-9.17% w=+16.79%, m=+19.63% + 2 cols w=-14.19%, m=-16.97% w=-6.59%, m=-7.89% w=+15.88%, m=+16.81% + 4 cols w=-17.14%, m=-14.34% w=-10.01%, m=-8.85% w=+37.38%, m=+24.01% text(8) w=-12.42%, m=-12.32% w=-4.04%, m=-1.96% w=+13.52%, m=+16.17% + 2 cols w=-13.50%, m=-14.98% w=-4.29%, m=-3.40% w=+15.98%, m=+19.45% + 4 cols w=-11.53%, m=-9.61% w=-3.70%, m=-5.91% w=+46.66%, m=+51.06% text(16)w=-11.46%, m=-9.71% w=-4.85%, m=-3.86% w=+16.47%, m=+22.10% + 2 cols w=-13.97%, m=-14.55% w=-7.08%, m=-6.07% w=+20.50%, m=+21.77% + 4 cols w=-9.72%, m=-11.31% w=-1.03%, m=-2.55% w=+8.25%, m=+12.21% text(32)w=-14.86%, m=-15.48% w=-9.36%, m=-9.27% w=+19.86%, m=+15.34% + 2 cols w=-12.35%, m=-11.71% w=-10.61%, m=-9.87% w=+98.29%, m=+97.40% + 4 cols w=-10.71%, m=-10.40% w=-2.42%, m=-1.25% w=-8.34%, m=-10.44% text(64)w=-9.45%, m=-11.36% w=-13.94%, m=-11.42% w=+9.44%, m=+9.57% + 2 cols w=-12.47%, m=-13.17% w=-9.60%, m=-6.61% w=-4.69%, m=+10.06% + 4 cols w=-9.47%, m=-12.20% w=-5.60%, m=-3.55% w=-15.91%, m=-23.29% I'd expect the right-hand column to get a further speed-up (or slow-down) of around 1/5 the given numbers, if we also prefetch during the build phase (= s/r). Maybe 2-stage pipeline would help, though I'm starting to see the complexity of organising a perfecting primed memory pipeline ... ie what this topic is really all about. Well, that's enough learning-basic-facts-about-computers by trying to whack PostgreSQL with database literature for now. Looks like I'll have to work quite a bit harder to make something useful out of this. I think I see where some of the problems lie. I think being able to store multiple tuples in a slot (via TTS-implementation-specific means, as we see with the heap scan in my patch, and as I think hash join would want to do to emit multiple tuples before relinquishing control) and then look at them via ExecScrollSlot() and perhaps also consume them via ExecNextSlot() are promising ideas, but I've only scratched the surface. [1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf -- Thomas Munro
Re: Experimenting with hash join prefetch
On Wed, Apr 10, 2019 at 2:10 AM Thomas Munro wrote: > Here is an example of times for a trivial join on my laptop. Note > that this is prefetching only the probing phase, not while building > which should also be possible. I didn't get around to trying deeper > prefetch pipelines as discussed earlier, but those seemed to produce > diminishing returns with hardcoded tests like in the earlier message. It would be interesting to see how this does with moderately-long text keys, say 32 or 64 byte strings, and with actually-long text keys, say several kB, and then with gigantic text keys, say several MB. At some point the key probably gets large enough that computing the hash value for the next key evicts the current key from the relevant CPU cache, and if I had to guess, at that point prefetching will become a loser. But I don't know where that point is. If it turns out for example that this technique is a winner for pass-by-value datatypes and a loser for pass-by-reference datatypes, or that it's a winner always, or some sort of simple rule like that, awesome! But if it turns out that there's no simple rule that we can use to know whether it wins or loses, then that might make things a little tricky. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Experimenting with hash join prefetch
On Sun, Oct 14, 2018 at 11:11 PM Andrey Borodin wrote: > > 14 окт. 2018 г., в 9:18, Thomas Munro > > написал(а): > > > > + /* Prefetch the bucket for the next key */ > > + uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1); > > + uint32 next_bucket = next_hash % hashtable->nbuckets; > > + > > __builtin_prefetch(>buckets.unshared[next_bucket]); > > > +1, I also think that we should use __builtin_prefetch these days (years, > actually). > Exactly after listening Anastassia Ailamaki's (author of referenced paper) > talk on VLDB I've proposed to do that for B-tree [0], but did not really > pursuit that idea. The above was obviously just a hard-coded hack that "knew" the next key would be n + 1. I've been wondering how you might abstract real look-ahead with the shiny new TupleTableSlot design. Here's a concept I came up with: ExecScrollSlot(slot, 1) -> bool, to try to look ahead to the next tuple if possible. I suppose there could be a kind of scrolling that actually consumes tuples (for true batch-mode tuple processing in tight inner loops, for example hash table build), and scrolling that merely peeks ahead (what I'm describing so far). I'm keen to hear other ideas about how that could look, because I know that "vector" and "batch" mode tuple processing are ideas that people have been bouncing around for a while. Flame away. POC patch attached. I never got around to figuring out why it breaks anti-joins (hence some regression test failures) or filling out various other important details (see commit message for a list), but I figured it was better on the mailing list than hiding in a drawer, anyway. Here is an example of times for a trivial join on my laptop. Note that this is prefetching only the probing phase, not while building which should also be possible. I didn't get around to trying deeper prefetch pipelines as discussed earlier, but those seemed to produce diminishing returns with hardcoded tests like in the earlier message. shared_buffers = '3GB' create table r as select generate_series(1, 4000)::int i; create table s as select generate_series(1, 1000)::int i; analyze; set max_parallel_workers_per_gather = 0; set work_mem = '2GB'; select pg_prewarm('r'::regclass); select pg_prewarm('s'::regclass); select count(*) from r join s using (i); Master: 00:14.230 Patched: 00:11.818 -- Thomas Munro https://enterprisedb.com From a352a861372aa0738746fce0b5a5dc962bb87d95 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 10 Apr 2019 15:19:23 +1200 Subject: [PATCH] Experimental ExecScrollSlot() for hash join prefetch. Incomplet, inkorrect, proof-of-concept code only. Problems with ExecScrollSlot(): * needs a configure test * not preserving deform state when scrolling back and forth * should ExecScrollSlot(n) be absolute or relative? * how could we have a "consuming" mode, so that has build could use a tight inner loop over all the tuples in a page? * look-ahead size of just 1 for now! * only works with page-mode BufferHeapTupleTableSlot with no key test * ... Problems with hash join prefetch: * anti-joins are broken (!), see make check, haven't looked into why yet * ignoring skew hash table * not prefetching in hash build phase * need parallel hash join support * ... Author: Thomas Munro Discussion: https://postgr.es/m/flat/CA%2Bq6zcXg5-Rc4k0JY%2B7%3DgEDGWjCVp0X9t7JdnCuaAfeNmtTEZQ%40mail.gmail.com#35d34634efe8315587efd0b0da9775fc --- src/backend/access/heap/heapam.c | 21 +- src/backend/access/heap/heapam_handler.c | 10 +++ src/backend/executor/execTuples.c| 36 +-- src/backend/executor/nodeHash.c | 37 src/backend/executor/nodeHashjoin.c | 4 +++ src/include/access/heapam.h | 1 + src/include/executor/hashjoin.h | 5 src/include/executor/nodeHash.h | 6 src/include/executor/tuptable.h | 26 - 9 files changed, 136 insertions(+), 10 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index a05b6a07ad..6e66acb688 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -292,6 +292,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) scan->rs_numblocks = InvalidBlockNumber; scan->rs_inited = false; scan->rs_ctup.t_data = NULL; + scan->rs_ntup.t_data = NULL; ItemPointerSetInvalid(>rs_ctup.t_self); scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; @@ -493,6 +494,8 @@ heapgettup(HeapScanDesc scan, int linesleft; ItemId lpp; + scan->rs_ntup.t_data = NULL; + /* * calculate next starting lineoff, given scan direction */ @@ -807,6 +810,8 @@ heapgettup_pagemode(HeapScanDesc scan, int linesleft; ItemId lpp; + scan->rs_ntup.t_data = NULL; + /* * calculate next starting lineindex, given scan direction */ @@
Re: Experimenting with hash join prefetch
On Mon, Oct 15, 2018 at 12:16 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > On Sun, 14 Oct 2018 at 06:19, Thomas Munro > > wrote: > > Cache-oblivious hash joins cause a lot of TLB and cache misses. > > ... > > (There is another class of cache-aware hash join algorithms that partition > > carefully up front to avoid them; that's not us.) > > Just out of curiosity, can you please elaborate more on this part (with > references)? I'm thinking about this topic for a while, and I'm wondering, if > by another class you mean something like this [1], then even if it's not us > today, are there any issues that prevent from experimenting in this area? Hmm, I didn't mean the term-of-art "cache-oblivious" (Wikipedia definition: "an algorithm designed to take advantage of a CPU cache without having the size of the cache"), I meant not "cache-conscious" (we don't do anything at all to reduce cache misses, though obviously we could add prefetching to improve on that). The distinction I'm making is between "no partition" hash join (what we have), where you don't have to do any work up front, but you pay for a lot of cache misses during building and probing, and "partition" hash join, notably "radix join" (what MonetDB has?), where you have a partition phase before the build phase that aims to break the data up into enough partitions so that the hash tables will fit better in cache, making the later phases faster. There seems to be some disagreement about which is best -- passing through the data first is expensive, but so are cache misses on every probe, and there are claims that the winner depends on skew and tuple size. Here's some of the stuff I read/watched about this subject: https://15721.courses.cs.cmu.edu/spring2018/schedule.html#apr-04-2018 Add to that http://www.adms-conf.org/2017/camera-ready/Analyzing_In_Memory_Hash_Joins__Granularity_Matters.pdf. Skimming very quickly through the paper you posted, yeah, I mean exactly that stuff. Specifically I was thinking of the radix join mentioned in background section 2.3. (I see also that the same authors wrote a paper "Cache-Oblivious Hash Joins" which appears to describe a refinement of radix that doesn't need to be parameterised for cache size.) Sure, we could always consider these ideas. I am not personally working on that; to me it looked very hard to build, for a feature so uncertain to produce better results! (Note: we do of course have some kind of partitioning called "batching" when work_mem runs out, but it's not a kind of partitioning that cares about reducing cache misses, so if I understood correctly it's still "no partition" as far as this discussion goes.) -- Thomas Munro http://www.enterprisedb.com
Re: Experimenting with hash join prefetch
> On Sun, 14 Oct 2018 at 06:19, Thomas Munro > wrote: > > Cache-oblivious hash joins cause a lot of TLB and cache misses. > ... > (There is another class of cache-aware hash join algorithms that partition > carefully up front to avoid them; that's not us.) Just out of curiosity, can you please elaborate more on this part (with references)? I'm thinking about this topic for a while, and I'm wondering, if by another class you mean something like this [1], then even if it's not us today, are there any issues that prevent from experimenting in this area? [1]: https://www.cse.ust.hk/catalac/papers/coqp_tods08.pdf
Re: Experimenting with hash join prefetch
Hi, Thomas! > 14 окт. 2018 г., в 9:18, Thomas Munro > написал(а): > > + /* Prefetch the bucket for the next key */ > + uint32 next_hash = hash_uint32(DatumGetInt32(keyval) + 1); > + uint32 next_bucket = next_hash % hashtable->nbuckets; > + __builtin_prefetch(>buckets.unshared[next_bucket]); +1, I also think that we should use __builtin_prefetch these days (years, actually). Exactly after listening Anastassia Ailamaki's (author of referenced paper) talk on VLDB I've proposed to do that for B-tree [0], but did not really pursuit that idea. [0] https://www.postgresql.org/message-id/3B774C9E-01E8-46A7-9642-7830DC1108F1%40yandex-team.ru