Re: [HACKERS] Range Merge Join v1
On Thu, Aug 31, 2017 at 1:52 AM, Jeff Davis <pg...@j-davis.com> wrote: > Updated patch attached. Changelog: > > * Rebased > * Changed MJCompare to return an enum as suggested, but it has 4 > possible values rather than 3. > * Added support for joining on contains or contained by (@> or <@) and > updated tests. The current patch does not work well with multiple keys, and I believe it's important to solve because it will make it usable for multi-dimension spatial joins. The problem is this: the algorithm for a single key demands that the input is sorted by (lower(r) NULLS FIRST, upper(r) NULLS LAST). That's easy, because that's also the sort order for the range operator class, so everything just works. For multiple keys, the order of the input is: lower(r1) NULLS FIRST, lower(r2) NULLS FIRST, upper(r1) NULLS LAST, upper(r2) NULLS LAST But that can't match up with any opclass, because an opclass can only order one attribute at a time. In this case, the lower bound of r2 is more significant than the upper bound of r1. It's easy enough to adapt the execution to make this work as long as the input is properly sorted. The challenge is making the optimizer choose the sort keys properly. I have tried a few approaches. The current approach that I'm using is: * have a new, special range operator family with no operator classes * in check_mergejoinable(), detect the && operator and set the mergeopfamilies to contain only the special operator family * in try_mergejoin_path(), it will convert the pathkeys using that operator class into pathkeys over a "lower" expression over the same EC; and then add on additional pathkeys for the "upper" expressions onto the end of the pathkey list Any comments or alternative suggestions welcome. This will probably take a few days at least, so I put the patch in "waiting on author" state. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range Merge Join v1
On Fri, Aug 25, 2017 at 10:19 AM, Alexander Kuzmenkov <a.kuzmen...@postgrespro.ru> wrote: > Hi Jeff, Hi, Thank you for the review and suggestions! > * At the moment, "mergejoinable clause" and "equality clause" mean the same > thing to the planner, and those clauses are used both to create equivalence > classes and to perform merge joins. If we start performing merge joins for > different kinds of clauses, such as comparison or range intersection, it > makes sense to separate these two meanings. I tried this in my patch and it > didn't require many changes. I use RestrictInfo.equivopfamilies to build > equivalence classes, and RestrictInfo.mergeopfamilies for mergejoinable > clauses. I like the idea. I will look in more detail and I can either change my patch or piggyback on yours. > * Semantically, MJCompare() is a function that determines whether you should > emit a join result or advance one of the pointers. This meaning is not > explicit in the code and is not well encapsulated: the function returns and > int and 'compareNoMatch' flag, and the calling code combines them in various > ways to derive the final result. This meaning can be made explicit by making > MJCompare return enum values {Join, NextInner, NextOuter}, and putting > inside it all the logic that decides whether we join or advance. > ExecMergeJoin looks intimidating already, and I think this change would help > readability. Again, you can find an illustration in my patch. I actually tried doing something similar in my patch, but there are four cases to consider in EXEC_MJ_SKIP_TEST: 1. Overlaps: mark and then join the tuples 2. left-of: SKIPOUTER_ADVANCE 3. right-of: SKIPINNER_ADVANCE 4. None of the above: mark and NEXTINNER However, you are right that the current code is ugly, and I should refactor into 4 explicit cases. I don't think I will call them by the same names as the join states, though, because there's not a direct mapping. Updated patch attached. Changelog: * Rebased * Changed MJCompare to return an enum as suggested, but it has 4 possible values rather than 3. * Added support for joining on contains or contained by (@> or <@) and updated tests. Regards, Jeff Davis diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml index 9557c16..84578a7 100644 *** a/doc/src/sgml/rangetypes.sgml --- b/doc/src/sgml/rangetypes.sgml *** *** 522,525 INSERT 0 1 --- 522,595 + + Range Join + + + range type + join + + + + A Range Join is a join where one of the join conditions is the range + overlaps operator (see ). In other + words, rather than returning rows where corresponding fields are equal, it + returns rows where the corresponding fields overlap. + + + For instance, to find users who were active on a website at the same time + as they were using a mobile app, you might issue a query like: + + CREATE TABLE web_session( + web_session_id text primary key, + username text, + during tstzrange); + CREATE TABLE app_session( + app_session_id text primary key, + username text, + during tstzrange); + INSERT INTO web_session VALUES + ('0cc175b9c0f1b6a8', 'user1', '[2017-02-01 09:30, 2017-02-01 10:45)'), + ('92eb5ffee6ae2fec', 'user2', '[2017-02-01 09:30, 2017-02-01 10:45)'), + ('4a8a08f09d37b737', 'user3', '[2017-02-01 09:30, 2017-02-01 10:45)'); + INSERT INTO app_session VALUES + ('8277e0910d750195', 'user1', '[2017-02-01 10:30, 2017-02-01 11:45)'), + ('b448797616e091ad', 'user3', '[2017-02-01 09:00, 2017-02-01 11:00)'), + ('95649038408b5f33', 'user4', '[2017-02-01 09:30, 2017-02-01 10:45)'); + + SELECT w.username, +w.during * a.during AS both_during + FROM web_session w, app_session a + WHERE w.username = a.username + AND w.during && a.during; + username | both_during + --+- + user1| ["2017-02-01 10:30:00-08","2017-02-01 10:45:00-08") + user3| ["2017-02-01 09:30:00-08","2017-02-01 10:45:00-08") + (2 rows) + + + + In addition to the general execution strategies available, Postgres also + has special support for a variant of Merge Join called Range Merge Join: + + EXPLAIN (costs off) SELECT w.username, +w.during * a.during AS both_during + FROM web_session w, app_session a + WHERE w.username = a.username + AND w.during && a.during; + QUERY PLAN + -- + Range Merge Join +Merge Cond: ((w.during && a.during) AND (w.username = a.username)) +-> Sort + Sort Key: w.during, w.username + -> Seq Scan on web_session w
Re: [HACKERS] Range Merge Join v1
On Tue, May 30, 2017 at 11:44 PM, Andrew Borodin <boro...@octonica.com> wrote: > Hi, Jeff! Hi! > Sorry for being late. Actually, I had several unsuccessful attempts to > find something wrong with the patch. > Here's my review. > > in pathkey.c > > ecs = (EquivalenceClass **) palloc(nClauses * sizeof(EquivalenceClass *)); > scores = (int *) palloc(nClauses * sizeof(int)); > range_ecs = palloc(nClauses * sizeof(bool)); > > Third assignment has no cast. Will fix. > And I have few questions: > 1. Are there any types, which could benefit from Range Merge and are > not covered by this patch? I thought about this for a while, and the only thing I can think of are range joins that don't explicitly use range types. > 2. Can Range Merge handle merge of different ranges? Like int4range() > && int8range() ? Right now, there aren't even casts between range types. I think the best way to handle that at this point would be to add casts among the numeric ranges. There may be advantages to supporting any two ranges where the contained types are part of the same opfamily, but it seems a little early to add that complication. > My perf test script from the previous message was broken, here's fixed > one in the attachment. > > This patch implements feature, contains new tests and passes old > tests, is documented and spec compliant. I do not see any reason why > not mark it "Ready for committer". Great! I think there are a couple more things that could be done if we want to. Let me know if you think these things should be done now, or if they should be a separate patch later when the need arises: * Support for r1 @> r2 joins (join on "contains" rather than "overlaps"). * Better integration with the catalog so that users could add their own types that support range merge join. Thank you for the review. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Thu, Jun 1, 2017 at 11:25 AM, Andres Freund <and...@anarazel.de> wrote: > Secondly, I think that's to a significant degree caused by > the fact that in practice people way more often partition on types like > int4/int8/date/timestamp/uuid rather than text - there's rarely good > reasons to do the latter. Once we support more pushdowns to partitions, the only question is: what are your join keys and what are your grouping keys? Text is absolutely a normal join key or group key. Consider joins on a user ID or grouping by a model number. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Thu, Jun 1, 2017 at 10:59 AM, Robert Haas <robertmh...@gmail.com> wrote: > 1. Are the new problems worse than the old ones? > > 2. What could we do about it? Exactly the right questions. 1. For range partitioning, I think it's "yes, a little". As you point out, there are already some weird edge cases -- the main way range partitioning would make the problem worse is simply by having more users. But for hash partitioning I think the problems will become more substantial. Different encodings, endian issues, etc. will be a headache for someone, and potentially a bad day if they are urgently trying to restore on a new machine. We should remember that not everyone is a full-time postgres DBA, and users might reasonably think that the default options to pg_dump[all] will give them a portable dump. 2. I basically see two approaches to solve the problem: (a) Tom suggested at PGCon that we could have a GUC that automatically causes inserts to the partition to be re-routed through the parent. We could discuss whether to always route through the parent, or do a recheck on the partition constrains and only reroute tuples that will fail it. If the user gets into trouble, the worst that would happen is a helpful error message telling them to set the GUC. I like this idea. (b) I had suggested before that we could make the default text dump (and the default output from pg_restore, for consistency) route through the parent. Advanced users would dump with -Fc, and pg_restore would support an option to do partition-wise loading. To me, this is simpler, but users might forget to use (or not know about) the pg_restore option and then it would load more slowly. Also, the ship is sailing on range partitioning, so we might prefer option (a) just to avoid making any changes. I am fine with either option. > 2. Add an option like --dump-partition-data-with-parent. I'm not sure > who originally proposed this, but it seems that everybody likes it. > What we disagree about is the degree to which it's sufficient. Jeff > Davis thinks it doesn't go far enough: what if you have an old > plain-format dump that you don't want to hand-edit, and the source > database is no longer available? Most people involved in the > unconference discussion of partitioning at PGCon seemed to feel that > wasn't really something we should be worry about too much. I had been > taking that position also, more or less because I don't see that there > are better alternatives. If the suggestions above are unacceptable, and we don't come up with anything better, then of course we have to move on. I am worrying now primarily because now is the best time to worry; I don't expect any horrible outcome. > 3. Implement portable hash functions (Jeff Davis or me, not sure > which). Andres scoffed at this idea, but I still think it might have > legs. I think it reduces the problem, which has value, but it's hard to make it rock-solid. > make fast. Those two things also solve different parts of the > problem; one is insulating the user from a difference in hardware > architecture, while the other is insulating the user from a difference > in user-selected settings. I think that the first of those things is > more important than the second, because it's easier to change your > settings than it is to change your hardware. Good point. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Hash Functions
On Thursday, May 18, 2017, Robert Haas <robertmh...@gmail.com> wrote: > My experience with this area has led > me to give up on the idea of complete uniformity as impractical, and > instead look at it from the perspective of "what do we absolutely have > to ban in order for this to be sane?". I could agree to something like that. Let's explore some of the challenges there and potential solutions: 1. Dump/reload of hash partitioned data. Falling back to restore-through-the-root seems like a reasonable answer here. Moving to a different encoding is not an edge case, but it's not common either, so a performance penalty seems acceptable. I'm not immediately sure how we'd implement this in pg_dump/restore, so I'd feel a little more comfortable if I saw a sketch. 2. Having a lot of hash partitions would be cumbersome The user would need to create and manage each partition, and try to do global operations in a sane way. The normal case would probably involve scripts to do things like add an index to all partitions, or a column. Many partitions would also just pollute the namespace unless you remember to put them in a separate schema (yes, it's easy, but most people will still forget). Some syntax sugar would go a long way here. 3. The user would need to specify details they really don't care about for each partition. Things like "modulus 16, remainder 0", "modulus 16, remainder 1" are tedious boilerplate. And if the user makes a mistake, then 1/16 of inserts start failing. Probably would be caught during testing, but not exactly a good user experience. I'm not thrilled about this, considering that all the user really wants is 16 partitions, but it's not the end of the world. 4. Detach is a foot-gun If you detach a partition, random inserts will start failing. Not thrilled about this, but a hapless user would accept most of the blame if they stumble over it. Another way of saying this is with hash partitioning you really need the whole set for the table to be online at all. But we can't really enforce that, because it would limit some of the flexibility that you have in mind. Stepping back, your approach might be closer to the general postgres philosophy of allowing the user to assemble from spare parts first, then a few releases later we offer some pre-built subassemblies, and a few releases later we make the typical cases work out of the box. I'm fine with it as long as we don't paint ourselves into a corner. Of course we still have work to do on the hash functions. We should solve at least the most glaring portability problems, and try to harmonize the hash opfamilies. If you agree, I can put together a patch or two. Regards, Jeff Davis
Re: [HACKERS] Hash Functions
On Wed, May 17, 2017 at 11:35 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > I think the question is whether we are going to make a distinction between > logical partitions (where the data division rule makes some sense to the > user) and physical partitions (where it needn't). I think it might be > perfectly reasonable for those to behave differently. Agreed. To summarize my perspective: * hash partitioning offers a nice way to divide the data for later processing by parallel query * range partitioning is good for partition elimination (constraint_exclusion) and separating hot/cold data (e.g. partitioning on date) * both offer some maintenance benefits (e.g. reindex one partition at a time), though range partitioning seems like it offers better flexibility here in some cases I lean toward separating the concepts, but Robert is making some reasonable arguments and I could be convinced. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Wed, May 17, 2017 at 12:10 PM, Robert Haas <robertmh...@gmail.com> wrote: > 1. To handle dump-and-reload the way we partitioning does today, hash > functions would need to be portable across encodings. > 2. That's impractically difficult. > 3. So let's always load data through the top-parent. > 4. But that could fail due to e.g. a UNIQUE index on an individual > child, so let's try to prohibit all of the things that could be done > to an individual partition that could cause a reload failure. > 5. And then for good measure let's hide the existence of the partitions, too. That is one thread of logic, but out of the discussion also highlighted some of the consequences of treating hash partitions like range/list partitions. For instance, it makes little sense to have individual check constraints, indexes, permissions, etc. on a hash-partitioned table. It doesn't mean that we should necessarily forbid them, but it should make us question whether combining range and hash partitions is really the right design. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Tuesday, May 16, 2017, Robert Haas <robertmh...@gmail.com> wrote: > I don't really find this a very practical design. If the table > partitions are spread across different relfilenodes, then those > relfilenodes have to have separate pg_class entries and separate > indexes, and those indexes also need to have separate pg_class > entries. Otherwise, nothing works. And if they do have separate > pg_class entries, then the partitions have to have their own names, > and likewise for their indexes, and a dump-and-reload has to preserve > those names. If it doesn't, and those objects get new system-assigned > names after the dump-and-reload, then dump restoration can fail when a > system-assigned name collides with an existing name that is first > mentioned later in the dump. Why can't hash partitions be stored in tables the same way as we do TOAST? That should take care of the naming problem. > If Java has portable hash functions, why can't we? Java standardizes on a particular unicode encoding (utf-16). Are you suggesting that we do the same? Or is there another solution that I am missing? Regards, Jeff Davis
Re: [HACKERS] Hash Functions
On Mon, May 15, 2017 at 1:04 PM, David Fetter <da...@fetter.org> wrote: > As the discussion has devolved here, it appears that there are, at > least conceptually, two fundamentally different classes of partition: > public, which is to say meaningful to DB clients, and "private", used > for optimizations, but otherwise opaque to DB clients. > > Mashing those two cases together appears to cause more problems than > it solves. I concur at this point. I originally thought hash functions might be made portable, but I think Tom and Andres showed that to be too problematic -- the issue with different encodings is the real killer. But I also believe hash partitioning is important and we shouldn't give up on it yet. That means we need to have a concept of hash partitions that's different from range/list partitioning. The terminology "public"/"private" does not seem appropriate. Logical/physical or external/internal might be better. With hash partitioning: * User only specifies number of partitions of the parent table; does not specify individual partition properties (modulus, etc.) * Dump/reload goes through the parent table (though we may provide options so pg_dump/restore can optimize this) * We could provide syntax to adjust the number of partitions, which would be expensive but still useful sometimes. * All DDL should be on the parent table, including check constraints, FKs, unique constraints, exclusion constraints, indexes, etc. - Unique and exclusion constraints would only be permitted if the keys are a superset of the partition keys. - FKs would only be permitted if the two table's partition schemes match and the keys are members of the same hash opfamily (this could be relaxed slightly, but it gets a little confusing if so) * No attach/detach of partitions * All partitions have the same permissions * Individual partitions would only be individually-addressable for maintenance (like reindex and vacuum), but not for arbitrary queries - perhaps also COPY for bulk loading/dumping, in case we get clients smart enough to do their own hashing. The only real downside is that it could surprise users -- why can I add a CHECK constraint on my range-partitioned table but not the hash-partitioned one? We should try to document this so users don't find that out too far along. As long as they aren't surprised, I think users will understand why these aren't quite the same concepts. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Sun, May 14, 2017 at 6:22 PM, Robert Haas <robertmh...@gmail.com> wrote: > You'd have to prohibit a heck of a lot more than that in order for > this to work 100% reliably. You'd have to prohibit CHECK constraints, > triggers, rules, RLS policies, and UNIQUE indexes, at the least. You > might be able to convince me that some of those things are useless > when applied to partitions, but wanting a CHECK constraint that > applies to only one partition seems pretty reasonable (e.g. CHECK that > records for older years are all in the 'inactive' state, or whatever). > I think getting this to work 100% reliably in all cases would require > an impractically large hammer. The more I think about it the more I think hash partitions are "semi-logical". A check constraint on a single partition of a range-partitioned table makes sense, but it doesn't make sense on a single partition of a hash-partitioned table. I think hash partitioning is mainly useful for parallel query (so the optimizer needs to know about it) and maintenance commands (as you listed in another email). But those don't need hash portability. FKs are a little more interesting, but I actually think they still work regardless of hash portability. If the two sides are in the same hash opfamily, they should hash the same and it's fine. And if they aren't, the FK has no hope of working regardless of hash portability. This would mean we need to reload through the root as Andres and others suggested, and disable a lot of logical partitioning capabilities. I'd be a little worried about what we do with attaching/detaching, though. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Sun, May 14, 2017 at 8:00 PM, Bruce Momjian <br...@momjian.us> wrote: > Do we even know that floats are precise enough to determine the > partition. For example, if you have 6.1, is it possible for > that to be 5.999 on some systems? Are IEEE systems all the same for > these values? I would say we should disallow any approximate date type > for partitioning completely. I'm inclined in this direction, as well. Hash partitioning is mostly useful for things that are likely to be join keys or group keys, and floats aren't. Same for complex user-defined types. The real problem here is what Tom pointed out: that we would have trouble hashing strings in an encoding-insensitive way. Strings are useful as join/group keys, so it would be painful to not support them. Perhaps there's some kind of compromise, like a pg_dump mode that reloads through the root. Or maybe hash partitions are really a "semi-logical" partitioning that the optimizer understands, but where things like per-partition check constraints don't make sense. Regards, Jeff Davis Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Fri, May 12, 2017 at 12:38 PM, Robert Haas <robertmh...@gmail.com> wrote: > That is a good question. I think it basically amounts to this > question: is hash partitioning useful, and if so, for what? Two words: parallel query. To get parallelism, one of the best approaches is dividing the data, then doing as much work as possible before combining it again. If you have hash partitions on some key, then you can do partition-wise joins or partition-wise aggregation on that key in parallel with no synchronization/communication overhead (until the final result). You've taken postgres pretty far in this direction already; hash partitioning will take it one step further by allowing more pushdowns and lower sync/communication costs. Some of these things can be done with range partitioning, too, but see my other message here: https://www.postgresql.org/message-id/CAMp0ubfNMSGRvZh7N7TRzHHN5tz0ZeFP13Aq3sv6b0H37fdcPg%40mail.gmail.com Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Fri, May 12, 2017 at 11:45 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Forget hash partitioning. There's no law saying that that's a good > idea and we have to have it. With a different set of constraints, > maybe we could do it, but I think the existing design decisions have > basically locked it out --- and I doubt that hash partitioning is so > valuable that we should jettison other desirable properties to get it. A lot of the optimizations that can make use of hash partitioning could also make use of range partitioning. But let me defend hash partitioning: * hash partitioning requires fewer decisions by the user * naturally balances data and workload among partitions in most cases * easy to match with a degree of parallelism But with range partitioning, you can have situations where different tables have different distributions of data. If you partition to balance the data between partitions in both cases, then that makes partition-wise join a lot harder because the boundaries don't line up. If you make the boundaries line up to do partition-wise join, the partitions might have wildly different amounts of data in them. Either way, it makes parallelism harder. Even without considering joins, range partitioning could force you to make a choice between balancing the data and balancing the workload. If you are partitioning based on date, then a lot of the workload will be on more recent partitions. That's desirable sometimes (e.g. for vacuum) but not always desirable for parallelism. Hash partitioning doesn't have these issues and goes very nicely with parallel query. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Functions
On Fri, May 12, 2017 at 10:34 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Maintaining such a property for float8 (and the types that depend on it) > might be possible if you believe that nobody ever uses anything but IEEE > floats, but we've never allowed that as a hard assumption before. This is not such a big practical problem (for me at least) because hashing of floats is of dubious value. > Even architecture dependence isn't the whole scope of the problem. > Consider for example dumping a LATIN1-encoded database and trying > to reload it into a UTF8-encoded database. People will certainly > expect that to be possible, and do you want to guarantee that the > hash of a text value is encoding-independent? That is a major problem. In an ideal world, we could make that work with something like ucol_getSortKey(), but we don't require ICU, and we can't mix getSortKey() with strxfrm(), or even strxfrm() results from different platforms. I don't think it's correct to hash the code points, either, because strings may be considered equal in a locale even if the code points aren't identical. But I don't think postgres lives up to that standard currently. But hash partitioning is too valuable to give up on entirely. I think we should consider supporting a limited subset of types for now with something not based on the hash am. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Hash Functions
https://www.postgresql.org/message-id/camp0ubeo3fzzefie1vmc1ajkkrpxlnzqooaseu6o-c+...@mail.gmail.com In that thread, I pointed out some important considerations for the hash functions themselves. This is a follow-up, after I looked more carefully. 1. The hash functions as they exist today aren't portable -- they can return different results on different machines. That means using these functions for hash partitioning would yield different contents for the same partition on different architectures (and that's bad, considering they are logical partitions and not some internal detail). The core hashing algorithm is based on 32-bit chunks, so it's only portable if the input is an int32 (or array of int32s). That's not true for varchar (etc.) or numeric (which is based on an array of int16s). There is a hack for int8, but see issue #2 below. We could try to mark portable vs. unportable hash functions, but it seems quite valuable to be able to logically partition on varchar, so I think we should have some portable answer there. Another annoyance is that we would have to assume container types are unportable, or make them inherit the portability of the underlying type's hash function. We could revert 26043592 (copied Tom because that was his patch) to make hash_any() go back to being portable -- do we know what that speedup actually was? Maybe the benefit is smaller on newer processors? Another option is to try to do some combination of byteswapping and word-at-a-time, which might be better than byte-at-a-time if the byteswapping is done with a native instruction. 2. hashint8() is a little dubious. If the input is positive, it XORs the high bits; if negative, it XORs the complement of the high bits. That works for compatibility with hashint2/4, but it does not cause good hash properties[1]. I prefer that we either (a) upcast int2/4 to int8 before hashing and then hash the whole 64 bits; or (b) if the value is within range, downcast to int4, otherwise hash the whole 64 bits. 3. We should downcast numeric to int4/8 before hashing if it's in range, so that it can be a part of the same opfamily as int2/4/8. 4. text/varchar should use hashbpchar() so that they can be part of the same opfamily. Trailing blanks seem unlikely to be significant for most real-world data anyway. 5. For salts[2], I don't think it's too hard to support them in an optional way. We just allow the function to be a two-argument function with a default. Places that care about specifying the salt may do so if the function has pronargs==2, otherwise it just gets the default value. If we have salts, I don't think having 64-bit hashes is very important. If we run out of bits, we can just salt the hash function differently and get more hash bits. This is not urgent and I believe we should just implement salts when and if some algorithm needs them. Regards, Jeff Davis [1] You can a kind of mirroring in the hash outputs indicating bad mixing: postgres=# select hashint8((2^32-100)::int8); hashint8 -- 41320869 (1 row) postgres=# select hashint8((-2^32-100)::int8); hashint8 -- 41320869 (1 row) postgres=# select hashint8((2^32-101)::int8); hashint8 - -1214482326 (1 row) postgres=# select hashint8((-2^32-101)::int8); hashint8 - -1214482326 (1 row) [2] Not sure I'm using the term "salt" properly. I really just mean a way to affect the initial state, which I think is good enough for our purposes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On Tue, May 2, 2017 at 7:01 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Tue, May 2, 2017 at 9:01 PM, Jeff Davis <pg...@j-davis.com> wrote: >> 1. Consider a partition-wise join of two hash-partitioned tables. If >> that's a hash join, and we just use the hash opclass, we immediately >> lose some useful bits of the hash function. Same for hash aggregation >> where the grouping key is the partition key. > > Hmm, that could be a problem in some cases. I think there's probably > much less of a problem if the modulus isn't a power of two? That's true, but it's awkward to describe that to users. And I think most people would be inclined to use a power-of-two number of partitions, perhaps coming from other systems. >> To fix this, I think we need to include a salt in the hash API. Each >> level of hashing can choose a random salt. > > Do you mean that we'd salt partitioning hashing differently from > grouping hashing which would be salted different from aggregation > hashing which, I suppose, would be salted differently from hash index > hashing? Yes. The way I think about it is that choosing a new random salt is an easy way to get a new hash function. > Or do you mean that you'd have to specify a salt when > creating a hash-partitioned table, and make sure it's the same across > all compatibly partitioned tables you might want to hash-join? That > latter sounds unappealing. I don't see a reason to expose the salt to users. If we found a reason in the future, we could, but it would create all of the problems you are thinking about. >> 2. Consider a partition-wise join where the join keys are varchar(10) >> and char(10). We can't do that join if we just use the existing hash >> strategy, because 'foo' = 'foo ' should match, but those values >> have different hashes when using the standard hash opclass. ... > You're basically describing what a hash opfamily already does, except > that we don't have a single opfamily that covers both varchar(10) and > char(10), nor do we have one that covers both int and numeric. We > have one that covers int2, int4, and int8, though. If somebody wanted > to make the ones you're suggesting, there's nothing preventing it, > although I'm not sure exactly how we'd encourage people to start using > the new one and deprecating the old one. We don't seem to have a good > infrastructure for that. OK. I will propose new hash opfamilies for varchar/bpchar/text, int2/4/8/numeric, and timestamptz/date. One approach is to promote the narrower type to the wider type, and then hash. The problem is that would substantially slow down the hashing of integers, so then we'd need to use one hash opfamily for partitioning and one for hashjoin, and it gets messy. The other approach is to check if the wider type is within the domain of the narrower type, and if so, *demote* the value and then hash. For instance, '4.2'::numeric would hash the same as it does today, but '4'::numeric would hash as an int2. I prefer this approach, and int8 already does something resembling it. For timestamptz/date, it's not nearly as important. >> My opinion is that we should work on this hashing infrastructure >> first, and then support the DDL. If we get the hash functions right, >> that frees us up to create better plans, with better push-downs, which >> will be good for parallel query. > > I am opposed to linking the fate of this patch to multiple > independent, possibly large, possibly difficult, possibly > controversial enhancements to the hashing mechanism. It's a little early in the v11 cycle to be having this argument. Really what I'm saying is that a small effort now may save us a lot of headache later. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [POC] hash partitioning
On Tue, Feb 28, 2017 at 6:33 AM, Yugo Nagata <nag...@sraoss.co.jp> wrote: > In this patch, user can specify a hash function USING. However, > we migth need default hash functions which are useful and > proper for hash partitioning. I suggest that we consider the hash functions more carefully. This is (effectively) an on-disk format so it can't be changed easily later. 1. Consider a partition-wise join of two hash-partitioned tables. If that's a hash join, and we just use the hash opclass, we immediately lose some useful bits of the hash function. Same for hash aggregation where the grouping key is the partition key. To fix this, I think we need to include a salt in the hash API. Each level of hashing can choose a random salt. 2. Consider a partition-wise join where the join keys are varchar(10) and char(10). We can't do that join if we just use the existing hash strategy, because 'foo' = 'foo ' should match, but those values have different hashes when using the standard hash opclass. To fix this, we need to be smarter about normalizing values at a logical level before hashing. We can take this to varying degrees, perhaps even normalizing an integer to a numeric before hashing so that you can do a cross-type join on int=numeric. Furthermore, we need catalog metadata to indicate which hash functions are suitable for which cross-type comparisons. Or, to put it the other way, which typecasts preserve the partitioning. 3. We might want to use a hash function that is a little slower that is more resistant to collisions. We may even want to use a 64-bit hash. My opinion is that we should work on this hashing infrastructure first, and then support the DDL. If we get the hash functions right, that frees us up to create better plans, with better push-downs, which will be good for parallel query. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range Merge Join v1
On Thu, Apr 20, 2017 at 2:05 AM, Rafia Sabih <rafia.sa...@enterprisedb.com> wrote: > Looks like an interesting idea, however, in an attempt to test this patch I > found following error when compiling, > selfuncs.c: In function ‘mergejoinscansel’: > selfuncs.c:2901:12: error: ‘op_strategy’ undeclared (first use in this > function) >_strategy, > Looks like filterdiff destroyed my patch from git. Attaching unified version against master 3820c63d. Thanks! Jeff Davis diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml index 9557c16..84578a7 100644 --- a/doc/src/sgml/rangetypes.sgml +++ b/doc/src/sgml/rangetypes.sgml @@ -522,4 +522,74 @@ INSERT 0 1 + + Range Join + + +range type +join + + + +A Range Join is a join where one of the join conditions is the range +overlaps operator (see ). In other +words, rather than returning rows where corresponding fields are equal, it +returns rows where the corresponding fields overlap. + + +For instance, to find users who were active on a website at the same time +as they were using a mobile app, you might issue a query like: + +CREATE TABLE web_session( +web_session_id text primary key, +username text, +during tstzrange); +CREATE TABLE app_session( +app_session_id text primary key, +username text, +during tstzrange); +INSERT INTO web_session VALUES +('0cc175b9c0f1b6a8', 'user1', '[2017-02-01 09:30, 2017-02-01 10:45)'), +('92eb5ffee6ae2fec', 'user2', '[2017-02-01 09:30, 2017-02-01 10:45)'), +('4a8a08f09d37b737', 'user3', '[2017-02-01 09:30, 2017-02-01 10:45)'); +INSERT INTO app_session VALUES +('8277e0910d750195', 'user1', '[2017-02-01 10:30, 2017-02-01 11:45)'), +('b448797616e091ad', 'user3', '[2017-02-01 09:00, 2017-02-01 11:00)'), +('95649038408b5f33', 'user4', '[2017-02-01 09:30, 2017-02-01 10:45)'); + +SELECT w.username, + w.during * a.during AS both_during +FROM web_session w, app_session a +WHERE w.username = a.username +AND w.during && a.during; + username | both_during +--+- + user1| ["2017-02-01 10:30:00-08","2017-02-01 10:45:00-08") + user3| ["2017-02-01 09:30:00-08","2017-02-01 10:45:00-08") +(2 rows) + + + +In addition to the general execution strategies available, Postgres also +has special support for a variant of Merge Join called Range Merge Join: + + EXPLAIN (costs off) SELECT w.username, + w.during * a.during AS both_during +FROM web_session w, app_session a +WHERE w.username = a.username +AND w.during && a.during; + QUERY PLAN +-- + Range Merge Join + Merge Cond: ((w.during && a.during) AND (w.username = a.username)) + -> Sort + Sort Key: w.during, w.username + -> Seq Scan on web_session w + -> Sort + Sort Key: a.during, a.username + -> Seq Scan on app_session a +(8 rows) + + + diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 9359d0a..1010668 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -909,7 +909,11 @@ ExplainNode(PlanState *planstate, List *ancestors, pname = sname = "Nested Loop"; break; case T_MergeJoin: - pname = "Merge"; /* "Join" gets added by jointype switch */ + if (((MergeJoin*)plan)->mergeRangeJoin) +pname = "Range Merge"; /* "Join" gets added by jointype switch */ + else +pname = "Merge"; /* "Join" gets added by jointype switch */ + sname = "Merge Join"; break; case T_HashJoin: diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c index 8c483bf..bd312e6 100644 --- a/src/backend/executor/nodeMergejoin.c +++ b/src/backend/executor/nodeMergejoin.c @@ -89,14 +89,67 @@ * proceed to another state. This state is stored in the node's * execution state information and is preserved across calls to * ExecMergeJoin. -cim 10/31/89 + * + * RANGE MERGE JOIN + * + * Range Merge Join is a generalization of merge join. When a join + * predicate involves the range overlaps operator (&&), a merge join + * becomes a range join. + * + * The input ranges must be ordered by (lower-bound, upper-bound), which + * is the ordinary range_ops operator class. MJCompare must not simply + * use the operator class's comparison semantics though; instead it + * follows these rules: + * + * - return 0 if the outer range overlaps the inner range + * - return <0 if the outer range is strictly left-of the inner range + * - return >0 if the outer ran
Re: [HACKERS] Merge join for GiST
On Wed, Apr 12, 2017 at 10:44 PM, Andrew Borodin <boro...@octonica.com> wrote: >> How do you think we should proceed? Which projects do you think should >> eventually be in core, versus which are fine as extensions? > > Some points in favor of Range joins via nbtree: My patch doesn't require indexes, it can sort the input (and the 5X improvement that I got included the effort to sort). In fact, I expect using sort will be more common than maintaining btree indexes on a range column. > 1. It's more efficient than B-tree over GiST > 2. It is already in a patch form > > Point against: > 1. Particular implementation is somewhat leaked abstraction. Inside > the planner, you check not for capabilities of operator and type, but > for specific op and type. But I do not know how to fix this. It's not a specific type, it's the "anyrange" type, so you can define new range types to take advantage of range merge join. I can drive the planner changes from the catalog rather than hard-coded OIDs if we think range merge join is a good solution. > So, here is my opinion: if we can inside the planner understand that > join condition involves range specifics (for all available ranges) and > do Range Merge Join, then this is preferred solution. Yes, I can do that. > Yes, Spatial Join, when available, will provide roughly the same scan > performance. But B-trees are more well known to users new to > PostgreSQL, and B-trees are faster. I don't quite follow. I don't think any of these proposals uses btree, right? Range merge join doesn't need any index, your proposal uses gist, and PgSphere's crossmatch uses gist. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
On Tue, Apr 11, 2017 at 8:35 AM, Alexander Korotkov <a.korot...@postgrespro.ru> wrote: > On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis <pg...@j-davis.com> wrote: >> Do you have a sense of how this might compare with range merge join? > > > If you have GiST indexes over ranges for both sides of join, then this > method could be used for range join. Hence, it could be compared with range > merge join. > However, particular implementation in pgsphere uses hardcoded datatypes and > operations. > Thus, for range join we need either generalized version of GiST-based join > or special implementation for ranges. Alexander, Andrew, How do you think we should proceed? Which projects do you think should eventually be in core, versus which are fine as extensions? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge join for GiST
On Tue, Apr 11, 2017 at 2:17 AM, Alexander Korotkov <a.korot...@postgrespro.ru> wrote: > FYI, I've implemented this algorithm for pgsphere. See following branch. > https://github.com/akorotkov/pgsphere/tree/experimental > It's implemented as crossmatch() function which takes as arguments names of > two indexes over spoint and maximum distance (it checks not overlapping but > proximity of points). This function returns setof pairs of TIDs. > > Later, Dmitry Ivanov made it a custom scan node. > https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode > > You also can find some experimental evaluation here: > http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf Do you have a sense of how this might compare with range merge join? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range Merge Join v1
On Tue, Apr 11, 2017 at 12:17 AM, Jeff Davis <pg...@j-davis.com> wrote: > Version 2 attached. Fixed a few issues, expanded tests, added docs. It looks like the CF app only listed my perf test script. Re-attaching rangejoin-v2.patch so that it appears in the CF app. Identical to other rangejoin-v2.patch. Regards, Jeff Davis diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml index 9557c16..84578a7 100644 *** a/doc/src/sgml/rangetypes.sgml --- b/doc/src/sgml/rangetypes.sgml *** *** 522,525 INSERT 0 1 --- 522,595 + + Range Join + + + range type + join + + + + A Range Join is a join where one of the join conditions is the range + overlaps operator (see ). In other + words, rather than returning rows where corresponding fields are equal, it + returns rows where the corresponding fields overlap. + + + For instance, to find users who were active on a website at the same time + as they were using a mobile app, you might issue a query like: + + CREATE TABLE web_session( + web_session_id text primary key, + username text, + during tstzrange); + CREATE TABLE app_session( + app_session_id text primary key, + username text, + during tstzrange); + INSERT INTO web_session VALUES + ('0cc175b9c0f1b6a8', 'user1', '[2017-02-01 09:30, 2017-02-01 10:45)'), + ('92eb5ffee6ae2fec', 'user2', '[2017-02-01 09:30, 2017-02-01 10:45)'), + ('4a8a08f09d37b737', 'user3', '[2017-02-01 09:30, 2017-02-01 10:45)'); + INSERT INTO app_session VALUES + ('8277e0910d750195', 'user1', '[2017-02-01 10:30, 2017-02-01 11:45)'), + ('b448797616e091ad', 'user3', '[2017-02-01 09:00, 2017-02-01 11:00)'), + ('95649038408b5f33', 'user4', '[2017-02-01 09:30, 2017-02-01 10:45)'); + + SELECT w.username, +w.during * a.during AS both_during + FROM web_session w, app_session a + WHERE w.username = a.username + AND w.during && a.during; + username | both_during + --+- + user1| ["2017-02-01 10:30:00-08","2017-02-01 10:45:00-08") + user3| ["2017-02-01 09:30:00-08","2017-02-01 10:45:00-08") + (2 rows) + + + + In addition to the general execution strategies available, Postgres also + has special support for a variant of Merge Join called Range Merge Join: + + EXPLAIN (costs off) SELECT w.username, +w.during * a.during AS both_during + FROM web_session w, app_session a + WHERE w.username = a.username + AND w.during && a.during; + QUERY PLAN + -- + Range Merge Join +Merge Cond: ((w.during && a.during) AND (w.username = a.username)) +-> Sort + Sort Key: w.during, w.username + -> Seq Scan on web_session w +-> Sort + Sort Key: a.during, a.username + -> Seq Scan on app_session a + (8 rows) + + + diff --git a/src/backend/commands/eindex 9359d0a..1010668 100644 *** a/src/backend/commands/explain.c --- b/src/backend/commands/explain.c *** *** 909,915 ExplainNode(PlanState *planstate, List *ancestors, pname = sname = "Nested Loop"; break; case T_MergeJoin: ! pname = "Merge"; /* "Join" gets added by jointype switch */ sname = "Merge Join"; break; case T_HashJoin: --- 909,919 pname = sname = "Nested Loop"; break; case T_MergeJoin: ! if (((MergeJoin*)plan)->mergeRangeJoin) ! pname = "Range Merge"; /* "Join" gets added by jointype switch */ ! else ! pname = "Merge"; /* "Join" gets added by jointype switch */ ! sname = "Merge Join"; break; case T_HashJoin: diff --git a/src/backend/executor/nodindex 8c483bf..bd312e6 100644 *** a/src/backend/executor/nodeMergejoin.c --- b/src/backend/executor/nodeMergejoin.c *** *** 89,102 --- 89,155 * proceed to another state. This state is stored in the node's * execution state information and is preserved across calls to * ExecMergeJoin. -cim 10/31/89 + * + * RANGE MERGE JOIN + * + * Range Merge Join is a generalization of merge join. When a join + * predicate involves the range overlaps operator (&&), a merge join + * becomes a range join. + * + * The input ranges must be ordered by (lower-bound, upper-bound), which + * is the ordinary range_ops operator class. MJCompare must not simply + * use the operator class's comparison semantics though; instead it + * follows these rules: + * + * - return 0 if the outer range overlaps the inner range + * - return <0 if the outer range is
Re: [HACKERS] Range Merge Join v1
Version 2 attached. Fixed a few issues, expanded tests, added docs. A simple performance test (script attached) shows about a 5X improvement when comparing against a nested loop with an inner index-only scan over a gist index. Even better, this doesn't require an index, so it will work even if the input relations are subqueries. Regards, Jeff Davis On Thu, Apr 6, 2017 at 1:43 AM, Jeff Davis <pg...@j-davis.com> wrote: > > Example: > > > Find different people using the same website at the same time: > > create table session(sessionid text, username text, during tstzrange); > SELECT s1.username, s2.username, s1.during * s2.during > FROM session s1, session s2 > WHERE s1.during && s2.during AND s1.username < s2.username > > > = > Brief summary of previous discussion: > = > > http://www.postgresql.org/message-id/1334554850.10878.52.camel@jdavis > > - can indexes solve it already (parameterized paths, etc.)? > - planner complexity (track path keys, mergejoinable equality, etc.) > - spatial join algorithm choice > - compare range merge join against existing indexes to see if any > indexing approach is viable > > == > Externals: > == > > No new syntax or other externals. Just improved execution strategy for > joining ranges using the overlaps operator (&&). > > New syntax is possible, but I don't see a strong reason to support > special range join syntax at this time. > > = > Optimizer Design > = > > I took the path of least resistance, so if I am doing something wrong > please let me know. I hardwired it to look for the range overlaps > operator when checking if it could do a merge join, and then add > range_ops to restrictinfo->mergeopfamilies. > > Then, I have to suppress adding it as an equivalence class, because > overlaps is not equality. It still adds single-member ECs to > restrictinfo->right_ec and left_ec, but (I think) that's OK. > > Also, I have to prevent generating paths for right/full join, because > the range join algorithm can't (currently) handle those. > > = > Costing > = > > Needs more consideration. Seems to do reasonable things in the few > examples I tried. > > = > Executor Design > = > > See detailed comments in nodeMergejoin.c > > = > Performance > = > > Seems much better than index nest loop join when the join is not > selective. I will post detailed numbers soon. > > If no index is available, range merge join is the only reasonable way > to execute a range join. For instance, if the inner is not a leaf in > the plan. > > = > Alternatives: > = > > It was suggested that I approach it as a general spatial-join > problem. That has merit, but I rejected it for now because the > algorithm that I think has the most promise is range-oriented > anyway. By that I mean we would need to extract all of the dimensions > into ranges first, and then perform the join. With that in mind, it > makes sense to implement range joins first; and then later provide the > APIs to get at the spatial dimensions so that we can implement other > spatial joins as range joins. > > Another suggestion was to base it on hash join, but I never understood > that proposal and it didn't seem to map very well to a spatial join. > > Yet another suggestion was to use some kind of temporary index. Some > brief experiments I did indicated that it would be fairly slow (though > more investigation might be useful here). Also, it doesn't provide any > alternative to the nestloop-with-inner-index we already offer at the > leaf level today. > > Regards, > Jeff Davis perf.sql Description: application/sql diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml index 9557c16..84578a7 100644 *** a/doc/src/sgml/rangetypes.sgml --- b/doc/src/sgml/rangetypes.sgml *** *** 522,525 INSERT 0 1 --- 522,595 + + Range Join + + + range type + join + + + + A Range Join is a join where one of the join conditions is the range + overlaps operator (see ). In other + words, rather than returning rows where corresponding fields are equal, it + returns rows where the corresponding fields overlap. + + + For instance, to find users who were active on a website at the same time + as they were using a mobile app, you might issue a query like: + + CREATE TABLE web_session( + web_session_id text primary key, + username text, + during tstzrange); + CREATE TABLE app_session( + app_ses
[HACKERS] Range Merge Join v1
Example: Find different people using the same website at the same time: create table session(sessionid text, username text, during tstzrange); SELECT s1.username, s2.username, s1.during * s2.during FROM session s1, session s2 WHERE s1.during && s2.during AND s1.username < s2.username = Brief summary of previous discussion: = http://www.postgresql.org/message-id/1334554850.10878.52.camel@jdavis - can indexes solve it already (parameterized paths, etc.)? - planner complexity (track path keys, mergejoinable equality, etc.) - spatial join algorithm choice - compare range merge join against existing indexes to see if any indexing approach is viable == Externals: == No new syntax or other externals. Just improved execution strategy for joining ranges using the overlaps operator (&&). New syntax is possible, but I don't see a strong reason to support special range join syntax at this time. = Optimizer Design = I took the path of least resistance, so if I am doing something wrong please let me know. I hardwired it to look for the range overlaps operator when checking if it could do a merge join, and then add range_ops to restrictinfo->mergeopfamilies. Then, I have to suppress adding it as an equivalence class, because overlaps is not equality. It still adds single-member ECs to restrictinfo->right_ec and left_ec, but (I think) that's OK. Also, I have to prevent generating paths for right/full join, because the range join algorithm can't (currently) handle those. = Costing = Needs more consideration. Seems to do reasonable things in the few examples I tried. = Executor Design = See detailed comments in nodeMergejoin.c = Performance = Seems much better than index nest loop join when the join is not selective. I will post detailed numbers soon. If no index is available, range merge join is the only reasonable way to execute a range join. For instance, if the inner is not a leaf in the plan. = Alternatives: = It was suggested that I approach it as a general spatial-join problem. That has merit, but I rejected it for now because the algorithm that I think has the most promise is range-oriented anyway. By that I mean we would need to extract all of the dimensions into ranges first, and then perform the join. With that in mind, it makes sense to implement range joins first; and then later provide the APIs to get at the spatial dimensions so that we can implement other spatial joins as range joins. Another suggestion was to base it on hash join, but I never understood that proposal and it didn't seem to map very well to a spatial join. Yet another suggestion was to use some kind of temporary index. Some brief experiments I did indicated that it would be fairly slow (though more investigation might be useful here). Also, it doesn't provide any alternative to the nestloop-with-inner-index we already offer at the leaf level today. Regards, Jeff Davis diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index a18ab43..1110c1e 100644 *** a/src/backend/commands/explain.c --- b/src/backend/commands/explain.c *** *** 909,915 ExplainNode(PlanState *planstate, List *ancestors, pname = sname = "Nested Loop"; break; case T_MergeJoin: ! pname = "Merge"; /* "Join" gets added by jointype switch */ sname = "Merge Join"; break; case T_HashJoin: --- 909,919 pname = sname = "Nested Loop"; break; case T_MergeJoin: ! if (((MergeJoin*)plan)->mergeRangeJoin) ! pname = "Range Merge"; /* "Join" gets added by jointype switch */ ! else ! pname = "Merge"; /* "Join" gets added by jointype switch */ ! sname = "Merge Join"; break; case T_HashJoin: diff --git a/src/backend/executor/nodindex 62784af..f3375c3 100644 *** a/src/backend/executor/nodeMergejoin.c --- b/src/backend/executor/nodeMergejoin.c *** *** 89,102 --- 89,151 * proceed to another state. This state is stored in the node's * execution state information and is preserved across calls to * ExecMergeJoin. -cim 10/31/89 + * + * RANGE MERGE JOIN + * + * Range Merge Join is a generalization of merge join. When a join + * predicate involves the range overlaps operator (&&), a merge join + * becomes a range join. + * + * The input ranges must be ordered by (lower-bound, upper-bound), which + * is the ordinary range_ops operator class. MJCompare must not simply + * use the operator class's comparison semantics though; instead it + * follows these rules: + * + * - return 0 if the outer range overlaps the inner range + * - retu
[HACKERS] New expression evaluator and indirect jumps
Andres, Thank you for your great work on the expression evaluator: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755 I was looking at the dispatch code, and it goes to significant effort (using computed goto) to generate many indirect jumps instead of just one. The reasoning is that CPUs can do better branch prediction when it can predict separately for each of the indirect jumps. But the paper here: https://hal.inria.fr/hal-01100647/document claims that it's not really needed on newer CPUs because they are better at branch prediction. I skimmed it, and if I understand correctly, modern branch predictors use some history, so it can predict based on the instructions executed before it got to the indirect jump. I tried looking through the discussion on this list, but most seemed to resolve around which compilers generated the assembly we wanted rather than how much it actually improved performance. Can someone please point me to the numbers? Do they refute the conclusions in the paper, or are we concerned about a wider range of processors? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree
On Sun, Jan 22, 2017 at 10:32 PM, Jeff Davis <pg...@j-davis.com> wrote: > On Sat, Jan 21, 2017 at 4:25 AM, Andrew Borodin <boro...@octonica.com> wrote: > One idea I had that might be simpler is to use a two-stage page > delete. The first stage would remove the link from the parent and mark > the page deleted, but leave the right link intact and prevent > recycling. The second stage would follow the chain of right links > along each level, removing the right links to deleted pages and > freeing the page to be recycled. Do you think this approach is viable as a simplification? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree
On Tue, Jan 24, 2017 at 3:18 AM, Andrew Borodin <boro...@octonica.com> wrote: > Technically, approach of locking a subtree is not novel. Lehman and > Yao focused on "that any process for manipulating the tree uses only a > small (constant) number of locks at any time." We are locking unknown > and possibly large amount of pages. By the way, can you show me where the Lehman and Yao paper addresses page recycling? It says that one approach is to allow fewer than K entries on a leaf node; presumably as few as zero. But it doesn't seem to show how to remove all references to the page and recycle it in a new place in the tree. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree
On Mon, Jan 23, 2017 at 1:55 AM, Andrew Borodin <boro...@octonica.com> wrote: > Proposed patch, on a contrary, simplifies code. There is more code > deleted than added. It does not affect WAL, changes nothing in page > layout. I appreciate that it is fewer lines of code. My hesitation comes from a couple areas: 1. I haven't seen the approach before of finding the parent of any empty subtree. While that sounds like a reasonable thing to do, it worries me to invent new approaches in an area as well-studied as btrees. The problem is less that it's too complex, and more that it's different. Future developers looking at this code will expect it to be either very simple or very standard, because it's just a posting list. 2. It relies on searches to only start from the very leftmost page (correct?). If someone wants to optimize the intersection of two posting trees later, they might break it if they don't understand the assumptions in this patch. 3. This adds a heap allocation, and copies the contents of the page, and then releases the pin and lock. GinStepRight is careful to avoid doing that (it locks the target before releasing the page containing the pointer). I don't see a problem in your patch, but again, we are breaking an assumption that future developers might make. Your patch solves a real problem (a 90-second stall is clearly not good) and I don't want to dismiss that. But I'd like to consider some alternatives that may not have these downsides. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree
On Sat, Jan 21, 2017 at 4:25 AM, Andrew Borodin <boro...@octonica.com> wrote: > Hello Jeff! > >>Review of the code itself: > How do you think, is there anything else to improve in that patch or > we can mark it as 'Ready for committer'? > > I've checked the concurrency algorithm with original work of Lehman > and Yao on B-link-tree. For me everything looks fine (safe, deadlock > free and not increased probability of livelock due to > LockBufferForCleanup call). Hi, I looked at the patch some more. I have some reservations about the complexity and novelty of the approach. I don't think I've seen an approach quite like this one before. It seems like the main reason you are using this approach is because the btree structure was too simple (no left links); is that right? My gut is telling me we should either leave it simple, or use a real concurrent btree implementation. One idea I had that might be simpler is to use a two-stage page delete. The first stage would remove the link from the parent and mark the page deleted, but leave the right link intact and prevent recycling. The second stage would follow the chain of right links along each level, removing the right links to deleted pages and freeing the page to be recycled. Another idea is to partition the posting tree so that the root page actually points to K posting trees. Scans would need to merge them, but it would allow inserts to progress in one tree while the other is being vacuumed. I think this would give good concurrency even for K=2. I just had this idea now, so I didn't think it through very well. What do you think? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Review: GIN non-intrusive vacuum of posting tree
https://www.postgresql.org/message-id/CAJEAwVE4UAmm8fr%2BNW8XTnKV6M--ACoNhL3ES8yoKL2sKhbaiw%40mail.gmail.com Let me re-summarize what's been done here to make sure I understand: Each key in GIN has its own posting tree, which is itself a btree, holding all of the tuples that contain that key. This posting tree is really just a set of tuples, and searches always scan an entire posting tree (because all the tuples contain the key, so all are a potential match). Currently, the cleanup process requires blocking all reads and writes in the posting list. I assume this is only a problem when a key is common enough that its posting list is quite large (how large?). This patch makes a couple changes to improve concurrency: 1. Vacuum leaves first, without removing any pages. Instead, just keep track of pages which are completely empty (more generally, subtrees that contain empty pages). 2. Free the empty pages in those subtrees. That requires blocking reads and writes to that subtree, but not the entire posting list. Blocking writes is easy: acquiring a cleanup lock on the root of any subtree always blocks any writes to that subtree, because the write keeps a pin on all pages in that path. Blocking reads is accomplished by locking the page to be deleted, then locking/unlocking the page one left of that (which may be outside the subtree?). Maybe a basic question, but why is the posting list a btree at all, and not, say a doubly-linked list? Review of the code itself: * Nice and simple, with a net line count of only +45. * Remove commented-out code. * In the README, "leaf-to-right" should be left-to-right; and "takinf" should be "taking". * When you say the leftmost page is never deleted; do you mean the leftmost of the entire posting tree, or leftmost of the subtree that you are removing pages from? * In order to keep out concurrent reads, you need to lock/unlock the left page while holding exclusive lock on the page being deleted, but I didn't see how that happens exactly in the code. Where does that happen? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
On Fri, May 20, 2016 at 1:41 PM, David G. Johnston <david.g.johns...@gmail.com> wrote: > How does the relatively new FILTER clause play into this, if at all? My interpretation of the standard is that FILTER is not allowable for a window function, and IGNORE|RESPECT NULLS is not allowable for an ordinary aggregate. So if we support IGNORE|RESPECT NULLS for anything other than a window function, we have to come up with our own semantics. > We already have "STRICT" for deciding whether a function processes nulls. > Wouldn't this need to exist on the "CREATE AGGREGATE" STRICT defines behavior at DDL time. I was suggesting that we might want a DDL-time flag to indicate whether a function can make use of the query-time IGNORE|RESPECT NULLS option. In other words, most functions wouldn't know what to do with IGNORE|RESPECT NULLS, but perhaps some would if we allowed them the option. Perhaps I didn't understand your point? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls
Old thread link: http://www.postgresql.org/message-id/CA+=vxna5_n1q5q5okxc0aqnndbo2ru6gvw+86wk+onsunjd...@mail.gmail.com On Thu, Apr 14, 2016 at 1:29 PM, Stephen Frost <sfr...@snowman.net> wrote: > Jeff > > (Reviving an old thread for 2014...) > > Would you have time to work on this for 9.7..? I came across a > real-world use case for exactly this capability and was sorely > disappointed to discover we didn't support it even though there had been > discussion for years on it, which quite a few interested parties. > > I'll take a look at reviewing your last version of the patch but wanted > to get an idea of if you still had interest and might be able to help. > > Thanks! > > Stephen There are actually quite a few issues remaining here. First, I think the syntax is still implemented in a bad way. Right now it's part of the OVER clause, and the IGNORE NULLS gets put into the frame options. It doesn't match the way the spec defines the grammar, and I don't see how it really makes sense that it's a part of the frame options or the window object at all. I believe that it should be a part of the FuncCall, and end up in the FuncExpr, so the flag can be read when the function is called without exposing frame options (which we don't want to do). I think the reason it was done as a part of the window object is so that it could save state between calls for the purpose of optimization, but I think that can be done using fn_extra. This change should remove a lot of the complexity about trying to share window definitions, etc. Second, we need a way to tell which functions support IGNORE NULLS and which do not. The grammar as implemented in the patch seems to allow it for any function with an OVER clause (which can include ordinary aggregates). Then the parse analysis hard-codes that only LEAD and LAG accept IGNORE NULLS, and throws an error otherwise. Neither of these things seem right. I think we need a need catalog support to say whether a function honors IGNORE|RESPECT NULLS or not, which means we also need support in CREATE FUNCTION. I think the execution is pretty good, except that (a) we need to keep the state in fn_extra rather than the winstate; and (b) we should get rid of the bitmaps and just do a naive scan unless we really think non-constant offsets will be important. We can always optimize more later. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: numeric timestamp in log_line_prefix
On Mon, 2015-09-07 at 18:28 -0300, Alvaro Herrera wrote: > I noticed %t, but I don't think we care since the precision is so poor. > Making m and n work in unison seems enough. I think it would be > reasonably simple to handle %t in the same way, but I'm not sure we > care. OK. > I think the extra ugliness is warranted, since it's not THAT much > additional ugliness, and not doing it could be considered a regression; > apparently strftime can be slower even than snprintf, so doing it twice > per log message might be excessive overhead. Patch attached. Please take a quick look. Regards, Jeff Davis *** a/src/backend/utils/error/elog.c --- b/src/backend/utils/error/elog.c *** *** 143,152 static int errordata_stack_depth = -1; /* index of topmost active frame */ static int recursion_depth = 0; /* to detect actual recursion */ ! /* buffers for formatted timestamps that might be used by both ! * log_line_prefix and csv logs. */ #define FORMATTED_TS_LEN 128 static char formatted_start_time[FORMATTED_TS_LEN]; static char formatted_log_time[FORMATTED_TS_LEN]; --- 143,156 static int recursion_depth = 0; /* to detect actual recursion */ ! /* ! * Saved timeval and buffers for formatted timestamps that might be used by ! * both log_line_prefix and csv logs. */ + static struct timeval saved_timeval; + static boolsaved_timeval_set = false; + #define FORMATTED_TS_LEN 128 static char formatted_start_time[FORMATTED_TS_LEN]; static char formatted_log_time[FORMATTED_TS_LEN]; *** *** 2195,2206 write_console(const char *line, int len) static void setup_formatted_log_time(void) { - struct timeval tv; pg_time_t stamp_time; char msbuf[8]; ! gettimeofday(, NULL); ! stamp_time = (pg_time_t) tv.tv_sec; /* * Note: we expect that guc.c will ensure that log_timezone is set up (at --- 2199,2214 static void setup_formatted_log_time(void) { pg_time_t stamp_time; char msbuf[8]; ! if (!saved_timeval_set) ! { ! gettimeofday(_timeval, NULL); ! saved_timeval_set = true; ! } ! ! stamp_time = (pg_time_t) saved_timeval.tv_sec; /* * Note: we expect that guc.c will ensure that log_timezone is set up (at *** *** 2213,2219 setup_formatted_log_time(void) pg_localtime(_time, log_timezone)); /* 'paste' milliseconds into place... */ ! sprintf(msbuf, ".%03d", (int) (tv.tv_usec / 1000)); memcpy(formatted_log_time + 19, msbuf, 4); } --- 2221,2227 pg_localtime(_time, log_timezone)); /* 'paste' milliseconds into place... */ ! sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_usec / 1000)); memcpy(formatted_log_time + 19, msbuf, 4); } *** *** 2440,2450 log_line_prefix(StringInfo buf, ErrorData *edata) break; case 'n': { - struct timeval tv; char strfbuf[128]; ! gettimeofday(, NULL); ! sprintf(strfbuf, "%ld.%03d", tv.tv_sec, (int)(tv.tv_usec / 1000)); if (padding != 0) appendStringInfo(buf, "%*s", padding, strfbuf); --- 2448,2463 break; case 'n': { char strfbuf[128]; ! if (!saved_timeval_set) ! { ! gettimeofday(_timeval, NULL); ! saved_timeval_set = true; ! } ! ! sprintf(strfbuf, "%ld.%03d", saved_timeval.tv_sec, ! (int)(saved_timeval.tv_usec / 1000)); if (padding != 0) appendStringInfo(buf, "%*s", padding, strfbuf); *** *** 2825,2830 send_message_to_server_log(ErrorData *edata) --- 2838,2844 initStringInfo(); + saved_timeval_set = false; formatted_log_time[0] = '\0'; log_line_prefix(, edata); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: numeric timestamp in log_line_prefix
On Mon, 2015-09-07 at 17:47 -0300, Alvaro Herrera wrote: > Jeff Davis wrote: > > On Sun, 2015-03-22 at 19:47 +0100, Andres Freund wrote: > > > On 2015-03-22 00:47:12 +0100, Tomas Vondra wrote: > > > > from time to time I need to correlate PostgreSQL logs to other logs, > > > > containing numeric timestamps - a prime example of that is pgbench. With > > > > %t and %m that's not quite trivial, because of timezones etc. > > > > > > I have a hard time seing this is sufficient cause for adding more format > > > codes. They're not free runtime and documentation wise. -0.5 from me. > > > > I am about to commit this patch (the one that adds a single extra > > option). Although nothing is free, the cost seems very low, and at least > > three people have expressed interest in this patch. > > Did you see my message at > http://www.postgresql.org/message-id/20150907192719.GS2912@alvherre.pgsql ? I guess we were looking at this at the same time, and I missed your message and committed it. I will investigate now. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: numeric timestamp in log_line_prefix
> I wonder about this separate gettimeofday() call. We already have > formatted_log_time which is used for CSV logs and freeform log lines > (stderr/syslog); if we introduce a separate gettimeofday() call here, > and the user has %n in freeform log and CSV logging is active, the > timings will diverge occasionally. > > Maybe I'm worrying over nothing, because really what use case is there > for having the two log formats enabled at the same time? Yet somebody > went some lengths to ensure they are consistent; I think we should do > likewise here. We now have three time-related options[1]: t, m, and n; and they each acquire the time independently. Are you suggesting that we make all three consistent, or only m and n? The cleanest fix would be for the global variable to only hold the timeval, and then format it once for the CSV log (always 'm' format) and once for the regular log ('m', 'n', or 't'). If the regular log uses 'm', that would be some wasted cycles formatting it the same way twice. Is it worth a little extra ugliness to cache both the timeval and the formatted string? Regards, Jeff Davis [1] As of minutes ago, after I missed your message by a few minutes. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: numeric timestamp in log_line_prefix
On Sun, 2015-03-22 at 19:47 +0100, Andres Freund wrote: > On 2015-03-22 00:47:12 +0100, Tomas Vondra wrote: > > from time to time I need to correlate PostgreSQL logs to other logs, > > containing numeric timestamps - a prime example of that is pgbench. With > > %t and %m that's not quite trivial, because of timezones etc. > > I have a hard time seing this is sufficient cause for adding more format > codes. They're not free runtime and documentation wise. -0.5 from me. I am about to commit this patch (the one that adds a single extra option). Although nothing is free, the cost seems very low, and at least three people have expressed interest in this patch. What tips the balance is that we expose the unix epoch in the pgbench logs, as Tomas points out. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
On Fri, 2015-07-17 at 15:52 +1200, David Rowley wrote: Should we mark the patch as returned with feedback in the commitfest app then? I believe the memory accounting patch has been rejected. Instead, the work will be done in the HashAgg patch. Thank you for the review! Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote: tuplesort.c does its own accounting, and TBH that seems like the right thing to do here, too. The difficulty is, I think, that some transition functions use an internal data type for the transition state, which might not be a single palloc'd chunk. But since we can't spill those aggregates to disk *anyway*, that doesn't really matter. So would it be acceptable to just ignore the memory consumed by internal, or come up with some heuristic? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote: I think a heuristic would be more suited here and ignoring memory consumption for internal types means that we are not making the memory accounting useful for a set of usecases. OK, I will drop this patch and proceed with the HashAgg patch, with a heuristic for internal types. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Memory Accounting v11
On Thu, 2015-07-09 at 14:41 +1200, David Rowley wrote: Are you going to implement this? or are you happy with the single level context tracking is the right thing? I'm going to mark the patch as waiting on author for now. Attached a version of the patch that does multi-level tracking, v12. It does this is a simpler way, just like an earlier version of the patch: it simply traverses up the chain and adds to each parent in a total_allocated field. The idea you mentioned is a possible optimization of this idea, but it only makes sense if I'm able to detect a difference between v11 (single-level) and v12 (multi-level). I tried Robert's test[1] again and I didn't see a difference on my workstation (technically, v12 came out the fastest, which means I'm just seeing noise anyway), so I can't evaluate whether your idea will improve things. After talking with a few people at PGCon, small noisy differences in CPU timings can appear for almost any tweak to the code, and aren't necessarily cause for major concern. Regards, Jeff Davis [1] pgbench -i -s 300, then do the following 3 times each for master, v11, and v12, and take the median of logged traces: start server; set trace_sort=on; reindex index pgbench_accounts_pkey; stop server *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 242,247 typedef struct AllocChunkData --- 242,249 #define AllocChunkGetPointer(chk) \ ((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ)) + static void update_alloc(MemoryContext context, int64 size); + /* * These functions implement the MemoryContext API for AllocSet contexts. */ *** *** 500,505 AllocSetContextCreate(MemoryContext parent, --- 502,510 errdetail(Failed while creating memory context \%s\., name))); } + + update_alloc((MemoryContext) set, blksize); + block-aset = set; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; *** *** 590,595 AllocSetReset(MemoryContext context) --- 595,602 else { /* Normal case, release the block */ + update_alloc(context, -(block-endptr - ((char*) block))); + #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block-freeptr - ((char *) block)); #endif *** *** 635,640 AllocSetDelete(MemoryContext context) --- 642,654 #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block-freeptr - ((char *) block)); #endif + + /* + * Parent is already unlinked from this context, so we can't use + * update_alloc(). The caller should have already subtracted from the + * parents' total_allocated. + */ + free(block); block = next; } *** *** 672,677 AllocSetAlloc(MemoryContext context, Size size) --- 686,694 block = (AllocBlock) malloc(blksize); if (block == NULL) return NULL; + + update_alloc(context, blksize); + block-aset = set; block-freeptr = block-endptr = ((char *) block) + blksize; *** *** 861,866 AllocSetAlloc(MemoryContext context, Size size) --- 878,885 if (block == NULL) return NULL; + update_alloc(context, blksize); + block-aset = set; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; *** *** 964,969 AllocSetFree(MemoryContext context, void *pointer) --- 983,991 set-blocks = block-next; else prevblock-next = block-next; + + update_alloc(context, -(block-endptr - ((char*) block))); + #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block-freeptr - ((char *) block)); #endif *** *** 1077,1082 AllocSetRealloc(MemoryContext context, void *pointer, Size size) --- 1099,1105 AllocBlock prevblock = NULL; Size chksize; Size blksize; + Size oldblksize; while (block != NULL) { *** *** 1094,1102 AllocSetRealloc(MemoryContext context, void *pointer, Size size) --- 1117,1130 /* Do the realloc */ chksize = MAXALIGN(size); blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; + oldblksize = block-endptr - ((char *)block); + block = (AllocBlock) realloc(block, blksize); if (block == NULL) return NULL; + + update_alloc(context, blksize - oldblksize); + block-freeptr = block-endptr = ((char *) block) + blksize; /* Update pointers since block has likely been moved */ *** *** 1361,1363 AllocSetCheck(MemoryContext context) --- 1389,1407 } #endif /* MEMORY_CONTEXT_CHECKING */ + + /* + * Update self_allocated and total_allocated for the context. Size can be + * positive or negative. + */ + void + update_alloc(MemoryContext context, int64 size) + { + MemoryContext parent; + + context-self_allocated += size; + + /* update total_allocated for self and all parents */ + for (parent = context; parent != NULL
Re: [HACKERS] Memory Accounting v11
On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote: of performance decrease anywhere. I'm just getting too much variation in the test results to get any sort of idea. That was my experience as well. Thank you for taking a look. My main question here is: How sure are you that none of your intended use cases will need to call MemoryContextMemAllocated with recurse = true? I'd imagine if you ever have to use MemoryContextMemAllocated(ctx, true) then we'll all be sitting around with our stopwatches out to see if the call incurs too much of a performance penalty. I don't expect HashAgg to be using many memory contexts. It did with array_agg, but that's been changed, and was a bad pattern anyway (one context per group is quite wasteful). If it turns out to be slow, we can call it less frequently (e.g. extrapolate growth rate to determine when to check). Shouldn't you be setting oldblksize after the if? I'd imagine the realloc() failure is rare, but it just seems better to do it just before it's required and only when required. I don't think that's safe. Realloc can return an entirely new pointer, so the endptr would be pointing outside of the allocation. I'd have to hang on to the original pointer before the realloc, so I'd need an extra variable anyway. Here there's a double assignment of child. Will fix. I might be mistaken here, but can't you just set context-mem_allocted = 0; after that loop? Or maybe it would be an improvement to only do the decrement if MEMORY_CONTEXT_CHECKING is defined, then Assert() that mem_allocated == 0 after the loop... OK. Why not just always Assert that? One other thing that I don't remember seeing discussed would be to just store a List in each context which would store all of the mem_allocated's that need to be updated during each allocation and deallocation of memory. The list would be setup once when the context is created. When a child context is setup we'd just loop up the parent context chain and list_concat() all of the parent's lists onto the child's lists. Would this method add too much overhead when a context is deleted? Has this been explored already? That's a good idea, as it would be faster than recursing. Also, the number of parents can't change, so we can just make an array, and that would be quite fast to update. Unless I'm missing something, this sounds like a nice solution. It would require more space in MemoryContextData, but that might not be a problem. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Tue, 2015-07-07 at 09:27 +0530, Amit Kapila wrote: I am not sure how many blocks difference could be considered okay for deviation? In my testing (a long time ago) deviations of tens of blocks didn't show a problem. However, an assumption of the sync scan work was that the CPU is processing faster than the IO system; whereas the parallel scan patch assumes that the IO system is faster than a single core. So perhaps the features are incompatible after all. Only testing will say for sure. Then again, syncscans are designed in such a way that they are unlikely to hurt in any situation. Even if the scans diverge (or never converge in the first place), it shouldn't be worse than starting at block zero every time. I'd prefer to leave syncscans intact for parallel scans unless you find a reasonable situation where they perform worse. This shouldn't add any complexity to the patch (if it does, let me know). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Mon, 2015-07-06 at 10:37 +0530, Amit Kapila wrote: Or the other way to look at it could be separate out fields which are required for parallel scan which is done currently by forming a separate structure ParallelHeapScanDescData. I was suggesting that you separate out both the normal scan fields and the partial scan fields, that way we're sure that rs_nblocks is not accessed during a parallel scan. Or, you could try wrapping the parts of heapam.c that are affected by parallelism into new static functions. The reason why partial scan can't be mixed with sync scan is that in parallel scan, it performs the scan of heap by synchronizing blocks (each parallel worker scans a block and then asks for a next block to scan) among parallel workers. Now if we try to make sync scans work along with it, the synchronization among parallel workers will go for a toss. It might not be impossible to make that work in some way, but not sure if it is important enough for sync scans to work along with parallel scan. I haven't tested it, but I think it would still be helpful. The block accesses are still in order even during a partial scan, so why wouldn't it help? You might be concerned about the reporting of a block location, which would become more noisy with increased parallelism. But in my original testing, sync scans weren't very sensitive to slight deviations, because of caching effects. tqueue.c is mainly designed to pass tuples between parallel workers and currently it is used in Funnel operator to gather the tuples generated by all the parallel workers. I think we can use it for any other operator which needs tuple communication among parallel workers. Some specifics of the Funnel operator seem to be a part of tqueue, which doesn't make sense to me. For instance, reading from the set of queues in a round-robin fashion is part of the Funnel algorithm, and doesn't seem suitable for a generic tuple communication mechanism (that would never allow order-sensitive reading, for example). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Fri, 2015-07-03 at 17:35 +0530, Amit Kapila wrote: Attached, find the rebased version of patch. Comments: * The heapam.c changes seem a little ad-hoc. Conceptually, which portions should be affected by parallelism? How do we know we didn't miss something? * Why is initscan getting the number of blocks from the structure? Is it just to avoid an extra syscall, or is there a correctness issue there? Is initscan expecting that heap_parallelscan_initialize is always called first (if parallel)? Please add a comment explaining above. * What's the difference between scan-rs_nblocks and scan-rs_parallel-phs_nblocks? Same for rs_rd-rd_id and phs_relid. * It might be good to separate out some fields which differ between the normal heap scan and the parallel heap scan. Perhaps put rs_ctup, rs_cblock, and rs_cbuf into a separate structure, which is always NULL during a parallel scan. That way we don't accidentally use a non-parallel field when doing a parallel scan. * Is there a reason that partial scans can't work with syncscan? It looks like you're not choosing the starting block in the same way, so it always starts at zero and never does syncscan. If we don't want to mix syncscan and partial scan, that's fine, but it should be more explicit. I'm trying to understand where tqueue.c fits in. It seems very closely tied to the Funnel operator, because any change to the way Funnel works would almost certainly require changes in tqueue.c. But tqueue is a generic name for the file, so something seems off. Either we should explicitly make it the supporting routines for the Funnel operator, or we should try to generalize it a little. I still have quite a bit to look at, but this is a start. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Wed, 2015-07-01 at 11:07 +0530, Amit Kapila wrote: For what you are asking to change name for? There are still some places, at least in the comments, that call it a parallel sequential scan. a. Infrastructure for parallel execution, like some of the stuff in execparallel.c, heapam.c,tqueue.c, etc and all other generic (non-nodes specific) code. Did you consider passing tuples through the tqueue by reference rather than copying? The page should be pinned by the worker process, but perhaps that's a bad assumption to make? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
[Jumping in without catching up on entire thread. Please let me know if these questions have already been covered.] 1. Can you change the name to something like ParallelHeapScan? Parallel Sequential is a contradiction. (I know this is bikeshedding and I won't protest further if you keep the name.) 2. Where is the speedup coming from? How much of it is CPU and IO overlapping (i.e. not leaving disk or CPU idle while the other is working), and how much from the CPU parallelism? I know this is difficult to answer rigorously, but it would be nice to have some breakdown even if for a specific machine. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Refactor to split nodeAgg.c?
I was going to rebase my HashAgg patch, and got some conflicts related to the grouping sets patch. I could probably sort them out, but I think that may be the tipping point where we want to break up nodeAgg.c into nodeSortedAgg.c and nodeHashAgg.c, and probably a common file as well. This would also (I hope) be convenient for Simon and David Rowley, who have been hacking on aggregates in general. Anyone see a reason I shouldn't give this a try? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Memory Accounting v11
This patch tracks memory usage (at the block level) for all memory contexts. Individual palloc()s aren't tracked; only the new blocks allocated to the memory context with malloc(). It also adds a new function, MemoryContextMemAllocated() which can either retrieve the total allocated for the context, or it can recurse and sum up the allocations for all subcontexts as well. This is intended to be used by HashAgg in an upcoming patch that will bound the memory and spill to disk. Previous discussion here: http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop Previous concerns: * There was a slowdown reported of around 1-3% (depending on the exact version of the patch) on an IBM power machine when doing an index rebuild. The results were fairly noisy for me, but it seemed to hold up. See http://www.postgresql.org/message-id/CA +Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajp3cro9db...@mail.gmail.com * Adding a struct field to MemoryContextData may be bad for the CPU caching behavior, and may be the cause of the above slowdown. * Previous versions of the patch updated the parent contexts' allocations as well, which avoided the need to recurse when querying the total allocation. That seemed to penalize cases that didn't need to track the allocation. We discussed trying to opt-in to this behavior, but it seemed more awkward than helpful. Now, the patch only updates the allocation of the current context, and querying means recursing through the child contexts. * There was a concern that, if MemoryContextMemAllocated needs to recurse to the child contexts, it will be too slow for HashAgg of array_agg, which creates a child context per group. That was solved with http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1 My general answer to the performance concerns is that they aren't a reason to block this patch, unless someone has a suggestion about how to fix them. Adding one field to a struct and a few arithmetic operations for each malloc() or realloc() seems reasonable to me. The current state, where HashAgg just blows up the memory, is just not reasonable, and we need to track the memory to fix that problem. Others have also mentioned that we might want to use this mechanism to track memory for other operators, like Sort or HashJoin, which might be simpler and more accurate. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 500,505 AllocSetContextCreate(MemoryContext parent, --- 500,508 errdetail(Failed while creating memory context \%s\., name))); } + + ((MemoryContext) set)-mem_allocated += blksize; + block-aset = set; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; *** *** 590,595 AllocSetReset(MemoryContext context) --- 593,600 else { /* Normal case, release the block */ + context-mem_allocated -= block-endptr - ((char*) block); + #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block-freeptr - ((char *) block)); #endif *** *** 632,637 AllocSetDelete(MemoryContext context) --- 637,644 { AllocBlock next = block-next; + context-mem_allocated -= block-endptr - ((char *) block); + #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block-freeptr - ((char *) block)); #endif *** *** 672,677 AllocSetAlloc(MemoryContext context, Size size) --- 679,687 block = (AllocBlock) malloc(blksize); if (block == NULL) return NULL; + + context-mem_allocated += blksize; + block-aset = set; block-freeptr = block-endptr = ((char *) block) + blksize; *** *** 861,866 AllocSetAlloc(MemoryContext context, Size size) --- 871,878 if (block == NULL) return NULL; + context-mem_allocated += blksize; + block-aset = set; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; *** *** 964,969 AllocSetFree(MemoryContext context, void *pointer) --- 976,984 set-blocks = block-next; else prevblock-next = block-next; + + context-mem_allocated -= block-endptr - ((char*) block); + #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block-freeptr - ((char *) block)); #endif *** *** 1077,1082 AllocSetRealloc(MemoryContext context, void *pointer, Size size) --- 1092,1098 AllocBlock prevblock = NULL; Size chksize; Size blksize; + Size oldblksize; while (block != NULL) { *** *** 1094,1102 AllocSetRealloc(MemoryContext context, void *pointer, Size size) --- 1110,1123 /* Do the realloc */ chksize = MAXALIGN(size); blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; + oldblksize = block-endptr - ((char *)block); + block = (AllocBlock) realloc(block, blksize
Re: [HACKERS] Memory Accounting v11
On Sun, 2015-06-14 at 16:21 -0400, Tom Lane wrote: Meh. HashAgg could track its memory usage without loading the entire system with a penalty. When I tried doing it outside of the MemoryContexts, it seemed to get awkward quickly. I'm open to suggestion if you can point me in the right direction. Maybe I can peek at the sizes of chunks holding state values and group keys, and combine that with the hash table size-estimating code? Moreover, this is about fourth or fifth down the list of the implementation problems with spilling hash aggregation to disk. It would be good to see credible solutions for the bigger issues before we buy into adding overhead for a mechanism with no uses in core. I had several iterations of a full implementation of the spill-to-disk HashAgg patch[1]. Tomas Vondra has some constructive review comments, but all of them seemed solvable. What do you see as a major unsolved issue? If I recall, you were concerned about things like array_agg, where an individual state could get larger than work_mem. That's a valid concern, but it's not the problem I was trying to solve. Regards, Jeff Davis [1] http://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type
On Thu, Apr 23, 2015 at 1:49 AM, David G. Johnston david.g.johns...@gmail.com wrote: Reading and writing all this I'm convinced you have gotten the idea in your mind an expectation of equivalency and consistency where there really is little or none from an overall design perspective. And none insofar as would merit trying to force the two example queries you provide to behave identically. There are a number of things about SQL that one either simply lives with or goes through mind contortions to understand the, possibly imperfect, reasoning behind. This is one of those things: and while it's been fun to do those contortions in the end I am only a little bit better off than when I simply accepted the fact the unknown and untyped were similar but different (even if I hadn't considered giving them different names). You make some good points, but I disagree for two reasons: 1. This is a postgres issue, not a general SQL issue, so we can't simply say SQL is weird (like we can with a lot of the strange NULL semantics). 2. Postgres has determined that the unknown column reference b in the query SELECT a=b FROM (SELECT ''::text, ' ') x(a,b) can and should be coerced to text. So postgres has already made that decision. It's just that when it tries to coerce it, it fails. Even if you believe that literals and column references are not equivalent, I don't see any justification at all for this behavior -- postgres should not decide to coerce it in the first place if it's going to fail. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type
On Wed, 2015-04-22 at 20:35 -0700, David G. Johnston wrote: But the fact that column b has the data type unknown is only a warning - not an error. I get an error: postgres=# SELECT ' '::text = 'a'; ?column? -- f (1 row) postgres=# SELECT a=b FROM (SELECT ''::text, ' ') x(a,b); ERROR: failed to find conversion function from unknown to text So that means the column reference b is treated differently than the literal. Here I don't mean a reference to an actual column of a real table, just an identifier (b) that parses as a columnref. Creating the table gives you a warning (not an error), but I think that was a poor example for me to choose, and not important to my point. This seems to be a case of the common problem (or, at least recently mentioned) where type conversion only deals with data and not context. http://www.postgresql.org/message-id/CADx9qBmVPQvSH3 +2cH4cwwPmphW1mE18e=wumlfuc-qz-t7...@mail.gmail.com I think that is a different problem. That's a runtime type conversion error (execution time), and I'm talking about something happening at parse analysis time. but this too works - which is why the implicit cast concept above fails (I'm leaving it since the thought process may help in understanding): SELECT 1 = '1'; From which I infer that an unknown literal is allowed to be fed directly into a type's input function to facilitate a direct coercion. Yes, I believe that's what's happening. When we use an unknown literal, it's acting more like a value constructor and will pass it to the type input function. When it's a columnref, even if unknown, it tries to cast it and fails. But that is very confusing. In the example at the top of this email, it seems like the second query should be equivalent to the first, or even that postgres should be able to rewrite the second into the first. But the second query fails where the first succeeds. At this point...backward compatibility? Backwards compatibility of what queries? I guess the ones that return unknowns to the client or create tables with unknown columns? create table a(u) as select '1'; WARNING: column u has type unknown DETAIL: Proceeding with relation creation anyway. Related question: was there ever a time when the above failed instead of just supplying a warning? Not that I recall. My gut reaction is if you feel strongly enough to add some additional documentation or warnings/hints/details related to this topic they probably would get put in; but disallowing unknown as first-class type is likely to fail to pass a cost-benefit evaluation. I'm not proposing that we eliminate unknown. I just think columnrefs and literals should behave consistently. If we really don't want unknown columnrefs, it seems like we could at least throw a better error. If we were starting from scratch, I'd also not return unknown to the client, but we have to worry about the backwards compatibility. Distinguishing between untyped literals and unknown type literals seems promising concept to aid in understanding the difference in the face of not being able (or wanting) to actually change the behavior. Not sure I understand that proposal, can you elaborate? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type
Moving thread to -hackers. On Wed, Apr 8, 2015 at 11:18 PM, Jeff Davis pg...@j-davis.com wrote: That example was just for illustration. My other example didn't require creating a table at all: SELECT a=b FROM (SELECT ''::text, ' ') x(a,b); it's fine with me if we want that to fail, but I don't think we're failing in the right place, or with the right error message. I'm not clear on what rules we're applying such that the above query should fail, but: SELECT ''::text=' '; should succeed. Unknown literals are OK, but unknown column references are not? If that's the rule, can we catch that earlier, and throw an error like 'column reference b has unknown type'? Is the behavior of unknown literals vs. unknown column references documented anywhere? I tried looking here: http://www.postgresql.org/docs/devel/static/typeconv.html, but it doesn't seem to make the distinction between how unknown literals vs. unknown column references are handled. My understanding until now has been that unknown types are a placeholder while still inferring the types. But that leaves a few questions: 1. Why do we care whether the unknown is a literal or a column reference? 2. Unknown column references seem basically useless, because we will almost always encounter the failure to find conversion error, so why do we allow them? 3. If unknowns are just a placeholder, why do we return them to the client? What do we expect the client to do with it? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Invent a memory context reset/delete callback mechanism.
On Fri, 2015-02-27 at 22:17 +, Tom Lane wrote: In passing, per discussion, rearrange some boolean fields in struct MemoryContextData so as to avoid wasted padding space. For safety, this requires making allowInCritSection's existence unconditional; but I think that's a better approach than what was there anyway. I notice that this uses the bytes in MemoryContextData that I was hoping to use for the memory accounting patch (for memory-bounded HashAgg). Leaving no space for even a 32-bit value (without AllocSetContext growing beyond the magical 192-byte number Andres mentioned) removes most of the options for memory accounting. The only things I can think of are: 1. Move a few less-frequently-accessed fields out to an auxiliary structure (Tomas proposed something similar); or 2. Walk the memory contexts, but use some kind of estimation/extrapolation so that it doesn't need to be done for every input tuple of HashAgg. Thoughts? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Wed, 2015-01-07 at 20:07 +0100, Tomas Vondra wrote: So I started digging in the code and I noticed this: hash_mem = MemoryContextMemAllocated(aggstate-hashcontext, true); which is IMHO obviously wrong, because that accounts only for the hashtable itself. It might be correct for aggregates with state passed by value, but it's incorrect for state passed by reference (e.g. Numeric, arrays etc.), because initialize_aggregates does this: oldContext = MemoryContextSwitchTo(aggstate-aggcontext); pergroupstate-transValue = datumCopy(peraggstate-initValue, peraggstate-transtypeByVal, peraggstate-transtypeLen); MemoryContextSwitchTo(oldContext); and it's also wrong for all the user-defined aggretates that have no access to hashcontext at all, and only get reference to aggcontext using AggCheckCallContext(). array_agg() is a prime example. Actually, I believe the first one is right, and the second one is wrong. If we allocate pass-by-ref transition states in aggcontext, that defeats the purpose of the patch, because we can't release the memory when we start a new batch (because we can't reset aggcontext). Instead, the transition states should be copied into hashcontext; we should count the memory usage of hashcontext; AggCheckCallContext should return hashcontext; and after each batch we should reset hashcontext. It might be worth refactoring a bit to call it the groupcontext or something instead, and use it for both HashAgg and GroupAgg. That would reduce the need for conditionals. [ ... problems with array_agg subcontexts ... ] and it runs indefinitely (I gave up after a few minutes). I believe this renders the proposed memory accounting approach dead. ... The array_agg() patch I submitted to this CF would fix this particular query, as it removes the child contexts (so there's no need for recursion in MemoryContextMemAllocated), but it does nothing to the user-defined aggregates out there. And it's not committed yet. Now that your patch is committed, I don't think I'd call the memory accounting patch I proposed dead yet. We decided that creating one context per group is essentially a bug, so we don't need to optimize for that case. Your approach may be better, though. Thank you for reviewing. I'll update my patches and resubmit for the next CF. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Variable renaming in AllocSetContextCreate (will commit soon, no functional impact)
This is just to make the variable names in this function consistent with the rest of the file (and less confusing). No functional impact, so I'll commit soon unless someone objects. Previously, this was part of the memory accounting patch, but that hasn't made it in yet. Might as well commit the cleanup at least. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 438,451 AllocSetContextCreate(MemoryContext parent, Size initBlockSize, Size maxBlockSize) { ! AllocSet context; /* Do the type-independent part of context creation */ ! context = (AllocSet) MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); /* * Make sure alloc parameters are reasonable, and save them. --- 438,454 Size initBlockSize, Size maxBlockSize) { ! AllocSet set; ! MemoryContext context; /* Do the type-independent part of context creation */ ! context = MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); ! ! set = (AllocSet) context; /* * Make sure alloc parameters are reasonable, and save them. *** *** 459,467 AllocSetContextCreate(MemoryContext parent, if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! context-initBlockSize = initBlockSize; ! context-maxBlockSize = maxBlockSize; ! context-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be --- 462,470 if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! set-initBlockSize = initBlockSize; ! set-maxBlockSize = maxBlockSize; ! set-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be *** *** 477,486 AllocSetContextCreate(MemoryContext parent, * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! context-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION)) ! context-allocChunkLimit = 1; /* * Grab always-allocated space, if requested --- 480,489 * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! set-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION)) ! set-allocChunkLimit = 1; /* * Grab always-allocated space, if requested *** *** 500,519 AllocSetContextCreate(MemoryContext parent, errdetail(Failed while creating memory context \%s\., name))); } ! block-aset = context; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; ! block-next = context-blocks; ! context-blocks = block; /* Mark block as not to be released at reset time */ ! context-keeper = block; /* Mark unallocated space NOACCESS; leave the block header alone. */ VALGRIND_MAKE_MEM_NOACCESS(block-freeptr, blksize - ALLOC_BLOCKHDRSZ); } ! return (MemoryContext) context; } /* --- 503,522 errdetail(Failed while creating memory context \%s\., name))); } ! block-aset = set; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; ! block-next = set-blocks; ! set-blocks = block; /* Mark block as not to be released at reset time */ ! set-keeper = block; /* Mark unallocated space NOACCESS; leave the block header alone. */ VALGRIND_MAKE_MEM_NOACCESS(block-freeptr, blksize - ALLOC_BLOCKHDRSZ); } ! return context; } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Variable renaming in AllocSetContextCreate (will commit soon, no functional impact)
On Sun, 2015-02-22 at 00:07 -0500, Tom Lane wrote: If you want to have just *one* variable but change its name and type, I'd be ok with that. Thank you for taking a quick look. Committed as a simple rename from context to set. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Sun, 2015-02-22 at 03:14 +0100, Tomas Vondra wrote: SELECT COUNT(x) FROM ( SELECT a, array_agg(i) AS x FRM test GROUP BY 1 ) foo; That's actually a bogus test -- array_agg is never executed. Really? How could that happen when the result of array_agg() is passed to the COUNT()? Also, how could that allocate huge amounts of memory and get killed by OOM, which happens easily with this query? Oops, I misread that as COUNT(*). Count(x) will force array_agg() to be executed. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Wed, 2015-01-28 at 23:25 +0100, Tomas Vondra wrote: 3) moves the assert into the 'if (release)' branch You missed one, but I got it. 4) includes the comments proposed by Ali Akbar in his reviews Warnings at makeArrayResult/makeMdArrayResult about freeing memory with private subcontexts. I also edited the comments substantially. Regarding the performance impact of decreasing the size of the preallocated array from 64 to just 8 elements, I tried this. CREATE TABLE test AS SELECT mod(i,10) a, i FROM generate_series(1,64*10) s(i); SELECT a, array_agg(i) AS x FRM test GROUP BY 1; or actually (to minimize transfer overhead): SELECT COUNT(x) FROM ( SELECT a, array_agg(i) AS x FRM test GROUP BY 1 ) foo; That's actually a bogus test -- array_agg is never executed. See the test script I used (attached). Many small groups has a significant improvement with your patch (median 167ms unpatched, 125ms patched), and the one large group is not measurably different (median 58ms for both). The test script uses a small dataset of 100K tuples, because 1M tuples will run out of memory for small groups without the patch. Committed. Regards, Jeff Davis array_agg_test.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.6 Feature help requested: Inclusion Constraints
On Sat, 2015-02-07 at 16:08 -0800, Jeff Davis wrote: I believe Inclusion Constraints will be important for postgres. I forgot to credit Darren Duncan with the name of this feature: http://www.postgresql.org/message-id/4f8bb9b0.5090...@darrenduncan.net Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] 9.6 Feature help requested: Inclusion Constraints
I believe Inclusion Constraints will be important for postgres. Unfortunately, my time has been scarce lately, so without help it may miss 9.6 as well. If someone is interested in working on this with me to deliver a good submission in a couple months, please let me know. The idea is to make a generalized form of foreign keys, just like exclusion constraints are a generalized form of unique constraints. === Syntax: === (see attachment for more specifics) INCLUDE [ USING index_method ] ( exclude_element WITH operator [, ... ] ) REFERENCES reftable [ ( refcolumn [, ... ] ) ] [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] index_parameters [ WHERE ( predicate ) ] Notice the WHERE clause, which is missing from FKs. == Example #1 == -- e.g. Super Bowl create table event ( event_id int8, title text, eduring tstzrange, exclude using gist (event_id with =, eduring with ) ); -- e.g. Catch on one yard line create table event_highlight ( event_id int8, description text, hduring tstzrange, exclude using gist (event_id with =, hduring with ), include using gist (event_id with =, hduring with @) references event (event_id, eduring) ); Ensure that the hduring is always contained (@) in some eduring with the same event_id. == Example #2 == create table asset_ownership ( asset_id INT8, owner_id INT8, during tstzrange, exclude using gist (asset_id with =, during with ), include using gist (asset_id with =, during with -|-) references asset_possession (asset_id, during) where (not lower_inf(during)) ); Ensure that assets in a business are always in the possession of some responsible party. Here we require that every possession period is adjacent (-|-) to some other possession period for the same asset, unless there is no lower bound. Limitations: I don't see a good way to make cascading updates work, as Peter pointed out in a previous discussion: http://www.postgresql.org/message-id/1288113924.22800.4.ca...@vanquo.pezone.net And cascading deletes would be a little strange, as well. I think it's better to just only support ON UPDATE/DELETE NO ACTION. If someone sees a use case here that is not too specific, we can always try to add support for something like that later. We could support ON UPDATE/DELETE RESTRICT, too, but that can be accomplished by making it IMMEDIATE (unless I'm missing some subtlety). === Challenges: === I think the biggest technical challenge will be to account for the case where the a referencing tuple has multiple matches on the referenced side. This is similar to the challenge of supporting MATCH PARTIAL with FKs, which is, I assume, why we don't support that yet. It *seems* solvable, but I haven't looked in detail yet. Another approach would be to try to enforce the idea that there is only one match on the referenced side, using an exclusion constraint. I rejected this approach because: * Notice in example #1 that the exclusion constraint on the referenced side uses the overlaps operator (), while the inclusion constraint uses the contained by operator (@). Postgres currently has no way to know that those two operators have a special relationship that ensures only one match on the referenced side. I don't even know the name for that relationship, so asking users to declare such a relationship seems a little much. * It seems like it might eliminate some legitimate use cases. * We could get MATCH PARTIAL support as a result of this work. Work to be done: The rest of the work seems similar in scope to Exclusion Constraints (0cb65564): - parse analysis work - catalog work - dump/reload support - compare performance of a trivial inclusion constraint to a FK - ensure that deadlocks are not too common - add tests Any takers? Regards, Jeff Davis *** a/doc/src/sgml/ref/create_table.sgml --- b/doc/src/sgml/ref/create_table.sgml *** *** 63,69 CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI PRIMARY KEY ( replaceable class=PARAMETERcolumn_name/replaceable [, ... ] ) replaceable class=PARAMETERindex_parameters/replaceable | EXCLUDE [ USING replaceable class=parameterindex_method/replaceable ] ( replaceable class=parameterexclude_element/replaceable WITH replaceable class=parameteroperator/replaceable [, ... ] ) replaceable class=parameterindex_parameters/replaceable [ WHERE ( replaceable class=parameterpredicate/replaceable ) ] | FOREIGN KEY ( replaceable class=PARAMETERcolumn_name/replaceable [, ... ] ) REFERENCES replaceable class=PARAMETERreftable/replaceable [ ( replaceable class=PARAMETERrefcolumn/replaceable [, ... ] ) ] ! [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON DELETE replaceable class=parameteraction/replaceable ] [ ON UPDATE replaceable class
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Tue, 2015-01-20 at 23:37 +0100, Tomas Vondra wrote: Tom's message where he points that out is here: http://www.postgresql.org/message-id/20707.1396372...@sss.pgh.pa.us That message also says: I think a patch that stood a chance of getting committed would need to detect whether the aggregate was being called in simple or grouped contexts, and apply different behaviors in the two cases. I take that as an objection to any patch which does not distinguish between the grouped and ungrouped aggregate cases, which includes your patch. I don't agree with that objection (or perhaps I don't understand it); but given the strong words above, I need to get some kind of response before I can consider committing your patch. I generally agree that having two API 'facets' with different behavior is slightly awkward and assymetric, but I wouldn't call that ugly. Right, your words are more precise (and polite). My apologies. I actually modified both APIs initially, but I think Ali is right that not breaking the existing API (and keeping the original behavior in that case) is better. We can break it any time we want in the future, but it's impossible to unbreak it ;-) We can't break the old API, and I'm not suggesting that we do. I was hoping to find some alternative. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Tue, Jan 20, 2015 at 6:44 AM, Tom Lane t...@sss.pgh.pa.us wrote: Jeff Davis pg...@j-davis.com writes: Tom (tgl), Is my reasoning above acceptable? Uh, sorry, I've not been paying any attention to this thread for awhile. What's the remaining questions at issue? This patch is trying to improve the array_agg case where there are many arrays being constructed simultaneously, such as in HashAgg. You strongly suggested that a commitable patch would differentiate between the grouped and ungrouped cases (or perhaps you meant differentiate between HashAgg and sorted aggregation?). Tomas's current patch does not do so; it does two main things: 1. Uses a common memory context for arrays being constructed by array_agg in any context (ungrouped, sorted agg, and HashAgg) 2. Reduces the initial array allocation to 8 elements from 64, but preserves doubling behavior I don't see either as a big problem, but perhaps there are some downsides in some cases. I think a worst case would be a slowdown in the sorted agg case where every group has 64 elements, so I'll try to see if that's a real problem or not. If you saw a bigger problem, please let me know; and if not, I'll proceed with the review. There are also some other concerns I have about the API ugliness, which I believe is the reason there's so much discussion about making the comments better. The reason the API is ugly is for backwards compatibility, so there's no perfect solution. Your opinion is welcome here too, but I mainly need to see if your objection above has been addressed. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Wed, Jan 14, 2015 at 1:52 PM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Attached is v8 patch, with a few comments added: 1) before initArrayResult() - explaining when it's better to use a single memory context, and when it's more efficient to use a separate memory context for each array build state 2) before makeArrayResult() - explaining that it won't free memory when allocated in a single memory context (and that a pfree() has to be used if necessary) 3) before makeMdArrayResult() - explaining that it's illegal to use release=true unless using a subcontext I understand there is history to this API, and we need to be compatible, but the end result is awkward. I'm wondering whether it would be better to just have a new set of functions like accumArrayResultElem, etc., and only allow astate=NULL in the original accumArrayResult(). That cure might be worse than the disease though. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Sun, Dec 28, 2014 at 11:53 PM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2014-04-01 at 13:08 -0400, Tom Lane wrote: I think a patch that stood a chance of getting committed would need to detect whether the aggregate was being called in simple or grouped contexts, and apply different behaviors in the two cases. The simple context doesn't seem like a big problem even if we change things as Tomas suggests: IMNSHO these are the issues we really should fix - by lowering the initial element count (64-4) and using a single memory context. In the simple context, there's only one context regardless, so the only cost I see is from reducing the initial allocation from 64 to some lower number. But if we're doubling each time, it won't take long to get there; and because it's the simple context, we only need to do it once. Tom (tgl), Is my reasoning above acceptable? The current patch, which I am evaluating for commit, does away with per-group memory contexts (it uses a common context for all groups), and reduces the initial array allocation from 64 to 8 (but preserves doubling behavior). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote: Speeding up the CRC calculation obviously won't help with the WAL volume per se, ie. you still generate the same amount of WAL that needs to be shipped in replication. But then again, if all you want to do is to reduce the volume, you could just compress the whole WAL stream. Was this point addressed? How much benefit is there to compressing the data before it goes into the WAL stream versus after? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Tue, 2014-12-23 at 01:16 -0800, Jeff Davis wrote: New patch attached (rebased, as well). I also see your other message about adding regression testing. I'm hesitant to slow down the tests for everyone to run through this code path though. Should I add regression tests, and then remove them later after we're more comfortable that it works? Attached are some tests I ran. First, generate the data sets with hashagg_test_data.sql. Then, do (I used work_mem at default of 4MB): set enable_hashagg=false; \o /tmp/sort.out \i /tmp/hashagg_test.sql \o set enable_hashagg=true; \o /tmp/hash.out \i /tmp/hashagg_test.sql and then diff'd the output to make sure the results are the same (except the plans, of course). The script loads the results into a temp table, then sorts it before outputting, to make the test order-independent. I didn't just add an ORDER BY, because that would change the plan and it would never use hashagg. I think that has fairly good coverage of the hashagg code. I used 3 different input data sets, byval and byref types (for group key and args), and a group aggregate query as well as DISTINCT. Let me know if I missed something. I also did some performance comparisons between disk-based sort+group and disk-based hashagg. The results are quite favorable for hashagg given the data sets I provided. Simply create the data using hashagg_test_data.sql (if not already done), set the work_mem to the value you want to test, and run hashagg_test_perf.sql. It uses EXPLAIN ANALYZE for the timings. singleton: 10M groups of 1 even: 1M groups of 10 skew: wildly different group sizes; see data script q1: group aggregate query q2: distinct query The total memory requirements for the test to run without going to disk ranges from about 100MB (for even) to about 1GB (for singleton). Regardless of work_mem, these all fit in memory on my machine, so they aren't *really* going to disk. Also note that, because of how the memory blocks are allocated, and that hashagg waits until memory is exceeded, then hashagg might use about double work_mem when work_mem is small (the effect is not important at higher values). work_mem='1MB': sort+group (s)hashagg (s) singleton q1 12 10 singleton q28 7 even q114 7 even q210 5 skew q122 6 skew q216 4 work_mem='4MB': sort+group (s)hashagg (s) singleton q1 12 11 singleton q28 6 even q112 7 even q2 9 5 skew q119 6 skew q213 3 work_mem='16MB': sort+group (s)hashagg (s) singleton q1 12 11 singleton q28 7 even q114 7 even q210 5 skew q115 6 skew q212 4 work_mem='64MB': sort+group (s)hashagg (s) singleton q1 13 12 singleton q29 8 even q114 8 even q210 5 skew q117 6 skew q213 4 work_mem='256MB': sort+group (s)hashagg (s) singleton q1 12 12 singleton q29 8 even q114 7 even q211 4 skew q116 6 skew q213 4 work_mem='512MB': sort+group (s)hashagg (s) singleton q1 12 12 singleton q29 7 even q114 7 even q210 4 skew q116 6 skew q212 4 work_mem='2GB': sort+group (s)hashagg (s) singleton q19 12 singleton q26 6 even q1 8 7 even q2 6 4 skew q1 7 6 skew q2 5 4 These numbers are great news for disk-based hashagg. It seems to be the same or better than sort+group in nearly all cases (again, this example doesn't actually go to disk, so those numbers may come out differently). Also, the numbers are remarkably stable for varying work_mem for both plans. That means that it doesn't cost much to keep a lower work_mem as long as your system has plenty of memory. Do others have similar numbers? I'm quite surprised at how little work_mem seems to matter for these plans (HashJoin might
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Thu, 2014-12-11 at 02:46 -0800, Jeff Davis wrote: On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote: This patch is requires the Memory Accounting patch, or something similar to track memory usage. The attached patch enables hashagg to spill to disk, which means that hashagg will contain itself to work_mem even if the planner makes a bad misestimate of the cardinality. New patch attached. All open items are complete, though the patch may have a few rough edges. This thread got moved over here: http://www.postgresql.org/message-id/1419326161.24895.13.camel@jeff-desktop Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Sun, 2014-12-28 at 12:37 -0800, Jeff Davis wrote: I feel like I made a mistake -- can someone please do a sanity check on my numbers? I forgot to randomize the inputs, which doesn't matter much for hashagg but does matter for sort. New data script attached. The results are even *better* for disk-based hashagg than the previous numbers suggest. Here are some new numbers: work_mem='1MB': sort+group (s)hashagg (s) singleton q1 21 10 singleton q2 12 8 even q120 7 even q213 5 skew q122 6 skew q216 4 work_mem='4MB': sort+group (s)hashagg (s) singleton q1 17 10 singleton q2 11 6 even q116 7 even q211 5 skew q119 6 skew q213 4 work_mem='16MB': sort+group (s)hashagg (s) singleton q1 16 11 singleton q2 11 7 even q115 8 even q212 6 skew q115 6 skew q212 4 work_mem='64MB': sort+group (s)hashagg (s) singleton q1 18 12 singleton q2 13 8 even q117 10 even q213 6 skew q117 6 skew q214 4 work_mem='256MB': sort+group (s)hashagg (s) singleton q1 18 12 singleton q2 14 7 even q116 9 even q214 5 skew q118 6 skew q213 4 work_mem='512MB': sort+group (s)hashagg (s) singleton q1 18 12 singleton q2 14 7 even q117 9 even q214 5 skew q117 6 skew q213 4 work_mem='2GB': sort+group (s)hashagg (s) singleton q1 11 11 singleton q27 6 even q110 9 even q2 7 5 skew q1 7 6 skew q2 4 4 Regards, Jeff Davis hashagg_test_data.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Sun, 2014-12-21 at 13:00 -0500, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: i.e. either destroy the whole context if possible, and just free the memory when using a shared memory context. But I'm afraid this would penalize the shared memory context, because that's intended for cases where all the build states coexist in parallel and then at some point are all converted into a result and thrown away. Adding pfree() calls is no improvement here, and just wastes cycles. FWIW, I quite dislike the terminology shared memory context, because it sounds too much like it means a context in shared memory. I see that the patch itself doesn't use that phrase, which is good, but can we come up with some other phrase for talking about it? Common memory context? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Tue, 2014-12-16 at 00:27 +0100, Tomas Vondra wrote: plperl.c: In function 'array_to_datum_internal': plperl.c:1196: error: too few arguments to function 'accumArrayResult' plperl.c: In function 'plperl_array_to_datum': plperl.c:1223: error: too few arguments to function 'initArrayResult' Cheers, Thanks, attached is a version that fixes this. Just jumping into this patch now. Do we think this is worth changing the signature of functions in array.h, which might be used from a lot of third-party code? We might want to provide new functions to avoid a breaking change. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg
On Tue, 2014-04-01 at 13:08 -0400, Tom Lane wrote: I think a patch that stood a chance of getting committed would need to detect whether the aggregate was being called in simple or grouped contexts, and apply different behaviors in the two cases. The simple context doesn't seem like a big problem even if we change things as Tomas suggests: IMNSHO these are the issues we really should fix - by lowering the initial element count (64-4) and using a single memory context. In the simple context, there's only one context regardless, so the only cost I see is from reducing the initial allocation from 64 to some lower number. But if we're doubling each time, it won't take long to get there; and because it's the simple context, we only need to do it once. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
It seems that these two patches are being reviewed together. Should I just combine them into one? My understanding was that some wanted to review the memory accounting patch separately. On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote: That's the only conflict, and after fixing it it compiles OK. However, I got a segfault on the very first query I tried :-( If lookup_hash_entry doesn't find the group, and there's not enough memory to create it, then it returns NULL; but the caller wasn't checking for NULL. My apologies for such a trivial mistake, I was doing most of my testing using DISTINCT. My fix here was done quickly, so I'll take a closer look later to make sure I didn't miss something else. New patch attached (rebased, as well). I also see your other message about adding regression testing. I'm hesitant to slow down the tests for everyone to run through this code path though. Should I add regression tests, and then remove them later after we're more comfortable that it works? Regards Jeff Davis *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 3045,3050 include_dir 'conf.d' --- 3045,3065 /listitem /varlistentry + varlistentry id=guc-enable-hashagg-disk xreflabel=enable_hashagg_disk + termvarnameenable_hashagg_disk/varname (typeboolean/type) + indexterm +primaryvarnameenable_hashagg_disk/ configuration parameter/primary + /indexterm + /term + listitem +para + Enables or disables the query planner's use of hashed aggregation plan + types when the planner expects the hash table size to exceed + varnamework_mem/varname. The default is literalon/. +/para + /listitem + /varlistentry + varlistentry id=guc-enable-hashjoin xreflabel=enable_hashjoin termvarnameenable_hashjoin/varname (typeboolean/type) indexterm *** a/src/backend/commands/explain.c --- b/src/backend/commands/explain.c *** *** 86,91 static void show_sort_group_keys(PlanState *planstate, const char *qlabel, --- 86,92 List *ancestors, ExplainState *es); static void show_sort_info(SortState *sortstate, ExplainState *es); static void show_hash_info(HashState *hashstate, ExplainState *es); + static void show_hashagg_info(AggState *hashstate, ExplainState *es); static void show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es); static void show_instrumentation_count(const char *qlabel, int which, *** *** 1422,1427 ExplainNode(PlanState *planstate, List *ancestors, --- 1423,1429 case T_Agg: show_agg_keys((AggState *) planstate, ancestors, es); show_upper_qual(plan-qual, Filter, planstate, ancestors, es); + show_hashagg_info((AggState *) planstate, es); if (plan-qual) show_instrumentation_count(Rows Removed by Filter, 1, planstate, es); *** *** 1912,1917 show_sort_info(SortState *sortstate, ExplainState *es) --- 1914,1955 } /* + * Show information on hash aggregate buckets and batches + */ + static void + show_hashagg_info(AggState *aggstate, ExplainState *es) + { + Agg *agg = (Agg *)aggstate-ss.ps.plan; + + Assert(IsA(aggstate, AggState)); + + if (agg-aggstrategy != AGG_HASHED) + return; + + if (!aggstate-hash_init_state) + { + long memPeakKb = (aggstate-hash_mem_peak + 1023) / 1024; + long diskKb = (aggstate-hash_disk + 1023) / 1024; + + if (es-format == EXPLAIN_FORMAT_TEXT) + { + appendStringInfoSpaces(es-str, es-indent * 2); + appendStringInfo( + es-str, + Batches: %d Memory Usage: %ldkB Disk Usage:%ldkB\n, + aggstate-hash_num_batches, memPeakKb, diskKb); + } + else + { + ExplainPropertyLong(HashAgg Batches, + aggstate-hash_num_batches, es); + ExplainPropertyLong(Peak Memory Usage, memPeakKb, es); + ExplainPropertyLong(Disk Usage, diskKb, es); + } + } + } + + /* * Show information on hash buckets/batches. */ static void *** a/src/backend/executor/execGrouping.c --- b/src/backend/executor/execGrouping.c *** *** 310,316 BuildTupleHashTable(int numCols, AttrNumber *keyColIdx, hash_ctl.hcxt = tablecxt; hashtable-hashtab = hash_create(TupleHashTable, nbuckets, hash_ctl, ! HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT); return hashtable; } --- 310,317 hash_ctl.hcxt = tablecxt; hashtable-hashtab = hash_create(TupleHashTable, nbuckets, hash_ctl, ! HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | ! HASH_CONTEXT | HASH_NOCHILDCXT); return hashtable; } *** *** 331,336 TupleHashEntry --- 332,386 LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot, bool *isnew) { + uint32 hashvalue; + + hashvalue = TupleHashEntryHash(hashtable, slot); + return LookupTupleHashEntryHash(hashtable, slot
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote: This patch is requires the Memory Accounting patch, or something similar to track memory usage. The attached patch enables hashagg to spill to disk, which means that hashagg will contain itself to work_mem even if the planner makes a bad misestimate of the cardinality. New patch attached. All open items are complete, though the patch may have a few rough edges. Summary of changes: * rebased on top of latest memory accounting patch http://www.postgresql.org/message-id/1417497257.5584.5.camel@jeff-desktop * added a flag to hash_create to prevent it from creating an extra level of memory context - without this, the memory accounting would have a measurable impact on performance * cost model for the disk usage * intelligently choose the number of partitions for each pass of the data * explain support * in build_hash_table(), be more intelligent about the value of nbuckets to pass to BuildTupleHashTable() - BuildTupleHashTable tries to choose a value to keep the table in work_mem, but it isn't very accurate. * some very rudimentary testing (sanity checks, really) shows good results Summary of previous discussion (my summary; I may have missed some points): Tom Lane requested that the patch also handle the case where transition values grow (e.g. array_agg) beyond work_mem. I feel this patch provides a lot of benefit as it is, and trying to handle that case would be a lot more work (we need a way to write the transition values out to disk at a minimum, and perhaps combine them with other transition values). I also don't think my patch would interfere with a fix there in the future. Tomas Vondra suggested an alternative design that more closely resembles HashJoin: instead of filling up the hash table and then spilling any new groups, the idea would be to split the current data into two partitions, keep one in the hash table, and spill the other (see ExecHashIncreaseNumBatches()). This has the advantage that it's very fast to identify whether the tuple is part of the in-memory batch or not; and we can avoid even looking in the memory hashtable if not. The batch-splitting approach has a major downside, however: you are likely to evict a skew value from the in-memory batch, which will result in all subsequent tuples with that skew value going to disk. My approach never evicts from the in-memory table until we actually finalize the groups, so the skew values are likely to be completely processed in the first pass. So, the attached patch implements my original approach, which I still feel is the best solution. Regards, Jeff Davis *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 3017,3022 include_dir 'conf.d' --- 3017,3037 /listitem /varlistentry + varlistentry id=guc-enable-hashagg-disk xreflabel=enable_hashagg_disk + termvarnameenable_hashagg_disk/varname (typeboolean/type) + indexterm +primaryvarnameenable_hashagg_disk/ configuration parameter/primary + /indexterm + /term + listitem +para + Enables or disables the query planner's use of hashed aggregation plan + types when the planner expects the hash table size to exceed + varnamework_mem/varname. The default is literalon/. +/para + /listitem + /varlistentry + varlistentry id=guc-enable-hashjoin xreflabel=enable_hashjoin termvarnameenable_hashjoin/varname (typeboolean/type) indexterm *** a/src/backend/commands/explain.c --- b/src/backend/commands/explain.c *** *** 86,91 static void show_sort_group_keys(PlanState *planstate, const char *qlabel, --- 86,92 List *ancestors, ExplainState *es); static void show_sort_info(SortState *sortstate, ExplainState *es); static void show_hash_info(HashState *hashstate, ExplainState *es); + static void show_hashagg_info(AggState *hashstate, ExplainState *es); static void show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es); static void show_instrumentation_count(const char *qlabel, int which, *** *** 1423,1428 ExplainNode(PlanState *planstate, List *ancestors, --- 1424,1430 case T_Agg: show_agg_keys((AggState *) planstate, ancestors, es); show_upper_qual(plan-qual, Filter, planstate, ancestors, es); + show_hashagg_info((AggState *) planstate, es); if (plan-qual) show_instrumentation_count(Rows Removed by Filter, 1, planstate, es); *** *** 1913,1918 show_sort_info(SortState *sortstate, ExplainState *es) --- 1915,1956 } /* + * Show information on hash aggregate buckets and batches + */ + static void + show_hashagg_info(AggState *aggstate, ExplainState *es) + { + Agg *agg = (Agg *)aggstate-ss.ps.plan; + + Assert(IsA(aggstate, AggState)); + + if (agg-aggstrategy != AGG_HASHED) + return; + + if (!aggstate
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote: On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis pg...@j-davis.com wrote: I can also just move isReset there, and keep mem_allocated as a uint64. That way, if I find later that I want to track the aggregated value for the child contexts as well, I can split it into two uint32s. I'll hold off any any such optimizations until I see some numbers from HashAgg though. I took a quick look at memory-accounting-v8.patch. Is there some reason why mem_allocated is a uint64? All other things being equal, I'd follow the example of tuplesort.c's MemoryContextAllocHuge() API, which (following bugfix commit 79e0f87a1) uses int64 variables to track available memory and so on. No reason. New version attached; that's the only change. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 438,451 AllocSetContextCreate(MemoryContext parent, Size initBlockSize, Size maxBlockSize) { ! AllocSet context; /* Do the type-independent part of context creation */ ! context = (AllocSet) MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); /* * Make sure alloc parameters are reasonable, and save them. --- 438,454 Size initBlockSize, Size maxBlockSize) { ! AllocSet set; ! MemoryContext context; /* Do the type-independent part of context creation */ ! context = MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); ! ! set = (AllocSet) context; /* * Make sure alloc parameters are reasonable, and save them. *** *** 459,467 AllocSetContextCreate(MemoryContext parent, if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! context-initBlockSize = initBlockSize; ! context-maxBlockSize = maxBlockSize; ! context-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be --- 462,470 if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! set-initBlockSize = initBlockSize; ! set-maxBlockSize = maxBlockSize; ! set-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be *** *** 477,486 AllocSetContextCreate(MemoryContext parent, * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! context-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION)) ! context-allocChunkLimit = 1; /* * Grab always-allocated space, if requested --- 480,489 * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! set-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION)) ! set-allocChunkLimit = 1; /* * Grab always-allocated space, if requested *** *** 500,519 AllocSetContextCreate(MemoryContext parent, errdetail(Failed while creating memory context \%s\., name))); } ! block-aset = context; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; ! block-next = context-blocks; ! context-blocks = block; /* Mark block as not to be released at reset time */ ! context-keeper = block; /* Mark unallocated space NOACCESS; leave the block header alone. */ VALGRIND_MAKE_MEM_NOACCESS(block-freeptr, blksize - ALLOC_BLOCKHDRSZ); } ! return (MemoryContext) context; } /* --- 503,525 errdetail(Failed while creating memory context \%s\., name))); } ! ! context-mem_allocated += blksize; ! ! block-aset = set; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; ! block-next = set-blocks; ! set-blocks = block; /* Mark block as not to be released at reset time */ ! set-keeper = block; /* Mark unallocated space NOACCESS; leave the block header alone. */ VALGRIND_MAKE_MEM_NOACCESS(block-freeptr, blksize - ALLOC_BLOCKHDRSZ); } ! return context; } /* *** *** 590,595 AllocSetReset(MemoryContext context) --- 596,603 else { /* Normal case, release the block */ + context-mem_allocated
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Wed, 2014-11-19 at 11:03 -0500, Robert Haas wrote: But your proposal does not solve this problem: 1. Backend A-1 acquires AccessExclusiveLock on relation R. 2. Backend A-2 waits for AccessShareLock on relation R. The good news is that the deadlock detector should realize that since A-1 and A-2 are in the same group, this is a deadlock. And it can abort either A-1 or A-2, which will eventually abort them both. Right. It can even give a nice error hint to tell the user how to avoid the problem. The bad news, to borrow a phrase from Peter Geoghegan, is that it's an unprincipled deadlock; a user confronted with the news that her parallel scan has self-deadlocked will be justifiably dismayed. You seem to be raising this as a show-stopping problem, and I'm not convinced that it is. (a) There are no parallel operators yet, so this is speculative. (b) Out of the parallel operators you imagine (e.g. Scan), we don't expect anything other than AccessShare locks in the normal case. (c) The failure mode is not so bad; it's just an error and the user can probably work around it. You could argue that we know we're headed for this problem, and therefore we should solve it now. I disagree. You are assuming that sharing exclusive heavyweight locks among a group will be a fundamental part of everything postgres does with parallelism; but not every design requires it. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Thu, 2014-11-20 at 11:22 -0500, Robert Haas wrote: 2. Propagate pre-existing locks from the user backend to all the workers. I initially proposed #1, but now I think #2 solves more of the problems for less code. OK. The primary concern with that is unintended consequences. But it's reasonable for you to ask for something more concrete. I will think on this more. A few things I'm thinking about now: * What do you mean by pre-existing? Locks existing prior to what event? (I don't think that's exactly what you meant.) * What's the conceptual difference between granting locks that would otherwise conflict with another process in the group (which is what this proposal is about) and having exactly the same set of locks? Is there any? * Let's say you have processes A1 and A2 in one group, and B. A1 and B both have an AccessShare lock, and A2 tries to acquire an exclusive lock. B is waiting on A2. That's still a deadlock, right? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Mon, 2014-11-17 at 18:04 +0100, Andres Freund wrote: That's quite possibly one culprit for the slowdown. Right now one AllocSetContext struct fits precisely into three cachelines. After your change it doesn't anymore. Thank you! I made the same mistake as Tomas: I saw that MemoryContextData went to 64 bytes and thought that should be fine while ignoring AllocSetContext. I did some tests in the past between a version of my patch that made MemoryContextData 72 bytes and one that made it 64 bytes, but I may have neglected to test against the original 56 byte version. I'm still not able to test to see if this is the real problem, but it seems best to eliminate it anyway. Consider not counting memory in bytes but blocks and adding it directly after the NodeTag. That'd leave the size the same as right now. I can also just move isReset there, and keep mem_allocated as a uint64. That way, if I find later that I want to track the aggregated value for the child contexts as well, I can split it into two uint32s. I'll hold off any any such optimizations until I see some numbers from HashAgg though. Attached new version. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 438,451 AllocSetContextCreate(MemoryContext parent, Size initBlockSize, Size maxBlockSize) { ! AllocSet context; /* Do the type-independent part of context creation */ ! context = (AllocSet) MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); /* * Make sure alloc parameters are reasonable, and save them. --- 438,454 Size initBlockSize, Size maxBlockSize) { ! AllocSet set; ! MemoryContext context; /* Do the type-independent part of context creation */ ! context = MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); ! ! set = (AllocSet) context; /* * Make sure alloc parameters are reasonable, and save them. *** *** 459,467 AllocSetContextCreate(MemoryContext parent, if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! context-initBlockSize = initBlockSize; ! context-maxBlockSize = maxBlockSize; ! context-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be --- 462,470 if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! set-initBlockSize = initBlockSize; ! set-maxBlockSize = maxBlockSize; ! set-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be *** *** 477,486 AllocSetContextCreate(MemoryContext parent, * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! context-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION)) ! context-allocChunkLimit = 1; /* * Grab always-allocated space, if requested --- 480,489 * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! set-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION)) ! set-allocChunkLimit = 1; /* * Grab always-allocated space, if requested *** *** 500,519 AllocSetContextCreate(MemoryContext parent, errdetail(Failed while creating memory context \%s\., name))); } ! block-aset = context; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; ! block-next = context-blocks; ! context-blocks = block; /* Mark block as not to be released at reset time */ ! context-keeper = block; /* Mark unallocated space NOACCESS; leave the block header alone. */ VALGRIND_MAKE_MEM_NOACCESS(block-freeptr, blksize - ALLOC_BLOCKHDRSZ); } ! return (MemoryContext) context; } /* --- 503,525 errdetail(Failed while creating memory context \%s\., name))); } ! ! context-mem_allocated += blksize; ! ! block-aset = set; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; ! block-next = set-blocks; ! set-blocks = block; /* Mark block as not to be released at reset time */ ! set-keeper = block; /* Mark unallocated space
Re: [HACKERS] group locking: incomplete patch, just for discussion
Great discussion. Robert, I think you make a lot of very valid points, but philosophically I am in pretty strong agreement with Andres here. On Fri, 2014-10-31 at 14:29 -0400, Robert Haas wrote: So just refusing parallelism as soon as anything has taken an access exclusive lock doesn't sound too bad to me. That restriction seems onerous to me; for example, it would prevent a parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL. Those seems like worthy targets for parallelism, and maybe even early targets, since I expect a lot of the real complexity here to be in the query planner, and you can avoid most of that complexity when planning DDL. If two backends both have an exclusive lock on the relation for a join operation, that implies that they need to do their own synchronization, because obviously the lock manager is not doing it for them. In turn, that implies that we need parallel versions of these operations; going to each one and parallelizing it. I don't think that's a good long-term strategy... it changes our simple executor (tuples in, tuples out) to something quite different (each node needs to know a lot about whatever is underneath it in order to divide the data up). process and both requests would have been granted. Parallelism is generally not about making different things happen; it's about spreading the stuff that would happen anyway across multiple backends, This point would be much stronger for declarative queries. When the user issues LOCK TABLE, they are getting their hands dirty with the implementation to some degree. If they really want to serialize, they can serialize around the lock of an object that they don't want to also read (in parallel). You're right that this would be surprising to users, but hopefully we can find a way to disable parallelism in that case, or at least issue a WARNING that the user is probably doing the wrong thing. cluster pg_collate... the only solution is to move the parallel workers ahead of the AccessExclusiveLock request, but the deadlock detector won't do that unless it knows about the relationship among the parallel workers. As discussed later in this thread, I also consider failure a solution ;-) So we don't need to move anything. If we have to tell our users that exclusive-locking a catalog can cause deadlocks for parallel query, I think they will find that reasonable. The same admin could turn parallelism down to one for non-superusers before they start doing that, and restore it later. patch that does parallel anything has got to worry about what undetected deadlocks might result, and then argue about which of them are obscure enough that we shouldn't care and which are not. That doesn't sound like a fun way to spend my time. We're in agreement that deadlocks can't go undetected. But, really, the first case is the one I'm more worried about: a bunch of parallel backends all grab the same locks, but slightly separated in time. A strong locker joins the queue midway through and we're hosed. I agree the situation sounds bad in the abstract, but none of the examples seem very compelling. Who locks random catalogs exclusively? And if they do, why is it the lock manager's job to try to reconcile the resulting mess without canceling someone? (OK, don't answer that. It obviously is the lock manager's job to some degree, but we can only demand so much.) Even if you think that all of these particular issues are somehow avoidable or that they don't matter, I think it would be unwise to bet on them being the only issues. The lock manager looks like it's old, ill-maintained code, but that's in no small part because it solved the problem it aims at so thoroughly that few people have felt the need to modify it very much in the last decade. I think that's part of why there's so much skepticism on this thread - we don't think about the problems that the lock manager solves as being important problems not because they truly aren't important, but because they are so thoroughly solved. Then every once in a while we complain about the performance. Guilty as charged. I won't forever oppose changes to the lock manager, but I resist making them a precondition of parallelism (aside from making deadlock detection work, of course). Users taking out exclusive locks recklessly can result in deadlocks today, and will result in more when we do parallelism (though perhaps in more surprising situations). I remember when doing Exclusion Constraints, I went to a lot of effort to make deadlocks impossible, and was quite proud. When Tom saw it, he told me not to bother, and to do it the simple way instead, because deadlocks can happen even with UNIQUE constraints (which I didn't even know). We should use the same strategy here. When we see deadlocks becoming a problem for any reasonable workload, we make a series of tweaks (perhaps some to the lock manager itself) to reduce them. Regards, Jeff
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Thu, Nov 13, 2014 at 11:26 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Nov 13, 2014 at 3:38 AM, Jeff Davis pg...@j-davis.com wrote: If two backends both have an exclusive lock on the relation for a join operation, that implies that they need to do their own synchronization, because obviously the lock manager is not doing it for them. This doesn't make sense to me. Why would they need to synchronize access to a relation in order to join it? Unfortunate typo: that was supposed to be joint operation, just meaning that they are working together for something (e.g. CLUSTER, VACUUM FULL as you suggest). Sorry for the confusion. I meant that the backends need to divide up the work somehow. And if each operator needs to divide up the work before operating, that means we need to change every operator. Regards, Jeff Davis
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Wed, 2014-11-12 at 14:16 -0500, Robert Haas wrote: Detected deadlocks are fine. Improving the deadlock detector is the heart of what needs to be done here. OK, great. As you say, the lock requests we're talking about will rarely wait, so deadlocks won't be frequent. The issue is making sure that, if they do happen, we get a better behavior than your parallel query hangs forever; good luck figuring out why. Right. We can still use this patch's notion of a lock group in the deadlock detector, but we don't need it to actually affect the way a lock is granted. That should eliminate concerns about subtle bugs. Later, after we understand how this is actually used, and if we see deadlock problems, we can look for ways to solve/mitigate them. This seems to be what Andres was saying, here: http://www.postgresql.org/message-id/20141031130727.gf13...@awork2.anarazel.de So I'll follow up in that thread, because it's an interesting discussion. More generally, I think there's some misunderstanding about the overall goal of the parallelism infrastructure that I'm trying to create. ... But my goal is in some ways the opposite: I'm trying to make it possible to run as much existing PostgreSQL backend code as possible inside a parallel worker without any modification. Thank you for clarifying, I think this is a good approach. Back to the patch: If I understand correctly, the _primary_ goal of this patch is to make it safe to take out heavyweight locks in worker processes, even if the deadlock involves LWLocks/latches synchronizing among the processes within a lock group. For example, say processes A1 and A2 are in the same lock group, and B is in a different lock group. A1 is holding heavyweight lock H1 and waiting on a LW lock L1; A2 is holding L1 and waiting on heavyweight lock H2; and B is holding H2 and waiting on H1. The current deadlock detector would see a dependency graph like: A2 - B - A1 But with lock groups, it would see: (A1 A2) - B - (A1 A2) which is a cycle, and can be detected regardless of the synchronization method used between A1 and A2. There are some details to work out to avoid false positives, of course. Is that about right? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Wed, 2014-10-15 at 00:03 -0400, Robert Haas wrote: For parallelism, I think we need a concept of group locking. That is, suppose we have a user backend and N worker backends collaborating to execute some query. For the sake of argument, let's say it's a parallel CLUSTER, requiring a full table lock. We need all of the backends to be able to lock the table even though at least one of them holds AccessExclusiveLock. This suggests that the backends should all be members of a locking group, and that locks within the same locking group should be regarded as mutually non-conflicting. Trying to catch up on this thread, please excuse me if these questions are already covered. You mention the possibility of undetected deadlocks, which is surely unacceptable. But why not improve the deadlock detector? How bad are _detected_ deadlocks? A lot of the concern was around catalog accesses, but those lock requests would rarely wait anyway. I also wonder if group locking is generally the wrong approach to parallelism. Parallel scan/sort should work by assigning workers to chunks of data, and then merging the results. In principle the workers don't need to touch the same data at all, so why are we trying so hard to get them to all take the same locks? The reason, I assume, is that a full table is the lockable object, but we are imagining the segment files as the chunks. But maybe there's a way to address this more directly with an extra field in the lock tag, and perhaps some catalog knowledge? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote: What about removing the callback per se and just keeping the argument, as it were. That is, a MemoryContext has a field of type size_t * that is normally NULL, but if set to non-NULL, then we increment the pointed-to value for pallocs and decrement for pfrees. I like that idea; patch attached. Two differences: * At block level, not chunk level. * I use a uint64, because a size_t is only guaranteed to hold one allocation, and we need to sum up many allocations. When unused, it does still appear to have a little overhead in Robert's test on his power machine. It seems to be between a 0.5-1.0% regression. There's not much extra code on that path, just a branch, pointer dereference, and an addition; so I don't think it will get much cheaper than it is. This regression seems harder to reproduce on my workstation, or perhaps it's just noisier. One thought in either case is that we don't have to touch the API for MemoryContextCreate: rather, the feature can be enabled in a separate call after making the context. That seems fairly awkward to me because the pointer needs to be inherited from the parent context when not specified. I left the extra API call in. The inheritance is awkward anyway, though. If you create a tracked context as a child of an already-tracked context, allocations in the newer one won't count against the original. I don't see a way around that without introducing even more performance problems. If this patch is unacceptable, my only remaining idea is to add the memory only for the current context with no inheritance (thereby eliminating the branch, also). That will require HashAgg to traverse all the child contexts to check whether the memory limit has been exceeded. As Tomas pointed out, that could be a lot of work in the case of array_agg with many groups. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 438,451 AllocSetContextCreate(MemoryContext parent, Size initBlockSize, Size maxBlockSize) { ! AllocSet context; /* Do the type-independent part of context creation */ ! context = (AllocSet) MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name); /* * Make sure alloc parameters are reasonable, and save them. --- 438,482 Size initBlockSize, Size maxBlockSize) { ! /* inherit 'track_mem' from parent context, if available */ ! uint64 *track_mem = (parent == NULL) ? NULL : parent-track_mem; ! ! return AllocSetContextCreateTracked(parent, name, minContextSize, ! initBlockSize, maxBlockSize, ! track_mem); ! } ! ! /* ! * AllocSetContextCreateTracked ! * Create a new AllocSet context, tracking total memory usage. ! * ! * parent: parent context, or NULL if top-level context ! * name: name of context (for debugging --- string will be copied) ! * minContextSize: minimum context size ! * initBlockSize: initial allocation block size ! * maxBlockSize: maximum allocation block size ! * track_mem: caller-supplied variable to use for memory tracking ! */ ! MemoryContext ! AllocSetContextCreateTracked(MemoryContext parent, ! const char *name, ! Size minContextSize, ! Size initBlockSize, ! Size maxBlockSize, ! uint64 *track_mem) ! { ! AllocSet set; ! MemoryContext context; /* Do the type-independent part of context creation */ ! context = MemoryContextCreate(T_AllocSetContext, ! sizeof(AllocSetContext), ! AllocSetMethods, ! parent, ! name, ! track_mem); ! ! set = (AllocSet) context; /* * Make sure alloc parameters are reasonable, and save them. *** *** 459,467 AllocSetContextCreate(MemoryContext parent, if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! context-initBlockSize = initBlockSize; ! context-maxBlockSize = maxBlockSize; ! context-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be --- 490,498 if (maxBlockSize initBlockSize) maxBlockSize = initBlockSize; Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */ ! set-initBlockSize = initBlockSize; ! set-maxBlockSize = maxBlockSize; ! set-nextBlockSize = initBlockSize; /* * Compute the allocation chunk size limit for this context. It can't be *** *** 477,486 AllocSetContextCreate(MemoryContext parent, * and actually-allocated sizes of any chunk must be on the same side of * the limit, else we get confused about whether the chunk is big. */ ! context-allocChunkLimit = ALLOC_CHUNK_LIMIT; ! while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Fri, 2014-08-22 at 12:34 -0400, Robert Haas wrote: Still doesn't look good here. On the same PPC64 machine I've been using for testing: I have a new approach to the patch which is to call a callback at each block allocation and child contexts inherit the callback from their parents. The callback could be defined to simply dereference and increment its argument, which would mean block allocations anywhere in the hierarchy are incrementing the same variable, avoiding the need to traverse the hierarchy. I've made some progress I think (thanks to both Robert and Tomas), but I have a bit more microoptimization and testing to do. I'll mark it returned with feedback for now, though if I find the time I'll do more testing to see if the performance concerns are fully addressed. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Fri, 2014-06-13 at 13:24 +0300, Heikki Linnakangas wrote: Attached is a new patch. It now keeps the current pg_clog unchanged, but adds a new pg_csnlog besides it. pg_csnlog is more similar to pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and segments older than GlobalXmin can be truncated. It appears that this patch weakens the idea of hint bits. Even if HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the snapshot. That's fast if the xid is less than snap-xmin, but otherwise it needs to do a fetch from the csnlog, which is exactly what the hint bits are designed to avoid. And we can't get around this, because the whole point of this patch is to remove the xip array from the snapshot. If the transaction was committed a long time ago, then we could set PD_ALL_VISIBLE and the VM bit, and a scan wouldn't even look at the hint bit. If it was committed recently, then it's probably greater than the recentXmin. I think there's still room for a hint bit to technically be useful, but it seems quite narrow unless I'm missing something (and a narrowly-useful hint bit doesn't seem to be useful at all). I'm not complaining, and I hope this is not a showstopper for this patch, but I think it's worth discussing. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote: My current thinking is that access to the csnlog will need to be made faster. Currently, it's just another SLRU, but I'm sure we can do better than that. Or add a backend-private cache on top of it; that might already alleviate it enough.. Aren't those the same ideas that have been proposed for eliminating hint bits? If this patch makes taking snapshots much faster, then perhaps it's enough incentive to follow through on one of those ideas. I am certainly open to removing hint bits (which was probably clear from earlier discussions) if there are significant wins elsewhere and the penalty is not too great. To continue the discussion on this patch, let's assume that we make TransactionIdGetCommitLSN sufficiently fast. If that's the case, can't we make HeapTupleSatisfiesMVCC look more like: LsnMin = TransactionIdGetCommitLSN(xmin); if (!COMMITLSN_IS_COMMITTED(LsnMin)) LsnMin = BIG_LSN; LsnMin = TransactionIdGetCommitLSN(xmin); if (!COMMITLSN_IS_COMMITTED(LsnMin)) LsnMin = BIG_LSN; return (snapshot-snapshotlsn = LsnMin snapshot-snapshotlsn LsnMax); There would need to be some handling for locked tuples, or tuples related to the current transaction, of course. But I still think it would turn out simpler; perhaps by enough to save a few cycles. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, 2014-08-26 at 19:25 +0100, Greg Stark wrote: I don't immediately see how to make that practical. One thought would be to have a list of xids in the page header with their corresponding csn -- which starts to sound a lot like Oralce's Interested Transaction List. But I don't see how to make that work for the hundreds of possible xids on the page. I feel like that's moving in the wrong direction. That's still causing a lot of modifications to a data page when the data is not changing, and that's bad for a lot of cases that I'm interested in (checksums are one example). We are mixing two kinds of data: user data and visibility information. Each is changed under different circumstances and has different characteristics, and I'm beginning to think they shouldn't be mixed at all. What if we just devised a structure specially designed to hold visibility information, put all of the visibility information there, and data pages would only change where there is a real, user-initiated I/U/D. Vacuum could still clear out dead tuples from the data area, but it would do the rest of its work on the visibility structure. It could even be a clever structure that could compress away large static areas until they become active again. Maybe this wouldn't work for all tables, but could be an option for big tables with low update rates. The worst case for visibility resolution is you have a narrow table that has random access DDL happening all the time, each update is a short transaction and there are a very high rate of such transactions spread out uniformly over a very large table. That means any given page has over 200 rows with random xids spread over a very large range of xids. That's not necessarily a bad case, unless the CLOG/CSNLOG lookup is a significant fraction of the effort to update a tuple. That would be a bad case if you introduce scans into the equation as well, but that's not a problem if the all-visible bit is set. Currently the invariant hint bits give us is that each xid needs to be looked up in the clog only a more or less fixed number of times, in that scenario only once since the table is very large and the transactions short lived. A backend-local cache might accomplish that, as well (would still need to do a lookup, but no locks or contention). There would be some challenges around invalidation (for xid wraparound) and pre-warming the cache (so establishing a lot of connections doesn't cause a lot of CLOG access). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Tue, 2014-08-26 at 12:39 +0300, Heikki Linnakangas wrote: I think this is enough for this commitfest - we have consensus on the design. For the next one, please address those open items, and resubmit. Agreed, return with feedback. I need to get the accounting patch in first, which needs to address some performance issues, but there's a chance of wrapping those up quickly. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, 2014-08-26 at 21:00 +0100, Greg Stark wrote: Well fundamentally the reason the visibility information is with the user data is so that we don't need to do two i/os to access the data. The whole point of hint bits is to guarantee that after some amount of time you can read data directly out of the heap page and use it without doing any additional I/O. If the data is that static, then the visibility information would be highly compressible, and surely in shared_buffers already. (Yes, it would need to be pinned, which has a cost.) Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote: Actually, that gets optimized to a constant in the planner: Oops, thank you (and Tom). your patch seems to be about 2x-3x as slow as unpatched master. So this needs some optimization. A couple of ideas: I didn't see anywhere near that kind of regression. On unpatched master, with your test case, I saw it stabilize to about 680ms. With similar-escape-1, I saw about 775ms (15% regression). Are those at all close to your numbers? Is there a chance you used an unoptimized build for one of them, or left asserts enabled? 1. If the escape string is in fact a single-byte character, you can proceed with the loop just as it is today, without the pg_mblen calls. 2. Since pg_mblen() will always return an integer between 1-6, it would probably be faster to replace the memcpy() and memcmp() calls with simple for-loops iterating byte-by-byte. In very brief testing, with the 1. change above, the performance with this patch is back to what it's without the patch. See attached. The particular patch has a mistake: the first branch is always taken because pg_mblen() won't return 0. It's also fairly ugly to set mblen in the test for the branch that doesn't use it. Attached a patch implementing the same idea though: only use the multibyte path if *both* the escape char and the current character from the pattern are multibyte. I also changed the comment to more clearly state the behavior upon which we're relying. I hope what I said is accurate. Regards, Jeff Davis *** a/src/backend/utils/adt/regexp.c --- b/src/backend/utils/adt/regexp.c *** *** 688,698 similar_escape(PG_FUNCTION_ARGS) elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else if (elen != 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg(invalid escape string), ! errhint(Escape string must be empty or one character.))); } /*-- --- 688,703 elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ ! else ! { ! int escape_mblen = pg_mbstrlen_with_len(e, elen); ! ! if (escape_mblen 1) ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), ! errmsg(invalid escape string), ! errhint(Escape string must be empty or one character.))); ! } } /*-- *** *** 724,729 similar_escape(PG_FUNCTION_ARGS) --- 729,782 { char pchar = *p; + /* + * If both the escape character and the current character from the + * pattern are multi-byte, we need to take the slow path. + * + * But if one of them is single-byte, we can process the the pattern + * one byte at a time, ignoring multi-byte characters. (This works + * because all server-encodings have the property that a valid + * multi-byte character representation cannot contain the + * representation of a valid single-byte character.) + */ + + if (elen 1) + { + int mblen = pg_mblen(p); + if (mblen 1) + { + /* slow, multi-byte path */ + if (afterescape) + { + *r++ = '\\'; + memcpy(r, p, mblen); + r += mblen; + afterescape = false; + } + else if (e elen == mblen memcmp(e, p, mblen) == 0) + { + /* SQL99 escape character; do not send to output */ + afterescape = true; + } + else + { + /* + * We know it's a multi-byte character, so we don't need + * to do all the comparisons to single-byte characters + * that we do below. + */ + memcpy(r, p, mblen); + r += mblen; + } + + p += mblen; + plen -= mblen; + + continue; + } + } + + /* fast path */ if (afterescape) { if (pchar == '' !incharclass) /* for SUBSTRING patterns */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal for CSN based snapshots
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote: Yeah. This patch in the current state is likely much much slower than unpatched master, except in extreme cases where you have thousands of connections and short transactions so that without the patch, you spend most of the time acquiring snapshots. What else are you looking to accomplish with this patch during this 'fest? Bug finding? Design review? Performance testing? I haven't had a good track record with my performance testing recently, so I'm unlikely to be much help there. It seems a bit early for bug hunting, unless you think it will raise possible design issues. I think there's already at least one design issue to consider, which is whether we care about CLOG/CSNLOG access for hinted records where the xid snapshot-xmin (that is, accesses that previously would have looked at xip). Would more discussion help here or do we need to wait for performance numbers? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
On Tue, 2014-08-26 at 22:13 -0700, Jeff Davis wrote: Attached a patch implementing the same idea though: only use the multibyte path if *both* the escape char and the current character from the pattern are multibyte. Forgot to mention: with this patch, the test completes in about 720ms, so just between unpatched and the v1 patch. I feel like we're spending too much time optimizing this patch, but we got this far, and it doesn't really complicate the code, so we might as well finish it. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Wed, 2014-08-20 at 14:32 -0400, Robert Haas wrote: Well, I think you're certainly right that a hash table lookup is more expensive than modulo on a 32-bit integer; so much is obvious. But if the load factor is not too large, I think that it's not a *lot* more expensive, so it could be worth it if it gives us other advantages. As I see it, the advantage of Jeff's approach is that it doesn't really matter whether our estimates are accurate or not. We don't have to decide at the beginning how many batches to do, and then possibly end up using too much or too little memory per batch if we're wrong; we can let the amount of memory actually used during execution determine the number of batches. That seems good. Of course, a hash join can increase the number of batches on the fly, but only by doubling it, so you might go from 4 batches to 8 when 5 would really have been enough. And a hash join also can't *reduce* the number of batches on the fly, which might matter a lot. Getting the number of batches right avoids I/O, which is a lot more expensive than CPU. My approach uses partition counts that are powers-of-two also, so I don't think that's a big differentiator. In principle my algorithm could adapt to other partition counts, but I'm not sure how big of an advantage there is. I think the big benefit of my approach is that it doesn't needlessly evict groups from the hashtable. Consider input like 0, 1, 0, 2, ..., 0, N. For large N, if you evict group 0, you have to write out about N tuples; but if you leave it in, you only have to write out about N/2 tuples. The hashjoin approach doesn't give you any control over eviction, so you only have about 1/P chance of saving the skew group (where P is the ultimate number of partitions). With my approach, we'd always keep the skew group in memory (unless we're very unlucky, and the hash table fills up before we even see the skew value). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote: The use-case for this is tracking a chosen subtree of contexts - e.g. aggcontext and below, so I'd expect the tracked subtrees to be relatively shallow. Am I right? Right. My fear is that by removing the inheritance bit, we'll hurt cases with a lot of child contexts. For example, array_agg currently creates a separate context for each group - what happens if you have 100k groups and do MemoryContextGetAllocated? I guess iterating over 100k groups is not free. Good point. We don't want to make checking the allocated space into an expensive operation, because then we will be forced to call it less frequently, which implies that we'd be sloppier about actually fitting in work_mem. Wouldn't the solution with inheritance and propagating the accounting info to the parent actually better? Or maybe allowing both, having two flags when creating a context instead of one? That seems overly complicated. We only have one driving use case, so two orthogonal options sounds excessive. Perhaps one option if we can't solve the performance problem and we need to isolate the changes to hashagg. Also, do we really need to track allocated bytes - couldn't we track kilobytes or something and use smaller data types to get below the 64B? Good idea. I attached a patch that uses two uint32 fields so that it doesn't increase the size of MemoryContextData, and it tracks memory usage for all contexts. I was unable to detect any performance regression vs. master, but on my machine the results are noisy. It would be easier to resolve the performance concern if I could reliably get the results Robert is getting. I think I was able to reproduce the regression with the old patch, but the results were still noisy. Regards, Jeff Davis *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 242,247 typedef struct AllocChunkData --- 242,249 #define AllocChunkGetPointer(chk) \ ((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ)) + static void update_allocation(MemoryContext context, size_t size); + /* * These functions implement the MemoryContext API for AllocSet contexts. */ *** *** 250,256 static void AllocSetFree(MemoryContext context, void *pointer); static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size); static void AllocSetInit(MemoryContext context); static void AllocSetReset(MemoryContext context); ! static void AllocSetDelete(MemoryContext context); static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer); static bool AllocSetIsEmpty(MemoryContext context); static void AllocSetStats(MemoryContext context, int level); --- 252,258 static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size); static void AllocSetInit(MemoryContext context); static void AllocSetReset(MemoryContext context); ! static void AllocSetDelete(MemoryContext context, MemoryContext parent); static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer); static bool AllocSetIsEmpty(MemoryContext context); static void AllocSetStats(MemoryContext context, int level); *** *** 500,505 AllocSetContextCreate(MemoryContext parent, --- 502,510 errdetail(Failed while creating memory context \%s\., name))); } + + update_allocation((MemoryContext) context, blksize); + block-aset = context; block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ; block-endptr = ((char *) block) + blksize; *** *** 590,595 AllocSetReset(MemoryContext context) --- 595,601 else { /* Normal case, release the block */ + update_allocation(context, -(block-endptr - ((char*) block))); #ifdef CLOBBER_FREED_MEMORY wipe_mem(block, block-freeptr - ((char *) block)); #endif *** *** 611,617 AllocSetReset(MemoryContext context) * But note we are not responsible for deleting the context node itself. */ static void ! AllocSetDelete(MemoryContext context) { AllocSet set = (AllocSet) context; AllocBlock block = set-blocks; --- 617,623 * But note we are not responsible for deleting the context node itself. */ static void ! AllocSetDelete(MemoryContext context, MemoryContext parent) { AllocSet set = (AllocSet) context; AllocBlock block = set-blocks; *** *** 623,628 AllocSetDelete(MemoryContext context) --- 629,644 AllocSetCheck(context); #endif + /* + * Parent is already unlinked from context, so can't use + * update_allocation(). + */ + while (parent != NULL) + { + parent-total_allocated -= context-total_allocated; + parent = parent-parent; + } + /* Make it look empty, just in case... */ MemSetAligned(set-freelist, 0, sizeof(set-freelist)); set-blocks = NULL; *** *** 678,683 AllocSetAlloc(MemoryContext context, Size size
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Fri, 2014-08-15 at 13:53 -0400, Robert Haas wrote: I think that's right, and I rather like your (Jeff's) approach. It's definitely true that we could do better if we have a mechanism for serializing and deserializing group states, but (1) I think an awful lot of cases would get an awful lot better even just with the approach proposed here and (2) I doubt we would make the serialization/deserialization interfaces mandatory, so even if we had that we'd probably want a fallback strategy anyway. Thank you for taking a look. To solve the problem for array_agg, that would open up two potentially lengthy discussions: 1. Trying to support non-serialized representations (like ArrayBuildState for array_agg) as a real type rather than using internal. 2. What changes should we make to the aggregate API? As long as we're changing/extending it, should we go the whole way and support partial aggregation[1] (particularly useful for parallelism)? Both of those discussions are worth having, and perhaps they can happen in parallel as I wrap up this patch. I'll see whether I can get consensus that my approach is (potentially) commit-worthy, and your statement that it (potentially) solves a real problem is a big help. Regards, Jeff Davis [1] http://blogs.msdn.com/b/craigfr/archive/2008/01/18/partial-aggregation.aspx -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Thu, 2014-08-14 at 12:52 -0400, Robert Haas wrote: It appears to me that the performance characteristics for this version are not significantly different from version 1. I have not looked at the code. While trying to reproduce your results, I noticed what might be around a 1% regression just from adding the 3 fields to MemoryContextData. If I cut it down to adding just one field, the regression disappears. The results are fairly noisy, so I could be chasing the wrong thing. But one reason to believe it is that I pushed the size of MemoryContextData above 64, which sounds like it might be an important threshold. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg
On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote: But maybe the inheritance really is not necessary - maybe it would be enough to track this per-context, and then just walk through the contexts and collect this. Because my observation is that most of the time is actually spent in walking through blocks and freelists. That makes a lot of sense to me. Another approach is to pass a flag to hash_create that tells it not to create a subcontext. Then we don't need to traverse anything; we already know which context we want to look at. Perhaps I was being too clever with the idea of tracking space for an entire hierarchy. Also, as I pointed out in my reply to Robert, adding too many fields to MemoryContextData may be the cause of the regression. Your idea requires only one field, which doesn't show the same regression in my tests. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
I think the hash-join like approach is reasonable, but I also think you're going to run into a lot of challenges that make it more complex for HashAgg. For instance, let's say you have the query: SELECT x, array_agg(y) FROM foo GROUP BY x; Say the transition state is an array (for the sake of simplicity), so the hash table has something like: 1000 = {7, 8, 9} 1001 = {12, 13, 14} You run out of memory and need to split the hash table, so you scan the hash table and find that group 1001 needs to be written to disk. So you serialize the key and array and write them out. Then the next tuple you get is (1001, 19). What do you do? Create a new group 1001 = {19} (how do you combine it later with the first one)? Or try to fetch the existing group 1001 from disk and advance it (horrible random I/O)? On Wed, 2014-08-13 at 12:31 +0200, Tomas Vondra wrote: My understanding of the batching algorithm (and I may be wrong on this one) is that once you choose the number of batches, it's pretty much fixed. Is that the case? It's only fixed for that one work item (iteration). A different K can be selected if memory is exhausted again. But you're right: this is a little less flexible than HashJoin. But what will happen in case of significant cardinality underestimate? I.e. what will happen if you decide to use 16 batches, and then find out 256 would be more appropriate? I believe you'll end up with batches 16x the size you'd want, most likely exceeding work_mem. Yes, except that work_mem would never be exceeded. If the partitions are 16X work_mem, then each would be added as another work_item, and hopefully it would choose better the next time. One thing I like about my simple approach is that it returns a good number of groups after each pass, and then those are completely finished (returned to the operator above, even). That's impossible with HashJoin because the hashing all needs to be done before the probe phase begins. The hash-join approach returns ~1/N groups after each pass, so I fail to see how this is better? You can't return any tuples until you begin the probe phase, and that doesn't happen until you've hashed the entire inner side (which involves splitting and other work). With my patch, it will return some tuples after the first scan. Perhaps I'm splitting hairs here, but the idea of finalizing some groups as early as possible seems appealing. Aha! And the new batches are 'private' to the work item, making it a bit recursive, right? Is there any reason not to just double the number of batches globally? I didn't quite follow this proposal. It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then allowing each work item to create it's own set of additional partitions effectively renders the HASH_DISK_MAX_PARTITIONS futile. It's the number of active partitions that matter, because that's what causes the random I/O. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Thu, 2014-08-14 at 10:06 -0400, Tom Lane wrote: If you're following the HashJoin model, then what you do is the same thing it does: you write the input tuple back out to the pending batch file for the hash partition that now contains key 1001, whence it will be processed when you get to that partition. I don't see that there's any special case here. HashJoin only deals with tuples. With HashAgg, you have to deal with a mix of tuples and partially-computed aggregate state values. Not impossible, but it is a little more awkward than HashJoin. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.5: Memory-bounded HashAgg
On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote: Either it belongs to the current batch (and either it's in the hash table, or you add it there), or it's not - in that case write it to a temp file. I think the part you left out is that you need two files per batch: one for the dumped-out partially-computed state values, and one for the tuples. In other words, you haven't really discussed the step where you reunite the tuples with that partially-computed state. For sure, it's not for free - it may write to quite a few files. Is it more expensive than what you propose? I'm not sure about that. With your batching scheme, you'll end up with lower number of large batches, and you'll need to read and split them, possibly repeatedly. The batching scheme from hashjoin minimizes this. My approach only has fewer batches if it elects to have fewer batches, which might happen for two reasons: 1. A cardinality misestimate. This certainly could happen, but we do have useful numbers to work from (we know the number of tuples and distincts that we've read so far), so it's far from a blind guess. 2. We're concerned about the random I/O from way too many partitions. I fail to see how this is different from your approach? How can you output any tuples before processing the whole inner relation? Right, the only thing I avoid is scanning the hash table and dumping out the groups. This isn't a major distinction, more like my approach does a little less work before returning tuples, and I'm not even sure I can defend that, so I'll retract this point. Your approach is to do multi-level batching, and I was thinking whether it'd be possible to use the same approach (single level). But in retrospect it probably does not make much sense, because the multi-level batching is one of the points of the proposed approach. Now that I think about it, many of the points we discussed could actually work with either approach: * In my approach, if I need more partitions, I could create more in much the same way as HashJoin to keep it single-level (as you suggest above). * In your approach, if there are too many partitions, you could avoid random I/O by intentionally putting tuples from multiple partitions in a single file and moving them while reading. * If given a way to write out the partially-computed states, I could evict some groups from the hash table to keep an array_agg() bounded. Our approaches only differ on one fundamental trade-off that I see: (A) My approach requires a hash lookup of an already-computed hash for every incoming tuple, not only the ones going into the hash table. (B) Your approach requires scanning the hash table and dumping out the states every time the hash table fills up, which therefore requires a way to dump out the partial states. You could probably win the argument by pointing out that (A) is O(N) and (B) is O(log2(N)). But I suspect that cost (A) is very low. Unfortunately, it would take some effort to test your approach because we'd actually need a way to write out the partially-computed state, and the algorithm itself seems a little more complex. So I'm not really sure how to proceed. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers