Re: [HACKERS] Range Merge Join v1

2017-09-17 Thread Jeff Davis
On Thu, Aug 31, 2017 at 1:52 AM, Jeff Davis <pg...@j-davis.com> wrote:
> Updated patch attached. Changelog:
>
> * Rebased
> * Changed MJCompare to return an enum as suggested, but it has 4
> possible values rather than 3.
> * Added support for joining on contains or contained by (@> or <@) and
> updated tests.

The current patch does not work well with multiple keys, and I believe
it's important to solve because it will make it usable for
multi-dimension spatial joins.

The problem is this: the algorithm for a single key demands that the
input is sorted by (lower(r) NULLS FIRST, upper(r) NULLS LAST). That's
easy, because that's also the sort order for the range operator class,
so everything just works.

For multiple keys, the order of the input is:
   lower(r1) NULLS FIRST, lower(r2) NULLS FIRST,
   upper(r1) NULLS LAST, upper(r2) NULLS LAST

But that can't match up with any opclass, because an opclass can only
order one attribute at a time. In this case, the lower bound of r2 is
more significant than the upper bound of r1.

It's easy enough to adapt the execution to make this work as long as
the input is properly sorted. The challenge is making the optimizer
choose the sort keys properly.

I have tried a few approaches. The current approach that I'm using is:
* have a new, special range operator family with no operator classes
* in check_mergejoinable(), detect the && operator and set the
mergeopfamilies to contain only the special operator family
* in try_mergejoin_path(), it will convert the pathkeys using that
operator class into pathkeys over a "lower" expression over the same
EC; and then add on additional pathkeys for the "upper" expressions
onto the end of the pathkey list

Any comments or alternative suggestions welcome. This will probably
take a few days at least, so I put the patch in "waiting on author"
state.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Merge Join v1

2017-08-31 Thread Jeff Davis
On Fri, Aug 25, 2017 at 10:19 AM, Alexander Kuzmenkov
<a.kuzmen...@postgrespro.ru> wrote:
> Hi Jeff,

Hi,

Thank you for the review and suggestions!

> * At the moment, "mergejoinable clause" and "equality clause" mean the same
> thing to the planner, and those clauses are used both to create equivalence
> classes and to perform merge joins. If we start performing merge joins for
> different kinds of clauses, such as comparison or range intersection, it
> makes sense to separate these two meanings. I tried this in my patch and it
> didn't require many changes. I use RestrictInfo.equivopfamilies to build
> equivalence classes, and RestrictInfo.mergeopfamilies for mergejoinable
> clauses.

I like the idea. I will look in more detail and I can either change my
patch or piggyback on yours.

> * Semantically, MJCompare() is a function that determines whether you should
> emit a join result or advance one of the pointers. This meaning is not
> explicit in the code and is not well encapsulated: the function returns and
> int and 'compareNoMatch' flag, and the calling code combines them in various
> ways to derive the final result. This meaning can be made explicit by making
> MJCompare return enum values {Join, NextInner, NextOuter}, and putting
> inside it all the logic that decides whether we join or advance.
> ExecMergeJoin looks intimidating already, and I think this change would help
> readability. Again, you can find an illustration in my patch.

I actually tried doing something similar in my patch, but there are
four cases to consider in EXEC_MJ_SKIP_TEST:

1. Overlaps: mark and then join the tuples
2. left-of: SKIPOUTER_ADVANCE
3. right-of: SKIPINNER_ADVANCE
4. None of the above: mark and NEXTINNER

However, you are right that the current code is ugly, and I should
refactor into 4 explicit cases. I don't think I will call them by the
same names as the join states, though, because there's not a direct
mapping.

Updated patch attached. Changelog:

* Rebased
* Changed MJCompare to return an enum as suggested, but it has 4
possible values rather than 3.
* Added support for joining on contains or contained by (@> or <@) and
updated tests.

Regards,
Jeff Davis
diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml
index 9557c16..84578a7 100644
*** a/doc/src/sgml/rangetypes.sgml
--- b/doc/src/sgml/rangetypes.sgml
***
*** 522,525  INSERT 0 1
--- 522,595 
  

   
+  
+   Range Join
+ 
+   
+ range type
+ join
+   
+ 
+   
+ A Range Join is a join where one of the join conditions is the range
+ overlaps operator (see ). In other
+ words, rather than returning rows where corresponding fields are equal, it
+ returns rows where the corresponding fields overlap.
+   
+   
+ For instance, to find users who were active on a website at the same time
+ as they were using a mobile app, you might issue a query like:
+ 
+ CREATE TABLE web_session(
+ web_session_id text primary key,
+ username text,
+ during tstzrange);
+ CREATE TABLE app_session(
+ app_session_id text primary key,
+ username text,
+ during tstzrange);
+ INSERT INTO web_session VALUES
+ ('0cc175b9c0f1b6a8', 'user1', '[2017-02-01 09:30, 2017-02-01 10:45)'),
+ ('92eb5ffee6ae2fec', 'user2', '[2017-02-01 09:30, 2017-02-01 10:45)'),
+ ('4a8a08f09d37b737', 'user3', '[2017-02-01 09:30, 2017-02-01 10:45)');
+ INSERT INTO app_session VALUES
+ ('8277e0910d750195', 'user1', '[2017-02-01 10:30, 2017-02-01 11:45)'),
+ ('b448797616e091ad', 'user3', '[2017-02-01 09:00, 2017-02-01 11:00)'),
+ ('95649038408b5f33', 'user4', '[2017-02-01 09:30, 2017-02-01 10:45)');
+ 
+ SELECT w.username,
+w.during * a.during AS both_during
+ FROM  web_session w, app_session a
+ WHERE w.username = a.username
+ AND w.during && a.during;
+  username | both_during 
+ --+-
+  user1| ["2017-02-01 10:30:00-08","2017-02-01 10:45:00-08")
+  user3| ["2017-02-01 09:30:00-08","2017-02-01 10:45:00-08")
+ (2 rows)
+ 
+   
+   
+ In addition to the general execution strategies available, Postgres also
+ has special support for a variant of Merge Join called Range Merge Join:
+ 
+  EXPLAIN (costs off) SELECT w.username,
+w.during * a.during AS both_during
+ FROM  web_session w, app_session a
+ WHERE w.username = a.username
+ AND w.during && a.during;
+   QUERY PLAN  
+ --
+  Range Merge Join
+Merge Cond: ((w.during && a.during) AND (w.username = a.username))
+->  Sort
+  Sort Key: w.during, w.username
+  ->  Seq Scan on web_session w

Re: [HACKERS] Range Merge Join v1

2017-06-02 Thread Jeff Davis
On Tue, May 30, 2017 at 11:44 PM, Andrew Borodin <boro...@octonica.com> wrote:
> Hi, Jeff!

Hi!

> Sorry for being late. Actually, I had several unsuccessful attempts to
> find something wrong with the patch.
> Here's my review.
>
> in pathkey.c
>
> ecs = (EquivalenceClass **) palloc(nClauses * sizeof(EquivalenceClass *));
> scores = (int *) palloc(nClauses * sizeof(int));
> range_ecs = palloc(nClauses * sizeof(bool));
>
> Third assignment has no cast.

Will fix.

> And I have few questions:
> 1. Are there any types, which could benefit from Range Merge and are
> not covered by this patch?

I thought about this for a while, and the only thing I can think of
are range joins that don't explicitly use range types.

> 2. Can Range Merge handle merge of different ranges? Like int4range()
> && int8range() ?

Right now, there aren't even casts between range types. I think the
best way to handle that at this point would be to add casts among the
numeric ranges. There may be advantages to supporting any two ranges
where the contained types are part of the same opfamily, but it seems
a little early to add that complication.

> My perf test script from the previous message was broken, here's fixed
> one in the attachment.
>
> This patch implements feature, contains new tests and passes old
> tests, is documented and spec compliant. I do not see any reason why
> not mark it "Ready for committer".

Great!

I think there are a couple more things that could be done if we want
to. Let me know if you think these things should be done now, or if
they should be a separate patch later when the need arises:

* Support for r1 @> r2 joins (join on "contains" rather than "overlaps").
* Better integration with the catalog so that users could add their
own types that support range merge join.

Thank you for the review.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-06-01 Thread Jeff Davis
On Thu, Jun 1, 2017 at 11:25 AM, Andres Freund <and...@anarazel.de> wrote:
> Secondly, I think that's to a significant degree caused by
> the fact that in practice people way more often partition on types like
> int4/int8/date/timestamp/uuid rather than text - there's rarely good
> reasons to do the latter.

Once we support more pushdowns to partitions, the only question is:
what are your join keys and what are your grouping keys?

Text is absolutely a normal join key or group key. Consider joins on a
user ID or grouping by a model number.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-06-01 Thread Jeff Davis
On Thu, Jun 1, 2017 at 10:59 AM, Robert Haas <robertmh...@gmail.com> wrote:
> 1. Are the new problems worse than the old ones?
>
> 2. What could we do about it?

Exactly the right questions.

1. For range partitioning, I think it's "yes, a little". As you point
out, there are already some weird edge cases -- the main way range
partitioning would make the problem worse is simply by having more
users.

But for hash partitioning I think the problems will become more
substantial. Different encodings, endian issues, etc. will be a
headache for someone, and potentially a bad day if they are urgently
trying to restore on a new machine. We should remember that not
everyone is a full-time postgres DBA, and users might reasonably think
that the default options to pg_dump[all] will give them a portable
dump.

2. I basically see two approaches to solve the problem:
  (a) Tom suggested at PGCon that we could have a GUC that
automatically causes inserts to the partition to be re-routed through
the parent. We could discuss whether to always route through the
parent, or do a recheck on the partition constrains and only reroute
tuples that will fail it. If the user gets into trouble, the worst
that would happen is a helpful error message telling them to set the
GUC. I like this idea.
  (b) I had suggested before that we could make the default text dump
(and the default output from pg_restore, for consistency) route
through the parent. Advanced users would dump with -Fc, and pg_restore
would support an option to do partition-wise loading. To me, this is
simpler, but users might forget to use (or not know about) the
pg_restore option and then it would load more slowly. Also, the ship
is sailing on range partitioning, so we might prefer option (a) just
to avoid making any changes.

I am fine with either option.

> 2. Add an option like --dump-partition-data-with-parent.  I'm not sure
> who originally proposed this, but it seems that everybody likes it.
> What we disagree about is the degree to which it's sufficient.  Jeff
> Davis thinks it doesn't go far enough: what if you have an old
> plain-format dump that you don't want to hand-edit, and the source
> database is no longer available?  Most people involved in the
> unconference discussion of partitioning at PGCon seemed to feel that
> wasn't really something we should be worry about too much.  I had been
> taking that position also, more or less because I don't see that there
> are better alternatives.

If the suggestions above are unacceptable, and we don't come up with
anything better, then of course we have to move on. I am worrying now
primarily because now is the best time to worry; I don't expect any
horrible outcome.

> 3. Implement portable hash functions (Jeff Davis or me, not sure
> which).  Andres scoffed at this idea, but I still think it might have
> legs.

I think it reduces the problem, which has value, but it's hard to make
it rock-solid.

> make fast.  Those two things also solve different parts of the
> problem; one is insulating the user from a difference in hardware
> architecture, while the other is insulating the user from a difference
> in user-selected settings.  I think that the first of those things is
> more important than the second, because it's easier to change your
> settings than it is to change your hardware.

Good point.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Hash Functions

2017-05-19 Thread Jeff Davis
On Thursday, May 18, 2017, Robert Haas <robertmh...@gmail.com> wrote:
> My experience with this area has led
> me to give up on the idea of complete uniformity as impractical, and
> instead look at it from the perspective of "what do we absolutely have
> to ban in order for this to be sane?".

I could agree to something like that. Let's explore some of the challenges
there and potential solutions:

1. Dump/reload of hash partitioned data.

Falling back to restore-through-the-root seems like a reasonable answer
here. Moving to a different encoding is not an edge case, but it's not
common either, so a performance penalty seems acceptable. I'm not
immediately sure how we'd implement this in pg_dump/restore, so I'd feel a
little more comfortable if I saw a sketch.

2. Having a lot of hash partitions would be cumbersome

The user would need to create and manage each partition, and try to do
global operations in a sane way. The normal case would probably involve
scripts to do things like add an index to all partitions, or a column. Many
partitions would also just pollute the namespace unless you remember to put
them in a separate schema (yes, it's easy, but most people will still
forget). Some syntax sugar would go a long way here.

3. The user would need to specify details they really don't care about for
each partition.

Things like "modulus 16, remainder 0", "modulus 16, remainder 1" are
tedious boilerplate. And if the user makes a mistake, then 1/16 of inserts
start failing. Probably would be caught during testing, but not exactly a
good user experience. I'm not thrilled about this, considering that all the
user really wants is 16 partitions, but it's not the end of the world.

4. Detach is a foot-gun

If you detach a partition, random inserts will start failing. Not thrilled
about this, but a hapless user would accept most of the blame if they
stumble over it. Another way of saying this is with hash partitioning you
really need the whole set for the table to be online at all. But we can't
really enforce that, because it would limit some of the flexibility that
you have in mind.


Stepping back, your approach might be closer to the general postgres
philosophy of allowing the user to assemble from spare parts first, then a
few releases later we offer some pre-built subassemblies, and a few
releases later we make the typical cases work out of the box. I'm fine with
it as long as we don't paint ourselves into a corner.

Of course we still have work to do on the hash functions. We should solve
at least the most glaring portability problems, and try to harmonize the
hash opfamilies. If you agree, I can put together a patch or two.

Regards,
Jeff Davis


Re: [HACKERS] Hash Functions

2017-05-18 Thread Jeff Davis
On Wed, May 17, 2017 at 11:35 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I think the question is whether we are going to make a distinction between
> logical partitions (where the data division rule makes some sense to the
> user) and physical partitions (where it needn't).  I think it might be
> perfectly reasonable for those to behave differently.

Agreed. To summarize my perspective:

* hash partitioning offers a nice way to divide the data for later
processing by parallel query
* range partitioning is good for partition elimination
(constraint_exclusion) and separating hot/cold data (e.g. partitioning
on date)
* both offer some maintenance benefits (e.g. reindex one partition at
a time), though range partitioning seems like it offers better
flexibility here in some cases

I lean toward separating the concepts, but Robert is making some
reasonable arguments and I could be convinced.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-05-17 Thread Jeff Davis
On Wed, May 17, 2017 at 12:10 PM, Robert Haas <robertmh...@gmail.com> wrote:
> 1. To handle dump-and-reload the way we partitioning does today, hash
> functions would need to be portable across encodings.
> 2. That's impractically difficult.
> 3. So let's always load data through the top-parent.
> 4. But that could fail due to e.g. a UNIQUE index on an individual
> child, so let's try to prohibit all of the things that could be done
> to an individual partition that could cause a reload failure.
> 5. And then for good measure let's hide the existence of the partitions, too.

That is one thread of logic, but out of the discussion also
highlighted some of the consequences of treating hash partitions like
range/list partitions.

For instance, it makes little sense to have individual check
constraints, indexes, permissions, etc. on a hash-partitioned table.
It doesn't mean that we should necessarily forbid them, but it should
make us question whether combining range and hash partitions is really
the right design.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-05-16 Thread Jeff Davis
On Tuesday, May 16, 2017, Robert Haas <robertmh...@gmail.com> wrote:
> I don't really find this a very practical design.  If the table
> partitions are spread across different relfilenodes, then those
> relfilenodes have to have separate pg_class entries and separate
> indexes, and those indexes also need to have separate pg_class
> entries.  Otherwise, nothing works.  And if they do have separate
> pg_class entries, then the partitions have to have their own names,
> and likewise for their indexes, and a dump-and-reload has to preserve
> those names.  If it doesn't, and those objects get new system-assigned
> names after the dump-and-reload, then dump restoration can fail when a
> system-assigned name collides with an existing name that is first
> mentioned later in the dump.

Why can't hash partitions be stored in tables the same way as we do TOAST?
That should take care of the naming problem.

> If Java has portable hash functions, why can't we?

Java standardizes on a particular unicode encoding (utf-16). Are you
suggesting that we do the same? Or is there another solution that I am
missing?

Regards,
   Jeff Davis


Re: [HACKERS] Hash Functions

2017-05-16 Thread Jeff Davis
On Mon, May 15, 2017 at 1:04 PM, David Fetter <da...@fetter.org> wrote:
> As the discussion has devolved here, it appears that there are, at
> least conceptually, two fundamentally different classes of partition:
> public, which is to say meaningful to DB clients, and "private", used
> for optimizations, but otherwise opaque to DB clients.
>
> Mashing those two cases together appears to cause more problems than
> it solves.

I concur at this point. I originally thought hash functions might be
made portable, but I think Tom and Andres showed that to be too
problematic -- the issue with different encodings is the real killer.

But I also believe hash partitioning is important and we shouldn't
give up on it yet.

That means we need to have a concept of hash partitions that's
different from range/list partitioning. The terminology
"public"/"private" does not seem appropriate. Logical/physical or
external/internal might be better.

With hash partitioning:
* User only specifies number of partitions of the parent table; does
not specify individual partition properties (modulus, etc.)
* Dump/reload goes through the parent table (though we may provide
options so pg_dump/restore can optimize this)
* We could provide syntax to adjust the number of partitions, which
would be expensive but still useful sometimes.
* All DDL should be on the parent table, including check constraints,
FKs, unique constraints, exclusion constraints, indexes, etc.
  - Unique and exclusion constraints would only be permitted if the
keys are a superset of the partition keys.
  - FKs would only be permitted if the two table's partition schemes
match and the keys are members of the same hash opfamily (this could
be relaxed slightly, but it gets a little confusing if so)
* No attach/detach of partitions
* All partitions have the same permissions
* Individual partitions would only be individually-addressable for
maintenance (like reindex and vacuum), but not for arbitrary queries
  - perhaps also COPY for bulk loading/dumping, in case we get clients
smart enough to do their own hashing.

The only real downside is that it could surprise users -- why can I
add a CHECK constraint on my range-partitioned table but not the
hash-partitioned one? We should try to document this so users don't
find that out too far along. As long as they aren't surprised, I think
users will understand why these aren't quite the same concepts.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-05-15 Thread Jeff Davis
On Sun, May 14, 2017 at 6:22 PM, Robert Haas <robertmh...@gmail.com> wrote:
> You'd have to prohibit a heck of a lot more than that in order for
> this to work 100% reliably.  You'd have to prohibit CHECK constraints,
> triggers, rules, RLS policies, and UNIQUE indexes, at the least.  You
> might be able to convince me that some of those things are useless
> when applied to partitions, but wanting a CHECK constraint that
> applies to only one partition seems pretty reasonable (e.g. CHECK that
> records for older years are all in the 'inactive' state, or whatever).
> I think getting this to work 100% reliably in all cases would require
> an impractically large hammer.

The more I think about it the more I think hash partitions are
"semi-logical". A check constraint on a single partition of a
range-partitioned table makes sense, but it doesn't make sense on a
single partition of a hash-partitioned table.

I think hash partitioning is mainly useful for parallel query (so the
optimizer needs to know about it) and maintenance commands (as you
listed in another email). But those don't need hash portability.

FKs are a little more interesting, but I actually think they still
work regardless of hash portability. If the two sides are in the same
hash opfamily, they should hash the same and it's fine. And if they
aren't, the FK has no hope of working regardless of hash portability.

This would mean we need to reload through the root as Andres and
others suggested, and disable a lot of logical partitioning
capabilities. I'd be a little worried about what we do with
attaching/detaching, though.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-05-15 Thread Jeff Davis
On Sun, May 14, 2017 at 8:00 PM, Bruce Momjian <br...@momjian.us> wrote:
> Do we even know that floats are precise enough to determine the
> partition.  For example, if you have 6.1, is it possible for
> that to be 5.999 on some systems?  Are IEEE systems all the same for
> these values?  I would say we should disallow any approximate date type
> for partitioning completely.

I'm inclined in this direction, as well. Hash partitioning is mostly
useful for things that are likely to be join keys or group keys, and
floats aren't. Same for complex user-defined types.

The real problem here is what Tom pointed out: that we would have
trouble hashing strings in an encoding-insensitive way. Strings are
useful as join/group keys, so it would be painful to not support them.

Perhaps there's some kind of compromise, like a pg_dump mode that
reloads through the root. Or maybe hash partitions are really a
"semi-logical" partitioning that the optimizer understands, but where
things like per-partition check constraints don't make sense.

Regards,
    Jeff Davis


Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-05-13 Thread Jeff Davis
On Fri, May 12, 2017 at 12:38 PM, Robert Haas <robertmh...@gmail.com> wrote:
> That is a good question.  I think it basically amounts to this
> question: is hash partitioning useful, and if so, for what?

Two words: parallel query. To get parallelism, one of the best
approaches is dividing the data, then doing as much work as possible
before combining it again. If you have hash partitions on some key,
then you can do partition-wise joins or partition-wise aggregation on
that key in parallel with no synchronization/communication overhead
(until the final result).

You've taken postgres pretty far in this direction already; hash
partitioning will take it one step further by allowing more pushdowns
and lower sync/communication costs.

Some of these things can be done with range partitioning, too, but see
my other message here:

https://www.postgresql.org/message-id/CAMp0ubfNMSGRvZh7N7TRzHHN5tz0ZeFP13Aq3sv6b0H37fdcPg%40mail.gmail.com

Regards,
    Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-05-13 Thread Jeff Davis
On Fri, May 12, 2017 at 11:45 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Forget hash partitioning.  There's no law saying that that's a good
> idea and we have to have it.  With a different set of constraints,
> maybe we could do it, but I think the existing design decisions have
> basically locked it out --- and I doubt that hash partitioning is so
> valuable that we should jettison other desirable properties to get it.

A lot of the optimizations that can make use of hash partitioning
could also make use of range partitioning. But let me defend hash
partitioning:

* hash partitioning requires fewer decisions by the user
* naturally balances data and workload among partitions in most cases
* easy to match with a degree of parallelism

But with range partitioning, you can have situations where different
tables have different distributions of data. If you partition to
balance the data between partitions in both cases, then that makes
partition-wise join a lot harder because the boundaries don't line up.
If you make the boundaries line up to do partition-wise join, the
partitions might have wildly different amounts of data in them. Either
way, it makes parallelism harder.

Even without considering joins, range partitioning could force you to
make a choice between balancing the data and balancing the workload.
If you are partitioning based on date, then a lot of the workload will
be on more recent partitions. That's desirable sometimes (e.g. for
vacuum) but not always desirable for parallelism.

Hash partitioning doesn't have these issues and goes very nicely with
parallel query.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-05-13 Thread Jeff Davis
On Fri, May 12, 2017 at 10:34 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Maintaining such a property for float8 (and the types that depend on it)
> might be possible if you believe that nobody ever uses anything but IEEE
> floats, but we've never allowed that as a hard assumption before.

This is not such a big practical problem (for me at least) because
hashing of floats is of dubious value.

> Even architecture dependence isn't the whole scope of the problem.
> Consider for example dumping a LATIN1-encoded database and trying
> to reload it into a UTF8-encoded database.  People will certainly
> expect that to be possible, and do you want to guarantee that the
> hash of a text value is encoding-independent?

That is a major problem. In an ideal world, we could make that work
with something like ucol_getSortKey(), but we don't require ICU, and
we can't mix getSortKey() with strxfrm(), or even strxfrm() results
from different platforms.

I don't think it's correct to hash the code points, either, because
strings may be considered equal in a locale even if the code points
aren't identical. But I don't think postgres lives up to that standard
currently.

But hash partitioning is too valuable to give up on entirely. I think
we should consider supporting a limited subset of types for now with
something not based on the hash am.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Hash Functions

2017-05-11 Thread Jeff Davis
https://www.postgresql.org/message-id/camp0ubeo3fzzefie1vmc1ajkkrpxlnzqooaseu6o-c+...@mail.gmail.com

In that thread, I pointed out some important considerations for the
hash functions themselves. This is a follow-up, after I looked more
carefully.

1. The hash functions as they exist today aren't portable -- they can
return different results on different machines. That means using these
functions for hash partitioning would yield different contents for the
same partition on different architectures (and that's bad, considering
they are logical partitions and not some internal detail).

The core hashing algorithm is based on 32-bit chunks, so it's only
portable if the input is an int32 (or array of int32s). That's not
true for varchar (etc.) or numeric (which is based on an array of
int16s). There is a hack for int8, but see issue #2 below.

We could try to mark portable vs. unportable hash functions, but it
seems quite valuable to be able to logically partition on varchar, so
I think we should have some portable answer there. Another annoyance
is that we would have to assume container types are unportable, or
make them inherit the portability of the underlying type's hash
function.

We could revert 26043592 (copied Tom because that was his patch) to
make hash_any() go back to being portable -- do we know what that
speedup actually was? Maybe the benefit is smaller on newer
processors? Another option is to try to do some combination of
byteswapping and word-at-a-time, which might be better than
byte-at-a-time if the byteswapping is done with a native instruction.

2. hashint8() is a little dubious. If the input is positive, it XORs
the high bits; if negative, it XORs the complement of the high bits.
That works for compatibility with hashint2/4, but it does not cause
good hash properties[1]. I prefer that we either (a) upcast int2/4 to
int8 before hashing and then hash the whole 64 bits; or (b) if the
value is within range, downcast to int4, otherwise hash the whole 64
bits.

3. We should downcast numeric to int4/8 before hashing if it's in
range, so that it can be a part of the same opfamily as int2/4/8.

4. text/varchar should use hashbpchar() so that they can be part of
the same opfamily. Trailing blanks seem unlikely to be significant for
most real-world data anyway.

5. For salts[2], I don't think it's too hard to support them in an
optional way. We just allow the function to be a two-argument function
with a default. Places that care about specifying the salt may do so
if the function has pronargs==2, otherwise it just gets the default
value. If we have salts, I don't think having 64-bit hashes is very
important. If we run out of bits, we can just salt the hash function
differently and get more hash bits. This is not urgent and I believe
we should just implement salts when and if some algorithm needs them.

Regards,
Jeff Davis

[1] You can a kind of mirroring in the hash outputs indicating bad mixing:

postgres=# select hashint8((2^32-100)::int8);
 hashint8
--
 41320869
(1 row)

postgres=# select hashint8((-2^32-100)::int8);
 hashint8
--
 41320869
(1 row)

postgres=# select hashint8((2^32-101)::int8);
  hashint8
-
 -1214482326
(1 row)

postgres=# select hashint8((-2^32-101)::int8);
  hashint8
-
 -1214482326
(1 row)


[2] Not sure I'm using the term "salt" properly. I really just mean a
way to affect the initial state, which I think is good enough for our
purposes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-05-03 Thread Jeff Davis
On Tue, May 2, 2017 at 7:01 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, May 2, 2017 at 9:01 PM, Jeff Davis <pg...@j-davis.com> wrote:
>> 1. Consider a partition-wise join of two hash-partitioned tables. If
>> that's a hash join, and we just use the hash opclass, we immediately
>> lose some useful bits of the hash function. Same for hash aggregation
>> where the grouping key is the partition key.
>
> Hmm, that could be a problem in some cases.  I think there's probably
> much less of a problem if the modulus isn't a power of two?

That's true, but it's awkward to describe that to users. And I think
most people would be inclined to use a power-of-two number of
partitions, perhaps coming from other systems.

>> To fix this, I think we need to include a salt in the hash API. Each
>> level of hashing can choose a random salt.
>
> Do you mean that we'd salt partitioning hashing differently from
> grouping hashing which would be salted different from aggregation
> hashing which, I suppose, would be salted differently from hash index
> hashing?

Yes. The way I think about it is that choosing a new random salt is an
easy way to get a new hash function.

> Or do you mean that you'd have to specify a salt when
> creating a hash-partitioned table, and make sure it's the same across
> all compatibly partitioned tables you might want to hash-join?  That
> latter sounds unappealing.

I don't see a reason to expose the salt to users. If we found a reason
in the future, we could, but it would create all of the problems you
are thinking about.

>> 2. Consider a partition-wise join where the join keys are varchar(10)
>> and char(10). We can't do that join if we just use the existing hash
>> strategy, because 'foo' = 'foo   ' should match, but those values
>> have different hashes when using the standard hash opclass.

...

> You're basically describing what a hash opfamily already does, except
> that we don't have a single opfamily that covers both varchar(10) and
> char(10), nor do we have one that covers both int and numeric.  We
> have one that covers int2, int4, and int8, though.  If somebody wanted
> to make the ones you're suggesting, there's nothing preventing it,
> although I'm not sure exactly how we'd encourage people to start using
> the new one and deprecating the old one.  We don't seem to have a good
> infrastructure for that.

OK. I will propose new hash opfamilies for varchar/bpchar/text,
int2/4/8/numeric, and timestamptz/date.

One approach is to promote the narrower type to the wider type, and
then hash. The problem is that would substantially slow down the
hashing of integers, so then we'd need to use one hash opfamily for
partitioning and one for hashjoin, and it gets messy.

The other approach is to check if the wider type is within the domain
of the narrower type, and if so, *demote* the value and then hash. For
instance, '4.2'::numeric would hash the same as it does today, but
'4'::numeric would hash as an int2. I prefer this approach, and int8
already does something resembling it.

For timestamptz/date, it's not nearly as important.

>> My opinion is that we should work on this hashing infrastructure
>> first, and then support the DDL. If we get the hash functions right,
>> that frees us up to create better plans, with better push-downs, which
>> will be good for parallel query.
>
> I am opposed to linking the fate of this patch to multiple
> independent, possibly large, possibly difficult, possibly
> controversial enhancements to the hashing mechanism.

It's a little early in the v11 cycle to be having this argument.
Really what I'm saying is that a small effort now may save us a lot of
headache later.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] hash partitioning

2017-05-02 Thread Jeff Davis
On Tue, Feb 28, 2017 at 6:33 AM, Yugo Nagata <nag...@sraoss.co.jp> wrote:
> In this patch, user can specify a hash function USING. However,
> we migth need default hash functions which are useful and
> proper for hash partitioning.

I suggest that we consider the hash functions more carefully. This is
(effectively) an on-disk format so it can't be changed easily later.

1. Consider a partition-wise join of two hash-partitioned tables. If
that's a hash join, and we just use the hash opclass, we immediately
lose some useful bits of the hash function. Same for hash aggregation
where the grouping key is the partition key.

To fix this, I think we need to include a salt in the hash API. Each
level of hashing can choose a random salt.

2. Consider a partition-wise join where the join keys are varchar(10)
and char(10). We can't do that join if we just use the existing hash
strategy, because 'foo' = 'foo   ' should match, but those values
have different hashes when using the standard hash opclass.

To fix this, we need to be smarter about normalizing values at a
logical level before hashing. We can take this to varying degrees,
perhaps even normalizing an integer to a numeric before hashing so
that you can do a cross-type join on int=numeric.

Furthermore, we need catalog metadata to indicate which hash functions
are suitable for which cross-type comparisons. Or, to put it the other
way, which typecasts preserve the partitioning.

3. We might want to use a hash function that is a little slower that
is more resistant to collisions. We may even want to use a 64-bit
hash.


My opinion is that we should work on this hashing infrastructure
first, and then support the DDL. If we get the hash functions right,
that frees us up to create better plans, with better push-downs, which
will be good for parallel query.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Merge Join v1

2017-04-20 Thread Jeff Davis
On Thu, Apr 20, 2017 at 2:05 AM, Rafia Sabih
<rafia.sa...@enterprisedb.com> wrote:
> Looks like an interesting idea, however, in an attempt to test this patch I
> found following error when compiling,
> selfuncs.c: In function ‘mergejoinscansel’:
> selfuncs.c:2901:12: error: ‘op_strategy’ undeclared (first use in this
> function)
>_strategy,
>

Looks like filterdiff destroyed my patch from git. Attaching unified
version against master 3820c63d.

Thanks!
 Jeff Davis
diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml
index 9557c16..84578a7 100644
--- a/doc/src/sgml/rangetypes.sgml
+++ b/doc/src/sgml/rangetypes.sgml
@@ -522,4 +522,74 @@ INSERT 0 1
 
   
  
+ 
+  Range Join
+
+  
+range type
+join
+  
+
+  
+A Range Join is a join where one of the join conditions is the range
+overlaps operator (see ). In other
+words, rather than returning rows where corresponding fields are equal, it
+returns rows where the corresponding fields overlap.
+  
+  
+For instance, to find users who were active on a website at the same time
+as they were using a mobile app, you might issue a query like:
+
+CREATE TABLE web_session(
+web_session_id text primary key,
+username text,
+during tstzrange);
+CREATE TABLE app_session(
+app_session_id text primary key,
+username text,
+during tstzrange);
+INSERT INTO web_session VALUES
+('0cc175b9c0f1b6a8', 'user1', '[2017-02-01 09:30, 2017-02-01 10:45)'),
+('92eb5ffee6ae2fec', 'user2', '[2017-02-01 09:30, 2017-02-01 10:45)'),
+('4a8a08f09d37b737', 'user3', '[2017-02-01 09:30, 2017-02-01 10:45)');
+INSERT INTO app_session VALUES
+('8277e0910d750195', 'user1', '[2017-02-01 10:30, 2017-02-01 11:45)'),
+('b448797616e091ad', 'user3', '[2017-02-01 09:00, 2017-02-01 11:00)'),
+('95649038408b5f33', 'user4', '[2017-02-01 09:30, 2017-02-01 10:45)');
+
+SELECT w.username,
+   w.during * a.during AS both_during
+FROM  web_session w, app_session a
+WHERE w.username = a.username
+AND w.during && a.during;
+ username | both_during 
+--+-
+ user1| ["2017-02-01 10:30:00-08","2017-02-01 10:45:00-08")
+ user3| ["2017-02-01 09:30:00-08","2017-02-01 10:45:00-08")
+(2 rows)
+
+  
+  
+In addition to the general execution strategies available, Postgres also
+has special support for a variant of Merge Join called Range Merge Join:
+
+ EXPLAIN (costs off) SELECT w.username,
+   w.during * a.during AS both_during
+FROM  web_session w, app_session a
+WHERE w.username = a.username
+AND w.during && a.during;
+  QUERY PLAN  
+--
+ Range Merge Join
+   Merge Cond: ((w.during && a.during) AND (w.username = a.username))
+   ->  Sort
+ Sort Key: w.during, w.username
+ ->  Seq Scan on web_session w
+   ->  Sort
+ Sort Key: a.during, a.username
+ ->  Seq Scan on app_session a
+(8 rows)
+
+  
+ 
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9359d0a..1010668 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -909,7 +909,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			pname = sname = "Nested Loop";
 			break;
 		case T_MergeJoin:
-			pname = "Merge";	/* "Join" gets added by jointype switch */
+			if (((MergeJoin*)plan)->mergeRangeJoin)
+pname = "Range Merge";	/* "Join" gets added by jointype switch */
+			else
+pname = "Merge";	/* "Join" gets added by jointype switch */
+
 			sname = "Merge Join";
 			break;
 		case T_HashJoin:
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 8c483bf..bd312e6 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -89,14 +89,67 @@
  *		proceed to another state.  This state is stored in the node's
  *		execution state information and is preserved across calls to
  *		ExecMergeJoin. -cim 10/31/89
+ *
+ *		RANGE MERGE JOIN
+ *
+ *		Range Merge Join is a generalization of merge join. When a join
+ *		predicate involves the range overlaps operator (&&), a merge join
+ *		becomes a range join.
+ *
+ *		The input ranges must be ordered by (lower-bound, upper-bound), which
+ *		is the ordinary range_ops operator class. MJCompare must not simply
+ *		use the operator class's comparison semantics though; instead it
+ *		follows these rules:
+ *
+ *		  - return 0 if the outer range overlaps the inner range
+ *		  - return <0 if the outer range is strictly left-of the inner range
+ *		  - return >0 if the outer ran

Re: [HACKERS] Merge join for GiST

2017-04-13 Thread Jeff Davis
On Wed, Apr 12, 2017 at 10:44 PM, Andrew Borodin <boro...@octonica.com> wrote:
>> How do you think we should proceed? Which projects do you think should
>> eventually be in core, versus which are fine as extensions?
>
> Some points in favor of Range joins via nbtree:

My patch doesn't require indexes, it can sort the input (and the 5X
improvement that I got included the effort to sort). In fact, I expect
using sort will be more common than maintaining btree indexes on a
range column.

> 1. It's more efficient than B-tree over GiST
> 2. It is already in a patch form
>
> Point against:
> 1. Particular implementation is somewhat leaked abstraction. Inside
> the planner, you check not for capabilities of operator and type, but
> for specific op and type. But I do not know how to fix this.

It's not a specific type, it's the "anyrange" type, so you can define
new range types to take advantage of range merge join.

I can drive the planner changes from the catalog rather than
hard-coded OIDs if we think range merge join is a good solution.

> So, here is my opinion: if we can inside the planner understand that
> join condition involves range specifics (for all available ranges) and
> do Range Merge Join, then this is preferred solution.

Yes, I can do that.

> Yes, Spatial Join, when available, will provide roughly the same scan
> performance. But B-trees are more well known to users new to
> PostgreSQL, and B-trees are faster.

I don't quite follow. I don't think any of these proposals uses btree,
right? Range merge join doesn't need any index, your proposal uses
gist, and PgSphere's crossmatch uses gist.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge join for GiST

2017-04-12 Thread Jeff Davis
On Tue, Apr 11, 2017 at 8:35 AM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
> On Tue, Apr 11, 2017 at 5:46 PM, Jeff Davis <pg...@j-davis.com> wrote:
>> Do you have a sense of how this might compare with range merge join?
>
>
> If you have GiST indexes over ranges for both sides of join, then this
> method could be used for range join.  Hence, it could be compared with range
> merge join.
> However, particular implementation in pgsphere uses hardcoded datatypes and
> operations.
> Thus, for range join we need either generalized version of GiST-based join
> or special implementation for ranges.

Alexander, Andrew,

How do you think we should proceed? Which projects do you think should
eventually be in core, versus which are fine as extensions?

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge join for GiST

2017-04-11 Thread Jeff Davis
On Tue, Apr 11, 2017 at 2:17 AM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
> FYI, I've implemented this algorithm for pgsphere.  See following branch.
> https://github.com/akorotkov/pgsphere/tree/experimental
> It's implemented as crossmatch() function which takes as arguments names of
> two indexes over spoint and maximum distance (it checks not overlapping but
> proximity of points).  This function returns setof pairs of TIDs.
>
> Later, Dmitry Ivanov made it a custom scan node.
> https://github.com/akorotkov/pgsphere/tree/crossmatch_cnode
>
> You also can find some experimental evaluation here:
> http://www.adass2016.inaf.it/images/presentations/10_Korotkov.pdf

Do you have a sense of how this might compare with range merge join?

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Merge Join v1

2017-04-11 Thread Jeff Davis
On Tue, Apr 11, 2017 at 12:17 AM, Jeff Davis <pg...@j-davis.com> wrote:
> Version 2 attached. Fixed a few issues, expanded tests, added docs.

It looks like the CF app only listed my perf test script. Re-attaching
rangejoin-v2.patch so that it appears in the CF app. Identical to
other rangejoin-v2.patch.

Regards,
 Jeff Davis
diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml
index 9557c16..84578a7 100644
*** a/doc/src/sgml/rangetypes.sgml
--- b/doc/src/sgml/rangetypes.sgml
***
*** 522,525  INSERT 0 1
--- 522,595 
  

   
+  
+   Range Join
+ 
+   
+ range type
+ join
+   
+ 
+   
+ A Range Join is a join where one of the join conditions is the range
+ overlaps operator (see ). In other
+ words, rather than returning rows where corresponding fields are equal, it
+ returns rows where the corresponding fields overlap.
+   
+   
+ For instance, to find users who were active on a website at the same time
+ as they were using a mobile app, you might issue a query like:
+ 
+ CREATE TABLE web_session(
+ web_session_id text primary key,
+ username text,
+ during tstzrange);
+ CREATE TABLE app_session(
+ app_session_id text primary key,
+ username text,
+ during tstzrange);
+ INSERT INTO web_session VALUES
+ ('0cc175b9c0f1b6a8', 'user1', '[2017-02-01 09:30, 2017-02-01 10:45)'),
+ ('92eb5ffee6ae2fec', 'user2', '[2017-02-01 09:30, 2017-02-01 10:45)'),
+ ('4a8a08f09d37b737', 'user3', '[2017-02-01 09:30, 2017-02-01 10:45)');
+ INSERT INTO app_session VALUES
+ ('8277e0910d750195', 'user1', '[2017-02-01 10:30, 2017-02-01 11:45)'),
+ ('b448797616e091ad', 'user3', '[2017-02-01 09:00, 2017-02-01 11:00)'),
+ ('95649038408b5f33', 'user4', '[2017-02-01 09:30, 2017-02-01 10:45)');
+ 
+ SELECT w.username,
+w.during * a.during AS both_during
+ FROM  web_session w, app_session a
+ WHERE w.username = a.username
+ AND w.during && a.during;
+  username | both_during 
+ --+-
+  user1| ["2017-02-01 10:30:00-08","2017-02-01 10:45:00-08")
+  user3| ["2017-02-01 09:30:00-08","2017-02-01 10:45:00-08")
+ (2 rows)
+ 
+   
+   
+ In addition to the general execution strategies available, Postgres also
+ has special support for a variant of Merge Join called Range Merge Join:
+ 
+  EXPLAIN (costs off) SELECT w.username,
+w.during * a.during AS both_during
+ FROM  web_session w, app_session a
+ WHERE w.username = a.username
+ AND w.during && a.during;
+   QUERY PLAN  
+ --
+  Range Merge Join
+Merge Cond: ((w.during && a.during) AND (w.username = a.username))
+->  Sort
+  Sort Key: w.during, w.username
+  ->  Seq Scan on web_session w
+->  Sort
+  Sort Key: a.during, a.username
+  ->  Seq Scan on app_session a
+ (8 rows)
+ 
+   
+  
  
diff --git a/src/backend/commands/eindex 9359d0a..1010668 100644
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 909,915  ExplainNode(PlanState *planstate, List *ancestors,
  			pname = sname = "Nested Loop";
  			break;
  		case T_MergeJoin:
! 			pname = "Merge";	/* "Join" gets added by jointype switch */
  			sname = "Merge Join";
  			break;
  		case T_HashJoin:
--- 909,919 
  			pname = sname = "Nested Loop";
  			break;
  		case T_MergeJoin:
! 			if (((MergeJoin*)plan)->mergeRangeJoin)
! pname = "Range Merge";	/* "Join" gets added by jointype switch */
! 			else
! pname = "Merge";	/* "Join" gets added by jointype switch */
! 
  			sname = "Merge Join";
  			break;
  		case T_HashJoin:
diff --git a/src/backend/executor/nodindex 8c483bf..bd312e6 100644
*** a/src/backend/executor/nodeMergejoin.c
--- b/src/backend/executor/nodeMergejoin.c
***
*** 89,102 
--- 89,155 
   *		proceed to another state.  This state is stored in the node's
   *		execution state information and is preserved across calls to
   *		ExecMergeJoin. -cim 10/31/89
+  *
+  *		RANGE MERGE JOIN
+  *
+  *		Range Merge Join is a generalization of merge join. When a join
+  *		predicate involves the range overlaps operator (&&), a merge join
+  *		becomes a range join.
+  *
+  *		The input ranges must be ordered by (lower-bound, upper-bound), which
+  *		is the ordinary range_ops operator class. MJCompare must not simply
+  *		use the operator class's comparison semantics though; instead it
+  *		follows these rules:
+  *
+  *		  - return 0 if the outer range overlaps the inner range
+  *		  - return <0 if the outer range is

Re: [HACKERS] Range Merge Join v1

2017-04-11 Thread Jeff Davis
Version 2 attached. Fixed a few issues, expanded tests, added docs.

A simple performance test (script attached) shows about a 5X
improvement when comparing against a nested loop with an inner
index-only scan over a gist index.

Even better, this doesn't require an index, so it will work even if
the input relations are subqueries.

Regards,
Jeff Davis

On Thu, Apr 6, 2017 at 1:43 AM, Jeff Davis <pg...@j-davis.com> wrote:
> 
> Example:
> 
>
> Find different people using the same website at the same time:
>
> create table session(sessionid text, username text, during tstzrange);
> SELECT s1.username, s2.username, s1.during * s2.during
>   FROM session s1, session s2
>   WHERE s1.during && s2.during AND s1.username < s2.username
>
>
> =
> Brief summary of previous discussion:
> =
>
> http://www.postgresql.org/message-id/1334554850.10878.52.camel@jdavis
>
> - can indexes solve it already (parameterized paths, etc.)?
> - planner complexity (track path keys, mergejoinable equality, etc.)
> - spatial join algorithm choice
> - compare range merge join against existing indexes to see if any
>   indexing approach is viable
>
> ==
> Externals:
> ==
>
> No new syntax or other externals. Just improved execution strategy for
> joining ranges using the overlaps operator (&&).
>
> New syntax is possible, but I don't see a strong reason to support
> special range join syntax at this time.
>
> =
> Optimizer Design
> =
>
> I took the path of least resistance, so if I am doing something wrong
> please let me know. I hardwired it to look for the range overlaps
> operator when checking if it could do a merge join, and then add
> range_ops to restrictinfo->mergeopfamilies.
>
> Then, I have to suppress adding it as an equivalence class, because
> overlaps is not equality. It still adds single-member ECs to
> restrictinfo->right_ec and left_ec, but (I think) that's OK.
>
> Also, I have to prevent generating paths for right/full join, because
> the range join algorithm can't (currently) handle those.
>
> =
> Costing
> =
>
> Needs more consideration. Seems to do reasonable things in the few
> examples I tried.
>
> =
> Executor Design
> =
>
> See detailed comments in nodeMergejoin.c
>
> =
> Performance
> =
>
> Seems much better than index nest loop join when the join is not
> selective. I will post detailed numbers soon.
>
> If no index is available, range merge join is the only reasonable way
> to execute a range join. For instance, if the inner is not a leaf in
> the plan.
>
> =
> Alternatives:
> =
>
> It was suggested that I approach it as a general spatial-join
> problem. That has merit, but I rejected it for now because the
> algorithm that I think has the most promise is range-oriented
> anyway. By that I mean we would need to extract all of the dimensions
> into ranges first, and then perform the join. With that in mind, it
> makes sense to implement range joins first; and then later provide the
> APIs to get at the spatial dimensions so that we can implement other
> spatial joins as range joins.
>
> Another suggestion was to base it on hash join, but I never understood
> that proposal and it didn't seem to map very well to a spatial join.
>
> Yet another suggestion was to use some kind of temporary index. Some
> brief experiments I did indicated that it would be fairly slow (though
> more investigation might be useful here). Also, it doesn't provide any
> alternative to the nestloop-with-inner-index we already offer at the
> leaf level today.
>
> Regards,
>  Jeff Davis


perf.sql
Description: application/sql
diff --git a/doc/src/sgml/rangetypes.sgml b/doc/src/sgml/rangetypes.sgml
index 9557c16..84578a7 100644
*** a/doc/src/sgml/rangetypes.sgml
--- b/doc/src/sgml/rangetypes.sgml
***
*** 522,525  INSERT 0 1
--- 522,595 
  

   
+  
+   Range Join
+ 
+   
+ range type
+ join
+   
+ 
+   
+ A Range Join is a join where one of the join conditions is the range
+ overlaps operator (see ). In other
+ words, rather than returning rows where corresponding fields are equal, it
+ returns rows where the corresponding fields overlap.
+   
+   
+ For instance, to find users who were active on a website at the same time
+ as they were using a mobile app, you might issue a query like:
+ 
+ CREATE TABLE web_session(
+ web_session_id text primary key,
+ username text,
+ during tstzrange);
+ CREATE TABLE app_session(
+ app_ses

[HACKERS] Range Merge Join v1

2017-04-06 Thread Jeff Davis

Example:


Find different people using the same website at the same time:

create table session(sessionid text, username text, during tstzrange);
SELECT s1.username, s2.username, s1.during * s2.during
  FROM session s1, session s2
  WHERE s1.during && s2.during AND s1.username < s2.username


=
Brief summary of previous discussion:
=

http://www.postgresql.org/message-id/1334554850.10878.52.camel@jdavis

- can indexes solve it already (parameterized paths, etc.)?
- planner complexity (track path keys, mergejoinable equality, etc.)
- spatial join algorithm choice
- compare range merge join against existing indexes to see if any
  indexing approach is viable

==
Externals:
==

No new syntax or other externals. Just improved execution strategy for
joining ranges using the overlaps operator (&&).

New syntax is possible, but I don't see a strong reason to support
special range join syntax at this time.

=
Optimizer Design
=

I took the path of least resistance, so if I am doing something wrong
please let me know. I hardwired it to look for the range overlaps
operator when checking if it could do a merge join, and then add
range_ops to restrictinfo->mergeopfamilies.

Then, I have to suppress adding it as an equivalence class, because
overlaps is not equality. It still adds single-member ECs to
restrictinfo->right_ec and left_ec, but (I think) that's OK.

Also, I have to prevent generating paths for right/full join, because
the range join algorithm can't (currently) handle those.

=
Costing
=

Needs more consideration. Seems to do reasonable things in the few
examples I tried.

=
Executor Design
=

See detailed comments in nodeMergejoin.c

=
Performance
=

Seems much better than index nest loop join when the join is not
selective. I will post detailed numbers soon.

If no index is available, range merge join is the only reasonable way
to execute a range join. For instance, if the inner is not a leaf in
the plan.

=
Alternatives:
=

It was suggested that I approach it as a general spatial-join
problem. That has merit, but I rejected it for now because the
algorithm that I think has the most promise is range-oriented
anyway. By that I mean we would need to extract all of the dimensions
into ranges first, and then perform the join. With that in mind, it
makes sense to implement range joins first; and then later provide the
APIs to get at the spatial dimensions so that we can implement other
spatial joins as range joins.

Another suggestion was to base it on hash join, but I never understood
that proposal and it didn't seem to map very well to a spatial join.

Yet another suggestion was to use some kind of temporary index. Some
brief experiments I did indicated that it would be fairly slow (though
more investigation might be useful here). Also, it doesn't provide any
alternative to the nestloop-with-inner-index we already offer at the
leaf level today.

Regards,
 Jeff Davis
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a18ab43..1110c1e 100644
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 909,915  ExplainNode(PlanState *planstate, List *ancestors,
  			pname = sname = "Nested Loop";
  			break;
  		case T_MergeJoin:
! 			pname = "Merge";	/* "Join" gets added by jointype switch */
  			sname = "Merge Join";
  			break;
  		case T_HashJoin:
--- 909,919 
  			pname = sname = "Nested Loop";
  			break;
  		case T_MergeJoin:
! 			if (((MergeJoin*)plan)->mergeRangeJoin)
! pname = "Range Merge";	/* "Join" gets added by jointype switch */
! 			else
! pname = "Merge";	/* "Join" gets added by jointype switch */
! 
  			sname = "Merge Join";
  			break;
  		case T_HashJoin:
diff --git a/src/backend/executor/nodindex 62784af..f3375c3 100644
*** a/src/backend/executor/nodeMergejoin.c
--- b/src/backend/executor/nodeMergejoin.c
***
*** 89,102 
--- 89,151 
   *		proceed to another state.  This state is stored in the node's
   *		execution state information and is preserved across calls to
   *		ExecMergeJoin. -cim 10/31/89
+  *
+  *		RANGE MERGE JOIN
+  *
+  *		Range Merge Join is a generalization of merge join. When a join
+  *		predicate involves the range overlaps operator (&&), a merge join
+  *		becomes a range join.
+  *
+  *		The input ranges must be ordered by (lower-bound, upper-bound), which
+  *		is the ordinary range_ops operator class. MJCompare must not simply
+  *		use the operator class's comparison semantics though; instead it
+  *		follows these rules:
+  *
+  *		  - return 0 if the outer range overlaps the inner range
+  *		  - retu

[HACKERS] New expression evaluator and indirect jumps

2017-04-01 Thread Jeff Davis
Andres,

Thank you for your great work on the expression evaluator:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755

I was looking at the dispatch code, and it goes to significant effort
(using computed goto) to generate many indirect jumps instead of just
one. The reasoning is that CPUs can do better branch prediction when
it can predict separately for each of the indirect jumps.

But the paper here: https://hal.inria.fr/hal-01100647/document claims
that it's not really needed on newer CPUs because they are better at
branch prediction. I skimmed it, and if I understand correctly, modern
branch predictors use some history, so it can predict based on the
instructions executed before it got to the indirect jump.

I tried looking through the discussion on this list, but most seemed
to resolve around which compilers generated the assembly we wanted
rather than how much it actually improved performance. Can someone
please point me to the numbers? Do they refute the conclusions in the
paper, or are we concerned about a wider range of processors?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-02-04 Thread Jeff Davis
On Sun, Jan 22, 2017 at 10:32 PM, Jeff Davis <pg...@j-davis.com> wrote:
> On Sat, Jan 21, 2017 at 4:25 AM, Andrew Borodin <boro...@octonica.com> wrote:
> One idea I had that might be simpler is to use a two-stage page
> delete. The first stage would remove the link from the parent and mark
> the page deleted, but leave the right link intact and prevent
> recycling. The second stage would follow the chain of right links
> along each level, removing the right links to deleted pages and
> freeing the page to be recycled.

Do you think this approach is viable as a simplification?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-24 Thread Jeff Davis
On Tue, Jan 24, 2017 at 3:18 AM, Andrew Borodin <boro...@octonica.com> wrote:
> Technically, approach of locking a subtree is not novel. Lehman and
> Yao focused on "that any process for manipulating the tree uses only a
> small (constant) number of locks at any time." We are locking unknown
> and possibly large amount of pages.

By the way, can you show me where the Lehman and Yao paper addresses
page recycling?

It says that one approach is to allow fewer than K entries on a leaf
node; presumably as few as zero. But it doesn't seem to show how to
remove all references to the page and recycle it in a new place in the
tree.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-23 Thread Jeff Davis
On Mon, Jan 23, 2017 at 1:55 AM, Andrew Borodin <boro...@octonica.com> wrote:
> Proposed patch, on a contrary, simplifies code. There is more code
> deleted than added. It does not affect WAL, changes nothing in page
> layout.

I appreciate that it is fewer lines of code. My hesitation comes from
a couple areas:

1. I haven't seen the approach before of finding the parent of any
empty subtree. While that sounds like a reasonable thing to do, it
worries me to invent new approaches in an area as well-studied as
btrees. The problem is less that it's too complex, and more that it's
different. Future developers looking at this code will expect it to be
either very simple or very standard, because it's just a posting list.

2. It relies on searches to only start from the very leftmost page
(correct?). If someone wants to optimize the intersection of two
posting trees later, they might break it if they don't understand the
assumptions in this patch.

3. This adds a heap allocation, and copies the contents of the page,
and then releases the pin and lock. GinStepRight is careful to avoid
doing that (it locks the target before releasing the page containing
the pointer). I don't see a problem in your patch, but again, we are
breaking an assumption that future developers might make.

Your patch solves a real problem (a 90-second stall is clearly not
good) and I don't want to dismiss that. But I'd like to consider some
alternatives that may not have these downsides.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-22 Thread Jeff Davis
On Sat, Jan 21, 2017 at 4:25 AM, Andrew Borodin <boro...@octonica.com> wrote:
> Hello Jeff!
>
>>Review of the code itself:
> How do you think, is there anything else to improve in that patch or
> we can mark it as 'Ready for committer'?
>
> I've checked the concurrency algorithm with original work of Lehman
> and Yao on B-link-tree. For me everything looks fine (safe, deadlock
> free and not increased probability of livelock due to
> LockBufferForCleanup call).

Hi,

I looked at the patch some more.

I have some reservations about the complexity and novelty of the
approach. I don't think I've seen an approach quite like this one
before. It seems like the main reason you are using this approach is
because the btree structure was too simple (no left links); is that
right? My gut is telling me we should either leave it simple, or use a
real concurrent btree implementation.

One idea I had that might be simpler is to use a two-stage page
delete. The first stage would remove the link from the parent and mark
the page deleted, but leave the right link intact and prevent
recycling. The second stage would follow the chain of right links
along each level, removing the right links to deleted pages and
freeing the page to be recycled.

Another idea is to partition the posting tree so that the root page
actually points to K posting trees. Scans would need to merge them,
but it would allow inserts to progress in one tree while the other is
being vacuumed. I think this would give good concurrency even for K=2.
I just had this idea now, so I didn't think it through very well.

What do you think?

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-18 Thread Jeff Davis
https://www.postgresql.org/message-id/CAJEAwVE4UAmm8fr%2BNW8XTnKV6M--ACoNhL3ES8yoKL2sKhbaiw%40mail.gmail.com

Let me re-summarize what's been done here to make sure I understand:

Each key in GIN has its own posting tree, which is itself a btree,
holding all of the tuples that contain that key. This posting tree is
really just a set of tuples, and searches always scan an entire
posting tree (because all the tuples contain the key, so all are a
potential match).

Currently, the cleanup process requires blocking all reads and writes
in the posting list. I assume this is only a problem when a key is
common enough that its posting list is quite large (how large?).

This patch makes a couple changes to improve concurrency:

1. Vacuum leaves first, without removing any pages. Instead, just keep
track of pages which are completely empty (more generally, subtrees
that contain empty pages).

2. Free the empty pages in those subtrees. That requires blocking
reads and writes to that subtree, but not the entire posting list.
Blocking writes is easy: acquiring a cleanup lock on the root of any
subtree always blocks any writes to that subtree, because the write
keeps a pin on all pages in that path. Blocking reads is accomplished
by locking the page to be deleted, then locking/unlocking the page one
left of that (which may be outside the subtree?).


Maybe a basic question, but why is the posting list a btree at all,
and not, say a doubly-linked list?

Review of the code itself:

* Nice and simple, with a net line count of only +45.
* Remove commented-out code.
* In the README, "leaf-to-right" should be left-to-right; and "takinf"
should be "taking".
* When you say the leftmost page is never deleted; do you mean the
leftmost of the entire posting tree, or leftmost of the subtree that
you are removing pages from?
* In order to keep out concurrent reads, you need to lock/unlock the
left page while holding exclusive lock on the page being deleted, but
I didn't see how that happens exactly in the code. Where does that
happen?

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls

2016-05-23 Thread Jeff Davis
On Fri, May 20, 2016 at 1:41 PM, David G. Johnston
<david.g.johns...@gmail.com> wrote:
> How does the relatively new FILTER clause play into this, if at all?

My interpretation of the standard is that FILTER is not allowable for
a window function, and IGNORE|RESPECT NULLS is not allowable for an
ordinary aggregate.

So if we support IGNORE|RESPECT NULLS for anything other than a window
function, we have to come up with our own semantics.

> We already have "STRICT" for deciding whether a function processes nulls.
> Wouldn't this need to exist on the "CREATE AGGREGATE"

STRICT defines behavior at DDL time. I was suggesting that we might
want a DDL-time flag to indicate whether a function can make use of
the query-time IGNORE|RESPECT NULLS option. In other words, most
functions wouldn't know what to do with IGNORE|RESPECT NULLS, but
perhaps some would if we allowed them the option.

Perhaps I didn't understand your point?

Regards,
  Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls

2016-05-20 Thread Jeff Davis
Old thread link:
http://www.postgresql.org/message-id/CA+=vxna5_n1q5q5okxc0aqnndbo2ru6gvw+86wk+onsunjd...@mail.gmail.com

On Thu, Apr 14, 2016 at 1:29 PM, Stephen Frost <sfr...@snowman.net> wrote:
> Jeff
>
> (Reviving an old thread for 2014...)
>
> Would you have time to work on this for 9.7..?  I came across a
> real-world use case for exactly this capability and was sorely
> disappointed to discover we didn't support it even though there had been
> discussion for years on it, which quite a few interested parties.
>
> I'll take a look at reviewing your last version of the patch but wanted
> to get an idea of if you still had interest and might be able to help.
>
> Thanks!
>
> Stephen

There are actually quite a few issues remaining here.

First, I think the syntax is still implemented in a bad way. Right now
it's part of the OVER clause, and the IGNORE NULLS gets put into the
frame options. It doesn't match the way the spec defines the grammar,
and I don't see how it really makes sense that it's a part of the
frame options or the window object at all. I believe that it should be
a part of the FuncCall, and end up in the FuncExpr, so the flag can be
read when the function is called without exposing frame options (which
we don't want to do). I think the reason it was done as a part of the
window object is so that it could save state between calls for the
purpose of optimization, but I think that can be done using fn_extra.
This change should remove a lot of the complexity about trying to
share window definitions, etc.

Second, we need a way to tell which functions support IGNORE NULLS and
which do not. The grammar as implemented in the patch seems to allow
it for any function with an OVER clause (which can include ordinary
aggregates). Then the parse analysis hard-codes that only LEAD and LAG
accept IGNORE NULLS, and throws an error otherwise. Neither of these
things seem right. I think we need a need catalog support to say
whether a function honors IGNORE|RESPECT NULLS or not, which means we
also need support in CREATE FUNCTION.

I think the execution is pretty good, except that (a) we need to keep
the state in fn_extra rather than the winstate; and (b) we should get
rid of the bitmaps and just do a naive scan unless we really think
non-constant offsets will be important. We can always optimize more
later.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: numeric timestamp in log_line_prefix

2015-09-07 Thread Jeff Davis
On Mon, 2015-09-07 at 18:28 -0300, Alvaro Herrera wrote:
> I noticed %t, but I don't think we care since the precision is so poor.
> Making m and n work in unison seems enough.  I think it would be
> reasonably simple to handle %t in the same way, but I'm not sure we
> care.

OK.

> I think the extra ugliness is warranted, since it's not THAT much
> additional ugliness, and not doing it could be considered a regression;
> apparently strftime can be slower even than snprintf, so doing it twice
> per log message might be excessive overhead.

Patch attached. Please take a quick look.

Regards,
Jeff Davis



*** a/src/backend/utils/error/elog.c
--- b/src/backend/utils/error/elog.c
***
*** 143,152  static int	errordata_stack_depth = -1; /* index of topmost active frame */
  
  static int	recursion_depth = 0;	/* to detect actual recursion */
  
! /* buffers for formatted timestamps that might be used by both
!  * log_line_prefix and csv logs.
   */
  
  #define FORMATTED_TS_LEN 128
  static char formatted_start_time[FORMATTED_TS_LEN];
  static char formatted_log_time[FORMATTED_TS_LEN];
--- 143,156 
  
  static int	recursion_depth = 0;	/* to detect actual recursion */
  
! /*
!  * Saved timeval and buffers for formatted timestamps that might be used by
!  * both log_line_prefix and csv logs.
   */
  
+ static struct timeval	saved_timeval;
+ static boolsaved_timeval_set = false;
+ 
  #define FORMATTED_TS_LEN 128
  static char formatted_start_time[FORMATTED_TS_LEN];
  static char formatted_log_time[FORMATTED_TS_LEN];
***
*** 2195,2206  write_console(const char *line, int len)
  static void
  setup_formatted_log_time(void)
  {
- 	struct timeval tv;
  	pg_time_t	stamp_time;
  	char		msbuf[8];
  
! 	gettimeofday(, NULL);
! 	stamp_time = (pg_time_t) tv.tv_sec;
  
  	/*
  	 * Note: we expect that guc.c will ensure that log_timezone is set up (at
--- 2199,2214 
  static void
  setup_formatted_log_time(void)
  {
  	pg_time_t	stamp_time;
  	char		msbuf[8];
  
! 	if (!saved_timeval_set)
! 	{
! 		gettimeofday(_timeval, NULL);
! 		saved_timeval_set = true;
! 	}
! 
! 	stamp_time = (pg_time_t) saved_timeval.tv_sec;
  
  	/*
  	 * Note: we expect that guc.c will ensure that log_timezone is set up (at
***
*** 2213,2219  setup_formatted_log_time(void)
  pg_localtime(_time, log_timezone));
  
  	/* 'paste' milliseconds into place... */
! 	sprintf(msbuf, ".%03d", (int) (tv.tv_usec / 1000));
  	memcpy(formatted_log_time + 19, msbuf, 4);
  }
  
--- 2221,2227 
  pg_localtime(_time, log_timezone));
  
  	/* 'paste' milliseconds into place... */
! 	sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_usec / 1000));
  	memcpy(formatted_log_time + 19, msbuf, 4);
  }
  
***
*** 2440,2450  log_line_prefix(StringInfo buf, ErrorData *edata)
  break;
  			case 'n':
  {
- 	struct	timeval tv;
  	char	strfbuf[128];
  
! 	gettimeofday(, NULL);
! 	sprintf(strfbuf, "%ld.%03d", tv.tv_sec, (int)(tv.tv_usec / 1000));
  
  	if (padding != 0)
  		appendStringInfo(buf, "%*s", padding, strfbuf);
--- 2448,2463 
  break;
  			case 'n':
  {
  	char	strfbuf[128];
  
! 	if (!saved_timeval_set)
! 	{
! 		gettimeofday(_timeval, NULL);
! 		saved_timeval_set = true;
! 	}
! 
! 	sprintf(strfbuf, "%ld.%03d", saved_timeval.tv_sec,
! 			(int)(saved_timeval.tv_usec / 1000));
  
  	if (padding != 0)
  		appendStringInfo(buf, "%*s", padding, strfbuf);
***
*** 2825,2830  send_message_to_server_log(ErrorData *edata)
--- 2838,2844 
  
  	initStringInfo();
  
+ 	saved_timeval_set = false;
  	formatted_log_time[0] = '\0';
  
  	log_line_prefix(, edata);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: numeric timestamp in log_line_prefix

2015-09-07 Thread Jeff Davis
On Mon, 2015-09-07 at 17:47 -0300, Alvaro Herrera wrote:
> Jeff Davis wrote:
> > On Sun, 2015-03-22 at 19:47 +0100, Andres Freund wrote:
> > > On 2015-03-22 00:47:12 +0100, Tomas Vondra wrote:
> > > > from time to time I need to correlate PostgreSQL logs to other logs,
> > > > containing numeric timestamps - a prime example of that is pgbench. With
> > > > %t and %m that's not quite trivial, because of timezones etc.
> > > 
> > > I have a hard time seing this is sufficient cause for adding more format
> > > codes. They're not free runtime and documentation wise. -0.5 from me.
> > 
> > I am about to commit this patch (the one that adds a single extra
> > option). Although nothing is free, the cost seems very low, and at least
> > three people have expressed interest in this patch.
> 
> Did you see my message at 
> http://www.postgresql.org/message-id/20150907192719.GS2912@alvherre.pgsql ?

I guess we were looking at this at the same time, and I missed your
message and committed it. I will investigate now.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: numeric timestamp in log_line_prefix

2015-09-07 Thread Jeff Davis
> I wonder about this separate gettimeofday() call.  We already have
> formatted_log_time which is used for CSV logs and freeform log lines
> (stderr/syslog); if we introduce a separate gettimeofday() call here,
> and the user has %n in freeform log and CSV logging is active, the
> timings will diverge occasionally.
> 
> Maybe I'm worrying over nothing, because really what use case is there
> for having the two log formats enabled at the same time?  Yet somebody
> went some lengths to ensure they are consistent; I think we should do
> likewise here.

We now have three time-related options[1]: t, m, and n; and they each
acquire the time independently. Are you suggesting that we make all
three consistent, or only m and n?

The cleanest fix would be for the global variable to only hold the
timeval, and then format it once for the CSV log (always 'm' format) and
once for the regular log ('m', 'n', or 't'). If the regular log uses
'm', that would be some wasted cycles formatting it the same way twice.
Is it worth a little extra ugliness to cache both the timeval and the
formatted string?

Regards,
Jeff Davis

[1] As of minutes ago, after I missed your message by a few minutes.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: numeric timestamp in log_line_prefix

2015-09-07 Thread Jeff Davis
On Sun, 2015-03-22 at 19:47 +0100, Andres Freund wrote:
> On 2015-03-22 00:47:12 +0100, Tomas Vondra wrote:
> > from time to time I need to correlate PostgreSQL logs to other logs,
> > containing numeric timestamps - a prime example of that is pgbench. With
> > %t and %m that's not quite trivial, because of timezones etc.
> 
> I have a hard time seing this is sufficient cause for adding more format
> codes. They're not free runtime and documentation wise. -0.5 from me.

I am about to commit this patch (the one that adds a single extra
option). Although nothing is free, the cost seems very low, and at least
three people have expressed interest in this patch.

What tips the balance is that we expose the unix epoch in the pgbench
logs, as Tomas points out.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-17 Thread Jeff Davis
On Fri, 2015-07-17 at 15:52 +1200, David Rowley wrote:

 Should we mark the patch as returned with feedback in the commitfest
 app then?

I believe the memory accounting patch has been rejected. Instead, the
work will be done in the HashAgg patch.

Thank you for the review!

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Jeff Davis
On Tue, 2015-07-14 at 16:19 -0400, Robert Haas wrote:
 tuplesort.c does its own accounting, and TBH that seems like the right
 thing to do here, too.  The difficulty is, I think, that some
 transition functions use an internal data type for the transition
 state, which might not be a single palloc'd chunk.  But since we can't
 spill those aggregates to disk *anyway*, that doesn't really matter.

So would it be acceptable to just ignore the memory consumed by
internal, or come up with some heuristic?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-15 Thread Jeff Davis
On Wed, 2015-07-15 at 12:59 +0530, Atri Sharma wrote:

 
 I think a heuristic would be more suited here and ignoring memory
 consumption for internal types means that we are not making the memory
 accounting useful for a set of usecases. 
 
OK, I will drop this patch and proceed with the HashAgg patch, with a
heuristic for internal types.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory Accounting v11

2015-07-11 Thread Jeff Davis
On Thu, 2015-07-09 at 14:41 +1200, David Rowley wrote:

 Are you going to implement this? or are you happy with the single
 level context tracking is the right thing?
 I'm going to mark the patch as waiting on author for now.

Attached a version of the patch that does multi-level tracking, v12. It
does this is a simpler way, just like an earlier version of the patch:
it simply traverses up the chain and adds to each parent in a
total_allocated field.

The idea you mentioned is a possible optimization of this idea, but it
only makes sense if I'm able to detect a difference between v11
(single-level) and v12 (multi-level). I tried Robert's test[1] again and
I didn't see a difference on my workstation (technically, v12 came out
the fastest, which means I'm just seeing noise anyway), so I can't
evaluate whether your idea will improve things.

After talking with a few people at PGCon, small noisy differences in CPU
timings can appear for almost any tweak to the code, and aren't
necessarily cause for major concern.

Regards,
Jeff Davis

[1] pgbench -i -s 300, then do the following 3 times each for master,
v11, and v12, and take the median of logged traces:

start server; set trace_sort=on; reindex index pgbench_accounts_pkey;
stop server

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_alloc(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 502,510 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		update_alloc((MemoryContext) set, blksize);
+ 
  		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 595,602 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_alloc(context, -(block-endptr - ((char*) block)));
+ 
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 635,640  AllocSetDelete(MemoryContext context)
--- 642,654 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
+ 
+ 		/*
+ 		 * Parent is already unlinked from this context, so we can't use
+ 		 * update_alloc(). The caller should have already subtracted from the
+ 		 * parents' total_allocated.
+ 		 */
+ 
  		free(block);
  		block = next;
  	}
***
*** 672,677  AllocSetAlloc(MemoryContext context, Size size)
--- 686,694 
  		block = (AllocBlock) malloc(blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		update_alloc(context, blksize);
+ 
  		block-aset = set;
  		block-freeptr = block-endptr = ((char *) block) + blksize;
  
***
*** 861,866  AllocSetAlloc(MemoryContext context, Size size)
--- 878,885 
  		if (block == NULL)
  			return NULL;
  
+ 		update_alloc(context, blksize);
+ 
  		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 964,969  AllocSetFree(MemoryContext context, void *pointer)
--- 983,991 
  			set-blocks = block-next;
  		else
  			prevblock-next = block-next;
+ 
+ 		update_alloc(context, -(block-endptr - ((char*) block)));
+ 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 1077,1082  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1099,1105 
  		AllocBlock	prevblock = NULL;
  		Size		chksize;
  		Size		blksize;
+ 		Size		oldblksize;
  
  		while (block != NULL)
  		{
***
*** 1094,1102  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1117,1130 
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
  		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ 		oldblksize = block-endptr - ((char *)block);
+ 
  		block = (AllocBlock) realloc(block, blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		update_alloc(context, blksize - oldblksize);
+ 
  		block-freeptr = block-endptr = ((char *) block) + blksize;
  
  		/* Update pointers since block has likely been moved */
***
*** 1361,1363  AllocSetCheck(MemoryContext context)
--- 1389,1407 
  }
  
  #endif   /* MEMORY_CONTEXT_CHECKING */
+ 
+ /*
+  * Update self_allocated and total_allocated for the context. Size can be
+  * positive or negative.
+  */
+ void
+ update_alloc(MemoryContext context, int64 size)
+ {
+ 	MemoryContext	parent;
+ 
+ 	context-self_allocated += size;
+ 
+ 	/* update total_allocated for self and all parents */
+ 	for (parent = context; parent != NULL

Re: [HACKERS] Memory Accounting v11

2015-07-07 Thread Jeff Davis
On Tue, 2015-07-07 at 16:49 +1200, David Rowley wrote:

 of performance decrease anywhere. I'm just getting too much variation
 in the test results to get any sort of idea.
 
That was my experience as well. Thank you for taking a look.

 My main question here is: How sure are you that none of your intended
 use cases will need to call MemoryContextMemAllocated with recurse =
 true? I'd imagine if you ever have to
 use MemoryContextMemAllocated(ctx, true) then we'll all be sitting
 around with our stopwatches out to see if the call incurs too much of
 a performance penalty.

I don't expect HashAgg to be using many memory contexts. It did with
array_agg, but that's been changed, and was a bad pattern anyway (one
context per group is quite wasteful).

If it turns out to be slow, we can call it less frequently (e.g.
extrapolate growth rate to determine when to check).

 
 Shouldn't you be setting oldblksize after the if? I'd imagine the
 realloc() failure is rare, but it just seems better to do it just
 before it's required and only when required.

I don't think that's safe. Realloc can return an entirely new pointer,
so the endptr would be pointing outside of the allocation. I'd have to
hang on to the original pointer before the realloc, so I'd need an extra
variable anyway.


 
 Here there's a double assignment of child.

Will fix.

 
 I might be mistaken here, but can't you just set context-mem_allocted
 = 0; after that loop? 
 Or maybe it would be an improvement to only do the decrement
 if MEMORY_CONTEXT_CHECKING is defined, then Assert() that
 mem_allocated == 0 after the loop...
 
OK. Why not just always Assert that?
 
 One other thing that I don't remember seeing discussed would be to
 just store a List in each context which would store all of the
 mem_allocated's that need to be updated during each allocation and
 deallocation of memory. The list would be setup once when the context
 is created. When a child context is setup we'd just loop up the parent
 context chain and list_concat() all of the parent's lists onto the
 child's lists. Would this method add too much overhead when a context
 is deleted? Has this been explored already?
 
That's a good idea, as it would be faster than recursing. Also, the
number of parents can't change, so we can just make an array, and that
would be quite fast to update. Unless I'm missing something, this sounds
like a nice solution. It would require more space in MemoryContextData,
but that might not be a problem.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-07-07 Thread Jeff Davis
On Tue, 2015-07-07 at 09:27 +0530, Amit Kapila wrote:

 I am not sure how many blocks difference could be considered okay for
 deviation?

In my testing (a long time ago) deviations of tens of blocks didn't show
a problem.

However, an assumption of the sync scan work was that the CPU is
processing faster than the IO system; whereas the parallel scan patch
assumes that the IO system is faster than a single core. So perhaps the
features are incompatible after all. Only testing will say for sure.

Then again, syncscans are designed in such a way that they are unlikely
to hurt in any situation. Even if the scans diverge (or never converge
in the first place), it shouldn't be worse than starting at block zero
every time.

I'd prefer to leave syncscans intact for parallel scans unless you find
a reasonable situation where they perform worse. This shouldn't add any
complexity to the patch (if it does, let me know).

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-07-06 Thread Jeff Davis
On Mon, 2015-07-06 at 10:37 +0530, Amit Kapila wrote:

 Or the other way to look at it could be separate out fields which are
 required for parallel scan which is done currently by forming a
 separate structure ParallelHeapScanDescData.
 
I was suggesting that you separate out both the normal scan fields and
the partial scan fields, that way we're sure that rs_nblocks is not
accessed during a parallel scan.

Or, you could try wrapping the parts of heapam.c that are affected by
parallelism into new static functions.

 The reason why partial scan can't be mixed with sync scan is that in
 parallel
 scan, it performs the scan of heap by synchronizing blocks (each
 parallel worker
 scans a block and then asks for a next block to scan) among parallel
 workers.
 Now if we try to make sync scans work along with it, the
 synchronization among
 parallel workers will go for a toss.  It might not be impossible to
 make that
 work in some way, but not sure if it is important enough for sync
 scans to work
 along with parallel scan.

I haven't tested it, but I think it would still be helpful. The block
accesses are still in order even during a partial scan, so why wouldn't
it help?

You might be concerned about the reporting of a block location, which
would become more noisy with increased parallelism. But in my original
testing, sync scans weren't very sensitive to slight deviations, because
of caching effects.

 tqueue.c is mainly designed to pass tuples between parallel workers
 and currently it is used in Funnel operator to gather the tuples
 generated
 by all the parallel workers.  I think we can use it for any other
 operator
 which needs tuple communication among parallel workers.

Some specifics of the Funnel operator seem to be a part of tqueue, which
doesn't make sense to me. For instance, reading from the set of queues
in a round-robin fashion is part of the Funnel algorithm, and doesn't
seem suitable for a generic tuple communication mechanism (that would
never allow order-sensitive reading, for example).

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-07-05 Thread Jeff Davis
On Fri, 2015-07-03 at 17:35 +0530, Amit Kapila wrote:

 Attached, find the rebased version of patch.
 

Comments:


* The heapam.c changes seem a little ad-hoc. Conceptually, which
portions should be affected by parallelism? How do we know we didn't
miss something?
* Why is initscan getting the number of blocks from the structure? Is it
just to avoid an extra syscall, or is there a correctness issue there?
Is initscan expecting that heap_parallelscan_initialize is always called
first (if parallel)? Please add a comment explaining above.
* What's the difference between scan-rs_nblocks and
scan-rs_parallel-phs_nblocks? Same for rs_rd-rd_id and phs_relid.
* It might be good to separate out some fields which differ between the
normal heap scan and the parallel heap scan. Perhaps put rs_ctup,
rs_cblock, and rs_cbuf into a separate structure, which is always NULL
during a parallel scan. That way we don't accidentally use a
non-parallel field when doing a parallel scan.
* Is there a reason that partial scans can't work with syncscan? It
looks like you're not choosing the starting block in the same way, so it
always starts at zero and never does syncscan. If we don't want to mix
syncscan and partial scan, that's fine, but it should be more explicit.

I'm trying to understand where tqueue.c fits in. It seems very closely
tied to the Funnel operator, because any change to the way Funnel works
would almost certainly require changes in tqueue.c. But tqueue is a
generic name for the file, so something seems off. Either we should
explicitly make it the supporting routines for the Funnel operator, or
we should try to generalize it a little.

I still have quite a bit to look at, but this is a start.

Regards,
Jeff Davis






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-07-01 Thread Jeff Davis
On Wed, 2015-07-01 at 11:07 +0530, Amit Kapila wrote:

 For what you are asking to change name for?

There are still some places, at least in the comments, that call it a
parallel sequential scan.


 a. Infrastructure for parallel execution, like some of the stuff in
 execparallel.c, heapam.c,tqueue.c, etc and all other generic
 (non-nodes specific) code.

Did you consider passing tuples through the tqueue by reference rather
than copying? The page should be pinned by the worker process, but
perhaps that's a bad assumption to make?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-06-29 Thread Jeff Davis
[Jumping in without catching up on entire thread. Please let me know
if these questions have already been covered.]

1. Can you change the name to something like ParallelHeapScan?
Parallel Sequential is a contradiction. (I know this is bikeshedding
and I won't protest further if you keep the name.)

2. Where is the speedup coming from? How much of it is CPU and IO
overlapping (i.e. not leaving disk or CPU idle while the other is
working), and how much from the CPU parallelism? I know this is
difficult to answer rigorously, but it would be nice to have some
breakdown even if for a specific machine.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Refactor to split nodeAgg.c?

2015-06-29 Thread Jeff Davis
I was going to rebase my HashAgg patch, and got some conflicts related
to the grouping sets patch. I could probably sort them out, but I think
that may be the tipping point where we want to break up nodeAgg.c into
nodeSortedAgg.c and nodeHashAgg.c, and probably a common file as well.

This would also (I hope) be convenient for Simon and David Rowley, who
have been hacking on aggregates in general.

Anyone see a reason I shouldn't give this a try?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Memory Accounting v11

2015-06-14 Thread Jeff Davis
This patch tracks memory usage (at the block level) for all memory
contexts. Individual palloc()s aren't tracked; only the new blocks
allocated to the memory context with malloc().

It also adds a new function, MemoryContextMemAllocated() which can
either retrieve the total allocated for the context, or it can recurse
and sum up the allocations for all subcontexts as well.

This is intended to be used by HashAgg in an upcoming patch that will
bound the memory and spill to disk.

Previous discussion here:

http://www.postgresql.org/message-id/1407012053.15301.53.camel@jeff-desktop

Previous concerns:

* There was a slowdown reported of around 1-3% (depending on the exact
version of the patch) on an IBM power machine when doing an index
rebuild. The results were fairly noisy for me, but it seemed to hold up.
See http://www.postgresql.org/message-id/CA
+Tgmobnu7XEn1gRdXnFo37P79bF=qLt46=37ajp3cro9db...@mail.gmail.com 
* Adding a struct field to MemoryContextData may be bad for the CPU
caching behavior, and may be the cause of the above slowdown.
* Previous versions of the patch updated the parent contexts'
allocations as well, which avoided the need to recurse when querying the
total allocation. That seemed to penalize cases that didn't need to
track the allocation. We discussed trying to opt-in to this behavior,
but it seemed more awkward than helpful. Now, the patch only updates the
allocation of the current context, and querying means recursing through
the child contexts.
* There was a concern that, if MemoryContextMemAllocated needs to
recurse to the child contexts, it will be too slow for HashAgg of
array_agg, which creates a child context per group. That was solved with
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1

My general answer to the performance concerns is that they aren't a
reason to block this patch, unless someone has a suggestion about how to
fix them. Adding one field to a struct and a few arithmetic operations
for each malloc() or realloc() seems reasonable to me.

The current state, where HashAgg just blows up the memory, is just not
reasonable, and we need to track the memory to fix that problem. Others
have also mentioned that we might want to use this mechanism to track
memory for other operators, like Sort or HashJoin, which might be
simpler and more accurate.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 500,508 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		((MemoryContext) set)-mem_allocated += blksize;
+ 
  		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 593,600 
  		else
  		{
  			/* Normal case, release the block */
+ 			context-mem_allocated -= block-endptr - ((char*) block);
+ 
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 632,637  AllocSetDelete(MemoryContext context)
--- 637,644 
  	{
  		AllocBlock	next = block-next;
  
+ 		context-mem_allocated -= block-endptr - ((char *) block);
+ 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 672,677  AllocSetAlloc(MemoryContext context, Size size)
--- 679,687 
  		block = (AllocBlock) malloc(blksize);
  		if (block == NULL)
  			return NULL;
+ 
+ 		context-mem_allocated += blksize;
+ 
  		block-aset = set;
  		block-freeptr = block-endptr = ((char *) block) + blksize;
  
***
*** 861,866  AllocSetAlloc(MemoryContext context, Size size)
--- 871,878 
  		if (block == NULL)
  			return NULL;
  
+ 		context-mem_allocated += blksize;
+ 
  		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 964,969  AllocSetFree(MemoryContext context, void *pointer)
--- 976,984 
  			set-blocks = block-next;
  		else
  			prevblock-next = block-next;
+ 
+ 		context-mem_allocated -= block-endptr - ((char*) block);
+ 
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 1077,1082  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1092,1098 
  		AllocBlock	prevblock = NULL;
  		Size		chksize;
  		Size		blksize;
+ 		Size		oldblksize;
  
  		while (block != NULL)
  		{
***
*** 1094,1102  AllocSetRealloc(MemoryContext context, void *pointer, Size size)
--- 1110,1123 
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
  		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+ 		oldblksize = block-endptr - ((char *)block);
+ 
  		block = (AllocBlock) realloc(block, blksize

Re: [HACKERS] Memory Accounting v11

2015-06-14 Thread Jeff Davis
On Sun, 2015-06-14 at 16:21 -0400, Tom Lane wrote:
 Meh.  HashAgg could track its memory usage without loading the entire
 system with a penalty.

When I tried doing it outside of the MemoryContexts, it seemed to get
awkward quickly. I'm open to suggestion if you can point me in the right
direction. Maybe I can peek at the sizes of chunks holding state values
and group keys, and combine that with the hash table size-estimating
code?

   Moreover, this is about fourth or fifth down the
 list of the implementation problems with spilling hash aggregation to
 disk.  It would be good to see credible solutions for the bigger issues
 before we buy into adding overhead for a mechanism with no uses in core.

I had several iterations of a full implementation of the spill-to-disk
HashAgg patch[1]. Tomas Vondra has some constructive review comments,
but all of them seemed solvable. What do you see as a major unsolved
issue?

If I recall, you were concerned about things like array_agg, where an
individual state could get larger than work_mem. That's a valid concern,
but it's not the problem I was trying to solve.

Regards,
Jeff Davis

[1]
http://www.postgresql.org/message-id/1407706010.6623.16.camel@jeff-desktop



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type

2015-04-23 Thread Jeff Davis
On Thu, Apr 23, 2015 at 1:49 AM, David G. Johnston
david.g.johns...@gmail.com wrote:
 Reading and writing all this I'm convinced you have gotten the idea in your
 mind an expectation of equivalency and consistency where there really is
 little or none from an overall design perspective.  And none insofar as
 would merit trying to force the two example queries you provide to behave
 identically.  There are a number of things about SQL that one either simply
 lives with or goes through mind contortions to understand the, possibly
 imperfect, reasoning behind.  This is one of those things: and while it's
 been fun to do those contortions in the end I am only a little bit better
 off than when I simply accepted the fact the unknown and untyped were
 similar but different (even if I hadn't considered giving them different
 names).

You make some good points, but I disagree for two reasons:

1. This is a postgres issue, not a general SQL issue, so we can't
simply say SQL is weird (like we can with a lot of the strange NULL
semantics).
2. Postgres has determined that the unknown column reference b in
the query SELECT a=b FROM (SELECT ''::text, '  ') x(a,b) can and
should be coerced to text. So postgres has already made that decision.
It's just that when it tries to coerce it, it fails. Even if you
believe that literals and column references are not equivalent, I
don't see any justification at all for this behavior -- postgres
should not decide to coerce it in the first place if it's going to
fail.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type

2015-04-23 Thread Jeff Davis
On Wed, 2015-04-22 at 20:35 -0700, David G. Johnston wrote:

 But the fact that column b has the data type unknown is only a
 warning - not an error.
 
I get an error:

postgres=# SELECT '  '::text = 'a';
 ?column? 
--
 f
(1 row)

postgres=# SELECT a=b FROM (SELECT ''::text, '  ') x(a,b);
ERROR:  failed to find conversion function from unknown to text

So that means the column reference b is treated differently than the
literal. Here I don't mean a reference to an actual column of a real
table, just an identifier (b) that parses as a columnref.

Creating the table gives you a warning (not an error), but I think that
was a poor example for me to choose, and not important to my point.
 
 This seems to be a case of the common problem (or, at least recently
 mentioned) where type conversion only deals with data and not context.
 
 
 http://www.postgresql.org/message-id/CADx9qBmVPQvSH3
 +2cH4cwwPmphW1mE18e=wumlfuc-qz-t7...@mail.gmail.com
 
 
I think that is a different problem. That's a runtime type conversion
error (execution time), and I'm talking about something happening at
parse analysis time.

 
 but this too works - which is why the implicit cast concept above
 fails (I'm leaving it since the thought process may help in
 understanding):
 
 
 SELECT 1 = '1';
 
 
 From which I infer that an unknown literal is allowed to be fed
 directly into a type's input function to facilitate a direct coercion.

Yes, I believe that's what's happening. When we use an unknown literal,
it's acting more like a value constructor and will pass it to the type
input function. When it's a columnref, even if unknown, it tries to cast
it and fails.

But that is very confusing. In the example at the top of this email, it
seems like the second query should be equivalent to the first, or even
that postgres should be able to rewrite the second into the first. But
the second query fails where the first succeeds.


 At this point...backward compatibility?

Backwards compatibility of what queries? I guess the ones that return
unknowns to the client or create tables with unknown columns?

 create table a(u) as select '1';
 
 
 WARNING: column u has type unknown​
 DETAIL:  Proceeding with relation creation anyway.
 
 
 Related question: was there ever a time when the above failed instead
 of just supplying a warning?

Not that I recall.



 ​My gut reaction is if you feel strongly enough to add some additional
 documentation or warnings/hints/details related to this topic they
 probably would get put in; but disallowing unknown as first-class
 type is likely to fail to pass a cost-benefit evaluation.

I'm not proposing that we eliminate unknown. I just think columnrefs and
literals should behave consistently. If we really don't want unknown
columnrefs, it seems like we could at least throw a better error.

If we were starting from scratch, I'd also not return unknown to the
client, but we have to worry about the backwards compatibility.

 Distinguishing between untyped literals and unknown type literals
 seems promising concept to aid in understanding the difference in the
 face of not being able (or wanting) to actually change the behavior.

Not sure I understand that proposal, can you elaborate?

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] Failure to coerce unknown type to specific type

2015-04-22 Thread Jeff Davis
Moving thread to -hackers.

On Wed, Apr 8, 2015 at 11:18 PM, Jeff Davis pg...@j-davis.com wrote:
 That example was just for illustration. My other example didn't require
 creating a table at all:

   SELECT a=b FROM (SELECT ''::text, '  ') x(a,b);

 it's fine with me if we want that to fail, but I don't think we're
 failing in the right place, or with the right error message.

 I'm not clear on what rules we're applying such that the above query
 should fail, but:

   SELECT ''::text='  ';

 should succeed. Unknown literals are OK, but unknown column references
 are not? If that's the rule, can we catch that earlier, and throw an
 error like 'column reference b has unknown type'?

Is the behavior of unknown literals vs. unknown column references
documented anywhere? I tried looking here:
http://www.postgresql.org/docs/devel/static/typeconv.html, but it
doesn't seem to make the distinction between how unknown literals vs.
unknown column references are handled.

My understanding until now has been that unknown types are a
placeholder while still inferring the types. But that leaves a few
questions:

1. Why do we care whether the unknown is a literal or a column reference?
2. Unknown column references seem basically useless, because we will
almost always encounter the failure to find conversion error, so why
do we allow them?
3. If unknowns are just a placeholder, why do we return them to the
client? What do we expect the client to do with it?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Invent a memory context reset/delete callback mechanism.

2015-02-27 Thread Jeff Davis
On Fri, 2015-02-27 at 22:17 +, Tom Lane wrote:
 In passing, per discussion, rearrange some boolean fields in struct
 MemoryContextData so as to avoid wasted padding space.  For safety,
 this requires making allowInCritSection's existence unconditional;
 but I think that's a better approach than what was there anyway.

I notice that this uses the bytes in MemoryContextData that I was hoping
to use for the memory accounting patch (for memory-bounded HashAgg).

Leaving no space for even a 32-bit value (without AllocSetContext
growing beyond the magical 192-byte number Andres mentioned) removes
most of the options for memory accounting. The only things I can think
of are:

  1. Move a few less-frequently-accessed fields out to an auxiliary
structure (Tomas proposed something similar); or
  2. Walk the memory contexts, but use some kind of
estimation/extrapolation so that it doesn't need to be done for every
input tuple of HashAgg.

Thoughts?

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2015-02-22 Thread Jeff Davis

On Wed, 2015-01-07 at 20:07 +0100, Tomas Vondra wrote:
 So I started digging in the code and I noticed this:
 
 hash_mem = MemoryContextMemAllocated(aggstate-hashcontext, true);
 
 which is IMHO obviously wrong, because that accounts only for the
 hashtable itself. It might be correct for aggregates with state passed
 by value, but it's incorrect for state passed by reference (e.g.
 Numeric, arrays etc.), because initialize_aggregates does this:
 
 oldContext = MemoryContextSwitchTo(aggstate-aggcontext);
 pergroupstate-transValue = datumCopy(peraggstate-initValue,
 peraggstate-transtypeByVal,
 peraggstate-transtypeLen);
 MemoryContextSwitchTo(oldContext);
 
 and it's also wrong for all the user-defined aggretates that have no
 access to hashcontext at all, and only get reference to aggcontext using
 AggCheckCallContext(). array_agg() is a prime example.

Actually, I believe the first one is right, and the second one is wrong.
If we allocate pass-by-ref transition states in aggcontext, that defeats
the purpose of the patch, because we can't release the memory when we
start a new batch (because we can't reset aggcontext).

Instead, the transition states should be copied into hashcontext; we
should count the memory usage of hashcontext; AggCheckCallContext should
return hashcontext; and after each batch we should reset hashcontext.

It might be worth refactoring a bit to call it the groupcontext or
something instead, and use it for both HashAgg and GroupAgg. That would
reduce the need for conditionals.



[ ... problems with array_agg subcontexts ... ]

 and it runs indefinitely (I gave up after a few minutes). I believe this
 renders the proposed memory accounting approach dead.

...

 The array_agg() patch I submitted to this CF would fix this particular
 query, as it removes the child contexts (so there's no need for
 recursion in MemoryContextMemAllocated), but it does nothing to the
 user-defined aggregates out there. And it's not committed yet.

Now that your patch is committed, I don't think I'd call the memory
accounting patch I proposed dead yet. We decided that creating one
context per group is essentially a bug, so we don't need to optimize for
that case.

Your approach may be better, though.

Thank you for reviewing. I'll update my patches and resubmit for the
next CF.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Variable renaming in AllocSetContextCreate (will commit soon, no functional impact)

2015-02-21 Thread Jeff Davis
This is just to make the variable names in this function consistent with
the rest of the file (and less confusing). No functional impact, so I'll
commit soon unless someone objects.

Previously, this was part of the memory accounting patch, but that
hasn't made it in yet. Might as well commit the cleanup at least.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,454 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   AllocSetMethods,
!   parent,
!   name);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context-initBlockSize = initBlockSize;
! 	context-maxBlockSize = maxBlockSize;
! 	context-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 462,470 
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set-initBlockSize = initBlockSize;
! 	set-maxBlockSize = maxBlockSize;
! 	set-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	context-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		context-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
--- 480,489 
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	set-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		set-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
***
*** 500,519  AllocSetContextCreate(MemoryContext parent,
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = context-blocks;
! 		context-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		context-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return (MemoryContext) context;
  }
  
  /*
--- 503,522 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = set-blocks;
! 		set-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		set-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return context;
  }
  
  /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Variable renaming in AllocSetContextCreate (will commit soon, no functional impact)

2015-02-21 Thread Jeff Davis
On Sun, 2015-02-22 at 00:07 -0500, Tom Lane wrote:
 If you want to have just *one* variable but change its name and type,
 I'd be ok with that.

Thank you for taking a quick look. Committed as a simple rename from
context to set.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2015-02-21 Thread Jeff Davis
On Sun, 2015-02-22 at 03:14 +0100, Tomas Vondra wrote:
SELECT COUNT(x) FROM (
   SELECT a, array_agg(i) AS x FRM test GROUP BY 1
) foo;
  
  That's actually a bogus test -- array_agg is never executed.
 
 Really? How could that happen when the result of array_agg() is passed
 to the COUNT()? Also, how could that allocate huge amounts of memory and
 get killed by OOM, which happens easily with this query?

Oops, I misread that as COUNT(*). Count(x) will force array_agg() to
be executed.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2015-02-21 Thread Jeff Davis
On Wed, 2015-01-28 at 23:25 +0100, Tomas Vondra wrote:
 3) moves the assert into the 'if (release)' branch

You missed one, but I got it.

 4) includes the comments proposed by Ali Akbar in his reviews
 
Warnings at makeArrayResult/makeMdArrayResult about freeing memory
with private subcontexts.

I also edited the comments substantially.

 Regarding the performance impact of decreasing the size of the
 preallocated array from 64 to just 8 elements, I tried this.
 
   CREATE TABLE test AS
   SELECT mod(i,10) a, i FROM generate_series(1,64*10) s(i);
 
   SELECT a, array_agg(i) AS x FRM test GROUP BY 1;
 
 or actually (to minimize transfer overhead):
 
   SELECT COUNT(x) FROM (
  SELECT a, array_agg(i) AS x FRM test GROUP BY 1
   ) foo;

That's actually a bogus test -- array_agg is never executed. See the
test script I used (attached). Many small groups has a significant
improvement with your patch (median 167ms unpatched, 125ms patched), and
the one large group is not measurably different (median 58ms for both).

The test script uses a small dataset of 100K tuples, because 1M tuples
will run out of memory for small groups without the patch.

Committed.

Regards,
Jeff Davis


array_agg_test.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.6 Feature help requested: Inclusion Constraints

2015-02-09 Thread Jeff Davis
On Sat, 2015-02-07 at 16:08 -0800, Jeff Davis wrote:
 I believe Inclusion Constraints will be important for postgres.

I forgot to credit Darren Duncan with the name of this feature:

http://www.postgresql.org/message-id/4f8bb9b0.5090...@darrenduncan.net

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] 9.6 Feature help requested: Inclusion Constraints

2015-02-07 Thread Jeff Davis
I believe Inclusion Constraints will be important for postgres.
Unfortunately, my time has been scarce lately, so without help it may
miss 9.6 as well. If someone is interested in working on this with me to
deliver a good submission in a couple months, please let me know.

The idea is to make a generalized form of foreign keys, just like
exclusion constraints are a generalized form of unique constraints.

===
Syntax:
===

(see attachment for more specifics)

  INCLUDE [ USING index_method ]
( exclude_element WITH operator [, ... ] )
REFERENCES reftable [ ( refcolumn [, ... ] ) ]
[ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ]
index_parameters [ WHERE ( predicate ) ]

Notice the WHERE clause, which is missing from FKs.

==
Example #1
==

  -- e.g. Super Bowl
  create table event (
event_id int8,
title text,
eduring tstzrange,
exclude using gist (event_id with =, eduring with )
  );

  -- e.g. Catch on one yard line
  create table event_highlight (
event_id int8,
description text,
hduring tstzrange,
exclude using gist (event_id with =, hduring with ),
include using gist (event_id with =, hduring with @)
  references event (event_id, eduring)
  );

Ensure that the hduring is always contained (@) in some eduring with
the same event_id.

==
Example #2
==

  create table asset_ownership (
asset_id INT8,
owner_id INT8,
during tstzrange,
exclude using gist (asset_id with =, during with ),
include using gist (asset_id with =, during with -|-)
  references asset_possession (asset_id, during)
  where (not lower_inf(during))
  );

Ensure that assets in a business are always in the possession of some
responsible party. Here we require that every possession period is
adjacent (-|-) to some other possession period for the same asset,
unless there is no lower bound.


Limitations:


I don't see a good way to make cascading updates work, as Peter pointed
out in a previous discussion:

http://www.postgresql.org/message-id/1288113924.22800.4.ca...@vanquo.pezone.net

And cascading deletes would be a little strange, as well. I think it's
better to just only support ON UPDATE/DELETE NO ACTION. If someone sees
a use case here that is not too specific, we can always try to add
support for something like that later.

We could support ON UPDATE/DELETE RESTRICT, too, but that can be
accomplished by making it IMMEDIATE (unless I'm missing some subtlety).

===
Challenges:
===

I think the biggest technical challenge will be to account for the case
where the a referencing tuple has multiple matches on the referenced
side. This is similar to the challenge of supporting MATCH PARTIAL with
FKs, which is, I assume, why we don't support that yet.

It *seems* solvable, but I haven't looked in detail yet.

Another approach would be to try to enforce the idea that there is only
one match on the referenced side, using an exclusion constraint. I
rejected this approach because:

  * Notice in example #1 that the exclusion constraint on the referenced
side uses the overlaps operator (), while the inclusion constraint
uses the contained by operator (@). Postgres currently has no way to
know that those two operators have a special relationship that ensures
only one match on the referenced side. I don't even know the name for
that relationship, so asking users to declare such a relationship seems
a little much.
  * It seems like it might eliminate some legitimate use cases.
  * We could get MATCH PARTIAL support as a result of this work.


Work to be done:


The rest of the work seems similar in scope to Exclusion Constraints
(0cb65564):
  - parse analysis work
  - catalog work
  - dump/reload support
  - compare performance of a trivial inclusion constraint to a FK
  - ensure that deadlocks are not too common
  - add tests

Any takers?

Regards,
Jeff Davis

*** a/doc/src/sgml/ref/create_table.sgml
--- b/doc/src/sgml/ref/create_table.sgml
***
*** 63,69  CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
PRIMARY KEY ( replaceable class=PARAMETERcolumn_name/replaceable [, ... ] ) replaceable class=PARAMETERindex_parameters/replaceable |
EXCLUDE [ USING replaceable class=parameterindex_method/replaceable ] ( replaceable class=parameterexclude_element/replaceable WITH replaceable class=parameteroperator/replaceable [, ... ] ) replaceable class=parameterindex_parameters/replaceable [ WHERE ( replaceable class=parameterpredicate/replaceable ) ] |
FOREIGN KEY ( replaceable class=PARAMETERcolumn_name/replaceable [, ... ] ) REFERENCES replaceable class=PARAMETERreftable/replaceable [ ( replaceable class=PARAMETERrefcolumn/replaceable [, ... ] ) ]
! [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON DELETE replaceable class=parameteraction/replaceable ] [ ON UPDATE replaceable class

Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2015-01-21 Thread Jeff Davis
On Tue, 2015-01-20 at 23:37 +0100, Tomas Vondra wrote:
 Tom's message where he points that out is here:
 http://www.postgresql.org/message-id/20707.1396372...@sss.pgh.pa.us

That message also says:

I think a patch that stood a chance of getting committed would need to
detect whether the aggregate was being called in simple or grouped
contexts, and apply different behaviors in the two cases.

I take that as an objection to any patch which does not distinguish
between the grouped and ungrouped aggregate cases, which includes your
patch.

I don't agree with that objection (or perhaps I don't understand it);
but given the strong words above, I need to get some kind of response
before I can consider committing your patch.

 I generally agree that having two API 'facets' with different behavior
 is slightly awkward and assymetric, but I wouldn't call that ugly.

Right, your words are more precise (and polite). My apologies.

 I
 actually modified both APIs initially, but I think Ali is right that not
 breaking the existing API (and keeping the original behavior in that
 case) is better. We can break it any time we want in the future, but
 it's impossible to unbreak it ;-)

We can't break the old API, and I'm not suggesting that we do. I was
hoping to find some alternative.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2015-01-20 Thread Jeff Davis
On Tue, Jan 20, 2015 at 6:44 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Davis pg...@j-davis.com writes:
 Tom (tgl),
 Is my reasoning above acceptable?

 Uh, sorry, I've not been paying any attention to this thread for awhile.
 What's the remaining questions at issue?

This patch is trying to improve the array_agg case where there are
many arrays being constructed simultaneously, such as in HashAgg. You
strongly suggested that a commitable patch would differentiate between
the grouped and ungrouped cases (or perhaps you meant differentiate
between HashAgg and sorted aggregation?). Tomas's current patch does
not do so; it does two main things:

   1. Uses a common memory context for arrays being constructed by
array_agg in any context (ungrouped, sorted agg, and HashAgg)
   2. Reduces the initial array allocation to 8 elements from 64, but
preserves doubling behavior

I don't see either as a big problem, but perhaps there are some
downsides in some cases. I think a worst case would be a slowdown in
the sorted agg case where every group has 64 elements, so I'll try to
see if that's a real problem or not. If you saw a bigger problem,
please let me know; and if not, I'll proceed with the review.

There are also some other concerns I have about the API ugliness,
which I believe is the reason there's so much discussion about making
the comments better. The reason the API is ugly is for backwards
compatibility, so there's no perfect solution. Your opinion is welcome
here too, but I mainly need to see if your objection above has been
addressed.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2015-01-19 Thread Jeff Davis
On Wed, Jan 14, 2015 at 1:52 PM, Tomas Vondra
tomas.von...@2ndquadrant.com wrote:
 Attached is v8 patch, with a few comments added:

 1) before initArrayResult() - explaining when it's better to use a
single memory context, and when it's more efficient to use a
separate memory context for each array build state

 2) before makeArrayResult() - explaining that it won't free memory
when allocated in a single memory context (and that a pfree()
has to be used if necessary)

 3) before makeMdArrayResult() - explaining that it's illegal to use
release=true unless using a subcontext


I understand there is history to this API, and we need to be
compatible, but the end result is awkward.

I'm wondering whether it would be better to just have a new set of
functions like accumArrayResultElem, etc., and only allow astate=NULL
in the original accumArrayResult(). That cure might be worse than the
disease though.

Regards,
 Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2015-01-19 Thread Jeff Davis
On Sun, Dec 28, 2014 at 11:53 PM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2014-04-01 at 13:08 -0400, Tom Lane wrote:
 I think a patch that stood a chance of getting committed would need to
 detect whether the aggregate was being called in simple or grouped
 contexts, and apply different behaviors in the two cases.

 The simple context doesn't seem like a big problem even if we change
 things as Tomas suggests:

 IMNSHO these are the issues we really should fix - by lowering the
 initial element count (64-4) and using a single memory context.

 In the simple context, there's only one context regardless, so the only
 cost I see is from reducing the initial allocation from 64 to some lower
 number. But if we're doubling each time, it won't take long to get
 there; and because it's the simple context, we only need to do it once.

Tom (tgl),

Is my reasoning above acceptable?

The current patch, which I am evaluating for commit, does away with
per-group memory contexts (it uses a common context for all groups), and
reduces the initial array allocation from 64 to 8 (but preserves
doubling behavior).

Regards,
Jeff Davis






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compression of full-page-writes

2014-12-30 Thread Jeff Davis
On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
 Speeding up the CRC calculation obviously won't help with the WAL volume 
 per se, ie. you still generate the same amount of WAL that needs to be 
 shipped in replication. But then again, if all you want to do is to 
 reduce the volume, you could just compress the whole WAL stream.

Was this point addressed? How much benefit is there to compressing the
data before it goes into the WAL stream versus after?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-28 Thread Jeff Davis
On Tue, 2014-12-23 at 01:16 -0800, Jeff Davis wrote:
 New patch attached (rebased, as well).
 
 I also see your other message about adding regression testing. I'm
 hesitant to slow down the tests for everyone to run through this code
 path though. Should I add regression tests, and then remove them later
 after we're more comfortable that it works?

Attached are some tests I ran. First, generate the data sets with
hashagg_test_data.sql. Then, do (I used work_mem at default of 4MB):

  set enable_hashagg=false;
  \o /tmp/sort.out
  \i /tmp/hashagg_test.sql
  \o
  set enable_hashagg=true;
  \o /tmp/hash.out
  \i /tmp/hashagg_test.sql

and then diff'd the output to make sure the results are the same (except
the plans, of course). The script loads the results into a temp table,
then sorts it before outputting, to make the test order-independent. I
didn't just add an ORDER BY, because that would change the plan and it
would never use hashagg.

I think that has fairly good coverage of the hashagg code. I used 3
different input data sets, byval and byref types (for group key and
args), and a group aggregate query as well as DISTINCT. Let me know if I
missed something.

I also did some performance comparisons between disk-based sort+group
and disk-based hashagg. The results are quite favorable for hashagg
given the data sets I provided. Simply create the data using
hashagg_test_data.sql (if not already done), set the work_mem to the
value you want to test, and run hashagg_test_perf.sql. It uses EXPLAIN
ANALYZE for the timings.

singleton: 10M groups of 1
even: 1M groups of 10
skew: wildly different group sizes; see data script

q1: group aggregate query
q2: distinct query

The total memory requirements for the test to run without going to disk
ranges from about 100MB (for even) to about 1GB (for singleton).
Regardless of work_mem, these all fit in memory on my machine, so they
aren't *really* going to disk. Also note that, because of how the memory
blocks are allocated, and that hashagg waits until memory is exceeded,
then hashagg might use about double work_mem when work_mem is small (the
effect is not important at higher values).

work_mem='1MB':
  sort+group (s)hashagg (s)
   singleton q1   12 10
   singleton q28  7
   even q114  7
   even q210  5
   skew q122  6
   skew q216  4

work_mem='4MB':
  sort+group (s)hashagg (s)
   singleton q1   12 11
   singleton q28  6
   even q112  7
   even q2 9  5
   skew q119  6
   skew q213  3

work_mem='16MB':
  sort+group (s)hashagg (s)
   singleton q1   12 11
   singleton q28  7
   even q114  7
   even q210  5
   skew q115  6
   skew q212  4

work_mem='64MB':
  sort+group (s)hashagg (s)
   singleton q1   13 12
   singleton q29  8
   even q114  8
   even q210  5
   skew q117  6
   skew q213  4

work_mem='256MB':
  sort+group (s)hashagg (s)
   singleton q1   12 12
   singleton q29  8
   even q114  7
   even q211  4
   skew q116  6
   skew q213  4

work_mem='512MB':
  sort+group (s)hashagg (s)
   singleton q1   12 12
   singleton q29  7
   even q114  7
   even q210  4
   skew q116  6
   skew q212  4

work_mem='2GB':
  sort+group (s)hashagg (s)
   singleton q19 12
   singleton q26  6
   even q1 8  7
   even q2 6  4
   skew q1 7  6
   skew q2 5  4


These numbers are great news for disk-based hashagg. It seems to be the
same or better than sort+group in nearly all cases (again, this example
doesn't actually go to disk, so those numbers may come out differently).
Also, the numbers are remarkably stable for varying work_mem for both
plans. That means that it doesn't cost much to keep a lower work_mem as
long as your system has plenty of memory.

Do others have similar numbers? I'm quite surprised at how little
work_mem seems to matter for these plans (HashJoin might

Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-12-28 Thread Jeff Davis
On Thu, 2014-12-11 at 02:46 -0800, Jeff Davis wrote:
 On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote:
  This patch is requires the Memory Accounting patch, or something similar
  to track memory usage.
  
  The attached patch enables hashagg to spill to disk, which means that
  hashagg will contain itself to work_mem even if the planner makes a
  bad misestimate of the cardinality.
 
 New patch attached. All open items are complete, though the patch may
 have a few rough edges.
 

This thread got moved over here:

http://www.postgresql.org/message-id/1419326161.24895.13.camel@jeff-desktop

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-28 Thread Jeff Davis
On Sun, 2014-12-28 at 12:37 -0800, Jeff Davis wrote:
 I feel like I made a mistake -- can someone please do a
 sanity check on my numbers?

I forgot to randomize the inputs, which doesn't matter much for hashagg
but does matter for sort. New data script attached. The results are even
*better* for disk-based hashagg than the previous numbers suggest. Here
are some new numbers:

work_mem='1MB':
  sort+group (s)hashagg (s)
   singleton q1   21 10
   singleton q2   12  8
   even q120  7
   even q213  5
   skew q122  6
   skew q216  4

work_mem='4MB':
  sort+group (s)hashagg (s)
   singleton q1   17 10
   singleton q2   11  6
   even q116  7
   even q211  5
   skew q119  6
   skew q213  4

work_mem='16MB':
  sort+group (s)hashagg (s)
   singleton q1   16 11
   singleton q2   11  7
   even q115  8
   even q212  6
   skew q115  6
   skew q212  4

work_mem='64MB':
  sort+group (s)hashagg (s)
   singleton q1   18 12
   singleton q2   13  8
   even q117 10
   even q213  6
   skew q117  6
   skew q214  4

work_mem='256MB':
  sort+group (s)hashagg (s)
   singleton q1   18 12
   singleton q2   14  7
   even q116  9
   even q214  5
   skew q118  6
   skew q213  4

work_mem='512MB':
  sort+group (s)hashagg (s)
   singleton q1   18 12
   singleton q2   14  7
   even q117  9
   even q214  5
   skew q117  6
   skew q213  4

work_mem='2GB':
  sort+group (s)hashagg (s)
   singleton q1   11 11
   singleton q27  6
   even q110  9
   even q2 7  5
   skew q1 7  6
   skew q2 4  4


Regards,
Jeff Davis



hashagg_test_data.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2014-12-28 Thread Jeff Davis
On Sun, 2014-12-21 at 13:00 -0500, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
  i.e. either destroy the whole context if possible, and just free the
  memory when using a shared memory context. But I'm afraid this would
  penalize the shared memory context, because that's intended for cases
  where all the build states coexist in parallel and then at some point
  are all converted into a result and thrown away. Adding pfree() calls is
  no improvement here, and just wastes cycles.
 
 FWIW, I quite dislike the terminology shared memory context, because
 it sounds too much like it means a context in shared memory.  I see
 that the patch itself doesn't use that phrase, which is good, but can
 we come up with some other phrase for talking about it?
 

Common memory context?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2014-12-28 Thread Jeff Davis
On Tue, 2014-12-16 at 00:27 +0100, Tomas Vondra wrote:
  plperl.c: In function 'array_to_datum_internal':
  plperl.c:1196: error: too few arguments to function 'accumArrayResult'
  plperl.c: In function 'plperl_array_to_datum':
  plperl.c:1223: error: too few arguments to function 'initArrayResult'
  
  Cheers,
 
 Thanks, attached is a version that fixes this.

Just jumping into this patch now. Do we think this is worth changing the
signature of functions in array.h, which might be used from a lot of
third-party code? We might want to provide new functions to avoid a
breaking change.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2014-12-28 Thread Jeff Davis
On Tue, 2014-04-01 at 13:08 -0400, Tom Lane wrote:
 I think a patch that stood a chance of getting committed would need to
 detect whether the aggregate was being called in simple or grouped
 contexts, and apply different behaviors in the two cases.

The simple context doesn't seem like a big problem even if we change
things as Tomas suggests:

IMNSHO these are the issues we really should fix - by lowering the
initial element count (64-4) and using a single memory context.

In the simple context, there's only one context regardless, so the only
cost I see is from reducing the initial allocation from 64 to some lower
number. But if we're doubling each time, it won't take long to get
there; and because it's the simple context, we only need to do it once.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-23 Thread Jeff Davis
It seems that these two patches are being reviewed together. Should I
just combine them into one? My understanding was that some wanted to
review the memory accounting patch separately.

On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
 That's the only conflict, and after fixing it it compiles OK. However, I
 got a segfault on the very first query I tried :-(

If lookup_hash_entry doesn't find the group, and there's not enough
memory to create it, then it returns NULL; but the caller wasn't
checking for NULL. My apologies for such a trivial mistake, I was doing
most of my testing using DISTINCT. My fix here was done quickly, so I'll
take a closer look later to make sure I didn't miss something else.

New patch attached (rebased, as well).

I also see your other message about adding regression testing. I'm
hesitant to slow down the tests for everyone to run through this code
path though. Should I add regression tests, and then remove them later
after we're more comfortable that it works?

Regards
Jeff Davis

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 3045,3050  include_dir 'conf.d'
--- 3045,3065 
/listitem
   /varlistentry
  
+  varlistentry id=guc-enable-hashagg-disk xreflabel=enable_hashagg_disk
+   termvarnameenable_hashagg_disk/varname (typeboolean/type)
+   indexterm
+primaryvarnameenable_hashagg_disk/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the planner expects the hash table size to exceed
+ varnamework_mem/varname. The default is literalon/.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-enable-hashjoin xreflabel=enable_hashjoin
termvarnameenable_hashjoin/varname (typeboolean/type)
indexterm
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 86,91  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
--- 86,92 
  	 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
+ static void show_hashagg_info(AggState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  	ExplainState *es);
  static void show_instrumentation_count(const char *qlabel, int which,
***
*** 1422,1427  ExplainNode(PlanState *planstate, List *ancestors,
--- 1423,1429 
  		case T_Agg:
  			show_agg_keys((AggState *) planstate, ancestors, es);
  			show_upper_qual(plan-qual, Filter, planstate, ancestors, es);
+ 			show_hashagg_info((AggState *) planstate, es);
  			if (plan-qual)
  show_instrumentation_count(Rows Removed by Filter, 1,
  		   planstate, es);
***
*** 1912,1917  show_sort_info(SortState *sortstate, ExplainState *es)
--- 1914,1955 
  }
  
  /*
+  * Show information on hash aggregate buckets and batches
+  */
+ static void
+ show_hashagg_info(AggState *aggstate, ExplainState *es)
+ {
+ 	Agg *agg = (Agg *)aggstate-ss.ps.plan;
+ 
+ 	Assert(IsA(aggstate, AggState));
+ 
+ 	if (agg-aggstrategy != AGG_HASHED)
+ 		return;
+ 
+ 	if (!aggstate-hash_init_state)
+ 	{
+ 		long	memPeakKb = (aggstate-hash_mem_peak + 1023) / 1024;
+ 		long	diskKb	  = (aggstate-hash_disk + 1023) / 1024;
+ 
+ 		if (es-format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es-str, es-indent * 2);
+ 			appendStringInfo(
+ es-str,
+ Batches: %d  Memory Usage: %ldkB  Disk Usage:%ldkB\n,
+ aggstate-hash_num_batches, memPeakKb, diskKb);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong(HashAgg Batches,
+ aggstate-hash_num_batches, es);
+ 			ExplainPropertyLong(Peak Memory Usage, memPeakKb, es);
+ 			ExplainPropertyLong(Disk Usage, diskKb, es);
+ 		}
+ 	}
+ }
+ 
+ /*
   * Show information on hash buckets/batches.
   */
  static void
*** a/src/backend/executor/execGrouping.c
--- b/src/backend/executor/execGrouping.c
***
*** 310,316  BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
  	hash_ctl.hcxt = tablecxt;
  	hashtable-hashtab = hash_create(TupleHashTable, nbuckets,
  	 hash_ctl,
! 	HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
  
  	return hashtable;
  }
--- 310,317 
  	hash_ctl.hcxt = tablecxt;
  	hashtable-hashtab = hash_create(TupleHashTable, nbuckets,
  	 hash_ctl,
! 	 HASH_ELEM | HASH_FUNCTION | HASH_COMPARE |
! 	 HASH_CONTEXT | HASH_NOCHILDCXT);
  
  	return hashtable;
  }
***
*** 331,336  TupleHashEntry
--- 332,386 
  LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
  	 bool *isnew)
  {
+ 	uint32 hashvalue;
+ 
+ 	hashvalue = TupleHashEntryHash(hashtable, slot);
+ 	return LookupTupleHashEntryHash(hashtable, slot

Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-12-11 Thread Jeff Davis
On Sun, 2014-08-10 at 14:26 -0700, Jeff Davis wrote:
 This patch is requires the Memory Accounting patch, or something similar
 to track memory usage.
 
 The attached patch enables hashagg to spill to disk, which means that
 hashagg will contain itself to work_mem even if the planner makes a
 bad misestimate of the cardinality.

New patch attached. All open items are complete, though the patch may
have a few rough edges.

Summary of changes:

 * rebased on top of latest memory accounting patch
http://www.postgresql.org/message-id/1417497257.5584.5.camel@jeff-desktop
 * added a flag to hash_create to prevent it from creating an extra
level of memory context
   - without this, the memory accounting would have a measurable impact
on performance
 * cost model for the disk usage
 * intelligently choose the number of partitions for each pass of the
data
 * explain support
 * in build_hash_table(), be more intelligent about the value of
nbuckets to pass to BuildTupleHashTable()
   - BuildTupleHashTable tries to choose a value to keep the table in
work_mem, but it isn't very accurate.
 * some very rudimentary testing (sanity checks, really) shows good
results

Summary of previous discussion (my summary; I may have missed some
points):

Tom Lane requested that the patch also handle the case where transition
values grow (e.g. array_agg) beyond work_mem. I feel this patch provides
a lot of benefit as it is, and trying to handle that case would be a lot
more work (we need a way to write the transition values out to disk at a
minimum, and perhaps combine them with other transition values). I also
don't think my patch would interfere with a fix there in the future.

Tomas Vondra suggested an alternative design that more closely resembles
HashJoin: instead of filling up the hash table and then spilling any new
groups, the idea would be to split the current data into two partitions,
keep one in the hash table, and spill the other (see
ExecHashIncreaseNumBatches()). This has the advantage that it's very
fast to identify whether the tuple is part of the in-memory batch or
not; and we can avoid even looking in the memory hashtable if not.

The batch-splitting approach has a major downside, however: you are
likely to evict a skew value from the in-memory batch, which will result
in all subsequent tuples with that skew value going to disk. My approach
never evicts from the in-memory table until we actually finalize the
groups, so the skew values are likely to be completely processed in the
first pass.

So, the attached patch implements my original approach, which I still
feel is the best solution.

Regards,
Jeff Davis

*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 3017,3022  include_dir 'conf.d'
--- 3017,3037 
/listitem
   /varlistentry
  
+  varlistentry id=guc-enable-hashagg-disk xreflabel=enable_hashagg_disk
+   termvarnameenable_hashagg_disk/varname (typeboolean/type)
+   indexterm
+primaryvarnameenable_hashagg_disk/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the planner expects the hash table size to exceed
+ varnamework_mem/varname. The default is literalon/.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-enable-hashjoin xreflabel=enable_hashjoin
termvarnameenable_hashjoin/varname (typeboolean/type)
indexterm
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 86,91  static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
--- 86,92 
  	 List *ancestors, ExplainState *es);
  static void show_sort_info(SortState *sortstate, ExplainState *es);
  static void show_hash_info(HashState *hashstate, ExplainState *es);
+ static void show_hashagg_info(AggState *hashstate, ExplainState *es);
  static void show_tidbitmap_info(BitmapHeapScanState *planstate,
  	ExplainState *es);
  static void show_instrumentation_count(const char *qlabel, int which,
***
*** 1423,1428  ExplainNode(PlanState *planstate, List *ancestors,
--- 1424,1430 
  		case T_Agg:
  			show_agg_keys((AggState *) planstate, ancestors, es);
  			show_upper_qual(plan-qual, Filter, planstate, ancestors, es);
+ 			show_hashagg_info((AggState *) planstate, es);
  			if (plan-qual)
  show_instrumentation_count(Rows Removed by Filter, 1,
  		   planstate, es);
***
*** 1913,1918  show_sort_info(SortState *sortstate, ExplainState *es)
--- 1915,1956 
  }
  
  /*
+  * Show information on hash aggregate buckets and batches
+  */
+ static void
+ show_hashagg_info(AggState *aggstate, ExplainState *es)
+ {
+ 	Agg *agg = (Agg *)aggstate-ss.ps.plan;
+ 
+ 	Assert(IsA(aggstate, AggState));
+ 
+ 	if (agg-aggstrategy != AGG_HASHED)
+ 		return;
+ 
+ 	if (!aggstate

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-01 Thread Jeff Davis
On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
 On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis pg...@j-davis.com wrote:
  I can also just move isReset there, and keep mem_allocated as a uint64.
  That way, if I find later that I want to track the aggregated value for
  the child contexts as well, I can split it into two uint32s. I'll hold
  off any any such optimizations until I see some numbers from HashAgg
  though.
 
 I took a quick look at memory-accounting-v8.patch.
 
 Is there some reason why mem_allocated is a uint64? All other things
 being equal, I'd follow the example of tuplesort.c's
 MemoryContextAllocHuge() API, which (following bugfix commit
 79e0f87a1) uses int64 variables to track available memory and so on.

No reason. New version attached; that's the only change.

Regards,
Jeff Davis


*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,454 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   AllocSetMethods,
!   parent,
!   name);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context-initBlockSize = initBlockSize;
! 	context-maxBlockSize = maxBlockSize;
! 	context-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 462,470 
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set-initBlockSize = initBlockSize;
! 	set-maxBlockSize = maxBlockSize;
! 	set-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	context-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		context-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
--- 480,489 
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	set-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		set-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
***
*** 500,519  AllocSetContextCreate(MemoryContext parent,
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = context-blocks;
! 		context-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		context-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return (MemoryContext) context;
  }
  
  /*
--- 503,525 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 
! 		context-mem_allocated += blksize;
! 
! 		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = set-blocks;
! 		set-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		set-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return context;
  }
  
  /*
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 596,603 
  		else
  		{
  			/* Normal case, release the block */
+ 			context-mem_allocated

Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-20 Thread Jeff Davis
On Wed, 2014-11-19 at 11:03 -0500, Robert Haas wrote:
 But your proposal does not solve this problem:
 
 1. Backend A-1 acquires AccessExclusiveLock on relation R.
 2. Backend A-2 waits for AccessShareLock on relation R.
 
 The good news is that the deadlock detector should realize that since
 A-1 and A-2 are in the same group, this is a deadlock.  And it can
 abort either A-1 or A-2, which will eventually abort them both.

Right. It can even give a nice error hint to tell the user how to avoid
the problem.

   The
 bad news, to borrow a phrase from Peter Geoghegan, is that it's an
 unprincipled deadlock; a user confronted with the news that her
 parallel scan has self-deadlocked will be justifiably dismayed.

You seem to be raising this as a show-stopping problem, and I'm not
convinced that it is.

(a) There are no parallel operators yet, so this is speculative.
(b) Out of the parallel operators you imagine (e.g. Scan), we don't
expect anything other than AccessShare locks in the normal case.
(c) The failure mode is not so bad; it's just an error and the user can
probably work around it.

You could argue that we know we're headed for this problem, and
therefore we should solve it now. I disagree. You are assuming that
sharing exclusive heavyweight locks among a group will be a fundamental
part of everything postgres does with parallelism; but not every design
requires it.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-20 Thread Jeff Davis
On Thu, 2014-11-20 at 11:22 -0500, Robert Haas wrote:
 2. Propagate pre-existing locks from the user backend to all the workers.
 
 I initially proposed #1, but now I think #2 solves more of the
 problems for less code.

OK. The primary concern with that is unintended consequences. But it's
reasonable for you to ask for something more concrete. I will think on
this more.

A few things I'm thinking about now:

 * What do you mean by pre-existing? Locks existing prior to what
event? (I don't think that's exactly what you meant.)
 * What's the conceptual difference between granting locks that would
otherwise conflict with another process in the group (which is what this
proposal is about) and having exactly the same set of locks? Is there
any?
 * Let's say you have processes A1 and A2 in one group, and B. A1 and B
both have an AccessShare lock, and A2 tries to acquire an exclusive
lock. B is waiting on A2. That's still a deadlock, right?

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-11-17 Thread Jeff Davis
On Mon, 2014-11-17 at 18:04 +0100, Andres Freund wrote:
 That's quite possibly one culprit for the slowdown. Right now one
 AllocSetContext struct fits precisely into three cachelines. After your
 change it doesn't anymore.

Thank you! I made the same mistake as Tomas: I saw that
MemoryContextData went to 64 bytes and thought that should be fine
while ignoring AllocSetContext.

I did some tests in the past between a version of my patch that made
MemoryContextData 72 bytes and one that made it 64 bytes, but I may have
neglected to test against the original 56 byte version.

I'm still not able to test to see if this is the real problem, but it
seems best to eliminate it anyway.

 Consider not counting memory in bytes but blocks and adding it directly
 after the NodeTag. That'd leave the size the same as right now.

I can also just move isReset there, and keep mem_allocated as a uint64.
That way, if I find later that I want to track the aggregated value for
the child contexts as well, I can split it into two uint32s. I'll hold
off any any such optimizations until I see some numbers from HashAgg
though.

Attached new version.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,454 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   AllocSetMethods,
!   parent,
!   name);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context-initBlockSize = initBlockSize;
! 	context-maxBlockSize = maxBlockSize;
! 	context-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 462,470 
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set-initBlockSize = initBlockSize;
! 	set-maxBlockSize = maxBlockSize;
! 	set-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	context-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		context-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
--- 480,489 
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	set-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		set-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
***
*** 500,519  AllocSetContextCreate(MemoryContext parent,
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = context-blocks;
! 		context-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		context-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return (MemoryContext) context;
  }
  
  /*
--- 503,525 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 
! 		context-mem_allocated += blksize;
! 
! 		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = set-blocks;
! 		set-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		set-keeper = block;
  
  		/* Mark unallocated space

Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-13 Thread Jeff Davis
Great discussion.

Robert, I think you make a lot of very valid points, but philosophically
I am in pretty strong agreement with Andres here.

On Fri, 2014-10-31 at 14:29 -0400, Robert Haas wrote:
  So just refusing parallelism as soon as anything has taken an access
  exclusive lock doesn't sound too bad to me.
 
 That restriction seems onerous to me; for example, it would prevent a
 parallel sort for CLUSTER or a parallel index rebuild for VACUUM FULL.
 Those seems like worthy targets for parallelism, and maybe even early
 targets, since I expect a lot of the real complexity here to be in the
 query planner, and you can avoid most of that complexity when planning
 DDL.

If two backends both have an exclusive lock on the relation for a join
operation, that implies that they need to do their own synchronization,
because obviously the lock manager is not doing it for them.

In turn, that implies that we need parallel versions of these
operations; going to each one and parallelizing it. I don't think that's
a good long-term strategy... it changes our simple executor (tuples in,
tuples out) to something quite different (each node needs to know a lot
about whatever is underneath it in order to divide the data up).

 process and both requests would have been granted.  Parallelism is
 generally not about making different things happen; it's about
 spreading the stuff that would happen anyway across multiple backends,

This point would be much stronger for declarative queries. When the user
issues LOCK TABLE, they are getting their hands dirty with the
implementation to some degree. If they really want to serialize, they
can serialize around the lock of an object that they don't want to also
read (in parallel).

You're right that this would be surprising to users, but hopefully we
can find a way to disable parallelism in that case, or at least issue a
WARNING that the user is probably doing the wrong thing.

 cluster pg_collate... the only solution is to
 move the parallel workers ahead of the AccessExclusiveLock request,
 but the deadlock detector won't do that unless it knows about the
 relationship among the parallel workers.

As discussed later in this thread, I also consider failure a
solution ;-) So we don't need to move anything.

If we have to tell our users that exclusive-locking a catalog can cause
deadlocks for parallel query, I think they will find that reasonable.
The same admin could turn parallelism down to one for non-superusers
before they start doing that, and restore it later.


 patch that does parallel anything has got to worry about what
 undetected deadlocks might result, and then argue about which of them
 are obscure enough that we shouldn't care and which are not.  That
 doesn't sound like a fun way to spend my time.

We're in agreement that deadlocks can't go undetected.

 But, really, the first case is the one I'm more worried about: a bunch
 of parallel backends all grab the same locks, but slightly separated
 in time.  A strong locker joins the queue midway through and we're
 hosed.

I agree the situation sounds bad in the abstract, but none of the
examples seem very compelling. Who locks random catalogs exclusively?
And if they do, why is it the lock manager's job to try to reconcile the
resulting mess without canceling someone? (OK, don't answer that. It
obviously is the lock manager's job to some degree, but we can only
demand so much.)

 Even if you think that all of these particular issues are somehow
 avoidable or that they don't matter, I think it would be unwise to bet
 on them being the only issues.  The lock manager looks like it's old,
 ill-maintained code, but that's in no small part because it solved the
 problem it aims at so thoroughly that few people have felt the need to
 modify it very much in the last decade.  I think that's part of why
 there's so much skepticism on this thread - we don't think about the
 problems that the lock manager solves as being important problems not
 because they truly aren't important, but because they are so
 thoroughly solved.  Then every once in a while we complain about the
 performance.

Guilty as charged.

I won't forever oppose changes to the lock manager, but I resist making
them a precondition of parallelism (aside from making deadlock detection
work, of course). Users taking out exclusive locks recklessly can result
in deadlocks today, and will result in more when we do parallelism
(though perhaps in more surprising situations).

I remember when doing Exclusion Constraints, I went to a lot of effort
to make deadlocks impossible, and was quite proud. When Tom saw it, he
told me not to bother, and to do it the simple way instead, because
deadlocks can happen even with UNIQUE constraints (which I didn't even
know).

We should use the same strategy here. When we see deadlocks becoming a
problem for any reasonable workload, we make a series of tweaks (perhaps
some to the lock manager itself) to reduce them.

Regards,
Jeff

Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-13 Thread Jeff Davis
On Thu, Nov 13, 2014 at 11:26 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Nov 13, 2014 at 3:38 AM, Jeff Davis pg...@j-davis.com wrote:
  If two backends both have an exclusive lock on the relation for a join
  operation, that implies that they need to do their own synchronization,
  because obviously the lock manager is not doing it for them.

 This doesn't make sense to me.  Why would they need to synchronize
 access to a relation in order to join it?


Unfortunate typo: that was supposed to be joint operation, just meaning
that they are working together for something (e.g. CLUSTER, VACUUM FULL as
you suggest). Sorry for the confusion.

I meant that the backends need to divide up the work somehow. And if each
operator needs to divide up the work before operating, that means we need
to change every operator.

Regards,
 Jeff Davis


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-12 Thread Jeff Davis
On Wed, 2014-11-12 at 14:16 -0500, Robert Haas wrote:
 Detected deadlocks are fine.  Improving the deadlock detector is the
 heart of what needs to be done here.

OK, great.

 As you say, the lock requests
 we're talking about will rarely wait, so deadlocks won't be frequent.
 The issue is making sure that, if they do happen, we get a better
 behavior than your parallel query hangs forever; good luck figuring
 out why.

Right. We can still use this patch's notion of a lock group in the
deadlock detector, but we don't need it to actually affect the way a
lock is granted. That should eliminate concerns about subtle bugs.

Later, after we understand how this is actually used, and if we see
deadlock problems, we can look for ways to solve/mitigate them.

This seems to be what Andres was saying, here:

http://www.postgresql.org/message-id/20141031130727.gf13...@awork2.anarazel.de

So I'll follow up in that thread, because it's an interesting
discussion.

 More generally, I think there's some misunderstanding about the
 overall goal of the parallelism infrastructure that I'm trying to
 create.  ...  But my goal is in some ways
 the opposite: I'm trying to make it possible to run as much existing
 PostgreSQL backend code as possible inside a parallel worker without
 any modification.

Thank you for clarifying, I think this is a good approach.

Back to the patch:

If I understand correctly, the _primary_ goal of this patch is to make
it safe to take out heavyweight locks in worker processes, even if the
deadlock involves LWLocks/latches synchronizing among the processes
within a lock group.

For example, say processes A1 and A2 are in the same lock group, and B
is in a different lock group. A1 is holding heavyweight lock H1 and
waiting on a LW lock L1; A2 is holding L1 and waiting on heavyweight
lock H2; and B is holding H2 and waiting on H1.

The current deadlock detector would see a dependency graph like:

   A2 - B - A1

But with lock groups, it would see:

   (A1 A2) - B - (A1 A2)

which is a cycle, and can be detected regardless of the synchronization
method used between A1 and A2. There are some details to work out to
avoid false positives, of course.

Is that about right?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-11-11 Thread Jeff Davis
On Wed, 2014-10-15 at 00:03 -0400, Robert Haas wrote:
 For parallelism, I think we need a concept of group locking.  That is,
 suppose we have a user backend and N worker backends collaborating to
 execute some query.  For the sake of argument, let's say it's a
 parallel CLUSTER, requiring a full table lock.  We need all of the
 backends to be able to lock the table even though at least one of them
 holds AccessExclusiveLock.  This suggests that the backends should all
 be members of a locking group, and that locks within the same locking
 group should be regarded as mutually non-conflicting.

Trying to catch up on this thread, please excuse me if these questions
are already covered.

You mention the possibility of undetected deadlocks, which is surely
unacceptable. But why not improve the deadlock detector? How bad are
_detected_ deadlocks? A lot of the concern was around catalog accesses,
but those lock requests would rarely wait anyway.

I also wonder if group locking is generally the wrong approach to
parallelism. Parallel scan/sort should work by assigning workers to
chunks of data, and then merging the results. In principle the workers
don't need to touch the same data at all, so why are we trying so hard
to get them to all take the same locks?

The reason, I assume, is that a full table is the lockable object, but
we are imagining the segment files as the chunks. But maybe there's a
way to address this more directly with an extra field in the lock tag,
and perhaps some catalog knowledge?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-10-15 Thread Jeff Davis
On Fri, 2014-08-29 at 10:12 -0400, Tom Lane wrote:
 What about removing the callback per se and just keeping the argument,
 as it were.  That is, a MemoryContext has a field of type size_t *
 that is normally NULL, but if set to non-NULL, then we increment the
 pointed-to value for pallocs and decrement for pfrees.

I like that idea; patch attached. Two differences:
  * At block level, not chunk level.
  * I use a uint64, because a size_t is only guaranteed to hold one
allocation, and we need to sum up many allocations.

When unused, it does still appear to have a little overhead in Robert's
test on his power machine. It seems to be between a 0.5-1.0% regression.
There's not much extra code on that path, just a branch, pointer
dereference, and an addition; so I don't think it will get much cheaper
than it is.

This regression seems harder to reproduce on my workstation, or perhaps
it's just noisier.

 One thought in either case is that we don't have to touch the API for
 MemoryContextCreate: rather, the feature can be enabled in a separate
 call after making the context.

That seems fairly awkward to me because the pointer needs to be
inherited from the parent context when not specified. I left the extra
API call in.

The inheritance is awkward anyway, though. If you create a tracked
context as a child of an already-tracked context, allocations in the
newer one won't count against the original. I don't see a way around
that without introducing even more performance problems.

If this patch is unacceptable, my only remaining idea is to add the
memory only for the current context with no inheritance (thereby
eliminating the branch, also). That will require HashAgg to traverse all
the child contexts to check whether the memory limit has been exceeded.
As Tomas pointed out, that could be a lot of work in the case of
array_agg with many groups.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,482 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	/* inherit 'track_mem' from parent context, if available */
! 	uint64 *track_mem = (parent == NULL) ? NULL : parent-track_mem;
! 
! 	return AllocSetContextCreateTracked(parent, name, minContextSize,
! 		initBlockSize, maxBlockSize,
! 		track_mem);
! }
! 
! /*
!  * AllocSetContextCreateTracked
!  *		Create a new AllocSet context, tracking total memory usage.
!  *
!  * parent: parent context, or NULL if top-level context
!  * name: name of context (for debugging --- string will be copied)
!  * minContextSize: minimum context size
!  * initBlockSize: initial allocation block size
!  * maxBlockSize: maximum allocation block size
!  * track_mem: caller-supplied variable to use for memory tracking
!  */
! MemoryContext
! AllocSetContextCreateTracked(MemoryContext parent,
! 			 const char *name,
! 			 Size minContextSize,
! 			 Size initBlockSize,
! 			 Size maxBlockSize,
! 			 uint64 *track_mem)
! {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   AllocSetMethods,
!   parent,
!   name,
!   track_mem);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context-initBlockSize = initBlockSize;
! 	context-maxBlockSize = maxBlockSize;
! 	context-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 490,498 
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set-initBlockSize = initBlockSize;
! 	set-maxBlockSize = maxBlockSize;
! 	set-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	context-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-29 Thread Jeff Davis
On Fri, 2014-08-22 at 12:34 -0400, Robert Haas wrote:
 Still doesn't look good here.  On the same PPC64 machine I've been
 using for testing:

I have a new approach to the patch which is to call a callback at each
block allocation and child contexts inherit the callback from their
parents. The callback could be defined to simply dereference and
increment its argument, which would mean block allocations anywhere in
the hierarchy are incrementing the same variable, avoiding the need to
traverse the hierarchy.

I've made some progress I think (thanks to both Robert and Tomas), but I
have a bit more microoptimization and testing to do. I'll mark it
returned with feedback for now, though if I find the time I'll do more
testing to see if the performance concerns are fully addressed.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Fri, 2014-06-13 at 13:24 +0300, Heikki Linnakangas wrote:
 Attached is a new patch. It now keeps the current pg_clog unchanged, but 
 adds a new pg_csnlog besides it. pg_csnlog is more similar to 
 pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup, and 
 segments older than GlobalXmin can be truncated.

It appears that this patch weakens the idea of hint bits. Even if
HEAP_XMIN_COMMITTED is set, it still needs to find out if it's in the
snapshot.

That's fast if the xid is less than snap-xmin, but otherwise it needs
to do a fetch from the csnlog, which is exactly what the hint bits are
designed to avoid. And we can't get around this, because the whole point
of this patch is to remove the xip array from the snapshot.

If the transaction was committed a long time ago, then we could set
PD_ALL_VISIBLE and the VM bit, and a scan wouldn't even look at the hint
bit. If it was committed recently, then it's probably greater than the
recentXmin. I think there's still room for a hint bit to technically be
useful, but it seems quite narrow unless I'm missing something (and a
narrowly-useful hint bit doesn't seem to be useful at all).

I'm not complaining, and I hope this is not a showstopper for this
patch, but I think it's worth discussing.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:
 My current thinking is that access to the csnlog will need to be made 
 faster. Currently, it's just another SLRU, but I'm sure we can do better 
 than that. Or add a backend-private cache on top of it; that might 
 already alleviate it enough..

Aren't those the same ideas that have been proposed for eliminating hint
bits?

If this patch makes taking snapshots much faster, then perhaps it's
enough incentive to follow through on one of those ideas. I am certainly
open to removing hint bits (which was probably clear from earlier
discussions) if there are significant wins elsewhere and the penalty is
not too great.

To continue the discussion on this patch, let's assume that we make
TransactionIdGetCommitLSN sufficiently fast. If that's the case, can't
we make HeapTupleSatisfiesMVCC look more like:

  LsnMin = TransactionIdGetCommitLSN(xmin);
  if (!COMMITLSN_IS_COMMITTED(LsnMin))
 LsnMin = BIG_LSN;

  LsnMin = TransactionIdGetCommitLSN(xmin);
  if (!COMMITLSN_IS_COMMITTED(LsnMin))
 LsnMin = BIG_LSN;

  return (snapshot-snapshotlsn = LsnMin 
  snapshot-snapshotlsn   LsnMax);

There would need to be some handling for locked tuples, or tuples
related to the current transaction, of course. But I still think it
would turn out simpler; perhaps by enough to save a few cycles.

Regards,
Jeff Davis






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 19:25 +0100, Greg Stark wrote:
 I don't immediately see how to make that practical. One thought would
 be to have a list of xids in the page header with their corresponding
 csn -- which starts to sound a lot like Oralce's Interested
 Transaction List. But I don't see how to make that work for the
 hundreds of possible xids on the page.

I feel like that's moving in the wrong direction. That's still causing a
lot of modifications to a data page when the data is not changing, and
that's bad for a lot of cases that I'm interested in (checksums are one
example).

We are mixing two kinds of data: user data and visibility information.
Each is changed under different circumstances and has different
characteristics, and I'm beginning to think they shouldn't be mixed at
all.

What if we just devised a structure specially designed to hold
visibility information, put all of the visibility information there, and
data pages would only change where there is a real, user-initiated
I/U/D. Vacuum could still clear out dead tuples from the data area, but
it would do the rest of its work on the visibility structure. It could
even be a clever structure that could compress away large static areas
until they become active again.

Maybe this wouldn't work for all tables, but could be an option for big
tables with low update rates.

 The worst case for visibility resolution is you have a narrow table
 that has random access DDL happening all the time, each update is a
 short transaction and there are a very high rate of such transactions
 spread out uniformly over a very large table. That means any given
 page has over 200 rows with random xids spread over a very large range
 of xids.

That's not necessarily a bad case, unless the CLOG/CSNLOG lookup is a
significant fraction of the effort to update a tuple.

That would be a bad case if you introduce scans into the equation as
well, but that's not a problem if the all-visible bit is set.

 Currently the invariant hint bits give us is that each xid needs to be
 looked up in the clog only a more or less fixed number of times, in
 that scenario only once since the table is very large and the
 transactions short lived.

A backend-local cache might accomplish that, as well (would still need
to do a lookup, but no locks or contention). There would be some
challenges around invalidation (for xid wraparound) and pre-warming the
cache (so establishing a lot of connections doesn't cause a lot of CLOG
access).

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 12:39 +0300, Heikki Linnakangas wrote:
 I think this is enough for this commitfest - we have consensus on the 
 design. For the next one, please address those open items, and resubmit.

Agreed, return with feedback.

I need to get the accounting patch in first, which needs to address some
performance issues, but there's a chance of wrapping those up quickly.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 21:00 +0100, Greg Stark wrote:
 Well fundamentally the reason the visibility information is with the
 user data is so that we don't need to do two i/os to access the data.
 The whole point of hint bits is to guarantee that after some amount of
 time you can read data directly out of the heap page and use it
 without doing any additional I/O.

If the data is that static, then the visibility information would be
highly compressible, and surely in shared_buffers already.

(Yes, it would need to be pinned, which has a cost.)

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-26 Thread Jeff Davis
On Mon, 2014-08-25 at 17:41 +0300, Heikki Linnakangas wrote:
 Actually, that gets optimized to a constant in the planner:

Oops, thank you (and Tom).

 your patch seems to be about 2x-3x as slow as unpatched master. So this 
 needs some optimization. A couple of ideas:

I didn't see anywhere near that kind of regression. On unpatched master,
with your test case, I saw it stabilize to about 680ms. With
similar-escape-1, I saw about 775ms (15% regression). Are those at all
close to your numbers? Is there a chance you used an unoptimized build
for one of them, or left asserts enabled?

 1. If the escape string is in fact a single-byte character, you can 
 proceed with the loop just as it is today, without the pg_mblen calls.
 
 2. Since pg_mblen() will always return an integer between 1-6, it would 
 probably be faster to replace the memcpy() and memcmp() calls with 
 simple for-loops iterating byte-by-byte.
 
 In very brief testing, with the 1. change above, the performance with 
 this patch is back to what it's without the patch. See attached.

The particular patch has a mistake: the first branch is always taken
because pg_mblen() won't return 0. It's also fairly ugly to set mblen in
the test for the branch that doesn't use it.

Attached a patch implementing the same idea though: only use the
multibyte path if *both* the escape char and the current character from
the pattern are multibyte.

I also changed the comment to more clearly state the behavior upon which
we're relying. I hope what I said is accurate.

Regards,
Jeff Davis

*** a/src/backend/utils/adt/regexp.c
--- b/src/backend/utils/adt/regexp.c
***
*** 688,698  similar_escape(PG_FUNCTION_ARGS)
  		elen = VARSIZE_ANY_EXHDR(esc_text);
  		if (elen == 0)
  			e = NULL;			/* no escape character */
! 		else if (elen != 1)
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! 	 errmsg(invalid escape string),
!   errhint(Escape string must be empty or one character.)));
  	}
  
  	/*--
--- 688,703 
  		elen = VARSIZE_ANY_EXHDR(esc_text);
  		if (elen == 0)
  			e = NULL;			/* no escape character */
! 		else
! 		{
! 			int			escape_mblen = pg_mbstrlen_with_len(e, elen);
! 
! 			if (escape_mblen  1)
! ereport(ERROR,
! 		(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
! 		 errmsg(invalid escape string),
! 		 errhint(Escape string must be empty or one character.)));
! 		}
  	}
  
  	/*--
***
*** 724,729  similar_escape(PG_FUNCTION_ARGS)
--- 729,782 
  	{
  		char		pchar = *p;
  
+ 		/*
+ 		 * If both the escape character and the current character from the
+ 		 * pattern are multi-byte, we need to take the slow path.
+ 		 *
+ 		 * But if one of them is single-byte, we can process the the pattern
+ 		 * one byte at a time, ignoring multi-byte characters.  (This works
+ 		 * because all server-encodings have the property that a valid
+ 		 * multi-byte character representation cannot contain the
+ 		 * representation of a valid single-byte character.)
+ 		 */
+ 
+ 		if (elen  1)
+ 		{
+ 			int mblen = pg_mblen(p);
+ 			if (mblen  1)
+ 			{
+ /* slow, multi-byte path */
+ if (afterescape)
+ {
+ 	*r++ = '\\';
+ 	memcpy(r, p, mblen);
+ 	r += mblen;
+ 	afterescape = false;
+ }
+ else if (e  elen == mblen  memcmp(e, p, mblen) == 0)
+ {
+ 	/* SQL99 escape character; do not send to output */
+ 	afterescape = true;
+ }
+ else
+ {
+ 	/*
+ 	 * We know it's a multi-byte character, so we don't need
+ 	 * to do all the comparisons to single-byte characters
+ 	 * that we do below.
+ 	 */
+ 	memcpy(r, p, mblen);
+ 	r += mblen;
+ }
+ 
+ p += mblen;
+ plen -= mblen;
+ 
+ continue;
+ 			}
+ 		}
+ 
+ 		/* fast path */
  		if (afterescape)
  		{
  			if (pchar == ''  !incharclass)	/* for SUBSTRING patterns */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 13:45 +0300, Heikki Linnakangas wrote:
 Yeah. This patch in the current state is likely much much slower than 
 unpatched master, except in extreme cases where you have thousands of 
 connections and short transactions so that without the patch, you spend 
 most of the time acquiring snapshots.

What else are you looking to accomplish with this patch during this
'fest? Bug finding? Design review? Performance testing?

I haven't had a good track record with my performance testing recently,
so I'm unlikely to be much help there.

It seems a bit early for bug hunting, unless you think it will raise
possible design issues.

I think there's already at least one design issue to consider, which is
whether we care about CLOG/CSNLOG access for hinted records where the
xid  snapshot-xmin (that is, accesses that previously would have
looked at xip). Would more discussion help here or do we need to wait
for performance numbers?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

2014-08-26 Thread Jeff Davis
On Tue, 2014-08-26 at 22:13 -0700, Jeff Davis wrote:
 Attached a patch implementing the same idea though: only use the
 multibyte path if *both* the escape char and the current character from
 the pattern are multibyte.

Forgot to mention: with this patch, the test completes in about 720ms,
so just between unpatched and the v1 patch.

I feel like we're spending too much time optimizing this patch, but we
got this far, and it doesn't really complicate the code, so we might as
well finish it.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-21 Thread Jeff Davis
On Wed, 2014-08-20 at 14:32 -0400, Robert Haas wrote:
 Well, I think you're certainly right that a hash table lookup is more
 expensive than modulo on a 32-bit integer; so much is obvious.  But if
 the load factor is not too large, I think that it's not a *lot* more
 expensive, so it could be worth it if it gives us other advantages.
 As I see it, the advantage of Jeff's approach is that it doesn't
 really matter whether our estimates are accurate or not.  We don't
 have to decide at the beginning how many batches to do, and then
 possibly end up using too much or too little memory per batch if we're
 wrong; we can let the amount of memory actually used during execution
 determine the number of batches.  That seems good.  Of course, a hash
 join can increase the number of batches on the fly, but only by
 doubling it, so you might go from 4 batches to 8 when 5 would really
 have been enough.  And a hash join also can't *reduce* the number of
 batches on the fly, which might matter a lot.  Getting the number of
 batches right avoids I/O, which is a lot more expensive than CPU.

My approach uses partition counts that are powers-of-two also, so I
don't think that's a big differentiator. In principle my algorithm could
adapt to other partition counts, but I'm not sure how big of an
advantage there is.

I think the big benefit of my approach is that it doesn't needlessly
evict groups from the hashtable. Consider input like 0, 1, 0, 2, ..., 0,
N. For large N, if you evict group 0, you have to write out about N
tuples; but if you leave it in, you only have to write out about N/2
tuples. The hashjoin approach doesn't give you any control over
eviction, so you only have about 1/P chance of saving the skew group
(where P is the ultimate number of partitions). With my approach, we'd
always keep the skew group in memory (unless we're very unlucky, and the
hash table fills up before we even see the skew value).

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-20 Thread Jeff Davis
On Tue, 2014-08-19 at 12:54 +0200, Tomas Vondra wrote:
 The use-case for this is tracking a chosen subtree of contexts - e.g.
 aggcontext and below, so I'd expect the tracked subtrees to be relatively
 shallow. Am I right?

Right.

 My fear is that by removing the inheritance bit, we'll hurt cases with a
 lot of child contexts. For example, array_agg currently creates a separate
 context for each group - what happens if you have 100k groups and do
 MemoryContextGetAllocated? I guess iterating over 100k groups is not free.

Good point. We don't want to make checking the allocated space into an
expensive operation, because then we will be forced to call it less
frequently, which implies that we'd be sloppier about actually fitting
in work_mem.

 Wouldn't the solution with inheritance and propagating the accounting info
 to the parent actually better? Or maybe allowing both, having two flags
 when creating a context instead of one?

That seems overly complicated. We only have one driving use case, so two
orthogonal options sounds excessive. Perhaps one option if we can't
solve the performance problem and we need to isolate the changes to
hashagg.

 Also, do we really need to track allocated bytes - couldn't we track
 kilobytes or something and use smaller data types to get below the 64B?

Good idea.

I attached a patch that uses two uint32 fields so that it doesn't
increase the size of MemoryContextData, and it tracks memory usage for
all contexts. I was unable to detect any performance regression vs.
master, but on my machine the results are noisy.

It would be easier to resolve the performance concern if I could
reliably get the results Robert is getting. I think I was able to
reproduce the regression with the old patch, but the results were still
noisy.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_allocation(MemoryContext context, size_t size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 250,256  static void AllocSetFree(MemoryContext context, void *pointer);
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
--- 252,258 
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context, MemoryContext parent);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 502,510 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		update_allocation((MemoryContext) context, blksize);
+ 
  		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 595,601 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 611,617  AllocSetReset(MemoryContext context)
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = set-blocks;
--- 617,623 
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context, MemoryContext parent)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = set-blocks;
***
*** 623,628  AllocSetDelete(MemoryContext context)
--- 629,644 
  	AllocSetCheck(context);
  #endif
  
+ 	/*
+ 	 * Parent is already unlinked from context, so can't use
+ 	 * update_allocation().
+ 	 */
+ 	while (parent != NULL)
+ 	{
+ 		parent-total_allocated -= context-total_allocated;
+ 		parent = parent-parent;
+ 	}
+ 
  	/* Make it look empty, just in case... */
  	MemSetAligned(set-freelist, 0, sizeof(set-freelist));
  	set-blocks = NULL;
***
*** 678,683  AllocSetAlloc(MemoryContext context, Size size

Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-19 Thread Jeff Davis
On Fri, 2014-08-15 at 13:53 -0400, Robert Haas wrote:
 I think that's right, and I rather like your (Jeff's) approach.  It's
 definitely true that we could do better if we have a mechanism for
 serializing and deserializing group states, but (1) I think an awful
 lot of cases would get an awful lot better even just with the approach
 proposed here and (2) I doubt we would make the
 serialization/deserialization interfaces mandatory, so even if we had
 that we'd probably want a fallback strategy anyway.

Thank you for taking a look.

To solve the problem for array_agg, that would open up two potentially
lengthy discussions:

1. Trying to support non-serialized representations (like
ArrayBuildState for array_agg) as a real type rather than using
internal.

2. What changes should we make to the aggregate API? As long as we're
changing/extending it, should we go the whole way and support partial
aggregation[1] (particularly useful for parallelism)?

Both of those discussions are worth having, and perhaps they can happen
in parallel as I wrap up this patch.

I'll see whether I can get consensus that my approach is (potentially)
commit-worthy, and your statement that it (potentially) solves a real
problem is a big help.

Regards,
Jeff Davis

[1]
http://blogs.msdn.com/b/craigfr/archive/2008/01/18/partial-aggregation.aspx




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-19 Thread Jeff Davis
On Thu, 2014-08-14 at 12:52 -0400, Robert Haas wrote:
 It appears to me that the performance characteristics for this version
 are not significantly different from version 1.  I have not looked at
 the code.

While trying to reproduce your results, I noticed what might be around a
1% regression just from adding the 3 fields to MemoryContextData. If I
cut it down to adding just one field, the regression disappears.

The results are fairly noisy, so I could be chasing the wrong thing. But
one reason to believe it is that I pushed the size of MemoryContextData
above 64, which sounds like it might be an important threshold.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-19 Thread Jeff Davis
On Sat, 2014-08-16 at 23:09 +0200, Tomas Vondra wrote:
 But maybe the inheritance really is not necessary - maybe it would be
 enough to track this per-context, and then just walk through the
 contexts and collect this. Because my observation is that most of the
 time is actually spent in walking through blocks and freelists.

That makes a lot of sense to me.

Another approach is to pass a flag to hash_create that tells it not to
create a subcontext. Then we don't need to traverse anything; we already
know which context we want to look at. Perhaps I was being too clever
with the idea of tracking space for an entire hierarchy.

Also, as I pointed out in my reply to Robert, adding too many fields to
MemoryContextData may be the cause of the regression. Your idea requires
only one field, which doesn't show the same regression in my tests.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-14 Thread Jeff Davis
I think the hash-join like approach is reasonable, but I also think
you're going to run into a lot of challenges that make it more complex
for HashAgg. For instance, let's say you have the query:

  SELECT x, array_agg(y) FROM foo GROUP BY x;

Say the transition state is an array (for the sake of simplicity), so
the hash table has something like:

  1000 = {7,   8,  9}
  1001 = {12, 13, 14}

You run out of memory and need to split the hash table, so you scan the
hash table and find that group 1001 needs to be written to disk. So you
serialize the key and array and write them out.

Then the next tuple you get is (1001, 19). What do you do? Create a new
group 1001 = {19} (how do you combine it later with the first one)? Or
try to fetch the existing group 1001 from disk and advance it (horrible
random I/O)?



On Wed, 2014-08-13 at 12:31 +0200, Tomas Vondra wrote:
 My understanding of the batching algorithm (and I may be wrong on this
 one) is that once you choose the number of batches, it's pretty much
 fixed. Is that the case?

It's only fixed for that one work item (iteration). A different K can
be selected if memory is exhausted again. But you're right: this is a
little less flexible than HashJoin.

 But what will happen in case of significant cardinality underestimate?
 I.e. what will happen if you decide to use 16 batches, and then find
 out 256 would be more appropriate? I believe you'll end up with batches
 16x the size you'd want, most likely exceeding work_mem.

Yes, except that work_mem would never be exceeded. If the partitions are
16X work_mem, then each would be added as another work_item, and
hopefully it would choose better the next time.

  One thing I like about my simple approach is that it returns a good
  number of groups after each pass, and then those are completely finished
  (returned to the operator above, even). That's impossible with HashJoin
  because the hashing all needs to be done before the probe phase begins.
 
 The hash-join approach returns ~1/N groups after each pass, so I fail to
 see how this is better?

You can't return any tuples until you begin the probe phase, and that
doesn't happen until you've hashed the entire inner side (which involves
splitting and other work). With my patch, it will return some tuples
after the first scan. Perhaps I'm splitting hairs here, but the idea of
finalizing some groups as early as possible seems appealing.

 Aha! And the new batches are 'private' to the work item, making it a bit
 recursive, right? Is there any reason not to just double the number of
 batches globally?

I didn't quite follow this proposal.

 It also seems to me that using HASH_DISK_MAX_PARTITIONS, and then allowing
 each work item to create it's own set of additional partitions effectively
 renders the HASH_DISK_MAX_PARTITIONS futile.

It's the number of active partitions that matter, because that's what
causes the random I/O.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-14 Thread Jeff Davis
On Thu, 2014-08-14 at 10:06 -0400, Tom Lane wrote:
 If you're following the HashJoin model, then what you do is the same thing
 it does: you write the input tuple back out to the pending batch file for
 the hash partition that now contains key 1001, whence it will be processed
 when you get to that partition.  I don't see that there's any special case
 here.

HashJoin only deals with tuples. With HashAgg, you have to deal with a
mix of tuples and partially-computed aggregate state values. Not
impossible, but it is a little more awkward than HashJoin.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-14 Thread Jeff Davis
On Thu, 2014-08-14 at 16:17 +0200, Tomas Vondra wrote:
 Either it belongs to the current batch (and either it's in the hash
 table, or you add it there), or it's not - in that case write it to a
 temp file.

I think the part you left out is that you need two files per batch: one
for the dumped-out partially-computed state values, and one for the
tuples.

In other words, you haven't really discussed the step where you reunite
the tuples with that partially-computed state.

 For sure, it's not for free - it may write to quite a few files. Is it
 more expensive than what you propose? I'm not sure about that. With
 your batching scheme, you'll end up with lower number of large batches,
 and you'll need to read and split them, possibly repeatedly. The
 batching scheme from hashjoin minimizes this.

My approach only has fewer batches if it elects to have fewer batches,
which might happen for two reasons:
 1. A cardinality misestimate. This certainly could happen, but we do
have useful numbers to work from (we know the number of tuples and
distincts that we've read so far), so it's far from a blind guess. 
 2. We're concerned about the random I/O from way too many partitions.

 I fail to see how this is different from your approach? How can you
 output any tuples before processing the whole inner relation?

Right, the only thing I avoid is scanning the hash table and dumping out
the groups.

This isn't a major distinction, more like my approach does a little
less work before returning tuples, and I'm not even sure I can defend
that, so I'll retract this point.

 Your approach is to do multi-level batching, and I was thinking whether
 it'd be possible to use the same approach (single level). But in
 retrospect it probably does not make much sense, because the multi-level
 batching is one of the points of the proposed approach.

Now that I think about it, many of the points we discussed could
actually work with either approach:
  * In my approach, if I need more partitions, I could create more in
much the same way as HashJoin to keep it single-level (as you suggest
above).
  * In your approach, if there are too many partitions, you could avoid
random I/O by intentionally putting tuples from multiple partitions in a
single file and moving them while reading.
  * If given a way to write out the partially-computed states, I could
evict some groups from the hash table to keep an array_agg() bounded.

Our approaches only differ on one fundamental trade-off that I see:
  (A) My approach requires a hash lookup of an already-computed hash for
every incoming tuple, not only the ones going into the hash table.
  (B) Your approach requires scanning the hash table and dumping out the
states every time the hash table fills up, which therefore requires a
way to dump out the partial states.

You could probably win the argument by pointing out that (A) is O(N) and
(B) is O(log2(N)). But I suspect that cost (A) is very low.

Unfortunately, it would take some effort to test your approach because
we'd actually need a way to write out the partially-computed state, and
the algorithm itself seems a little more complex. So I'm not really sure
how to proceed.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   6   7   8   9   10   >