[HACKERS] Problems with approach #2 to value locking (INSERT ... ON CONFLICT UPDATE/IGNORE patch)

2015-01-01 Thread Peter Geoghegan
I've been working on fixing the bugs that Jeff Janes found [1] with
approach #2 to value locking [2]. Approach #1 was unaffected.

I'm starting this new thread, to discuss these issues. Let's try and
confine discussion of semantics and syntax to the main thread, since
that has been what has mostly been discussed there.

Jeff produced a stress testing tool which tests concurrent deletions
(and not just concurrent updates); his test case deletes tuples when a
counter value goes under zero, which happens occasionally. It also
independently does book keeping for all values in a range of values
that are randomly incremented or decremented, and verifies that
everything is as it should be. This revealed problems with #2 around
concurrent "super deletion" of "unkept" promise heap tuples. As Jeff
pointed out, all-NULL tuples were appearing as visible, or spurious
duplicates appeared. For example:

postgres=# select xmin, xmax, ctid, * from upsert_race_test where index = 1287;
  xmin  |  xmax  |   ctid| index  | count
++---++
  0 | 205438 | (260,27)  | [null] | [null]
 176066 |  0 | (258,158) |   1287 |  1
(2 rows)

I should point out that one of the problems I found here was a garden
variety bug, which has been fixed (the conflict flag in ExecInsert()
was not being reset correctly, I believe). However, the other
remaining problems could only be fixed with further changes to the
routines in tqual.c, which I'm not happy about, and require further
discussion here. I attach a differential patch which applies on top of
the most recent revision of #2, which is v1.8.vallock2.tar.gz [3].

Note that I've been extending Jeff's stress testing tool to make it
highlight problems earlier, and to make it provide more and better
instrumentation (e.g. printing of XIDs to correlate with pg_xlogdump
output). A good trick I added was to have an INSERT...ON CONFLICT
UPDATE predicate of "WHERE TARGET.index = EXCLUDED.index", which
serves as a kind of assertion when used within the Perl script (the
Perl scripts checks RETURNING-projected tuples, and expects to always
insert or update - that WHERE clause should always pass, obviously).
Jeff's tool is available here (with my revisions):
https://github.com/petergeoghegan/jjanes_upsert

I've been running with fsync off, which Jeff felt increased the
likelihood of revealing the bug. It isn't necessary, though, and FWIW
it was easy for me to recreate this problem once I understood it,
using only my 4 core laptop. I found it important to build without
assertions and at optimization level -O2, though.

The attached patch fixes Jeff's test case, as far as it goes. But, as
I mentioned, it's not intended as a real fix. For one thing, I have
made multiple changes to HeapTupleSatisfies*() routines, but haven't
fully considered the consequences for, say, ri_triggers.c.
HeapTupleSatisfiesSelf() and HeapTupleSatisfiesHistoricMVCC() were not
modified. I've modified HeapTupleSatisfiesVacuum() to see
InvalidTransactionID (not invalid xid infomask bits) as visible for
garbage collection (alongside HeapTupleSatisfiesDirty() and
HeapTupleSatisfiesMVCC(), which I changed first), and I wouldn't be
surprised if that created new problems in other areas. In short, this
patch is just for illustrative purposes.

To see the difference this patch makes, I suggest commenting out the
new code, and running the following test after a fresh initdb:

$ perl count_upsert.pl 4 10

I think that the actual nature of the problem I fixed was that
concurrent upserters felt entitled to update a promise tuple that they
could at least initially see, but was then "super deleted", but it
turned out that once it was super deleted it was okay to update (the
tuple wasn't actually duplicated insofar as HeapTupleSatisfiesDirty()
was then able to ascertain due to other concurrent activity). It's
pretty complicated, in any case. This is just a band-aid.

Thoughts? Does anyone have any thoughts on my diagnosis of the problem?

[1] 
http://www.postgresql.org/message-id/CAMkU=1w8e9Q6ZX3U85RtsyCMpdYWFrvVAO3=unevtqiruzn...@mail.gmail.com
[2] 
https://wiki.postgresql.org/wiki/Value_locking#.232._.22Promise.22_heap_tuples_.28Heikki_Linnakangas.29
[3] 
http://www.postgresql.org/message-id/cam3swzrg_htrol-6_wfe6_d_ucuyc28jfapsfh_tra76gkk...@mail.gmail.com
-- 
Peter Geoghegan
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4e5be1a..9bb13df 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -313,8 +313,8 @@ ExecInsert(TupleTableSlot *slot,
 		 * deleting the already-inserted tuple and retrying, but that's fairly
 		 * expensive, so we try to avoid it.
 		 */
-		conflict = false;
 vlock:
+		conflict = false;
 		ItemPointerSetInvalid(&conflictTid);
 
 		/*
@@ -354,7 +354,8 @@ vlock:
 			 * which is appropriate only for non-promise tuples) to wait for us
 			 * to decide if we're going to go ahead w

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2015-01-01 Thread Dilip kumar
On 31 December 2014 18:36, Amit Kapila Wrote,

>The patch looks good to me.  I have done couple of
>cosmetic changes (spelling mistakes, improve some comments,
>etc.), check the same once and if you are okay, we can move
>ahead.

Thanks for review and changes, changes looks fine to me..

Regards,
Dilip



Re: [HACKERS] orangutan seizes up during isolation-check

2015-01-01 Thread Noah Misch
On Wed, Dec 31, 2014 at 01:56:08PM -0500, Noah Misch wrote:
> On Wed, Dec 31, 2014 at 12:32:37AM -0500, Robert Haas wrote:
> > On Sun, Dec 28, 2014 at 4:58 PM, Noah Misch  wrote:
> > > I wondered whether to downgrade FATAL to LOG in back branches.  
> > > Introducing a
> > > new reason to block startup is disruptive for a minor release, but having 
> > > the
> > > postmaster deadlock at an unpredictable later time is even more 
> > > disruptive.  I
> > > am inclined to halt startup that way in all branches.
> > 
> > Jeepers.  I'd rather not do that.  From your report, this problem has
> > been around for years.  Yet, as far as I know, it's bothering very few
> > real users, some of whom might be far more bothered by the postmaster
> > suddenly failing to start.  I'm fine with a FATAL in master, but I'd
> > vote against doing anything that might prevent startup in the
> > back-branches without more compelling justification.
> 
> Clusters hosted on OS X fall into these categories:
> 
> 1) Unaffected configuration.  This includes everyone setting a valid messages
>locale via LANG, LC_ALL or LC_MESSAGES.
> 2) Affected configuration.  Through luck and light use, the cluster would not
>experience the crashes/hangs.
> 3) Cluster would experience the crashes/hangs.
> 
> DBAs in (3) want the FATAL at startup, but those in (2) want a LOG message
> instead.  DBAs in (1) don't care.  Since intermittent postmaster hangs are far
> worse than startup failure, if (2) and (3) have similar population, FATAL is
> the better bet.  If (2) is sufficiently more populous than (3), then the many
> small pricks from startup failure do add up to hurt more than the occasional
> postmaster hang.  Who knows how that calculation plays out.

The first attached patch, for all branches, adds LOG-level messages and an
assertion.  So cassert builds will fail hard, while others won't.  The second
patch, for master only, changes the startup-time message to FATAL.  If we
decide to use FATAL in all branches, I would just squash them into one.
diff --git a/configure b/configure
index 7594401..f7dedc0 100755
--- a/configure
+++ b/configure
@@ -11366,7 +11366,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit 
mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask 
symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
+for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit 
mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle 
setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes 
wcstombs wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 0dc3f18..76c6405 100644
--- a/configure.in
+++ b/configure.in
@@ -1265,7 +1265,7 @@ PGAC_FUNC_GETTIMEOFDAY_1ARG
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit 
mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask 
symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit 
mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle 
setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes 
wcstombs wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index 5106f52..6b0190c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -87,6 +87,10 @@
 #include 
 #endif
 
+#ifdef HAVE_PTHREAD_IS_THREADED_NP
+#include 
+#endif
+
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "bootstrap/bootstrap.h"
@@ -1200,6 +1204,24 @@ PostmasterMain(int argc, char *argv[])
 */
RemovePgTempFiles();
 
+#ifdef HAVE_PTHREAD_IS_THREADED_NP
+
+   /*
+* On Darwin, libintl replaces setlocale() with a version that calls
+* CFLocaleCopyCurrent() when its second argument is "" and every 
relevant
+* environment variable is unset or empty.  CFLocaleCopyCurrent() makes
+* the process multithreaded.  The postmaster calls sigprocmask() and
+* calls fork() without an immediate exec(), both of which have 
undefined
+* behavior in a multithreaded program.  A multithreaded postmaster is 
the
+* normal case on Windows, which offers neither fork() nor 
sigprocmask().
+*/
+   if (pthread_is_threaded_np() != 0)
+   ereport(LOG,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("postmaster became multithreaded during 
startup"),

Re: [HACKERS] orangutan seizes up during isolation-check

2015-01-01 Thread Noah Misch
On Wed, Dec 31, 2014 at 01:55:23PM -0500, Dave Cramer wrote:
> So at this point removing the  --enable-nls from my config will solve the
> build problem.
> 
> Everyone knows there is an issue so there is no point in continuing to have
> it fail.

We hope all packagers will build with --enable-nls, so OS X buildfarm coverage
thereof will be valuable.  Let me see what I can pull together over the next
several weeks toward getting it green.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compression of full-page-writes

2015-01-01 Thread Amit Kapila
On Fri, Jan 2, 2015 at 1:59 AM, Bruce Momjian  wrote:
>
> On Thu, Jan  1, 2015 at 10:40:53AM +0530, Amit Kapila wrote:
> > Good question,  I think there might be some impact due to that, but in
> > general for page level compression still there will be much more to
> > compress.
> >
> > In general, I think this idea has merit with respect to compressible
data,
> > and to save for the cases where it will not perform well, there is a
on/off
> > switch for this feature and in future if PostgreSQL has some better
> > compression method, we can consider the same as well.  One thing
> > that we need to think is whether user's can decide with ease when to
> > enable this global switch.
>
> Yes, that is the crux of my concern.  I am worried about someone who
> assumes compressions == good, and then enables it.  If we can't clearly
> know when it is good, it is even harder for users to know.

I think it might have been better if this switch is a relation level switch
as whether the data is compressible or not can be based on schema
and data in individual tables, but I think your concern is genuine.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Compression of full-page-writes

2015-01-01 Thread Amit Kapila
On Thu, Jan 1, 2015 at 12:03 PM, Michael Paquier 
wrote:
> On Thu, Jan 1, 2015 at 2:10 PM, Amit Kapila 
wrote:
> > On Thu, Jan 1, 2015 at 2:39 AM, Bruce Momjian  wrote:
> >> > So why are you bringing it up? That's not an argument for anything,
> >> > except not doing it in such a simplistic way.
> >>
> >> I still don't understand the value of adding WAL compression, given the
> >> high CPU usage and minimal performance improvement.  The only big
> >> advantage is WAL storage, but again, why not just compress the WAL file
> >> when archiving.
> When doing some tests with pgbench for a fixed number of transactions,
> I also noticed a reduction in replay time as well, see here for
> example some results here:
>
http://www.postgresql.org/message-id/CAB7nPqRv6RaSx7hTnp=g3dyqou++fel0uioyqpllbdbhayb...@mail.gmail.com
>
> >> I thought we used to see huge performance benefits from WAL
compression,
> >> but not any more?
> >
> > I think there can be performance benefit for the cases when the data
> > is compressible, but it would be loss otherwise.  The main thing is
> > that the current compression algorithm (pg_lz) used is not so
> > favorable for non-compresible data.
> Yes definitely. Switching to a different algorithm would be the next
> step forward. We have been discussing mainly about lz4 that has a
> friendly license, I think that it would be worth studying other things
> as well once we have all the infrastructure in place.
>
> >>Has the UPDATE WAL compression removed that benefit?
> >
> > Good question,  I think there might be some impact due to that, but in
> > general for page level compression still there will be much more to
> > compress.
> That may be a good thing to put a number on. We could try to patch a
> build with a revert of a3115f0d and measure a bit that the difference
> in WAL size that it creates. Thoughts?
>

You can do that, but what inference you want to deduce from it?
I think there can be some improvement in performance as well as
compression depending on the tests (if your tests involves lot of
Updates, then you might see some better results), however the
results will be more or less on similar lines.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] orangutan seizes up during isolation-check

2015-01-01 Thread Dave Cramer
So at this point removing the  --enable-nls from my config will solve the
build problem.

Everyone knows there is an issue so there is no point in continuing to have
it fail.

On 31 December 2014 at 13:52, Noah Misch  wrote:

> On Sun, Dec 28, 2014 at 07:20:04PM -0500, Andrew Dunstan wrote:
> > On 12/28/2014 04:58 PM, Noah Misch wrote:
> > >The gettext maintainer was open to implementing the
> setlocale_native_forked()
> > >technique in gettext, though the last visible progress was in October.
> In any
> > >event, PostgreSQL builds will see older gettext for several years.  If
> > >setlocale-darwin-fork-v1.patch is not wanted, I suggest making the
> postmaster
> > >check during startup whether it has become multithreaded.  If
> multithreaded:
> > >
> > >   FATAL: postmaster became multithreaded during startup
> > >   HINT: Set the LC_ALL environment variable to a valid locale.
>
> > >I would like to go ahead and commit setlocale-main-harden-v1.patch,
> which is a
> > >good thing to have regardless of what happens with gettext.
> > >
> >
> > I'm OK with this, but on its own it won't fix orangutan's problems, will
> it?
>
> Right; setlocale-main-harden-v1.patch fixes a bug not affecting orangutan
> at
> all.  None of the above will make orangutan turn green.  Checking
> multithreading during startup would merely let it fail cleanly.
>


Re: [HACKERS] add modulo (%) operator to pgbench

2015-01-01 Thread Alvaro Herrera
David Rowley wrote:

> At the moment I feel the patch is a bit half done. I really think that
> since the gaussian and exponential stuff was added in commit ed802e7d, that
> this should now be changed so that we have functions like random(),
> erandom() and grandom() and the way to use this becomes:
> 
> \set name random(1,10)
> \set name erandom(1,10)
> 
> And we completely pull out the new \\setrandom additions added in that
> commit.

Sounds good to me.  The current syntax is rather messy IMV, and
presumably a function-based syntax might be better.

> We'd likely keep \\setrandom name 1 10 for backwards compatibility.

Yep.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Publish autovacuum informations

2015-01-01 Thread Noah Misch
On Wed, Dec 31, 2014 at 12:46:17PM -0500, Tom Lane wrote:
> Robert Haas  writes:
> > On Mon, Dec 29, 2014 at 11:03 AM, Tom Lane  wrote:
> >> Either one of those approaches would cripple our freedom to change those
> >> data structures; which we've done repeatedly in the past and will surely
> >> want to do again.  So I'm pretty much -1 on exposing them.
> 
> > We could instead add a view of this information to core --
> > pg_stat_autovacuum, or whatever.
> 
> > But to be honest, I'm more in favor of Guillaume's proposal.  I will
> > repeat my recent assertion that we -- you in particular -- are too
> > reluctant to expose internal data structures to authors of C
> > extensions, and that this is developer-hostile.
> 
> Well, the core question there is whether we have a policy of not breaking
> extension-visible APIs.

No, we have no policy restricting backend C API changes in major releases.
Though this message is old enough to enroll in first grade, I know of no
policy decision supplanting it:
http://www.postgresql.org/message-id/8706.1230569...@sss.pgh.pa.us

> While we will very often do things like adding
> parameters to existing functions, I think we've tended to refrain from
> making wholesale semantic revisions to exposed data structures.

True.  I especially look to avoid changes that will cause extensions to build
and run, yet silently misbehave at runtime.  For example, had I reviewed the
pg_policy patch, I would have examined whether an unmodified 9.4 extension
might let a user bypass relation policy.  I oppose most header reorganization,
which breaks builds in exchange for insubstantial benefits.  I don't wish to
extend that anywhere near to the point of saying, "Your C function can't use
struct foo, because exposing struct foo in a header file would imply freezing
it."  Desire for backend API stability should not drive us to reject new
functionality.

> I'd be all right with putting the data structure declarations in a file
> named something like autovacuum_private.h, especially if it carried an
> annotation that "if you depend on this, don't be surprised if we break
> your code in future".

Such an annotation would be no more true than it is for the majority of header
files.  If including it makes you feel better, I don't object.

nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Additional role attributes && superuser review

2015-01-01 Thread Magnus Hagander
On Wed, Dec 31, 2014 at 4:23 PM, Stephen Frost  wrote:

> * Magnus Hagander (mag...@hagander.net) wrote:
> > On Wed, Dec 31, 2014 at 3:08 PM, Stephen Frost 
> wrote:
> > > * Magnus Hagander (mag...@hagander.net) wrote:
> > > I think having it do exactly what pg_dump needs, and not things like
> > > > execute functions etc, would be the thing people want for a 'DUMP'
> > > > privilege.
> > >
> > > What if we want pg_dump in 9.6 to have an option to execute xlog_pause
> > > and xlog_resume for you?  You wouldn't be able to run that against a
> 9.5
> > > database (or at least, that option wouldn't work).
> >
> > It would if you added an explicit grant for it, which would have to be
> > documented.
>
> Huh?  An explicit grant for xlog_pause/xlog_resume won't work as we
> check role attributes rights inside the function..
>

Correct, of course. I was confusing myself.


> > We've discussed having a role attribute for COPY-from-filesystem, but
> > > pg_dump doesn't use that ever, it only uses COPY TO STDOUT.  I'm not
> > > a fan of making a COPY_TO_STDOUT-vs-SELECT distinction just for this..
> >
> > Yeah, it's probably going overboard with it, since AFAICT the only thing
> > that would actually be affected is RULEs on SELECT, which I bet most
> people
> > don't use on their tables.
>
> Well, we could make SELECT not work, but if you've got COPY then you can
> still get all the data, so, yeah, not much different.  I seriously doubt
> many people are using rules..
>

Yeah.


> > > We could/should also throw a WARNING if DUMP Is granted to a role
> without
> > > > BYPASSRLS in case row level security is enabled in the system, I
> think.
> > > But
> > > > that's more of an implementation detail.
> > >
> > > That's a bit ugly and RLS could be added to a relation after the DUMP
> > > privilege is granted.
> >
> > Yes, it's not going to be all-covering, but it can still be a useful
> > hint/warning in the cases where it *does* that. We obviously still need
> > pg_dump to give the error in both scenarios.
>
> I'm not against doing it, personally, but I suspect others won't like it
> (or at least, that's been the case in the past with other things..).
>

Heh, let's defer to a third party then :)


> > Ok, I see the point you're making that we could make this into a
> > > capability which isn't something which can be expressed through our
> > > existing GRANT system.  That strikes me as a solution trying to find a
> > > problem though.  There's no need to invent such an oddity for this
> > > particular use-case, I don't think.
> >
> > Maybe not, but we should make sure we don't paint ourselves into a corner
> > where we cannot do it in the future either.
>
> Agreed.  Do you see a risk here of that?
>

Not really, anymore, i think :)


> > For regular permissions, we could just pre-populate the system with
> > > > predefined roles and use regular GRANTs to those roles, instead of
> > > relying
> > > > on role attributes, which might in that case make it even more clear?
> > >
> > > The reason for this approach is to address exactly the nightmare that
> is
> > > trying to maintain those regular permissions across all the objects in
> > > the system.  Today, people simply give the role trying to do the
> pg_dump
> > > superuser, which is the best option we have.  Saying 'grant SELECT on
> > > all the tables and USAGE on all the schemas' isn't suggested because
> > > it's a huge pain.  This role attribute provides a middle ground where
> > > the pg_dump'ing role isn't a superuser, but you don't have to ensure
> > > usage and select are granted to it for every relation.
> >
> > No, what I'm saying is we could have *predefined role* that allows
> "select
> > on all the tables and usage on all the schemas". And you are unable to
> > actually remove that. It's not stored on the object, so you cannot REVOKE
> > the permission on the *object*. Since it's not store on the object it
> will
> > also automatically apply to all new objects, regardless of what you've
> done
> > with DEFAULT PRIVILEGES etc.
> >
> > But you can grant users and other roles access to this role, and it's
> dealt
> > with like other roles from the "who has it" perspective, instead of being
> > special-cased.
> >
> > Instead of "ALTER USER foo WITH DUMP" (or whatever), you'd just do a
> "GRANT
> > pgdump TO foo" (which would then also work through things like group
> > membership as well)
>
> There's a definite backwards-compatibility concern with this, of course,
> but I see where you're coming from.  This only really applies with this
> particular pg_dump-related role-attribute discussion though, right?
>

Yes, I believe so. Because it's so similar to the regular permissions,
where as something like "being able to take a base backup" is more of a
boolean - it doesn't apply to actual objects in the database cluster, it's
more global.


This role wouldn't be able to be logged in with, I presume?


Definitely not.


> Would it
> show up when you run \du?  What 

Re: [HACKERS] Compression of full-page-writes

2015-01-01 Thread Bruce Momjian
On Thu, Jan  1, 2015 at 10:40:53AM +0530, Amit Kapila wrote:
> Good question,  I think there might be some impact due to that, but in
> general for page level compression still there will be much more to
> compress. 
> 
> In general, I think this idea has merit with respect to compressible data,
> and to save for the cases where it will not perform well, there is a on/off
> switch for this feature and in future if PostgreSQL has some better
> compression method, we can consider the same as well.  One thing
> that we need to think is whether user's can decide with ease when to
> enable this global switch.

Yes, that is the crux of my concern.  I am worried about someone who
assumes compressions == good, and then enables it.  If we can't clearly
know when it is good, it is even harder for users to know.  If we think
it isn't generally useful until a new compression algorithm is used,
perhaps we need to wait until the we implement this.


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2015-01-01 Thread Bruce Momjian
On Thu, Jan  1, 2015 at 09:04:48PM +0100, Andres Freund wrote:
> On January 1, 2015 8:49:06 PM CET, Robert Haas  wrote:
> >On Thu, Jan 1, 2015 at 11:59 AM, Andres Freund 
> >wrote:
> >> The problem is that just aligning the main allocation to some
> >boundary
> >> doesn't mean the hot part of the allocation is properly aligned.
> >shmem.c
> >> in fact can't really do much about that - so fully moving the
> >> responsibility seems more likely to ensure that future code thinks
> >about
> >> alignment.
> >
> >That's true, but if you don't align the beginnings of the allocations,
> >then it's a lot more complicated for the code to properly align stuff
> >within the allocation.  It's got to insert a variable amount of
> >padding based on the alignment it happens to get.
> 
> Hm? Allocate +PG_CACHELINE_SIZE and do var = CACHELINEALIGN(var).

Yeah, I am afraid we have to do that anyway --- you can see it in my
patch too.  I guess if you had the shmem allocation aligned on 64-byte
boundries and required all allocations to be a multiple of 64, that
would work too.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2015-01-01 Thread Andres Freund
On January 1, 2015 8:49:06 PM CET, Robert Haas  wrote:
>On Thu, Jan 1, 2015 at 11:59 AM, Andres Freund 
>wrote:
>> The problem is that just aligning the main allocation to some
>boundary
>> doesn't mean the hot part of the allocation is properly aligned.
>shmem.c
>> in fact can't really do much about that - so fully moving the
>> responsibility seems more likely to ensure that future code thinks
>about
>> alignment.
>
>That's true, but if you don't align the beginnings of the allocations,
>then it's a lot more complicated for the code to properly align stuff
>within the allocation.  It's got to insert a variable amount of
>padding based on the alignment it happens to get.

Hm? Allocate +PG_CACHELINE_SIZE and do var = CACHELINEALIGN(var).


-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2015-01-01 Thread Robert Haas
On Thu, Jan 1, 2015 at 11:59 AM, Andres Freund  wrote:
> The problem is that just aligning the main allocation to some boundary
> doesn't mean the hot part of the allocation is properly aligned. shmem.c
> in fact can't really do much about that - so fully moving the
> responsibility seems more likely to ensure that future code thinks about
> alignment.

That's true, but if you don't align the beginnings of the allocations,
then it's a lot more complicated for the code to properly align stuff
within the allocation.  It's got to insert a variable amount of
padding based on the alignment it happens to get.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2015-01-01 Thread Bruce Momjian
On Thu, Jan  1, 2015 at 05:59:25PM +0100, Andres Freund wrote:
> > That seems like a strange approach.  I think it's pretty sensible to
> > try to ensure that allocated blocks of shared memory have decent
> > alignment, and we don't have enough of them for aligning on 64-byte
> > boundaries (or even 128-byte boundaries, perhaps) to eat up any
> > meaningful amount of memory.  The BUFFERALIGN() stuff, like much else
> > about the way we manage shared memory, has also made its way into the
> > dynamic-shared-memory code.  So if we do adjust the alignment that we
> > guarantee for the main shared memory segment, we should perhaps adjust
> > DSM to match.  But I guess I don't understand why you'd want to do it
> > that way.
> 
> The problem is that just aligning the main allocation to some boundary
> doesn't mean the hot part of the allocation is properly aligned. shmem.c
> in fact can't really do much about that - so fully moving the
> responsibility seems more likely to ensure that future code thinks about
> alignment.

Yes, there is shared memory allocation alignment and object alignment. 
Since there are only about 50 cases of these, a worst-case change to
force 64-byte alignment would only cost 3.2k of shared memory.

It might make sense to make them all 64-byte aligned to reduce CPU cache
contention, but we have to have actual performance numbers to prove
that.  My two patches allow individual object alignment to be tested.  I
have not been able to see any performance difference (<1%) with:

$ pgbench --initialize --scale 100 pgbench
$ pgbench --protocol prepared --client 32 --jobs 16 --time=100 
--select-only pgbench

on my dual-socket 16 vcore server.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-01 Thread Robert Haas
On Thu, Jan 1, 2015 at 12:00 PM, Fabrízio de Royes Mello
 wrote:
> Can we check the number of free bgworkers slots to set the max workers?

The real solution here is that this patch can't throw an error if it's
unable to obtain the desired number of background workers.  It needs
to be able to smoothly degrade to a smaller number of background
workers, or none at all.  I think a lot of this work will fall out
quite naturally if this patch is reworked to use the parallel
mode/parallel context stuff, the latest version of which includes an
example of how to set up a parallel scan in such a manner that it can
run with any number of workers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-01 Thread Fabrízio de Royes Mello
> I think one thing we could do minimize the chance of such an
> error is set the value of parallel workers to be used for plan equal
> to max_worker_processes if parallel_seqscan_degree is greater
> than max_worker_processes.  Even if we do this, still such an
> error can come if user has registered bgworker before we could
> start parallel plan execution.
>
>
Can we check the number of free bgworkers slots to set the max workers?

Regards,

Fabrízio Mello


>

-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2015-01-01 Thread Andres Freund
On 2015-01-01 11:55:03 -0500, Robert Haas wrote:
> On Mon, Dec 29, 2014 at 6:48 PM, Andres Freund  wrote:
> >> Andres reported the above 2x pgbench difference in February, but no
> >> action was taken as everyone felt there needed to be more performance
> >> testing, but it never happened:
> >
> > FWIW, I have no idea what exactly should be tested there. Right now I
> > think what we should do is to remove the BUFFERALIGN from shmem.c and
> > instead add explicit alignment code in a couple callers
> > (BufferDescriptors/Blocks, proc.c stuff).
> 
> That seems like a strange approach.  I think it's pretty sensible to
> try to ensure that allocated blocks of shared memory have decent
> alignment, and we don't have enough of them for aligning on 64-byte
> boundaries (or even 128-byte boundaries, perhaps) to eat up any
> meaningful amount of memory.  The BUFFERALIGN() stuff, like much else
> about the way we manage shared memory, has also made its way into the
> dynamic-shared-memory code.  So if we do adjust the alignment that we
> guarantee for the main shared memory segment, we should perhaps adjust
> DSM to match.  But I guess I don't understand why you'd want to do it
> that way.

The problem is that just aligning the main allocation to some boundary
doesn't mean the hot part of the allocation is properly aligned. shmem.c
in fact can't really do much about that - so fully moving the
responsibility seems more likely to ensure that future code thinks about
alignment.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2015-01-01 Thread Robert Haas
On Mon, Dec 29, 2014 at 6:48 PM, Andres Freund  wrote:
>> Andres reported the above 2x pgbench difference in February, but no
>> action was taken as everyone felt there needed to be more performance
>> testing, but it never happened:
>
> FWIW, I have no idea what exactly should be tested there. Right now I
> think what we should do is to remove the BUFFERALIGN from shmem.c and
> instead add explicit alignment code in a couple callers
> (BufferDescriptors/Blocks, proc.c stuff).

That seems like a strange approach.  I think it's pretty sensible to
try to ensure that allocated blocks of shared memory have decent
alignment, and we don't have enough of them for aligning on 64-byte
boundaries (or even 128-byte boundaries, perhaps) to eat up any
meaningful amount of memory.  The BUFFERALIGN() stuff, like much else
about the way we manage shared memory, has also made its way into the
dynamic-shared-memory code.  So if we do adjust the alignment that we
guarantee for the main shared memory segment, we should perhaps adjust
DSM to match.  But I guess I don't understand why you'd want to do it
that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add modulo (%) operator to pgbench

2015-01-01 Thread Fabien COELHO


Hello David,


At the moment I feel the patch is a bit half done. I really think that
since the gaussian and exponential stuff was added in commit ed802e7d, that
this should now be changed so that we have functions like random(),
erandom() and grandom() and the way to use this becomes:



\set name random(1,10)
\set name erandom(1,10)

And we completely pull out the new \\setrandom additions added in that
commit. We'd likely keep \\setrandom name 1 10 for backwards compatibility.



Does anyone else feel strongly about fixing this now, while we have the
chance?


I thought about adding functions, possibly random, very probably abs & 
some hash, but I felt it would be more for a second round.


The other point is that although uniform random is fine, the gaussian and 
exponential ones require an additional floating point argument which means 
handling some typing.


The current patch is "just" about handling operators as before, although 
in a much nicer and extensible way, thus I would suggest to let Robert's 
patch more or less as it is, and people will be able to propose new 
extensions later on.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Transactions involving multiple postgres foreign servers

2015-01-01 Thread Ashutosh Bapat
Hi All,
While looking at the patch for supporting inheritance on foreign tables, I
noticed that if a transaction makes changes to more than two foreign
servers the current implementation in postgres_fdw doesn't make sure that
either all of them rollback or all of them commit their changes, IOW there
is a possibility that some of them commit their changes while others
rollback theirs.

PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at
XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
event it commits or aborts those transactions resp.

The logic to craft the prepared transaction ids is rudimentary and I am
open to suggestions for the same. I have following goals in mind while
crafting the transaction ids
1. Minimize the chances of crafting a transaction id which would conflict
with a concurrent transaction id on that foreign server.
2. Because of a limitation described later, DBA/user should be able to
identify the server which originated a remote transaction.
More can be found in comments above function pgfdw_get_prep_xact_id() in
the patch.

Limitations
---
1. After a transaction has been prepared on foreign server, if the
connection to that server is lost before the transaction is rolled back or
committed on that server, the transaction remains in prepared state
forever. Manual intervention would be needed to clean up such a transaction
(Hence the goal 2 above). Automating this process would require significant
changes to the transaction manager, so, left out of this patch, which I
thought would be better right now. If required, I can work on that part in
this patch itself.

2. 2PC is needed only when there are more than two foreign servers involved
in a transaction. Transactions on a single foreign server are handled well
right now. So, ideally, the code should detect if there are more than two
foreign server are involved in the transaction and only then use 2PC. But I
couldn't find a way to detect that without changing the transaction manager.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 116be7d..9492f14 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -42,20 +42,21 @@ typedef struct ConnCacheKey
 } ConnCacheKey;
 
 typedef struct ConnCacheEntry
 {
 	ConnCacheKey key;			/* hash key (must be first) */
 	PGconn	   *conn;			/* connection to foreign server, or NULL */
 	int			xact_depth;		/* 0 = no xact open, 1 = main xact open, 2 =
  * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	char		*prep_xact_name;	/* Name of prepared transaction on this connection */
 } ConnCacheEntry;
 
 /*
  * Connection cache (initialized on first use)
  */
 static HTAB *ConnectionHash = NULL;
 
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
@@ -135,20 +136,21 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	 * Find or create cached entry for requested connection.
 	 */
 	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
 	if (!found)
 	{
 		/* initialize new hashtable entry (key is already filled in) */
 		entry->conn = NULL;
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->prep_xact_name = NULL;
 	}
 
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
 	 * connection is actually used.
 	 */
 
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
@@ -156,20 +158,21 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	 * will be left in a valid empty state.)
 	 */
 	if (entry->conn == NULL)
 	{
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
 		entry->conn = connect_pg_server(server, user);
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
 			 entry->conn, server->servername);
+		entry->prep_xact_name = NULL;
 	}
 
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
 	begin_remote_xact(entry);
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
@@ -507,20 +510,55 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		if (clear)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 	if (clear)
 		PQclear(res);
 }
 
 /*
+ * pgfdw_get_prep_xact_id
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not

Re: [HACKERS] Parallel Seq Scan

2015-01-01 Thread Amit Kapila
On Wed, Dec 31, 2014 at 9:46 PM, Thom Brown  wrote:
>
> Another issue (FYI, pgbench2 initialised with: pgbench -i -s 100 -F 10
pgbench2):
>
>
>  ➤ psql://thom@[local]:5488/pgbench2
>
> # explain (analyse, buffers, verbose) select distinct bid from
pgbench_accounts;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> Time: 14897.991 ms
>
> 2014-12-31 15:21:57 GMT [9164]: [260-1] user=,db=,client= LOG:  server
process (PID 10869) was terminated by signal 9: Killed
> 2014-12-31 15:21:57 GMT [9164]: [261-1] user=,db=,client= DETAIL:  Failed
process was running: explain (analyse, buffers, verbose) select distinct
bid from pgbench_accounts;
> 2014-12-31 15:21:57 GMT [9164]: [262-1] user=,db=,client= LOG:
 terminating any other active server processes
>
> Running it again, I get the same issue.  This is with
parallel_seqscan_degree set to 8, and the crash occurs with 4 and 2 too.
>
> This doesn't happen if I set the pgbench scale to 50.  I suspect this is
a OOM issue.  My laptop has 16GB RAM, the table is around 13GB at scale
100, and I don't have swap enabled.  But I'm concerned it crashes the whole
instance.
>

Isn't this a backend crash due to OOM?
And after that server will restart automatically.

> I also notice that requesting BUFFERS in a parallel EXPLAIN output yields
no such information.
> --

Yeah and the reason for same is that all the work done related
to BUFFERS is done by backend workers, master backend
doesn't read any pages, so it is not able to accumulate this
information.

> Is that not possible to report?

It is not impossible to report such information, we can develop some
way to share such information between master backend and workers.
I think we can do this if required once the patch is more stablized.


Thanks for looking into patch and reporting the issues.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] add modulo (%) operator to pgbench

2015-01-01 Thread David Rowley
On 1 January 2015 at 21:23, Fabien COELHO  wrote:

>
>  I was about to go and look at this, but I had a problem when attempting to
>> compile with MSVC.
>>
>
> Thanks! Here is a v4 which includes your fix.


Thank you.

I've had a quick look at the patch as I'm quite interested in seeing some
improvements in this area.

At the moment I feel the patch is a bit half done. I really think that
since the gaussian and exponential stuff was added in commit ed802e7d, that
this should now be changed so that we have functions like random(),
erandom() and grandom() and the way to use this becomes:

\set name random(1,10)
\set name erandom(1,10)

And we completely pull out the new \\setrandom additions added in that
commit. We'd likely keep \\setrandom name 1 10 for backwards compatibility.

Does anyone else feel strongly about fixing this now, while we have the
chance?

Regards

David Rowley


Re: [HACKERS] add modulo (%) operator to pgbench

2015-01-01 Thread Fabien COELHO



I was about to go and look at this, but I had a problem when attempting to
compile with MSVC.


Thanks! Here is a v4 which includes your fix.

--
Fabien.diff --git a/contrib/pgbench/.gitignore b/contrib/pgbench/.gitignore
index 489a2d6..aae819e 100644
--- a/contrib/pgbench/.gitignore
+++ b/contrib/pgbench/.gitignore
@@ -1 +1,3 @@
+/exprparse.c
+/exprscan.c
 /pgbench
diff --git a/contrib/pgbench/Makefile b/contrib/pgbench/Makefile
index b8e2fc8..2d4033b 100644
--- a/contrib/pgbench/Makefile
+++ b/contrib/pgbench/Makefile
@@ -4,7 +4,9 @@ PGFILEDESC = "pgbench - a simple program for running benchmark tests"
 PGAPPICON = win32
 
 PROGRAM = pgbench
-OBJS	= pgbench.o $(WIN32RES)
+OBJS	= pgbench.o exprparse.o $(WIN32RES)
+
+EXTRA_CLEAN	= exprparse.c exprscan.c
 
 PG_CPPFLAGS = -I$(libpq_srcdir)
 PG_LIBS = $(libpq_pgport) $(PTHREAD_LIBS)
@@ -18,8 +20,22 @@ subdir = contrib/pgbench
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
+
+distprep: exprparse.c exprscan.c
 endif
 
 ifneq ($(PORTNAME), win32)
 override CFLAGS += $(PTHREAD_CFLAGS)
 endif
+
+# There is no correct way to write a rule that generates two files.
+# Rules with two targets don't have that meaning, they are merely
+# shorthand for two otherwise separate rules.  To be safe for parallel
+# make, we must chain the dependencies like this.  The semicolon is
+# important, otherwise make will choose the built-in rule for
+# gram.y=>gram.c.
+
+exprparse.h: exprparse.c ;
+
+# exprscan is compiled as part of exprparse
+exprparse.o: exprscan.c
diff --git a/contrib/pgbench/exprparse.y b/contrib/pgbench/exprparse.y
new file mode 100644
index 000..243c6b9
--- /dev/null
+++ b/contrib/pgbench/exprparse.y
@@ -0,0 +1,96 @@
+%{
+/*-
+ *
+ * exprparse.y
+ *	  bison grammar for a simple expression syntax
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-
+ */
+
+#include "postgres_fe.h"
+
+#include "pgbench.h"
+
+PgBenchExpr *expr_parse_result;
+
+static PgBenchExpr *make_integer_constant(int64 ival);
+static PgBenchExpr *make_variable(char *varname);
+static PgBenchExpr *make_op(char operator, PgBenchExpr *lexpr,
+		PgBenchExpr *rexpr);
+
+%}
+
+%expect 0
+%name-prefix="expr_yy"
+
+%union
+{
+	int64		ival;
+	char	   *str;
+	PgBenchExpr *expr;
+}
+
+%type  expr
+%type  INTEGER
+%type  VARIABLE
+%token INTEGER VARIABLE
+%token CHAR_ERROR /* never used, will raise a syntax error */
+
+%left	'+' '-'
+%left	'*' '/' '%'
+%right	UMINUS
+
+%%
+
+result: expr{ expr_parse_result = $1; }
+
+expr: '(' expr ')'			{ $$ = $2; }
+	| '+' expr %prec UMINUS	{ $$ = $2; }
+	| '-' expr %prec UMINUS	{ $$ = make_op('-', make_integer_constant(0), $2); }
+	| expr '+' expr			{ $$ = make_op('+', $1, $3); }
+	| expr '-' expr			{ $$ = make_op('-', $1, $3); }
+	| expr '*' expr			{ $$ = make_op('*', $1, $3); }
+	| expr '/' expr			{ $$ = make_op('/', $1, $3); }
+	| expr '%' expr			{ $$ = make_op('%', $1, $3); }
+	| INTEGER{ $$ = make_integer_constant($1); }
+	| VARIABLE { $$ = make_variable($1); }
+	;
+
+%%
+
+static PgBenchExpr *
+make_integer_constant(int64 ival)
+{
+	PgBenchExpr *expr = pg_malloc(sizeof(PgBenchExpr));
+
+	expr->etype = ENODE_INTEGER_CONSTANT;
+	expr->u.integer_constant.ival = ival;
+	return expr;
+}
+
+static PgBenchExpr *
+make_variable(char *varname)
+{
+	PgBenchExpr *expr = pg_malloc(sizeof(PgBenchExpr));
+
+	expr->etype = ENODE_VARIABLE;
+	expr->u.variable.varname = varname;
+	return expr;
+}
+
+static PgBenchExpr *
+make_op(char operator, PgBenchExpr *lexpr, PgBenchExpr *rexpr)
+{
+	PgBenchExpr *expr = pg_malloc(sizeof(PgBenchExpr));
+
+	expr->etype = ENODE_OPERATOR;
+	expr->u.operator.operator = operator;
+	expr->u.operator.lexpr = lexpr;
+	expr->u.operator.rexpr = rexpr;
+	return expr;
+}
+
+#include "exprscan.c"
diff --git a/contrib/pgbench/exprscan.l b/contrib/pgbench/exprscan.l
new file mode 100644
index 000..4c9229c
--- /dev/null
+++ b/contrib/pgbench/exprscan.l
@@ -0,0 +1,105 @@
+%{
+/*-
+ *
+ * exprscan.l
+ *	  a lexical scanner for a simple expression syntax
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *-
+ */
+
+/* line and column number for error reporting */
+static int	yyline = 0, yycol = 0;
+
+/* Handles to the buffer that the lexer uses internally */
+static YY_BUFFER_STATE scanbufhandle;
+static char *scanbuf;
+static int	scanbuflen;
+
+/* flex 2.5.4 doesn't bother with a decl for this */
+int expr_yylex(void);
+
+%}
+
+%option 8bit
+%option never-interactive
+%option node