from:"Markus Wanner"

Re: [HACKERS] IPv6 link-local addresses and init data type

2016-06-28 Thread Markus Wanner

Haribabu,

On 07.06.2016 07:19, Haribabu Kommi wrote:
>> I have not looked at the spec, but I wouldn't be surprised if there
>> were an upper limit on the length of valid scope names.

Yeah, I didn't find any upper limit, either.

> I am not able to find any link that suggests the maximum length of the scope 
> id.
> From [1], I came to know that, how the scope id is formed in different 
> operating
> systems.
> 
> windows - fe80::3%1
> unix systems - fe80::3%eth0

eth0 may well be interface number 1 on some machine, therefore being
equivalent. However, as discussed before, Postgres cannot know, so it
shouldn't bother.

> I added another character array of 256 member into inet_struct as a last 
> member
> to store the zone id.

I haven't looked at the patch in detail, but zeroing or memcpy'ing those
256 bytes seems like overkill to me. I'd recommend to limit this to only
allocate and move around as many bytes as needed for the scope id.

> Currently all the printable characters are treated as zone id's. I
> will restrict this
> to only alphabets and numbers.

I fear alphanumeric only is too restrictive. RFC 4007 only specifies
that the zone id "must not conflict with the delimiter character" and
leaves everything beyond that to the implementation (which seems too
loose, printable characters sounds about right to me...).

> i will add the zone support for everything and send the patch.

What's currently missing?

> How about the following case, Do we treat them as same or different?
> 
> select 'fe80::%eth1'::inet = 'fe80::%ETH1'::inet;

Let's be consistent in not interpreting the scope id in any way, meaning
those would be different. (After all, interfaces names seem to be case
sensitive - on my variant of Linux at the very least - i.e. ETH1 cannot
be found, while eth1 can be.)

> fe80::%2/64 is only treated as the valid address but not other way as
> fe80::/64%2.
> Do we need to throw an error in this case or just ignore.

I didn't find any evidence for the second case being invalid; nor for it
being valid.

Note, however, that RFC 4007 only gives recommendations for textual
representations (with a "should" preference for the former).

It explicitly "does not specify how the format for non-global addresses
should be combined with the preferred format for literal IPv6 addresses".

Also note that RFC 2732 (Format for Literal IPv6 Addresses in URL's)
doesn't have the '%' sign in the set of reserved characters (not even in
the "unwise" one).

I'm starting to question if it's really wise to add the scope id to the
INET6 type...

> [2] - 
> http://stackoverflow.com/questions/24932172/what-length-can-a-network-interface-name-have

Note that the scope id doesn't necessarily have to be a network
interface name. Concluding there's at max 256 bytes, just because that's
the network interface name's max, doesn't seem correct. However, I agree
that's a reasonable limit for a scope id of the inet6 data type.

Kind Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IPv6 link-local addresses and init data type

2016-06-03 Thread Markus Wanner

On 06/03/2016 06:55 PM, Tom Lane wrote:
> More importantly,
> on what basis do you conclude that the inet type will only be asked to
> store link-local addresses that are currently valid on the local machine?
> It is not very hard to think of applications where that wouldn't be the
> case.

That's a good point.

> I think a better plan is just to store the zone string verbatim.  It is
> not our job to check its validity or try to determine which spellings
> are equivalent.

Agreed.

That leaves me wondering if it's really worth extending INET, though.
TEXT would be just fine to store a textual scope id. And it makes it
utterly clear that there's no magic involved.

Kind Regards

Markus Wanner



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IPv6 link-local addresses and init data type

2016-06-03 Thread Markus Wanner

On 06/03/2016 12:14 AM, Tom Lane wrote:
> Markus Wanner <mar...@bluegap.ch> writes:
>> I'm even wondering if 'fe80::1%1'::inet = 'fe80::1%2'::inet shouldn't
>> simply yield true. After all, it's the same (non-global) address.
> 
> Surely not?  If the zone_ids didn't mean anything, why would the concept
> even exist?  ISTM that what you've got there is two different addresses
> neither of which is reachable from outside the given machine --- but
> within that machine, they are different.

You are right, both of these addresses actually are considered different
from 'fe80::1', i.e. I cannot ping6 fe80::1, but need to explicitly
specify an interface - either via an additional argument or with the
percent notation.

That's interesting news to me, as this means that only global IPv6
addresses can be stored in 128 bits. For non-global ones that's not
sufficient and 'fe80::1' isn't even considered a valid address by
itself. Checking the headers and manpages, the sockaddr_in6 sports an
uint32_t sin6_scope_id (as per RFC 2553).

Then, there also is the embedded form which uses the 2nd 16-bit word of
the IPv6 address to store the scope id, see for example "8.1.1.3 Scope
Index" in [0]. However, that document also clearly states: "When you
specify scoped address to the command line, NEVER write the embedded
form (such as ff02:1::1 or fe80:2::fedc)".

Given that interfaces are addressed by index internally, I'm pretty sure
the two representations are equivalent (i.e. if eth3 currently happens
to be the 7th interface , then 'fe80::1%eth3' resolves to 'fe80::1%7').

Considering that Postgres is not unlikely to write INET types to
permanent storage, its values should better survive a reboot. And while
I have some doubts about persistence of interface names, those clearly
have a higher chance of surviving a reboot compared to interface
indices. Therefore, I'd advocate resolving interface indices (if given)
to interface names using if_indextoname(3) and let INET types store only
names.

Assuming such an implementation, the following would hold (assuming an
example system as mentioned above):

  'fe80::1%1'::inet != 'fe80::1%2'::inet

but:

  'fe80::1%7'::inet = 'fe80::1%eth3'::inet

I also tend to deny scope ids for global addresses (like ping6), but
allow non-global ('fe80::1') addresses to be stored without an interface
name, as pre-existing applications may rely on that to work (unlike ping6).

Does that sound like a more reasonable plan?

Kind Regards

Markus Wanner

[0]: FreeBSD dev handbook, Section 8.1.1.3: Scope Index:
https://www.freebsd.org/doc/en/books/developers-handbook/ipv6.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IPv6 link-local addresses and init data type

2016-06-02 Thread Markus Wanner

On 31.05.2016 12:40, Andreas Karlsson wrote:
> On 05/31/2016 04:06 AM, Tom Lane wrote:
>> Unless there's a semantic difference between fe80::1%2/64 and
>> fe80::1/64%2, this doesn't seem like a big deal to me.
> 
> As far as I can till only fe80::1%2/64 is valid, but I am not 100% sure.

According to RFC 4007, Section 11.7 states: "In this combination, it is
important to place the zone index portion before the prefix length when
we consider parsing the format by a name-to-address library function
[11].  That is, we can first separate the address with the zone index
from the prefix length, and just pass the former to the library function."

However, in the sense of being liberal in what you accept, 'fe80::/64%2'
should probably work as well.

Given that a zone_id is a) highly system dependent and b) only ever
meaningful for non-global addresses, I'm wondering what the use case for
storing them is.

I'm even wondering if 'fe80::1%1'::inet = 'fe80::1%2'::inet shouldn't
simply yield true. After all, it's the same (non-global) address.

Kind Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-23 Thread Markus Wanner

On 07/16/2013 09:14 PM, I wrote:
 But okay, you're saying we *have* and *want* a guarantee that even a
 superuser cannot execute arbitrary native code via libpq (at least in
 default installs w/o extensions).

I stand corrected and have to change my position, again. For the record:

We do not have such a guarantee. Nor does it seem reasonable to want
one. On a default install, it's well possible for the superuser to run
arbitrary code via just libpq.

There are various ways to do it, but the simplest one I was shown is:
 - upload a DSO from the client into a large object
 - SELECT lo_export() that LO to a file on the server
 - LOAD it

There are a couple other options, so even if we let LOAD perform
permission checks (as I proposed before in this thread), the superuser
can still fiddle with function definitions. To the point that it doesn't
seem reasonable to try to protect against that.

Thus, the argument against the original proposal based on security
grounds is moot. Put another way: There already are a couple of
backdoors a superuser can use. By default. Or with plpgsql removed.

Thanks to Dimitri and Andres for patiently explaining and providing
examples.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-22 Thread Markus Wanner

On 07/22/2013 12:11 AM, Hannu Krosing wrote:
 Dropping this barrier by installing an untrusted PL (or equally insecure
 extensions), an attacker with superuser rights can trivially gain
 root.
 Could you elaborate ?
 
 This is equivalent to claiming that any linux user can trivially gain root.

Uh. Sorry, you're of course right, the attacker can only gain postgres
rights in that case. Thanks for correcting.

The point still holds. It's another layer that an attacker would have to
overcome.

 You already mentioned untrusted PL languages, and I don't see any
 difference in between offering PL/pythonu and PL/C on security grounds,
 really.
 I agree. However, this also means that any kind of solution it offers is
 not a good one for the security conscious sysadmin.
 This is usually the case with a security conscious sysadmin - they very
 seldom want to install anything.

Exactly. That's why I'm favoring solutions that don't require any
extension and keep the guarantee of preventing arbitrary native code.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-22 Thread Markus Wanner

Dimitri,

On 07/22/2013 08:44 PM, Dimitri Fontaine wrote:
 That's the trade-off we currently need to make to be able to ship with
 the current security protections we're talking about.

Anything wrong with shipping postgis-1.5.so and postgis-2.0.so, as I we
for Debian?

 Ok, here's the full worked out example:
 
   ~# select lo_import('/path/to/local/on/the/client/adminpack.so');
lo_import 
   ---
82153
   (1 row)
   
   ~# select lo_export(82153, '/tmp/foo.so');
lo_export 
   ---
1
   (1 row)
   
   ~# LOAD '/tmp/foo.so';
   LOAD
   
   ~# CREATE FUNCTION pg_catalog.pg_file_write(text, text, bool)
  RETURNS bigint AS '/tmp/foo.so', 'pg_file_write'
  LANGUAGE C VOLATILE STRICT;
   CREATE FUNCTION

Good point.

Note that these lo_import() and lo_export() methods - unlike the client
counterparts - read and write to the server's file system.

So, your example is incomplete in that SELECT lo_import() doesn't load
data from the client into the database. But there are many other ways to
do that.

None the less, your point holds: lo_export() is a way for a superuser to
write arbitrary files on the server's file-system via libpq.

 Or to put it another way: Trusted PLs exist for a good reason. And some
 people just value security a lot and want that separation of roles.
 
 Yeah, security is important, but it needs to be addressing real threats
 to have any value whatsoever. We'not talking about adding new big holes
 as extensions containing modules are using LANGUAGE C in their script
 and thus are already restricted to superuser. That part would not
 change.

I agree, it's not a big hole. But with security, every additional layer
counts. I think the next step is to clarify whether or not we want to
provide that guarantee.

 As someone mentioned previously, $PGDATA may well be mounted noexec, so
 that seems to be a bad choice.
 
 I don't understand. You want good default security policy in place, and
 you're saying that using PGDATA allows for a really easy policy to be
 implemented by people who don't want the feature. It seems to me to be
 exactly what you want, why would that be a bad choice then?

I'm just saying that mounting $PGDATA with noexec makes $PGDATA unusable
to load libraries from; i.e. pg_ldopen() just out right fails.

And noexec is another protective layer. Building a solution that
requires $PGDATA to be mounted with exec permissions means that very
solution won't work with that protective layer. Which is a bad thing.

 Alternatively, we could solve that problem the other way around: Rather
 than template-ify the DSO, we could instead turn the objects created by
 the SQL scripts into something that's more linked to the script.
 Something that would reload as soon as the file on disk changes.
 
 I think that's the wrong way to do it, because you generally want more
 control over those two steps (preparing the template then instanciating
 it) and you're proposing to completely lose the control and have them be
 a single step instead.

Doesn't versioning take care of that? I.e. if you've version 1.0 created
in your databases and then install 1.1 system wide, that shouldn't
immediately instantiate.

 Use case: shipping a new plproxy bunch of functions to 256 nodes. You
 want to push the templates everywhere then just do the extension update
 as fast as possible so that it appears to be about 'in sync'.
 
 Pro-tip: you can use a plproxy function that runs the alter extension
 update step using RUN ON ALL facility.

Contrary use case: you actually *want* to *replace* the created version
installed, as it's a plain security fix that's backwards compatible. A
plain 'apt-get upgrade' would do the job.

Anyways, I think this is bike-shedding: The question of what mental
model fits best doesn't matter too much, here.

I'm concerned about being consistent with whatever mental model we
choose and about an implementation that would work with whatever
security standards we want to adhere to.

 (Note how this would make out-of-line extensions a lot closer to the
 in-line variant your recent patch adds? With the dependency between
 template and instantiation?)
 
 Those dependencies are almost gone now except for being able to deal
 with DROP TEMPLATE FOR EXTENSION and guaranteeing we can actually
 pg_restore our pg_dump script.

I appreciate the bug fixes. However, there still is at least one
dependency. And that might be a good thing. Depending on what mental
model you follow.

 I've been changing the implementation to be what you proposed because I
 think it's making more sense (thanks!), and regression tests are
 reflecting that. My github branch is up-to-date with the last changes
 and I'm soon to send an updated patch that will be really as ready for
 commit as possible without a commiter review.

Good, thanks.

 An attacker having access to a libpq connection with superuser rights
 cannot currently run arbitrary native code. He can try a DOS by
 
 I think

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-21 Thread Markus Wanner

Salut Dimitri,

On 07/20/2013 01:23 AM, Dimitri Fontaine wrote:
 Markus Wanner mar...@bluegap.ch writes:
   - per-installation (not even per-cluster) DSO availability

 If you install PostGIS 1.5 on a system, then it's just impossible to
 bring another cluster (of the same PostgreSQL major version), let
 On Debian, that should be well possible. Certainly installing Postgres
 9.1 w/ postgis-1.5 in parallel to Postgres 9.2 w/ postgis-2.0 is. I
 designed it to be.

 I think I'm misunderstanding the problem statement, here.
 
 (of the same PostgreSQL major version)

Not sure what the issue is, here, but I agree that should be possible.

 Can CREATE EXTENSION check if the standbys have the extension installed?
 And refuse creation, if they don't?
 
 No, because we don't register standbies so we don't know who they are,
 and also because some things that we see connected and using the
 replication protocol could well be pg_basebackup or pg_receivexlog.

Can the standby check? In any case, these seem to be problems we can
solve without affecting security.

 Also, it's possible that the standby is only there for High Availability
 purposes and runs no user query.

Requiring the sysadmin to install the extensions there, too, seems
justified to me. Sounds like good advice, anyways.

 I'm sure you are aware that even without this clear separation of roles,
 the guarantee means we provide an additional level of security against
 attackers.
 
 Given lo_import() to upload a file from the client to the server then
 LOAD with the absolute path where the file ended up imported (or any
 untrusted PL really), this argument carries no sensible weight in my
 opinion.

lo_import() won't write a file for LOAD to load. An untrusted PL (or any
other extension allowing the superuser to do that) is currently required
to do that.

Or to put it another way: Trusted PLs exist for a good reason. And some
people just value security a lot and want that separation of roles.

 None the less, the safe by default has served us well, I think.
 
 That's true. We need to consider carefully the proposal at hand though.
 
 It's all about allowing the backend to automatically load a file that it
 finds within its own $PGDATA so that we can have per-cluster and
 per-database modules (DSO files).

As someone mentioned previously, $PGDATA may well be mounted noexec, so
that seems to be a bad choice.

 The only difference with before is the location where the file is read
 from, and the main security danger comes from the fact that we used to
 only consider root-writable places and now propose to consider postgres
 bootstrap user writable places.

FWIW, I only proposed to let postgres check write permissions on
libraries it loads. IIUC we don't currently do that, yet. And Postgres
happily loads a world-writable library, ATM.

 Having the modules in different places in the system when it's a
 template and when it's instanciated allows us to solve a problem I
 forgot to list:
 
   - upgrading an extension at the OS level
 
 Once you've done that, any new backend will load the newer module
 (DSO file), so you have to be real quick if installing an hot fix in
 production and the SQL definition must be changed to match the new
 module version…

I agree, that's a problem.

Alternatively, we could solve that problem the other way around: Rather
than template-ify the DSO, we could instead turn the objects created by
the SQL scripts into something that's more linked to the script.
Something that would reload as soon as the file on disk changes.

(Note how this would make out-of-line extensions a lot closer to the
in-line variant your recent patch adds? With the dependency between
template and instantiation?)

 With the ability to instanciate the module in a per-cluster
 per-database directory within $PGDATA the new version of the DSO module
 would only put in place and loaded at ALTER EXTENSION UPDATE time.
 
 I'm still ok with allowing to fix those problems only when a security
 option that defaults to 'false' has been switched to 'true', by the way,
 so that it's an opt-in,

Okay, good.

For the issues you raised, I'd clearly prefer fixes that maintain
current security standards, though.

 but I will admit having a hard time swallowing
 the threat model we're talking about…

An attacker having access to a libpq connection with superuser rights
cannot currently run arbitrary native code. He can try a DOS by
exhausting system resources, but that's pretty far from being
invisible. Or he can delete valuable data. Maybe other nasty things. But
he should not be able to gain root access and remove its traces.

Dropping this barrier by installing an untrusted PL (or equally insecure
extensions), an attacker with superuser rights can trivially gain
root.

Of course, an attacker shouldn't gain superuser rights in the first
place. But if he did, you better stop him right there with yet another
fence.

 It's extensions that undermine

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-17 Thread Markus Wanner

On 07/17/2013 08:52 PM, Dimitri Fontaine wrote:
 the next step of this discussion should be about talking about the
 problems we have and figuring out *if* we want to solve them, now that
 we managed to explicitely say what we want as a project.
 
   - per-installation (not even per-cluster) DSO availability
 
 If you install PostGIS 1.5 on a system, then it's just impossible to
 bring another cluster (of the same PostgreSQL major version), let
 alone database, with PostGIS 2.x, even for migration assessment
 purposes. The By Design™ part is really hard to explain even to
 security concious users.

On Debian, that should be well possible. Certainly installing Postgres
9.1 w/ postgis-1.5 in parallel to Postgres 9.2 w/ postgis-2.0 is. I
designed it to be.

On distributions that do not allow parallel installation of multiple
Postges major versions, it's certainly not the extension's fault.

I think I'm misunderstanding the problem statement, here.

   - hot standby and modules (aka DSO)
 
 As soon as you use some functions in 'language C' you need to
 carefully watch your external dependencies ahead of time. If you do
 CREATE EXTENSION hstore;, create an hstore column and a GiST index
 on it, then query the table on the standby… no luck. You would tell
 me that it's easy enough to do and that it's part of the job, so see
 next point.

Agreed, that's an area where Postgres could do better. I'd argue this
should be possible without relaxing the security guarantees provided,
though. Because there likely are people wanting both.

Can CREATE EXTENSION check if the standbys have the extension installed?
And refuse creation, if they don't?

   - sysadmin vs dba, or PostgreSQL meets the Cloud
 
 The current model of operations is intended for places where you
 have separate roles: the sysadmin cares about the OS setup and will
 provide with system packages (compiled extensions and the like), and
 DBA are never root on the OS. They can CREATE EXTENSION and maybe
 use the 'postgres' system account, but that's about it.

I'm sure you are aware that even without this clear separation of roles,
the guarantee means we provide an additional level of security against
attackers.

 Given the current raise of the Cloud environements and the devops
 teams, my understanding is that this model is no longer the only
 one. Depending on who you talk to, in some circles it's not even a
 relevant model anymore: user actions should not require the
 intervention of a sysadmin before hand.
 
 While I appreciate that many companies still want the old behavior
 that used to be the only relevant model of operations, I think we
 should also provide for the new one as it's pervasive enough now for
 us to stop ignoring it with our I know better smiling face.

I'd even think it's a minority who actually uses the guarantee we're
talking about. Mostly because of the many and wide spread untrusted PLs
(which undermine the guarantee). And thus even before the rise of the cloud.

None the less, the safe by default has served us well, I think.

 Now it should be possible to solve at least some of those items while
 still keeping the restriction above, or with an opt-in mechanism to
 enable the works by default, but you have to solve the security
 implications yourself behaviour. A new GUC should do it, boolean,
 defaults false:
 
   runs_in_the_cloud_with_no_security_concerns = false

[ I usually associate cloud with (increased) security concerns, but
  that's an entirely different story. ]

 I don't think the relaxed behaviour we're talking about is currently
 possible to develop as an extension, by the way.

It's extensions that undermine the guarantee, at the moment. But yeah,
it depends a lot on what kind of relaxed behavior you have in mind.

 Andres made two contrib-free suggestions: with COPY TO BINARY, you get a
 
 Well, what about using lo_import()?

That only reads from the file-system. You probably meant lo_export(),
which is writing. Although not on the server's, but only on the (libpq)
client's file-system. No threat to the server.

 Yes it's dangerous. It's also solving real world problems that I see no
 other way to solve apart from bypassing the need to ever load a DSO
 file, that is embedding a retargetable C compiler in the backend.

If the sysadmin wants to disallow arbitrary execution of native code to
postgres (the process), any kind of embedded compiler likely is equally
unwelcome.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-16 Thread Markus Wanner

On 07/16/2013 01:27 AM, Robert Haas wrote:
 Andres points out that you can install adminpack to obtain
 local filesystem access, and that is true.  But the system
 administrator can also refuse to allow adminpack, and/or use selinux
 or other mechanisms to prevent the postgres binary from writing a file
 with execute permissions.

I think execute permissions (on the FS) are irrelevant. It's about
loading a shared library. The noexec mount option can prevent that, though.

But okay, you're saying we *have* and *want* a guarantee that even a
superuser cannot execute arbitrary native code via libpq (at least in
default installs w/o extensions).

Andres made two contrib-free suggestions: with COPY TO BINARY, you get a
header prepended, which I think is sufficient to prevent a dlopen() or
LoadLibrary(). Text and CSV formats of COPY escape their output, so it's
hard to write \000 or other control bytes. ESCAPE and DELIMITER also
have pretty restrictive requirements. So COPY doesn't seem quite good
enough to write a valid DSO.

His second suggestion was tuplesort tapes. tuplesort.c says: We require
the first unsigned int of a stored tuple to be the total size on-tape
of the tuple That's kind of a header as well. Writing a proper DSO
certainly does not sound trivial, either.

From a security perspective, I wouldn't want to rely on that guarantee.
Postgres writes too many files to be sure none of those can be abused to
write a loadable DSO, IMO.

Mounting $PGDATA 'noexec' and allowing the postgres user to write only
to such noexec mounts sounds like a good layer. It's independent, though
- it can be used whether or not the above guarantee holds.

 Things aren't quite so bad if we write the bits to a file first and
 then dynamically load the file.  That way at least noexec or similar
 can provide protection.  But it still seems like a pretty dangerous
 direction.

I agree now. Thanks for elaborating.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-15 Thread Markus Wanner

Robert,

thanks for answering. I think you responded to the (or some) idea in
general, as I didn't see any relation to the quoted part. (Otherwise
this is a hint that I've not been clear enough.)

On 07/15/2013 05:49 AM, Robert Haas wrote (in a slightly different order):
 There is a lot of
 (well-founded, IMHO) resistance to the idea of letting users install C
 code via an SQL connection.

Nowhere did I propose anything close to that. I'm in full support of the
resistance. Pitchforks and torches ready to rumble. :-)

 The security problems with this model have been discussed in every
 email thread about the extension template feature.

I read through a lot of these discussions, recently, but mostly found
misunderstanding and failure to distinguish between available extension
(template) and created extension (instantiation). I think this is a new
proposal, which I didn't ever see discussed. It does not have much, if
anything, to do with Dimitri's patch, except for it having made the
point obvious to me, as it continues to mix the two mental models.

I don't see much of a difference security wise between loading the DSO
at extension creation time vs. loading it at every backend start. More
details below.

My major gripe with the current way extensions are handled is that SQL
scripts are treated as a template, where as the DSO is only pointed
to. Changing the SQL scripts in your extension directory has no effect
to the installed extension. Modifying the DSO file certainly has. That's
the inconsistency that I'd like to get rid of.


A security analysis of the proposal follows.

Granted, the internalization of the DSO is somewhat critical to
implement. Running as a non-root user, Postgres has less power than that
and can certainly not protect the internalized DSO from modification by
root. It can, however, store the DSO well protected from other users
(e.g. in a directory within $PGDATA).

Relying on the external DSO only exactly once can provide additional
safety. Consider an extensions directory that's accidentally
world-writable. As it stands, an attacker can modify the DSO and gain
control as soon as a new connection loads it. With my proposal, the
attacker would have to wait until CREATE EXTENSION. A point in time
where the newly created extension is more likely to be tested and
cross-checked.

I'm arguing both of these issues are very minor and negligible, security
wise. Baring other issues, I conclude there's no security impact by this
proposal.

Arguably, internalizing the DSO (together with the SQL) protects way
better from accidental modifications (i.e. removing the DSO by
un-installing the extension template via the distribution's packaging
system while some database still has the extension instantiated). That
clearly is a usability and not a security issue, though.

If nothing else (and beneficial in either case): Postgres should
probably check the permissions on the extension directory and at least
emit a warning, if it's world-writable. Or refuse to create (or even
load) an extension, right away.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-15 Thread Markus Wanner

On 07/15/2013 12:05 PM, Andres Freund wrote:
 A superuser can execute native code as postges user. That's it.

Oh, I though Robert meant postgres users, i.e. non-superusers.

I hereby withdraw my pitchforks: I'm certainly not opposing to simplify
the life of superusers, who already have the power.

 Now, I don't think Markus proposal is a good idea on *other* grounds
 though... but that's for another email.

Separation of concerns, huh? Good thing, thanks :-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-15 Thread Markus Wanner

On 07/15/2013 12:19 PM, Andres Freund wrote:
 On 2013-07-15 12:12:36 +0200, Markus Wanner wrote:
 Granted, the internalization of the DSO is somewhat critical to
 implement. Running as a non-root user, Postgres has less power than that
 and can certainly not protect the internalized DSO from modification by
 root. It can, however, store the DSO well protected from other users
 (e.g. in a directory within $PGDATA).
 
 There's heaps of problems with that though. Think noexec mounts or selinux.

Good point.

Note, however, that internalize doesn't necessarily mean exactly that
approach. It could be yet another directory, even system wide, which any
distribution should well be able to prepare.

I intentionally left the internalizing somewhat undefined. It could
even be more equivalent to what is done with the SQL stuff, i.e. some
system catalog. But that just poses other implementation issues.
(Loading a DSO from memory doesn't sound very portable to me.)

 Relying on the external DSO only exactly once can provide additional
 safety.
 
 Not necessarily. Think of a distribution provided upgrade with a
 security fix in an extension.

Ugh.. only to figure out the patched DSO is incompatible to the old
version of the SQL scripts? And therefore having to re-create the
extension, anyways? That only highlights why this difference in
treatment of SQL and DSO is troublesome.

 On a machine with dozens of clusters. Now
 you need to make sure each and every one of them has the new .so.

Agreed.

So this sounds like the other approach to unification may be more
useful: Linking the SQL scripts better and make them (re-)load from the
extensions directory, just like the DSOs.

 I think protecting against the case where such directories are writable
 is a rather bad argument.

I agree. That's why I'm neglecting he security implications and stated
there are none.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-15 Thread Markus Wanner

Robert,

On 07/15/2013 12:12 PM, Markus Wanner wrote:
 On 07/15/2013 05:49 AM, Robert Haas wrote (in a slightly different order):
 There is a lot of
 (well-founded, IMHO) resistance to the idea of letting users install C
 code via an SQL connection.
 
 Nowhere did I propose anything close to that. I'm in full support of the
 resistance. Pitchforks and torches ready to rumble. :-)

To clarify: In the above agreement, I had the Postgres level ordinary
users in mind, who certainly shouldn't ever be able to upload and run
arbitrary native code.

I'm not quite as enthusiastic if you meant the system level user that
happens to have superuser rights on Postgres. As Andres pointed out,
there are some more holes to plug, if you want to lock down a superuser
account. [1]

I'm not quite clear what kind of user you meant in your statement above.

Regards

Markus Wanner


[1]: I see the theoretical benefits of providing yet another layer of
security. Closing the existing holes would require restricting LOAD, but
that seems to suffice (given the sysadmin makes sure he doesn't
accidentally provide any of the harmful extensions).

How about the (optional?) ability to restrict LOAD to directories and
files that are only root writable? Is that enough for a root to disallow
the untrusted Postgres superuser to run arbitrary native code? Or does
that still have other holes? How many real use cases does it break? How
many does it fix?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-10 Thread Markus Wanner

Peter,

On 07/09/2013 11:04 PM, Peter Eisentraut wrote:
 I think there is an intrinsic conflict here.  You have things inside the
 database and outside.  When they depend on each other, it gets tricky.
 Extensions were invented to copy with that.  They do the job, more or
 less.

I agree. And to extend upon that, I think it's important to distinguish
between the created extension and the available one, i.e. the template.
Only the template lives outside. The created extension itself is firmly
sitting in the database, possibly with multiple dependencies from other
objects. It does not dependent on anything outside of the database
(assuming the absence of a DSO of the extension, which does not follow
that template concept).

And yes, we decided the objects that are part of the extension should
not get dumped with pg_dump. Nobody argues to change that. Note,
however, that this very decision is what raises the intrinsic conflict
for pg_restore, because CREATE EXTENSION in the dump depends on the
outside extension. If anything, Dimitri's patch solves that.

 Now you want to take the same mechanism and apply it entirely
 inside the database.  But that wasn't the point of extensions!  That's
 how you get definitional issues like, should extensions be dumped or not.

IMO the point of extensions is to extend Postgres (with code that's not
part of core). Whether their templates (SQL sources, if you want) are
stored on the file system (outside) or within the database is irrelevant
to the concept.

Think of it that way: Take one of those FUSE-Postgres-FS things [1],
which uses Postgres as the backend for a file system and allows you to
store arbitrary files. Mount that to the extensions directory of your
Postgres instance and make your extension templates available there
(i.e. copy them there). CREATE EXTENSION would just work, reading the
templates for the extension to create from itself, via that fuse
wrapper. (If the FUSE wrapper itself was using an extension, you'd get
into an interesting chicken and egg problem, but even that would be
resolvable, because the installed extension doesn't depend on the
template it was created from.)

Think of Dimitri's patch as a simpler and more elegant way to achieve
the very same thing. (Well, modulo our disagreement about the dependency
between extension and templates.) And as opposed to the file system or
fuse approach, you'd even gain transactional safety, consistency (i.e. a
constraint can enforce a full version exists as the basis for an upgrade
script), etc... But who am I to tell you the benefits of storing data in
a database?

Of course, you then also want to be able to backup your templates (not
the extensions) stored in the database. Just like you keep a backup of
your file-system templates. Either by simply making a copy, or maybe by
keeping an RPM or DEB package of it available. Thus, of course,
templates for extensions need to be dumped as well.

 I don't believe the above use case.  (Even if I did, it's marginal.)
 You should always be able to arrange things so that an upgrade of an
 inside-the-database-package is possible before or after pg_restore.

Dimitri's scenario assumes an old and a new version of an extension as
well as an old and a new Postgres major version. Where the old extension
is not compatible with the new Postgres major version. Which certainly
seems like a plausible scenario to me (postgis-2.0 is not compatible
with Postgres-9.3, for example - granted, it carries a DSO, so it's not
really a good example).

Given how extensions work, to upgrade to the new Postgres major version
*and* to the required new version of the extension, you don't ever need
to upgrade an inside-the-database-package. Instead, you need to:

createdb - provide templates - CREATE EXTENSION - restore data

Now, CREATE EXTENSION and restoring your data has effectively been
merged for the user, as pg_dump emits proper CREATE EXTENSION commands.
Providing templates so far meant installing an RPM or DEB. Or copying
the files manually.

But in fact, how and where you provide templates for the extension is
irrelevant to that order. And the possibility to merge the second step
into the 'restore data' step certainly sounds appealing to me.

 pg_dump and pg_restore are interfaces between the database and the
 outside. They should have nothing to do with upgrading things that live
 entirely inside the database.

I don't get your point here. In my view, libpq is intended to modify the
things that live inside the database, including extensions (i.e. ALTER
EXTENSION ADD FUNCTION). Or what kind of things that live entirely
inside the database do you have in mind.

 There would be value to inside-the-database package management, but it
 should be a separate concept.

Anything that's incompatible to extensions is not gonna fly. There are
too many of them available, already. We need to ease management of
those, not come up with yet another concept.

Regards

Markus Wanner


[1]: for example

Re: [HACKERS] Millisecond-precision connect_timeout for libpq

2013-07-09 Thread Markus Wanner

On 07/08/2013 08:31 PM, ivan babrou wrote:
 Seriously, I don't get why running 150 poolers is easier.

Did you consider running pgbouncer on the database servers?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Millisecond-precision connect_timeout for libpq

2013-07-09 Thread Markus Wanner

On 07/09/2013 09:15 AM, ivan babrou wrote:
 Database server lost network — boom, 2 seconds delay. What's the point then?

Oh, I see. Good point. It could still improve connection time during
normal operation, though.

None the less, I now agree with you: we recommend a pooler, which may be
capable of millisecond timeouts, but arguably is vastly more complex
than the proposed patch. And it even brings its own set of gotchas (lots
of connections). I guess I don't quite buy the complexity argument, yet.

Sure, gettimeofday() is subject to clock adjustments. But so is time().
And if you're setting timeouts that low, you probably know what you're
doing (or at least care about latency a lot). Or is gettimeofday() still
considerably slower on certain architectures or in certain scenarios?
Where's the complexity?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Millisecond-precision connect_timeout for libpq

2013-07-09 Thread Markus Wanner

Ian,

On 07/05/2013 07:28 PM, ivan babrou wrote:
 - /*
 -  * Rounding could cause connection to fail; need at 
 least 2 secs
 -  */

You removed this above comment... please check why it's there. The
relevant revision seems to be:

###
commit 2908a838ac2cf8cdccaa115249f8399eef8a731e
Author: Tom Lane t...@sss.pgh.pa.us
Date:   Thu Oct 24 23:35:55 2002 +

Code review for connection timeout patch.  Avoid unportable assumption
that tv_sec is signed; return a useful error message on timeout failure;
honor PGCONNECT_TIMEOUT environment variable in PQsetdbLogin; make code
obey documentation statement that timeout=0 means no timeout.
###

 - if (timeout  2)
 - timeout = 2;
 - /* calculate the finish time based on start + timeout */
 - finish_time = time(NULL) + timeout;
 + gettimeofday(finish_time, NULL);
 + finish_time.tv_usec += (int) timeout_usec;

I vaguely recall tv_usec only being required to hold values up to
100 by some standard. A signed 32 bit value would qualify, but only
hold up to a good half hour worth of microseconds. That doesn't quite
seem enough to calculate finish_time the way you are proposing to do it.

 + finish_time.tv_sec  += finish_time.tv_usec / 100;
 + finish_time.tv_usec  = finish_time.tv_usec % 100;
   }
   }
  
 @@ -1073,15 +1074,15 @@ pqSocketPoll(int sock, int forRead, int forWrite, 
 time_t end_time)
   input_fd.events |= POLLOUT;
  
   /* Compute appropriate timeout interval */
 - if (end_time == ((time_t) -1))
 + if (end_time == NULL)
   timeout_ms = -1;
   else
   {
 - time_t  now = time(NULL);
 + struct timeval now;
 + gettimeofday(now, NULL);
  
 - if (end_time  now)
 - timeout_ms = (end_time - now) * 1000;
 - else
 + timeout_ms = (end_time-tv_sec - now.tv_sec) * 1000 + 
 (end_time-tv_usec - now.tv_usec) / 1000;

I think that's incorrect on a platform where tv_sec and/or tv_usec is
unsigned. (And the cited commit above indicates there are such platforms.)


On 07/09/2013 02:25 PM, ivan babrou wrote:
 There's no complexity here :)

Not so fast, cowboy...  :-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-09 Thread Markus Wanner

Dimitri,

leaving the template vs link model aside for a moment, here are some
other issues I run into. All under the assumption that we want the link
model.

On 07/08/2013 11:49 AM, Dimitri Fontaine wrote:
 Please find attached to this mail version 9 of the Extension Templates
 patch with fixes for the review up to now.

First of all, I figured that creation of a template of a newer version
is prohibited in case an extension exists:

 db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $foo$ SELECT 1; 
 $foo$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# CREATE EXTENSION foo;
 CREATE EXTENSION
 db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.1' AS $foo$ SELECT 2; 
 $foo$;
 ERROR:  extension foo already exists

Why is that?

I then came to think of the upgrade scripts... what do we link against
if an extension has been created from some full version and then one or
more upgrade scripts got applied?

Currently, creation of additional upgrade scripts are not blocked. Which
is a good thing, IMO. I don't like the block on newer full versions.
However, the upgrade doesn't seem to change the dependency, so you can
still delete the update script after the update. Consider this:

 db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $$ $$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# CREATE EXTENSION foo;
 CREATE EXTENSION
 db1=# CREATE TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1' AS $$ $$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# ALTER EXTENSION foo UPDATE TO '0.1';
 ALTER EXTENSION
 db1=# SELECT * FROM pg_extension;
  extname | extowner | extnamespace | extrelocatable | extversion | extconfig 
 | extcondition 
 -+--+--+++---+--
  plpgsql |   10 |   11 | f  | 1.0|   
 | 
  foo |   10 | 2200 | f  | 0.1|   
 | 
 (2 rows)
 
 db1=# DROP TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1';
 DROP TEMPLATE FOR EXTENSION
 db1=# SELECT * FROM pg_extension;
  extname | extowner | extnamespace | extrelocatable | extversion | extconfig 
 | extcondition 
 -+--+--+++---+--
  plpgsql |   10 |   11 | f  | 1.0|   
 | 
  foo |   10 | 2200 | f  | 0.1|   
 | 
 (2 rows)
 

In this state, extension foo as of version '0.1' is installed, but
running this through dump  restore, you'll only get back '0.0'.

Interestingly, the following works (in the sense that the DROP of the
upgrade script is prohibited):

 db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $$ $$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# CREATE TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1' AS $$ $$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# CREATE EXTENSION foo VERSION '0.1';
 CREATE EXTENSION
 db1=# DROP TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1';
 ERROR:  cannot drop update template for extension foo because other objects 
 depend on it
 DETAIL:  extension foo depends on control template for extension foo
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.

However, in that case, you are free to drop the full version:

 db1=# DROP TEMPLATE FOR EXTENSION foo VERSION '0.0';
 DROP TEMPLATE FOR EXTENSION

This certainly creates a bad state that leads to an error, when run
through dump  restore.


Maybe this one...

 db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $$ $$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# CREATE TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1' AS $$ $$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# DROP TEMPLATE FOR EXTENSION foo VERSION '0.0';
 DROP TEMPLATE FOR EXTENSION

... should already err out here ...

 db1=# CREATE EXTENSION foo;
 ERROR:  Extension foo is not available from 
 /tmp/pginst/usr/local/pgsql/share/extension nor as a template
 db1=# CREATE EXTENSION foo VERSION '0.1';
 ERROR:  Extension foo is not available from 
 /tmp/pginst/usr/local/pgsql/share/extension nor as a template

... and not only here.

I.e. the TO version should probably have a dependency on the FROM
version (that might even be useful in the template model).


Another thing that surprised me is the inability to have an upgrade
script *and* a full version (for the same extension target version).
Even if that's intended behavior, the error message could be improved:

 db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.0' AS $$ $$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# CREATE TEMPLATE FOR EXTENSION foo FROM '0.0' TO '0.1' AS $$ $$;
 CREATE TEMPLATE FOR EXTENSION
 db1=# CREATE TEMPLATE FOR EXTENSION foo VERSION '0.1' AS $$ $$;
 ERROR:  duplicate key value violates unique constraint 
 pg_extension_control_name_version_index
 DETAIL:  Key (ctlname, ctlversion)=(foo, 0.1) already exists.


Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref

Re: [HACKERS] Review: extension template

2013-07-09 Thread Markus Wanner

Salut Dimitri,

On 07/09/2013 12:40 PM, Dimitri Fontaine wrote:
 Markus Wanner mar...@bluegap.ch writes:
 Or how do you think would pg_restore fail, if you followed the mental
 model of the template?
 
   # create template for extension foo version 'x' as '';
   # create extension foo;
   # alter template for extension foo rename to bar;
 
   $ pg_dump | psql
 
 And now it's impossible to CREATE EXTENSION foo, because there's no
 source to install it from available. I think we should actively prevent
 that scenario to happen in the field (current patch prevents it).

I see. You value that property a lot.

However, I don't think the plain ability to create an extension is quite
enough to ensure a consistent restore, though. You'd also have to ensure
you recreate the very *same* contents of the extension that you dumped.

Your patch doesn't do that. It seems to stop enforcing consistency at
some arbitrary point in between. For example, you can ALTER a function
that's part of the extension. Or ALTER TEMPLATE FOR EXTENSION while an
extension of that version is installed.

 Usually what you do when you want to change an extension is that you
 provide an upgrade script then run it with ALTER EXTENSION UPDATE.

As a developer, I often happen to work on one and the same version, but
test multiple modifications. This ability to change an extension
on-the-fly seems pretty valuable to me.

 The current file system based extensions allow you to maintain
 separately the files foo--1.0.sql and foo--1.2.sql, and you don't need
 to copy a current version of the whole extension away before hacking the
 new version.

 The Extension Template facility in the current patch allows you to do
 much the same

Sure. I'm aware of that ability and appreciate it.

 I see that you've spent extra time and effort to better understand any
 community consensus that might exist around this patch series, and I
 want to say thank you for that!

Thank you for your patience in pursuing this and improving user
experience with extensions. I really appreciate this work.

 Basically what I'm saying in this too long email is that I need other
 contributors to voice-in.

I fully agree. I don't want to hold this patch back any further, if it's
just me.

 I really recommend to rename the feature (and the
 commands), in that case, though. We may rather call the existing
 file-system thingie an extension template, instead, as it becomes a
 good differentiator to what you're proposing.
 
 Any proposals?

Inline extension is a bit contradictory. Maybe managed extension
describes best what you're aiming for.

In a similar vein, out-of-line extension sounds redundant to me, so
I'd rather characterize the file-system thingie as extension templates
wherever a clear distinction between the two is needed.

 How about ALTER EXTENSION ADD (or DROP)? With the link on the
 template, you'd have to prohibit that ALTER as well, based on the
 exact same grounds as the RENAME: The installed extension would
 otherwise differ from the template it is linked to.
 
 You're supposed to be using that from within the template scripts
 themselves. The main use case is the upgrade scripts from unpackaged.

Agreed. The documentation says it's mainly useful in extension update
scripts.

 I could see foreclosing that danger by enforcing that creating_extension
 is true in those commands.

For managed extensions only, right? It's not been prohibited for
extensions so far.

 See how this creates an animal pretty different from the current
 extensions? And IMO something that's needlessly restricted in many ways.
 
 Well really I'm not convinced. If you use ALTER EXTENSION ADD against an
 extension that you did install from the file system, then you don't know
 what will happen after a dump and restore cycle, because you didn't
 alter the files to match what you did, presumably.

Yeah. The user learned to work according to the template model. Maybe
that was not the best model to start with. And I certainly understand
your desire to ensure a consistent dump  restore cycle. However...

 If you do the same thing against an extension that you did install from
 a catalog template, you just managed to open yourself to the same
 hazards… 

... I think the current patch basically just moves the potential
hazards. Maybe these moved hazards are less dramatic and justify the
move. Let's recap:

In either case (template model  link model) the patch:

 a) guarantees to restore the template scripts and settings of
all managed extensions

With the link model (and v9 of your patch, which includes the RENAME
fix, and pretending there are no other bugs):

 b) it guarantees *some* revision of an extension version (identified
by name and version) that has been created at dump time can be
restored from a template from the dump

If you'd additionally restrict ALTER EXTENSION ADD/DROP as well as ALTER
TEMPLATE FOR EXTENSION AS ..., you'd also get:

 c) it guarantees the set of objects created

Re: [HACKERS] Review: extension template

2013-07-08 Thread Markus Wanner

On 06/10/2013 09:43 PM, Hannu Krosing wrote:
 On 07/08/2013 09:26 AM, Heikki Linnakangas wrote:
 The concept was useful, but not something we want
 to call an extension, because the distinguishing feature of an
 extension is that it lives outside the database and is *not* included
 in pg_dump.
 Either MODULE or PACKAGE would be good name candidates.
 
 Still, getting this functionality in seems more important than exact
 naming, though naming them right would be nice.

Remember that we already have quite a lot of extensions. And PGXN. Are
we really so wedded to the idea of extensions living outside of the
database that we need to come up with something different and incompatible?

Or do you envision modules or packages to be compatible with extensions?
Just putting another label on it so we can still claim extensions are
strictly external to the database? Sorry, I don't get the idea, there.

From a users perspective, I want extensions, modules, or packages to be
managed somehow. Including upgrades, migrations (i.e. dump  restore)
and removal. The approach of letting the distributors handle that
packaging clearly has its limitations. What's so terribly wrong with
Postgres itself providing better tools to manage those?

Inventing yet another type of extension, module or package (compatible
or not) doesn't help, but increases confusion even further. Or how do
you explain to an author of an existing extension, whether or not he
should convert his extension to a module (if you want those to be
incompatible)?

If it's the same thing, just with different loading mechanisms, please
keep calling it the same: an extension. (And maintain compatibility
between the different ways to load it.)

I fully agree with the fundamental direction of Dimitri's patch. I think
Postgres needs to better manage its extensions itself. Including dump
and restore cycles. However, I think the implementation isn't optimal,
yet. I pointed out a few usability issues and gave reasons why
template is a misnomer (with the proposed implementation). Extension
is not.

(I still think template would be a good mental model. See my other
thread...
http://archives.postgresql.org/message-id/51d72c1d.7010...@bluegap.ch)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-08 Thread Markus Wanner

On 07/08/2013 10:20 AM, Dimitri Fontaine wrote:
 Bypassing the file system entirely in order to install an extension. As
 soon as I figure out how to, including C-coded extensions.

Do I understand correctly that you want to keep the extensions (or their
templates) out of the dump and require the user to upload it via libpq
prior to the restor; instead of him having to install them via .deb or .rpm?

This would explain why you keep the CREATE TEMPLATE FOR EXTENSION as a
separate step from CREATE EXTENSION. And why you, too, insist on wanting
templates, and not just a way to create an extension via libpq.

However, why don't you follow the template model more closely? Why
should the user be unable to create a template, if there already exists
an extension of the same name? That's an unneeded and disturbing
limitation, IMO.

My wish: Please drop the pg_depend link between template and extension
and make the templates shared across databases. So I also have to
install the template only once per cluster. Keep calling them templates,
then. (However, mind that file-system extension templates are templates
as well. In-line vs. out-of-line templates, if you want.)

I think you could then safely allow an upgrade of an extension that has
been created from an out-of-line template by an upgrade script that
lives in-line. And vice-versa. Just as an example. It all gets nicer and
cleaner, if the in-line thing better matches the out-of-line one, IMO.

An extension should look and behave exactly the same, independent of
what kind of template it has been created from. And as we obviously
cannot add a pg_depend link to a file on the file system, we better
don't do that for the in-line variant, either, to maintain the symmetry.

 The main feature of the extensions system is its ability to have a clean
 pg_restore process even when you use some extensions. That has been the
 only goal of the whole feature development.

Great! Very much appreciated.

 Let me stress that the most important value in that behavior is to be
 able to pg_restore using a newer version of the extension, the one that
 works with the target major version. When upgrading from 9.2 to 9.3 if
 you depend on keywords that now are reserved you need to install the
 newer version of the extension at pg_restore time.
 
 The main features I'm interested into beside a clean pg_restore are
 UPDATE scripts for extensions and dependency management, even if that
 still needs improvements. Those improvements will be relevant for both
 ways to make extensions available for your system.

+1

 We can not use the name module for anything else, IMNSHO.

Agreed.

 The main goal here is not to have the extension live inside the database
 but rather to be able to bypass using the server's filesystem in order
 to be able to CREATE EXTENSION foo; and then to still have pg_restore do
 the right thing on its own.

Note that with the current, out-of-line approach, the *extension*
already lives inside the database. It's just the *template*, that
doesn't. (Modulo DSO, but the patch doesn't handle those either, yet. So
we're still kind of excluding those.)

Allowing for templates to live inside the database as well is a good
thing, IMO.

 If you want to scratch the new catalogs part, then just say that it's
 expected to be really complex to pg_restore a database using extensions,
 back to exactly how it was before 9.1: create the new database, create
 the extensions your dump depends on in that new database, now pg_restore
 your backup manually filtering away the extensions' objects or ignoring
 the errors when pg_restore tries to duplicate functions you already
 installed in the previous step. No fun.

Definitely not. Nobody wants to go back there. (And as Heikki pointed
out, if you absolutely want to, you can even punish yourself that way.)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-08 Thread Markus Wanner

Hello Dimitri,

On 07/08/2013 11:49 AM, Dimitri Fontaine wrote:
 Please find attached to this mail version 9 of the Extension Templates
 patch with fixes for the review up to now.

Thanks, cool.

 Markus Wanner mar...@bluegap.ch writes:
 I still think that we shouldn't allow creating a template for an
 extension that is already installed, though. Do you have any suggestions
 for a better error message?

 If we go for the template model, I beg to differ. In that mind-set, you
 should be free to create (or delete) any kind of template without
 affecting pre-existing extensions.
 
 Then what happens at pg_restore time? the CREATE EXTENSION in the backup
 script will suddenly install the other extension's that happen to have
 the same name? I think erroring out is the only safe way here.

Extensions are commonly identified by name (installed ones as well as
available ones, i.e. templates).

Thus I think if a user renames a template, he might have good reasons to
do so. He likely *wants* it to be a template for a different extension.
Likewise when (re-)creating a template with the very same name of a
pre-existing, installed extension.

Maybe the user just wanted to make a backup of the template prior to
modifying it. If he then gives the new template the same name as the old
one had, it very likely is similar, compatible or otherwise intended to
replace the former one.

The file-system templates work exactly that way (modulo DSOs). If you
create an extension, then modify (or rename and re-create under the same
name) the template on disk, then dump and restore, you end up with the
new version of it. That's how it worked so far. It's simple to
understand and use.

 It's all about pg_restore really, but it's easy to forget about that and
 get started into other views of the world. I'll try to be better at not
 following those tracks and just hammer my answers with pg_restore now.

That's unlikely to be of much help. It's not like pg_restore would stop
to work. It would just work differently. More like the file-system
templates. More like the users already knows and (likely) expects it.
And more like the template model that you advertise.

Or how do you think would pg_restore fail, if you followed the mental
model of the template?

 In any case, I'm arguing this template renaming behavior (and the
 subsequent error) are the wrong thing to do, anyways. Because an
 extension being linked to a parent of a different name is weird and IMO
 not an acceptable state.
 
 Yes, you're right on spot here. So I've amended the patch to implement
 the following behavior (and have added a new regression test):
 
   # CREATE TEMPLATE FOR EXTENSION foo VERSION 'v' AS '';
   # CREATE EXTENSION foo;
   # ALTER TEMPLATE FOR EXTENSION foo RENAME TO bar;
   ERROR:  55006: template for extension foo is in use
   DETAIL:  extension foo already exists
   LOCATION:  AlterExtensionTemplateRename, template.c:1040
   STATEMENT:  ALTER TEMPLATE FOR EXTENSION foo RENAME TO bar;

Okay, good, this prevents the strange state.

However, this also means you restrict the user even further... How can
he save a copy of an existing template, before (re-)creating it with
CREATE TEMPLATE FOR EXTENSION?

On the file system, a simple cp or mv is sufficient before
(re)installing the package from your distribution, for example.

 I bet that's because people have different mental models in mind. But I
 probably stressed that point enough by now...
 
 FWIW, I do agree.

Good. Why do you continue to propose the link model?

 But my understanding is that the community consensus
 is not going that way.

What way? And what community consensus?

Re-reading some of the past discussions, I didn't find anybody voting
for a dependency between the template and the created extension. And at
least Tom pretty clearly had the template model in mind, when he wrote
[1]: We don't want it to look like manipulating a template has anything
to do with altering an extension of the same name (which might or might
not even be installed). or [2]: But conflating this functionality
[i.e. extension templates] with installed extensions is just going to
create headaches.

The closest I found was Robert Haas mentioning [3], that [he doesn't]
see a problem having more than one kind of extensions. However, please
mind the context. He doesn't really sound enthusiastic, either.

I'm puzzled about some of your words in that thread. In the very message
Robert responded to, you wrote [4]: Having more than one way to ship an
extension is good, having two different animals with two different
incompatible behaviors named the same thing is bad.

With the link-model, you are now proposing to create exactly that. Two
different kinds of extensions that are not compatible with each other.
One that is independent and one that depends on the template it got
created from.

 Specifically, I request to either follow the template model more closely
 (accompanied by a separate patch to adjust binary, out-of-line

Re: [HACKERS] Review: extension template

2013-07-07 Thread Markus Wanner

 a fair remark that we want to get there eventually, and
 that like the superuser only limitation, that's something for a future
 patch and not this one.

I agree to defer a specific implementation. I disagree in that we should
*not* commit something that follows a mixture of mental models
underneath and sets in stone a strange API we later need to be
compatible with.

Specifically, I request to either follow the template model more closely
(accompanied by a separate patch to adjust binary, out-of-line
templates) or follow the link model more closely. The current naming
doesn't match any of the two, so renaming seems inevitable.

 Yeah, we might want to find some better naming for the on-file extension
 templates. We can't just call them extension templates though because
 that is the new beast implemented in that patch and it needs to be part
 of your pg_dump scripts, while the main feature of the file based kind
 of templates is that they are *NOT* part of your dump.

That's one issue where the mixture of concepts shines through, yeah. I
fear simply (re)naming things is not quite enough, at this point.
Especially because the thing on the file-system is closer to being a
template than the system catalog based variant.

[ Maybe we could even go with template extensions being the
file-system ones vs. inline extensions being the system catalog
ones...  In that case, they would follow different mental models. Which
might even be a reasonable solution. However, we need to be aware of
that difference and document it properly, so that users have a chance to
grasp the nifty difference. I'd rather prefer to eliminate such nifty
differences, though. ]

 It's just something I forgot to cleanup when completing the feature set.
 Cleaned up in my git branch.

Great!

 src/backend/commands/event_trigger.c, definition of
 event_trigger_support: several unnecessary whitespaces added. These make
 it hard(er) than necessary to review the patch.
 
 Here with the following command I see no problem, please advise:
 
 git diff --check --color-words postgres/master.. -- 
 src/backend/commands/event_trigger.c

rantThat's why I personally don't trust git./rant Please look at the
patch you submitted.

 Markus Wanner mar...@bluegap.ch writes:
 Dimitri, do you agree to resolve with returned with feedback for this
 CF? Or how would you like to proceed?
 
 I'd like to proceed, it's the 3rd development cycle that I'm working on
 this idea

Okay, I'm certainly not opposed to that and welcome further development.
I didn't change the status, so it should still be waiting on author,
which seems reasonable to me, ATM.

 (used to be called Inline Extensions, I also had a selective
 pg_dump patch to allow dumping some extension scripts, etc). I really
 wish we would be able to sort it out completely in this very CF and
 that's why I'm not proposing any other patch this time.

I understand it's tedious work, sometimes. Please look at this as
incremental progress. I would not have been able to reason about mental
models without your patch. Thanks for that, it was a necessary and good
step from my point of view. Please keep up the good work!

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-07 Thread Markus Wanner

On 07/06/2013 10:30 PM, Dimitri Fontaine wrote:
 I still think that we shouldn't allow creating a template for an
 extension that is already installed, though. Do you have any suggestions
 for a better error message?

Oh, I just realize that pg_extension_{template,control,uptmpl} are not
SHARED catalogs, but you need to install the template per-database and
then need to enable it - per-database *again*. Why is that?

If you want to just upload extensions to a database via libpq, that
should be a single step (maybe rather just CREATE EXTENSION ... AS ...)
If you want templates, those should be applicable to all databases, no?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-07 Thread Markus Wanner

On 07/07/2013 02:55 PM, Markus Wanner wrote:
 If you want to just upload extensions to a database via libpq..

Let's rephrase this in a (hopefully) more constructive way: I get the
impression you are trying to satisfy many different needs. Way more that
you need to scratch your own itch. To the point that I had trouble
figuring out what exactly the goal of your patch is.

My advice would be: Be brave! Dare to put down any request for
templates (including mine) and go for the simplest possible
implementation that allows you to create extensions via libpq. (Provided
that really is the itch you want to scratch. I'm still not quite sure I
got that right.)

As it stands, I get the impression the patch is trying to sell me a
feature that it doesn't really provide. If you stripped the template
stuff, including the CREATE TEMPLATE FOR EXTENSION command, but just
sold me this patch as a simple way to create an extension via libpq,
i.e. something closer to CREATE EXTENSION AS .. , I would immediately
buy that. Currently, while allowing an upload, it seems far from simple,
but adds quite a bit of unwanted complexity. If all I want is to upload
code for an extension via libpq, I don't want to deal with nifty
distinctions between templates and extensions.

Just my opinion, though. Maybe I'm still missing something.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-07 Thread Markus Wanner

Hello Dimitri,

On 07/07/2013 09:51 PM, Dimitri Fontaine wrote:
 Markus Wanner mar...@bluegap.ch writes:
 Oh, I just realize that pg_extension_{template,control,uptmpl} are not
 SHARED catalogs, but you need to install the template per-database and
 then need to enable it - per-database *again*. Why is that?
 
 Because the current model is not serving us well enough, with a single
 module version per major version of PostgreSQL. Meaning for all the
 clusters on the server, and all the databases in them.

That's not an excuse for letting the user duplicate the effort of
installing templates for every version of his extension in every database.

 We want to be able to have postgis 1.5 and 2.x installed in two
 different databases in the same cluster, don't we?

The extension, yes. The template versions, no. There's utterly no point
in having different 2.0 versions of the same extension in different
databases.

 Well the current patch we still can't because of the dynamically shared
 object side of things, but that's not a good reason to impose the same
 limitation on to the template idea.

Without a dynamically shared object, you can well have different
versions of an extension at work in different databases already.

 After playing around with several ideas around that in the past two
 development cycles, the community consensus clearly is that extensions
 are *NEVER* going to be part of your dump scripts.

Sounds strange to me. If you want Postgres to manage extensions, it
needs the ability to backup and restore them. Otherwise, it doesn't
really ... well ... manage them.

 Now when using templates you have no other source to install the
 extensions from at pg_restore time, given what I just said.
 
 The whole goal of the template idea is to offer a way to dump and
 restore the data you need for CREATE EXTENSION to just work at restore
 time, even when you sent the data over the wire.

Which in turn violates the above cited community consensus, then. You
lost me here. What's your point? I thought the goal of your patch was
the ability to upload an extension via libpq.

How does that address my concerns about usability and understandability
of how these things work? Why the strange dependencies between templates
and extensions? Or the ability to rename the template, but not the
extension - while still having the later depend on the former?

These things are what I'm opposing to. And I don't believe it
necessarily needs to be exactly that way for dump and restore to work.
Quite the opposite, in fact. Simpler design usually means simpler backup
and restore procedures.

 Current extension are managed on the file system, the contract is that
 it is the user's job to maintain and ship that, externally to PostgreSQL
 responsibilities. All that PostgreSQL knows about is to issue the CREATE
 EXTENSION command at pg_restore time.
 
 With templates or in-line extensions, the contract is that the user asks
 PostgreSQL to manage its extensions in the same way it does for the
 other objects on the system.

Understood.

 The design we found to address that is
 called Extension Templates and is implemented in the current patch.

I placed my concerns with the proposed implementation. It's certainly
not the only way how Postgres can manage its extensions. And I still
hope we can come up with something that's simpler to use and easier to
understand.

However, I'm not a committer nor have I written code for this. I did my
review and proposed two possible (opposite) directions for clean up and
simplification of the design. I would now like others to chime in.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Review: extension template

2013-07-05 Thread Markus Wanner

 be created
alongside a pre-existing system catalog installation. Postgres has no
way to prevent that (other than ignoring the files, maybe, but...). It
seems the file system variant takes precedence over anything
pre-existing in the system catalogs. This should be documented.

The documentation for ALTER TEMPLATE FOR EXTENSION says: Currently, the
only supported functionality is to change the template's default
version. I don't understand what that's supposed to mean. You can
perfectly well rename and alter the template's code.

I didn't look too deep into the code, but it seems Jaime and Hitoshi
raised some valid points.

Assorted very minor nit-picks:

In catalog/objectaddress.c, get_object_address_unqualified(): the case
OBJECT_TABLESPACE is being moved down. That looks like an like an
unintended change. Please revert.

src/backend/commands/event_trigger.c, definition of
event_trigger_support: several unnecessary whitespaces added. These make
it hard(er) than necessary to review the patch.

Regards

Markus Wanner


[1]: CF: Extension Templates
https://commitfest.postgresql.org/action/patch_view?id=1032

[2]: Dimitri's github branch 'tmpl4':
https://github.com/dimitri/postgres/tree/tmpl4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Proposal: template-ify (binary) extensions

2013-07-05 Thread Markus Wanner

Hi,

let me elaborate on an idea I had to streamline extension templates. As
I wrote in my recent review [1], I didn't have the mental model of a
template in mind for extensions, so far.

However, writing the review, I came to think of it. I certainly agree
that extensions which do not carry a binary executable can be considered
templates. The SQL scripts can be deleted after creation of the
extension in the database, because the objects created with them are an
instantiation in the database itself. They do not depend on the template.

That's not the case for compiled, executable code that usually lives in
a shared library on the filesystem. There, the instantiation of the
template merely consists of a link to the library on the file-system.
If you remove that file, you break the extension.

One way to resolve that - and that seems to be the direction Dimitri's
patch is going - is to make the extension depend on its template. (And
thereby breaking the mental model of a template, IMO. In the spirit of
that mental model, one could argue that code for stored procedures
shouldn't be duplicated, but instead just reference its ancestor.)

The other possible resolution is to make all extensions fit the template
model, i.e. make extensions independent of their templates.

This obviously requires Postgres to internalize the compiled binary code
of the library upon instantiation. (Storing the binary blobs by hash in
a shared catalog could easily prevent unwanted duplication between
databases.)

Advantages:
 - Consistent mental model: template
 - Closes the gap between how SQL scripts and binary parts of
   an extension are handled. (Or between binary and non-binary
   extensions.)
 - Compatibility and co-existence between file-system and
   system catalog based extension templates is simplified.
 - Independence of extensions from library files shipped by
   3rd party extensions (those would only ship a template, not
   the extension per se).
 - Paves the way to uploading extensions that carry a binary,
   executable payload via libpq.

Challenges:
 - CPU Architecture must be taken into account for backups. Restoring
   a backup on a different architecture would still require the
   (external) binary code for that specific architecture, because we
   cannot possibly store binaries for all architectures.
 - A bit of a mind shift for how binary extensions work.


Regards

Markus Wanner



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: extension template

2013-07-05 Thread Markus Wanner

On 07/05/2013 09:05 PM, Markus Wanner wrote:
 The patch has already
 been marked as 'returned with feedback', and I can support that
 resolution (for this CF).

Oops.. I just realize it's only set to waiting on author, now. I guess
I confused the two states. Please excuse my glitch.

Dimitri, do you agree to resolve with returned with feedback for this
CF? Or how would you like to proceed?

Josh, thanks for linking the review in the CF.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Markus Wanner

On 06/27/2013 11:12 AM, Nicolas Barbier wrote:
 Imagine that there are a lot of indexes, e.g., 50. Although a lookup
 (walking one index) is equally fast, an insertion must update al 50
 indexes. When each index requires one extra I/O (because each index is
 one level taller), that is 50 extra I/Os. In the partitioned case,
 each index would require the normal smaller amount of I/Os. Choosing
 which partition to use must only be done once: The result “counts” for
 all indexes that are to be updated.

I think you're underestimating the cost of partitioning. After all, the
lookup of what index to update for a given partition is a a lookup in
pg_index via pg_index_indrelid_index - a btree index.

Additionally, the depth of an index doesn't directly translate to the
number of I/O writes per insert (or delete). I'd rather expect the avg.
number of I/O writes per insert into a b-tree to be reasonably close to
one - depending mostly on the number of keys per page, not depth.

 Additionally: Imagine that the data can be partitioned along some
 column that makes sense for performance reasons (e.g., some “date”
 where most accesses are concentrated on rows with more recent dates).
 The other indexes will probably not have such a performance
 distribution. Using those other indexes (both for look-ups and
 updates) in the non-partitioned case, will therefore pull a huge
 portion of each index into cache (because of the “random distribution”
 of the non-date data). In the partitioned case, more cache can be
 spent on the indexes that correspond to the “hot partitions.”

That's a valid point, yes. I'd call this index partitioning. And with
partial indices, Postgres already has something that gets pretty close,
I think. Though, I don't consider this to be related to how the tuples
of the relation are laid out on disk.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Markus Wanner

On 06/27/2013 06:35 PM, Nicolas Barbier wrote:
 I am assuming that this (comparatively very small and super-hot) index
 is cached all the time, while for the other indexes (that are
 supposedly super-huge) only the top part stays cached.
 
 I am mostly just trying to find out where Yuri’s “partitioning is
 needed for super-huge tables” experience might come from, and noting
 that Heikki’s argument might not be 100% valid.

I think the OP made that clear by stating that his index has relatively
low selectivity. That seems to be a case that Postgres doesn't handle
very well.

 I think that the
 “PostgreSQL-answer” to this problem is to somehow cluster the data on
 the “hotness column” (so that all hot rows are close to one another,
 thereby improving the efficiency of the caching of relation blocks) +
 partial indexes for the hot rows (as first mentioned by Claudio; to
 improve the efficiency of the caching of index blocks).

Agreed, sounds like a sane strategy.

 My reasoning was: To determine which index block to update (typically
 one in both the partitioned and non-partitioned cases), one needs to
 walk the index first, which supposedly causes one additional (read)
 I/O in the non-partitioned case on average, because there is one extra
 level and the lower part of the index is not cached (because of the
 size of the index). I think that pokes a hole in Heikki’s argument of
 “it really doesn’t matter, partitioning == using one big table with
 big non-partial indexes.”

Heikki's argument holds for the general case, where you cannot assume a
well defined hot partition. In that case, the lowest levels of all the
b-trees of the partitions don't fit in the cache, either. A single index
performs better in that case, because it has lower overhead.

I take your point that in case you *can* define a hot partition and
apply partitioning, the hot(test) index(es) are more likely to be cached
and thus require less disk I/O.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Markus Wanner

On 06/27/2013 11:13 PM, Jeff Janes wrote:
 Wouldn't any IO system being used on a high-end system be fairly good
 about making this work through interleaved read-ahead algorithms?

To some extent, certainly. It cannot possibly get better than a fully
sequential load, though.

 That sounds like it would be much more susceptible to lock contention,
 and harder to get bug-free, than dividing into bigger chunks, like whole
 1 gig segments.  

Maybe, yes. Splitting a known amount of work into equal pieces sounds
like a pretty easy parallelization strategy. In case you don't know the
total amount of work or the size of each piece in advance, it gets a bit
harder. Choosing chunks that turn out to be too big certainly hurts.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner

On 06/25/2013 11:52 PM, Kevin Grittner wrote:
 At least until we have parallel
 query execution.  At *that* point this all changes.

Can you elaborate on that, please? I currently have a hard time
imagining how partitions can help performance in that case, either. At
least compared to modern RAID and read-ahead capabilities.

After all, RAID can be thought of as hash partitioning with a very weird
hash function. Or maybe rather range partitioning on an internal key.

Put another way: ideally, the system should take care of optimally
distributing data across its physical storage itself. If you need to do
partitioning manually for performance reasons, that's actually a
deficiency of it, not a feature.

I certainly agree that manageability may be a perfectly valid reason to
partition your data. Maybe there even exist other good reasons. I don't
think performance optimization is one. (It's more like giving the system
a hint. And we all dislike hints, don't we? *ducks*)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner

On 06/26/2013 04:10 PM, Yuri Levinsky wrote:
 You typically don't want to use b-tree index when yo select
 more when ~1-2% of your data. 

Agreed. Indices on columns with very low selectivity don't perform well.
(Postgres knows that and uses a sequential scan based on selectivity
estimates. Being able to eliminate entire partitions from such a seq
scan would certainly be beneficial, yes.)

In the Postgres world, though, I think CLUSTERing might be the better
approach to solve that problem. (Note: this has nothing to do with
distributed systems in this case.) I'm not sure what the current status
of auto clustering or optimized scans on such a permanently clustered
table is, though.

The minmax indices proposed for 9.4 might be another feature worth
looking at.

Both of these approaches may eventually provide a more general and more
automatic way to speed up scans on large portions of a relation, IMO.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner

On 06/26/2013 04:01 PM, k...@rice.edu wrote:
 I think he is referring to the fact that with parallel query execution,
 multiple partitions can be processed simultaneously instead of serially
 as they are now with the resulting speed increase.

Processing simultaneously is the purpose of parallel query execution,
yes. But I see no reason for that not to work equally well for
unpartitioned tables.

Disk I/O is already pretty well optimized and parallelized, I think.
Trying to parallelize a seq scan on the Postgres side is likely to yield
far inferior performance.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-26 Thread Markus Wanner

On 06/25/2013 08:26 PM, Andres Freund wrote:
 It's not about the reviewers being less. It's a comparison of
 effort. The effort for a casual review simply isn't comparable with the
 effort spent on developing a nontrivial patch.

Remember: Debugging is twice as hard as writing the code in the first
place. ... (Brian Kernighan)

IMO, the kind of reviews we need are of almost debug level difficulty.
(To the point where the reviewer becomes a co-author or even takes over
and submits a completely revamped patch instead.)

I agree that the casual review is several levels below that, so your
point holds. I doubt we need more reviews of that kind, though.

Thus, I'm in the AAB camp. The remaining question being: What's the
criterion for becoming a co-author (and thus getting mentioned in the
release notes)?

If at all, we should honor quality work with a prize. Maybe a price
for the best reviewer per CF? Maybe even based on votes from the general
public on who's been the best, so reviews gain attention that way...
Click here to vote for my review. ... Maybe a crazy idea.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash partitioning.

2013-06-26 Thread Markus Wanner

On 06/26/2013 05:46 PM, Heikki Linnakangas wrote:
 We could also allow a large query to search a single table in parallel.
 A seqscan would be easy to divide into N equally-sized parts that can be
 scanned in parallel. It's more difficult for index scans, but even then
 it might be possible at least in some limited cases.

So far reading sequentially is still faster than hopping between
different locations. Purely from the I/O perspective, that is.

For queries where the single CPU core turns into a bottle-neck and which
we want to parallelize, we should ideally still do a normal, fully
sequential scan and only fan out after the scan and distribute the
incoming pages (or even tuples) to the multiple cores to process.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Change authentication error message (patch)

2013-06-20 Thread Markus Wanner

On 06/20/2013 12:51 AM, Jeff Janes wrote:
 I think we need to keep the first password.  Password authentication
 is a single thing, it is the authentication method attempted.  It is the
 password method (which includes MD5) which failed, as opposed to the
 LDAP method or the Peer method or one of the other methods.

That's against the rule of not revealing any more knowledge than a
potential attacker already has, no? For that reason, I'd rather go with
just authentication failed.

 Without this level of explicitness, it might be hard to figure out which
 row in pg_hba.conf was the one that PostgreSQL glommed onto to use for
 authentication.

As argued before, that should go into the logs for diagnosis by the
sysadmin, but should not be revealed to an attacker.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Change authentication error message (patch)

2013-06-20 Thread Markus Wanner

On 06/20/2013 12:27 PM, Marko Tiikkaja wrote:
 My understanding is that the attacker would already have that
 information since the server would have sent an
 AuthenticationMD5Password message to get to the error in the first
 place.  And we still reveal the authentication method to the frontend in
 all other cases (peer authentication failed, for example).

Oh, right, I wasn't aware of that. Never mind, then.

+1 for keeping it mention password authentication explicitly.

However, thinking about this a bit more: Other authentication methods
may also provide password (or even account) expiration times. And may
fail to authenticate a user for entirely different reasons.

Given that, I wonder if password expired is such a special case worth
mentioning in case of the password auth method. If we go down that
path, don't we also have to include auth server unreachable as a
possible cause for authentication failure for methods that use an
external server?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic background workers

2013-06-20 Thread Markus Wanner

Robert,

On 06/14/2013 11:00 PM, Robert Haas wrote:
 Parallel query, or any subset of that project such as parallel sort,
 will require a way to start background workers on demand.

thanks for continuing this, very much appreciated. Postgres-R and thus
TransLattice successfully use a similar approach for years, now.

I only had a quick glance over the patch, yet. Some comments on the design:

 This requires some communication channel from ordinary workers to the
 postmaster, because it is the postmaster that must ultimately start
 the newly-registered workers.  However, that communication channel has
 to be designed pretty carefully, lest a shared memory corruption take
 out the postmaster and lead to inadvertent failure to restart after a
 crash.  Here's how I implemented that: there's an array in shared
 memory of a size equal to max_worker_processes.  This array is
 separate from the backend-private list of workers maintained by the
 postmaster, but the two are kept in sync.  When a new background
 worker registration is added to the shared data structure, the backend
 adding it uses the existing pmsignal mechanism to kick the postmaster,
 which then scans the array for new registrations.

That sounds like a good simplification. Even if it's an O(n) operation,
the n in question here has relatively low practical limits. It's
unlikely to be much of a concern, I guess.

Back then, I solved it by having a fork request slot. After starting,
the bgworker then had to clear that slot and register with a coordinator
process (i.e. the original requestor), so that one learned its fork
request was successful. At some point I expanded that to multiple
request slots to better handle multiple concurrent fork requests.
However, it was difficult to get right and requires more IPC than your
approach.

On the pro side: The shared memory area used by the postmaster was very
small in size and read-only to the postmaster. These were my main goals,
which I'm not sure were the best ones, now that I read your concept.

 I have attempted to
 make the code that transfers the shared_memory state into the
 postmaster's private state as paranoid as humanly possible.  The
 precautions taken are documented in the comments.  Conversely, when a
 background worker flagged as BGW_NEVER_RESTART is considered for
 restart (and we decide against it), the corresponding slot in the
 shared memory array is marked as no longer in use, allowing it to be
 reused for a new registration.

Sounds like the postmaster is writing to shared memory. Not sure why
I've been trying so hard to avoid that, though. After all, it can hardly
hurt itself *writing* to shared memory.

 Since the postmaster cannot take locks, synchronization between the
 postmaster and other backends using the shared memory segment has to
 be lockless.  This mechanism is also documented in the comments.  An
 lwlock is used to prevent two backends that are both registering a new
 worker at about the same time from stomping on each other, but the
 postmaster need not care about that lwlock.
 
 This patch also extends worker_spi as a demonstration of the new
 interface.  With this patch, you can CREATE EXTENSION worker_spi and
 then call worker_spi_launch(int4) to launch a new background worker,
 or combine it with generate_series() to launch a bunch at once.  Then
 you can kill them off with pg_terminate_backend() and start some new
 ones.  That, in my humble opinion, is pretty cool.

It definitely is. Thanks again.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic background workers

2013-06-20 Thread Markus Wanner

On 06/20/2013 04:41 PM, Robert Haas wrote:
 The constant factor is also very small.  Generally, I would expect
 num_worker_processes ~ # CPUs

That assumption might hold for parallel querying, yes. In case of
Postgres-R, it doesn't. In the worst case, i.e. with a 100% write load,
a cluster of n nodes, each with m backends performing transactions, all
of them replicated to all other (n-1) nodes, you end up with ((n-1) * m)
bgworkers. Which is pretty likely to be way above the # CPUs on any
single node.

I can imagine other extensions or integral features like autonomous
transactions that might possibly want many more bgworkers as well.

 and scanning a 32, 64, or even 128
 element array is not a terribly time-consuming operation.

I'd extend that to say scanning an array with a few thousand elements is
not terribly time-consuming, either. IMO the simplicity is worth it,
ATM. It's all relative to your definition of ... eh ... terribly.

.oO( ... premature optimization ... all evil ... )

 We might
 need to re-think this when systems with 4096 processors become
 commonplace, but considering how many other things would also need to
 be fixed to work well in that universe, I'm not too concerned about it
 just yet.

Agreed.

 One thing I think we probably want to explore in the future, for both
 worker backends and regular backends, is pre-forking.  We could avoid
 some of the latency associated with starting up a new backend or
 opening a new connection in that way.  However, there are quite a few
 details to be thought through there, so I'm not eager to pursue that
 just yet.  Once we have enough infrastructure to implement meaningful
 parallelism, we can benchmark it and find out where the bottlenecks
 are, and which solutions actually help most.

Do you mean pre-forking and connecting to a specific database? Or really
just the forking?

 I do think we need a mechanism to allow the backend that requested the
 bgworker to know whether or not the bgworker got started, and whether
 it unexpectedly died.  Once we get to the point of calling user code
 within the bgworker process, it can use any number of existing
 mechanisms to make sure that it won't die without notifying the
 backend that started it (short of a PANIC, in which case it won't
 matter anyway).  But we need a way to report failures that happen
 before that point.  I have some ideas about that, but decided to leave
 them for future passes.  The remit of this patch is just to make it
 possible to dynamically register bgworkers.  Allowing a bgworker to be
 tied to the session that requested it via some sort of feedback loop
 is a separate project - which I intend to tackle before CF2, assuming
 this gets committed (and so far nobody is objecting to that).

Okay, sounds good. Given my background, I considered that a solved
problem. Thanks for pointing it out.

 Sounds like the postmaster is writing to shared memory. Not sure why
 I've been trying so hard to avoid that, though. After all, it can hardly
 hurt itself *writing* to shared memory.
 
 I think there's ample room for paranoia about postmaster interaction
 with shared memory, but all it's doing is setting a flag, which is no
 different from what CheckPostmasterSignal() already does.

Sounds good to me.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Change authentication error message (patch)

2013-06-19 Thread Markus Wanner

This probably is nit-picking, but it interests me in terms of how the
language is used and understood.

On 06/19/2013 08:55 PM, Joshua D. Drake wrote:
 I believe it actually can. The error message that is returned for a bad
 password, bad user or expired password is all the same. Which is why I
 put the username in there.

Sure, the authentication can fail for all these reasons. What I stumbled
over was the formulation of a failed username. If an engine fails, it
might literally fall apart. The username itself - even if it doesn't
pass authentication - is not falling apart in the same sense. But does
the username (or the password) fail if authentication with it (in
combination with password and account expiration time) is not possible?
After all, it might still a valid and complete username for another
cluster or another service.

You can probably say: that username failed when you actually mean it
failed to authenticate together with the provided password. Or how do
English native speakers perceive this?

 Authentication failed or password has expired for user \%s\
 
 Authentication failed covers any combination of a username/password
 being wrong and obviously password expired covers the other.

Works for me. Considering the password to be the thing that expires
(rather than the account) is probably more accurate as well.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Change authentication error message (patch)

2013-06-18 Thread Markus Wanner

On 06/16/2013 06:02 PM, Joshua D. Drake wrote:
 Instead of pushing extra info to the logs I decided that we could
 without giving away extra details per policy. I wrote the error message
 in a way that tells the most obvious problems, without admitting to any
 of them. Please see attached:

+1 for solving this with a bit of word-smithing.

However, the proposed wording doesn't sound like a full sentence to my
ears, because a password or username cannot fail per-se.

How about:
password authentication failed or account expired for user \%s\

It's a bit longer, but sounds more like a full sentence, no?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2013-06-11 Thread Markus Wanner

Ants,

the more I think about this, the more I start to like it.

On 06/07/2013 02:50 PM, Ants Aasma wrote:
 On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner mar...@bluegap.ch wrote:
 Agreed. Postgres-R uses a CommitOrderId, which is very similar in
 concept, for example.
 
 Do you think having this snapshot scheme would be helpful for Postgres-R?

Yeah, it could help to reduce patch size, after a rewrite to use such a CSN.

 Or why do you need to tell apart aborted from in-progress transactions
 by CSN?
 
 I need to detect aborted transactions so they can be discared during
 the eviction process, otherwise the sparse array will fill up. They
 could also be filtered out by cross-referencing uncommitted slots with
 the procarray. Having the abort case do some additional work to make
 xid assigment cheaper looks like a good tradeoff.

I see.

 Sparse buffer needs to be at least big enough to fit CSN slots for the
 xids of all active transactions and non-overflowed subtransactions. At
 the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
 at 16 bytes * (64 + 1) slots * 100 =  backends = 101.6KB per buffer,
 or 203KB total in the default configuration.

 A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
 the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.
 
 8 byte alignment for CSNs is needed for atomic if not something else.

Oh, right, atomic writes.

 I think the size could be cut in half by using a base value for CSNs
 if we assume that no xid is active for longer than 2B transactions as
 is currently the case. I didn't want to include the complication in
 the first iteration, so I didn't verify if that would have any
 gotchas.

In Postgres-R, I effectively used a 32-bit order id which wraps around.

In this case, I guess adjusting the base value will get tricky. Wrapping
could probably be used as well, instead.

 The number of times each cache line can be invalidated is
 bounded by 8.

Hm.. good point.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

2013-06-07 Thread Markus Wanner

.

Is that really necessary? After all, an aborted transaction behaves
pretty much the same as a transaction in progress WRT visibility: it's
simply not visible.

Or why do you need to tell apart aborted from in-progress transactions
by CSN?

 Sizing the CSN maps
 ---
 
 CSN maps need to sized to accomodate the number of backends.
 
 Dense array size should be picked so that most xids commit before
 being evicted from the dense map and sparse array will contain slots
 necessary for either long running transactions or for long snapshots
 not yet converted to XID based snapshots. I did a few quick
 simulations to measure the dynamics. If we assume a power law
 distribution of transaction lengths and snapshots for the full length
 of transactions with no think time, then 16 slots per backend is
 enough to make the probability of eviction before commit less than
 1e-6 and being needed at eviction due to a snapshot about 1:1. In
 the real world very long transactions are more likely than predicted
 model, but at least the normal variation is mostly buffered by this
 size. 16 slots = 128bytes per backend ends up at a 12.5KB buffer for
 the default value of 100 backends, or 125KB for 1000 backends.

Sounds reasonable to me.

 Sparse buffer needs to be at least big enough to fit CSN slots for the
 xids of all active transactions and non-overflowed subtransactions. At
 the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
 at 16 bytes * (64 + 1) slots * 100 =  backends = 101.6KB per buffer,
 or 203KB total in the default configuration.

A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.

 Performance discussion
 --
 
 Taking a snapshot is extremely cheap in this scheme, I think the cost
 will be mostly for publishing the csnmin and rechecking validity after
 it. Taking snapshots in the shadow of another snapshot (often the case
 for the proposed MVCC catalog access) will be even cheaper as we don't
 have to advertise the new snapshot. The delayed XID based snapshot
 construction should be unlikely, but even if it isn't the costs are
 similar to GetSnapshotData, but we don't need to hold a lock.
 
 Visibility checking will also be simpler as for the cases where the
 snapshot is covered by the dense array it only requires two memory
 lookups and comparisons.

Keep in mind, though, that both of these lookups are into shared memory.
Especially the dense ring buffer may well turn into a point of
contention. Or at least the cache lines holding the most recent XIDs
within that ring buffer.

Where as currently, the snapshot's xip array resides in process-local
memory. (Granted, often enough, the proc array also is a point of
contention.)

 At this point I don't see any major issues with this approach. If the
 ensuing discussion doesn't find any major showstoppers then I will
 start to work on a patch bit-by-bit.

Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
in that format.  :-)

 we have a small one in the family.

Congratulations on that one.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: logical changeset generation v3 - comparison to Postgres-R change set format

2013-01-14 Thread Markus Wanner

On 01/13/2013 09:04 PM, Hannu Krosing wrote:
 I'd just start with what send() and recv() on each type produces
 now using GCC on 64bit Intel and move towards adjusting others
 to match. For a period anything else would still be allowed, but
 be non-standard

Intel being little endian seems like a bad choice to me, given that
send/recv kind of implies network byte ordering. I'd rather not tie this
to any particular processor architecture at all (at least not solely on
the ground that it's the most common one at the time).

I have no strong opinion on sameness of NULLs and could also imagine
that to throw some kind of invalid operation error. Based on the ground
that neither is a value and it's unclear what send() method to use at all.

FWIW, trying to determine the length of a sent NULL gives an interesting
result that I don't currently understand.

 psql (9.2.2)
 Type help for help.
 
 postgres=# SELECT length(int4send(NULL));
  length 
 

 (1 row)
 
 postgres=# SELECT length(float4send(NULL));
  length 
 

 (1 row)
 
 postgres=# SELECT length(textsend(NULL));
  length 
 

 (1 row)

 postgres=# SELECT length(textsend(NULL) || '\000'::bytea);
  length 
 

 (1 row)


Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: logical changeset generation v3 - comparison to Postgres-R change set format

2013-01-13 Thread Markus Wanner

On 01/13/2013 12:30 PM, Hannu Krosing wrote:
 On 01/13/2013 10:49 AM, Hannu Krosing wrote:
 Does this hint that postgreSQL also needs an sameness operator
 ( is or === in same languages).
 
 How do people feel about adding a real sameness operator ?

We'd need to define what sameness means. If this goes toward exact
match in binary representation, this gets a thumbs-up from me.

As a first step in that direction, I'd see adjusting send() and recv()
functions to use a portable binary format. A sameness operator could
then be implemented by simply comparing two value's send() outputs.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-04 Thread Markus Wanner

On 12/03/2012 10:35 PM, Alvaro Herrera wrote:
 So here's version 8.  This fixes a couple of bugs and most notably
 creates a separate PGPROC list for bgworkers, so that they don't
 interfere with client-connected backends.

Nice, thanks. I'll try to review this again, shortly.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-03 Thread Markus Wanner

On 12/03/2012 03:28 PM, Alvaro Herrera wrote:
 I'm not really that interested in it currently ... and there are enough
 other patches to review.  I would like bgworkers to mature a bit more
 and get some heavy real world testing before we change autovacuum.

Understood.

 Just like av_launcher does it now: set a flag in shared memory and
 signal the postmaster (PMSIGNAL_START_AUTOVAC_WORKER).
 
 I'm not sure how this works.  What process is in charge of setting such
 a flag?

The only process that currently starts background workers ... ehm ...
autovacuum workers is the autovacuum launcher. It uses the above
Postmaster Signal in autovacuum.c:do_start_autovacuum_worker() to have
the postmaster launch bg/autovac workers on demand.

(And yes, this kind of is an exception to the rule that the postmaster
must not rely on shared memory. However, these are just flags we write
atomically, see pmsignal.[ch])

I'd like to extend that, so that other processes can request to start
(pre-registered) background workers more dynamically. I'll wait for an
update of your patch, though.

 This is actually a very silly bug: it turns out that guc.c obtains the
 count of workers before workers have actually registered.  So this
 necessitates some reshuffling.

Okay, thanks.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-03 Thread Markus Wanner

On 12/03/2012 04:27 PM, Simon Riggs wrote:
 My understanding was that the patch keep autovac workers and
 background workers separate at this point.

I agree to that, for this patch. However, that might not keep me from
trying to (re-)sumbit some of the bgworker patches in my queue. :-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-03 Thread Markus Wanner

On 12/03/2012 04:44 PM, Alvaro Herrera wrote:
 Simon Riggs wrote:
 Is there anything to be gained *now* from merging those two concepts?
 I saw that as refactoring that can occur once we are happy it should
 take place, but isn't necessary.
 
 IMO it's a net loss in robustness of the autovac implementation.

Agreed.

That's only one aspect of it, though. From the other point of view, it
would be a proof of stability for the bgworker implementation if
autovacuum worked on top of it.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-12-03 Thread Markus Wanner

Alvaro,

in general, please keep in mind that there are two aspects that I tend
to mix and confuse: one is what's implemented and working for
Postgres-R. The other this is how I envision (parts of it) to be merged
back into Postgres, itself. Sorry if that causes confusion.

On 12/03/2012 04:43 PM, Alvaro Herrera wrote:
 Oh, I understand that.  My question was more about what mechanism are
 you envisioning for new bgworkers.  What process would be in charge of
 sending such signals?  Would it be a persistent bgworker which would
 decide to start up other bgworkers based on external conditions such as
 timing?  And how would postmaster know which bgworker to start -- would
 it use the bgworker's name to find it in the registered workers list?

Well, in the Postgres-R case, I've extended the autovacuum launcher to a
more general purpose job scheduler, named coordinator. Lacking your
bgworker patch, it is the only process that is able to launch background
workers.

Your work now allows extensions to register background workers. If I
merge the two concepts, I can easily imagine allowing other (bgworker)
processes to launch bgworkers.

Another thing I can imagine is turning that coordinator into something
that can schedule and trigger jobs registered by extensions (or even
user defined ones). Think: cron daemon for SQL jobs in the background.
(After all, that's pretty close to what the autovacuum launcher does,
already.)

 The trouble with the above rough sketch is how does the coordinating
 bgworker communicate with postmaster.

I know. I've gone through that pain.

In Postgres-R, I've solved this with imessages (which was the primary
reason for rejection of the bgworker patches back in 2010).

The postmaster only needs to starts *a* background worker process. That
process itself then has to figure out (by checking its imessage queue)
what job it needs to perform. Or if you think in terms of bgworker
types: what type of bgworker it needs to be.

 Autovacuum has a very, um,
 peculiar mechanism to make this work: avlauncher sends a signal (which
 has a hardcoded-in-core signal number) and postmaster knows to start a
 generic avworker; previously avlauncher has set things up in shared
 memory, and the generic avworker knows where to look to morph into the
 specific bgworker required.  So postmaster never really looks at shared
 memory other than the signal number (which is the only use of shmem in
 postmaster that's acceptable, because it's been carefully coded to be
 robust).

In Postgres-R, I've extended exactly that mechanism to work for other
jobs that autovacuum.

 This doesn't work for generic modules because we don't have a
 hardcoded signal number; if two modules wanted to start up generic
 bgworkers, how would postmaster know which worker-main function to call?

You've just described above how this works for autovacuum: the
postmaster doesn't *need* to know. Leave it to the bgworker to determine
what kind of task it needs to perform.

 As posted, the feature is already useful and it'd be good to have it
 committed soon so that others can experiment with whatever sample
 bgworkers they see fit.  That will give us more insight on other API
 refinements we might need.

I completely agree. I didn't ever intend to provide an alternative patch
or hold you back. (Except for the extra daemon issue, where we disagree,
but that's not a reason to keep this feature back). So please, go ahead
and commit this feature (once the issues I've mentioned in the review
are fixed).

Please consider all of these plans or ideas in here (or in Postgres-R)
as extending on your work, rather than competing against.

 I'm going to disappear on paternity leave, most likely later this week,
 or early next week; I would like to commit this patch before that.  When
 I'm back we can discuss other improvements.  That will give us several
 weeks before the end of the 9.3 development period to get these in.  Of
 course, other committers are welcome to improve the code in my absence.

Okay, thanks for sharing that. I'd certainly appreciate your inputs on
further refinements for bgworkers and/or autovacuum.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 11:09 AM, Dimitri Fontaine wrote:
 Markus Wanner mar...@bluegap.ch writes:
 However, as you say, maybe we need more coding examples.

 Maybe a minimally usable extra daemon example? Showing how to avoid
 common pitfalls? Use cases, anybody? :-)
 
 What about the PGQ ticker, pgqd?
 
   https://github.com/markokr/skytools/tree/master/sql/ticker
   https://github.com/markokr/skytools/blob/master/sql/ticker/pgqd.c

AFAICS pgqd currently uses libpq, so I think it would rather turn into
what I call a background worker, with a connection to Postgres shared
memory. I perfectly well see use cases (plural!) for those.

What I'm questioning is the use for what I rather call extra daemons,
that is, additional processes started by the postmaster without a
connection to Postgres shared memory (and thus without a database
connection).

I was asking for a minimal example of such an extra daemon, similar to
worker_spi, showing how to properly write such a thing. Ideally showing
how to avoid common pitfalls.

 Maybe it would be easier to have a version of GNU mcron as an extension,
 with the abitity to fire PostgreSQL stored procedures directly?  (That
 way the cron specific parts of the logic are already implemented)
 
   http://www.gnu.org/software/mcron/

Again, that's something that would eventually require a database
connection. Or at least access to Postgres shared memory to eventually
start the required background jobs. (Something Alvaro's patch doesn't
implement, yet. But which exactly matches what the coordinator and
bgworkers in Postgres-R do.)

For ordinary extra daemons, I'm worried about things like an OOM killer
deciding to kill the postmaster, being its parent. Or (io)niceness
settings. Or Linux cgroups. IMO all of these things just get harder to
administrate for extra daemons, when they move under the hat of the
postmaster.

Without access to shared memory, the only thing an extra daemon would
gain by moving under postmaster is the knowledge of the start, stop
and restart (crash) events of the database. And that in a very
inflexible way: the extra process is forced to start, stop and restart
together with the database to learn about these events.

Using a normal client connection arguably is a better way to learn about
crash events - it doesn't have the above limitation. And the start and
stop events, well, the DBA or sysadmin is under control of these,
already. We can possibly improve on that, yes. But extra daemons are not
the way to go, IMO.

 Another idea would be to have a pgbouncer extension. We would still need
 of course to have pgbouncer as a separate component so that client
 connection can outlive a postmaster crash, but that would still be very
 useful as a first step into admission control. Let's not talk about the
 feedback loop and per-cluster resource usage monitoring yet, but I guess
 that you can see the drift.

Sorry, I don't. Especially not with something like pgbouncer, because I
definitely *want* to control and administrate that separately.

So I guess this is a vote to disallow `worker.bgw_flags = 0` on the
basis that everything a process, which doesn't need to access Postgres
shared memory, better does whatever it does outside of Postgres. For the
benefit of the stability of Postgres and for ease of administration of
the two.

Or, maybe, rather drop the BGWORKER_SHMEM_ACCESS flag and imply that
every registered process wants to have access to Postgres shared memory.
Then document the gotchas and requirements of living under the
Postmaster, as I've requested before. (If you want a foot gun, you can
still write an extension that doesn't need to access Postgres shared
memory, but hey.. I we can't help those who desperately try to shoot
their foot).

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 01:40 PM, Dimitri Fontaine wrote:
 I totally missed the need to connect to shared memory to be able to
 connect to a database and query it. Can't we just link the bgworkder
 code to the client libpq library, just as plproxy is doing I believe?

Well, sure you can use libpq. But so can external processes. So that's
no benefit of running under the postmaster.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 01:59 PM, Andres Freund wrote:
 On 2012-11-30 09:57:20 -0300, Alvaro Herrera wrote:
 One of the uses for bgworkers that don't have shmem connection is to
 have them use libpq connections instead.  I don't really see the point
 of forcing everyone to use backend connections when libpq connections
 are enough.

Requiring a libpq connection is a good indication for *not* wanting the
process to run under the postmaster, IMO.

 In particular, they are easier to port from existing code;
 and they make it easier to share code with systems that still have to
 support older PG versions.
 
 They also can get away with a lot more crazy stuff without corrupting
 the database.

Exactly. That's a good reason to *not* tie that to the postmaster, then.
Please keep as much of the potentially dangerous stuff separate (and
advice developers to do so as well, instead of offering them a foot
gun). So that our postmaster can do its job. And do it reliably, without
trying to be a general purpose start/stop daemon. There are better and
well established tools for that.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

Alvaro,

On 11/30/2012 02:44 PM, Alvaro Herrera wrote:
 So it
 makes easier to have processes that need to run alongside postmaster.

That's where we are in respectful disagreement, then. As I don't think
that's easier, overall, but in my eye, this looks like a foot gun.

As long as things like pgbouncer, pgqd, etc.. keep to be available as
separate daemons, I'm happy, though.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-30 Thread Markus Wanner

On 11/30/2012 03:16 PM, Kohei KaiGai wrote:
 This feature does not enforce them to implement with this new framework.
 If they can perform as separate daemons, it is fine enough.

I'm not clear on what exactly you envision, but if a process needs
access to shared buffers, it sounds like it should be a bgworker. I
don't quite understand why that process also wants a libpq connection,
but that's certainly doable.

 But it is not all the cases where we want background workers being tied
 with postmaster's duration.

Not wanting a process to be tied to postmaster's duration is a strong
indication that it better not be a bgworker, I think.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Review: Extra Daemons / bgworker

2012-11-28 Thread Markus Wanner

.)


# Tests

No additional automated tests are included - hard to see how that could
be tested automatically, given the lack of a proper test harness.


I hope this review provides useful input.

Regards

Markus Wanner


[1]: That was during commit fest 2010-09. Back then, neither extensions,
latches nor the walsender existed. The patches had a dependency on
imessages, which was not accepted. Other than that, it's a working,
pooled bgworker implementation on top of which autovacuum runs just
fine. It lacks extensibility and the extra daemons feature, though. For
those who missed it:

https://commitfest.postgresql.org/action/patch_view?id=339
http://archives.postgresql.org/message-id/4c3c789c.1040...@bluegap.ch
http://git.postgres-r.org/?p=bgworker;a=summary


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review: Extra Daemons / bgworker

2012-11-28 Thread Markus Wanner

On 11/28/2012 03:51 PM, Alvaro Herrera wrote:
 I remember your patchset.  I didn't look at it for this round, for no
 particular reason.  I did look at KaiGai's submission from two
 commitfests ago, and also at a patch from Simon which AFAIK was never
 published openly.  Simon's patch merged the autovacuum code to make
 autovac workers behave like bgworkers as handled by his patch, just like
 you suggest.  To me it didn't look all that good; I didn't have the guts
 for that, hence the separation.  If later bgworkers are found to work
 well enough, we can port autovac workers to use this framework, but
 for now it seems to me that the conservative thing is to leave autovac
 untouched.

Hm.. interesting to see how the same idea grows into different patches
and gets refined until we end up with something considered committable.

Do you remember anything in particular that didn't look good? Would you
help reviewing a patch on top of bgworker-7 that changed autovacuum into
a bgworker?

 (As an example, we introduced ilist some commits back and changed some
 uses to it; but there are still many places where we're using SHM_QUEUE,
 or List, or open-coded lists, which we could port to the new
 infrastructure, but there's no pressing need to do it.)

Well, I usually like cleaning things up earlier rather than later (my
desk clearly being an exception to that rule, though). But yeah, new
code usually brings new bugs with it.

 Sorry about that -- forgot to push to github.  bgworker-7 corresponds to
 commit 0a49a540b which I have just pushed to github.

Thanks.

 Maybe one thing to do in this area would be to ensure that there is a
 certain number of PGPROC elements reserved specifically for bgworkers;
 kind of like autovacuum workers have.  That would avoid having regular
 clients exhausting slots for bgworkers, and vice versa.

Yeah, I think that's mandatory, anyways, see below.

 How are you envisioning that the requests would occur?

Just like av_launcher does it now: set a flag in shared memory and
signal the postmaster (PMSIGNAL_START_AUTOVAC_WORKER).

(That's why I'm so puzzled: it looks like it's pretty much all there,
already. I even remember a discussion about that mechanism probably not
being fast enough to spawn bgworkers. And a proposal to add multiple
such flags, so an avlauncher-like daemon could ask for multiple
bgworkers to be started in parallel. I've even measured the serial
bgworker fork rate back then, IIRC it was in the hundreds of forks per
second...)

 # Minor technical issues or questions

 In assign_maxconnections, et al, GetNumRegisteredBackgroundWorkers() is
 used in relation to MAX_BACKENDS or to calculate MaxBackends. That seems
 to leak the bgworkers that registered with neither
 BGWORKER_SHMEM_ACCESS nor BGWORKER_BACKEND_DATABASE_CONNECTION set. Or
 is there any reason to discount such extra daemon processes?
 
 No, I purposefully let those out, because those don't need a PGPROC.  In
 fact that seems to me to be the whole point of non-shmem-connected
 workers: you can have as many as you like and they won't cause a
 backend-side impact.  You can use such a worker to connect via libpq to
 the server, for example.

Hm.. well, in that case, the shmem-connected ones are *not* counted. If
I create and run an extensions that registers 100 shmem-connected
bgworkers, I cannot connect to a system with max_connections=100
anymore, because bgworkers then occupy all of the connections, already.

Please add the registered shmem-connected bgworkers to the
max_connections limit. I think it's counter intuitive to have autovacuum
workers reserved separately, but extension's bgworkers eat (client)
connections. Or put another way: max_connections should always be the
max number of *client* connections the DBA wants to allow.

(Or, if that's in some way complicated, please give the DBA an
additional GUC akin to max_background_workers. That can be merged with
the current max_autovacuum_workers, once/if we rebase autovaccum onto
bgworkers).

 The additional contrib modules auth_counter and worker_spi are missing
 from the contrib/Makefile. If that's intentional, they should probably
 be listed under Missing.

 The auth_counter module leaves worker.bgw_restart_time uninitialized.

 Being an example, it might make sense for auth_counter to provide a
 signal that just calls SetLatch() to interrupt WaitLatch() and make
 auth_counter emit its log line upon request, i.e. show how to use SIGUSR1.
 
 KaiGai proposed that we remove auth_counter.  I don't see why not; I
 mean, worker_spi is already doing most of what auth_counter is doing, so
 why not?

Agreed.

 However, as you say, maybe we need more coding examples.

Maybe a minimally usable extra daemon example? Showing how to avoid
common pitfalls? Use cases, anybody? :-)

 Ah, that's just a bug, of course.

I see. Glad my review found it.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your

[HACKERS] Re: [pgsql-cluster-hackers] Question: Can i cut NON-HOT chain Pointers if there are no concurrent updates?

2012-11-23 Thread Markus Wanner

Henning,

On 11/23/2012 03:17 PM, Henning Mälzer wrote:
 Can somebody help me?

Sure, but you might get better answers on the -hackers mailing list. I'm
redirecting there. The cluster-hackers one is pretty low volume and low
subscribers, I think.

 Question:
 What would be the loss if i cut NON-HOT chain Pointers, meaning i set 
 t_ctid=t_self in the case where it points to a tuple on another page?

READ COMMITTED would stop to work correctly in the face of concurrent
transactions, I guess. See the fine manual:

http://www.postgresql.org/docs/devel/static/transaction-iso.html#XACT-READ-COMMITTED

The problem essentially boils down to: READ COMMITTED transactions need
to learn about tuples *newer* than what their snapshot would see.

 I am working on a project based on postgres (PostgreSQL) 8.5devel with the 
 code from several master thesises befor me.

Care to elaborate a bit? Can (part of) that code be released under an
open license?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-17 Thread Markus Wanner

On 11/16/2012 03:05 PM, Andres Freund wrote:
 I'd like to provide a comparison of the proposed change set format to
 the one used in Postgres-R.
 
 Uh, sorry to interrupt you right here, but thats not the proposed
 format ;)

Understood. Sorry, I didn't mean to imply that. It's pretty obvious to
me that this is more of a human readable format and that others,
including binary formats, can be implemented. I apologize for the bad
wording of a proposed format, which doesn't make that clear.

 The Postgres-R approach is independent of WAL and its format, where as
 the approach proposed here clearly is not. Either way, there is a
 certain overhead - however minimal it is - which the former adds to the
 transaction processing itself, while the later postpones it to a
 separate XLogReader process. If there's any noticeable difference, it
 might reduce latency in case of asynchronous replication, but can only
 increase latency in the synchronous case. As far as I understood Andres,
 it was easier to collect the additional meta data from within the
 separate process.
 
 There also is the point that if you do the processing inside heap_* you
 need to make sure the replication targeted data is safely received 
 fsynced away, in our case thats not necessary as WAL already provides
 crash safety, so should the replication connection break you can simply
 start from the location last confirmed as being safely sent.

In the case of Postgres-R, the safely received part isn't really
handled at the change set level at all. And regarding the fsync
guarantee: you can well use the WAL to provide that, without basing
change set generation on in. In that regard, Postgres-R is probably the
more general approach: you can run Postgres-R with WAL turned off
entirely - which may well make sense if you take into account the vast
amount of cloud resources available, which don't have a BBWC. Instead of
WAL, you can add more nodes at more different locations. And no, you
don't want your database to ever go down, anyway  :-)

 In summary, I'd say that Postgres-R is an approach specifically
 targeting and optimized for multi-master replication between Postgres
 nodes, where as the proposed patches are kept more general.
 
 One major aim definitely was optionally be able to replicate into just
 about any target system, so yes, I certainly agree.

I'm glad I got that correct ;-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-17 Thread Markus Wanner

On 11/16/2012 03:14 PM, Andres Freund wrote:
 Whats the data type of the COID in -R?

It's short for CommitOrderId, a 32bit global transaction identifier,
being wrapped-around, very much like TransactionIds are. (In that sense,
it's global, but unique only for a certain amount of time).

 In the patchset the output plugin has enough data to get the old xid and
 the new xid in the case of updates (not in the case of deletes, but
 thats a small bug and should be fixable with a single line of code), and
 it has enough information to extract the primary key without problems.

It's the xmin of the old tuple that Postgres-R needs to get the COID.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-17 Thread Markus Wanner

On 11/17/2012 02:30 PM, Hannu Krosing wrote:
 Is it possible to replicate UPDATEs and DELETEs without a primary key in
 PostgreSQL-R

No. There must be some way to logically identify the tuple. Note,
though, that theoretically any (unconditional) unique key would suffice.
In practice, that usually doesn't matter, as you rarely have one or more
unique keys without a primary.

Also note that the underlying index is useful for remote application of
change sets (except perhaps for very small tables).

In some cases, for example for n:m linking tables, you need to add a
uniqueness key that spans all columns (as opposed to a simple index on
one of the columns that's usually required, anyway). I hope for
index-only scans eventually mitigating this issue.

Alternatively, I've been thinking about the ability to add a hidden
column, which can then be used as a PRIMARY KEY without breaking legacy
applications that rely on SELECT * not returning that primary key.

Are there other reasons to want tables without primary keys that I'm
missing?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-17 Thread Markus Wanner

Hannu,

On 11/17/2012 03:40 PM, Hannu Krosing wrote:
 On 11/17/2012 03:00 PM, Markus Wanner wrote:
 On 11/17/2012 02:30 PM, Hannu Krosing wrote:
 Is it possible to replicate UPDATEs and DELETEs without a primary key in
 PostgreSQL-R
 No. There must be some way to logically identify the tuple.
 It can be done as selecting on _all_ attributes and updating/deleting
 just the first matching row
 
 create cursor ...
 select from t ... where t.* = ()
 fetch one ...
 delete where current of ...

That doesn't sound like it could possibly work for Postgres-R. At least
not when there can be multiple rows with all the same attributes, i.e.
without a unique key constraint over all columns.

Otherwise, some nodes could detect two concurrent UPDATES as a conflict,
while other nodes select different rows and don't handle it as a conflict.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v3 - comparison to Postgres-R change set format

2012-11-16 Thread Markus Wanner

Andres,

On 11/15/2012 01:27 AM, Andres Freund wrote:
 In response to this you will soon find the 14 patches that currently
 implement $subject.

Congratulations on that piece of work.


I'd like to provide a comparison of the proposed change set format to
the one used in Postgres-R. I hope for this comparison to shed some
light on the similarities and differences of the two projects. As the
author of Postgres-R, I'm obviously biased, but I try to be as neutral
as I can.


Let's start with the representation: I so far considered the Postgres-R
change set format to be an implementation detail and I don't intend it
to be readable by humans or third party tools. It's thus binary only and
doesn't offer a textual representation. The approach presented here
seems to target different formats for different audiences, including
binary representations. More general, less specific.


Next, contents: this proposal is more verbose. In the textual
representation shown, it provides (usually redundant) information about
attribute names and types. Postgres-R doesn't ever transmit attribute
name or type information for INSERT, UPDATE or DELETE operations.
Instead, it relies on attribute numbers and pg_attributes being at some
known consistent state.

Let's compare by example:

 table replication_example: INSERT: id[int4]:1 somedata[int4]:1 
 text[varchar]:1
 table replication_example: UPDATE: id[int4]:1 somedata[int4]:-1 
 text[varchar]:1
 table replication_example: DELETE (pkey): id[int4]:1

In Postgres-R, the change sets for these same operations would carry the
following information, in a binary representation:

 table replication_example: INSERT: VALUES (1, 1, '1')
 table replication_example: UPDATE: PKEY(1) COID(77) MODS('nyn') VALUES(-1)
 table replication_example: DELETE: PKEY(1) COID(78)

You may have noticed that there's an additional COID field. This is an
identifier for the transaction that last changed this tuple. Together
with the primary key, it effectively identifies the exact version of a
tuple (during its lifetime, for example before vs after an UPDATE). This
in turn is used by Postgres-R to detect conflicts.

It may be possible to add that to the proposed format as well, for it to
be able to implement a Postgres-R-like algorithm.


To finish off this comparison, let's take a look at how and where the
change sets are generated: in Postgres-R the change set stream is
constructed directly from the heap modification routines, i.e. in
heapam.c's heap_{insert,update,delete}() methods. Where as the patches
proposed here parse the WAL to reconstruct the modifications and add the
required meta information.

To me, going via the WAL first sounded like a step that unnecessarily
complicates matters. I recently talked to Andres and brought that up.
Here's my current view of things:

The Postgres-R approach is independent of WAL and its format, where as
the approach proposed here clearly is not. Either way, there is a
certain overhead - however minimal it is - which the former adds to the
transaction processing itself, while the later postpones it to a
separate XLogReader process. If there's any noticeable difference, it
might reduce latency in case of asynchronous replication, but can only
increase latency in the synchronous case. As far as I understood Andres,
it was easier to collect the additional meta data from within the
separate process.


In summary, I'd say that Postgres-R is an approach specifically
targeting and optimized for multi-master replication between Postgres
nodes, where as the proposed patches are kept more general.

I hope you found this to be an insightful and fair comparison.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-13 Thread Markus Wanner

On 11/13/2012 01:22 AM, Greg Smith wrote:
 Once you accept that eventually there need to be online conversion
 tools, there needs to be some easy way to distinguish which pages have
 been processed for several potential implementations.

Agreed. What I'm saying is that this identification doesn't need to be
as fine grained as a per-page bit. A single horizon or border is
enough, given an ordering of relations (for example by OID) and an
ordering of pages in the relations (obvious).

 All of the table-based checksum enabling ideas ...

This is not really one - it doesn't allow per-table switching. It's just
meant to be a more compact way of representing which pages have been
checksummed and which not.

 I'm thinking of this in some ways like the way creation of a new (but
 not yet valid) foreign key works.  Once that's active, new activity is
 immediately protected moving forward.  And eventually there's this
 cleanup step needed, one that you can inch forward over a few days.

I understand that. However, I question if users really care. If a
corruption is detected, the clever DBA tells his trainee immediately
check the file- and disk subsystem - no matter whether the corruption
was on old or new data.

You have a point in that pages with newer data are often more likely
to be re-read and thus getting checked. Where as the checksums written
to pages with old data might not be re-read any time soon. Starting to
write checksums from the end of the relation could mitigate this to some
extent, though.

Also keep in mind the quietly corrupted after checked once, but still
in the middle of checking a relation case. Thus a single bit doesn't
really give us the guarantee you ask for. Sure, we can add more than one
bit. And yeah, if done properly, adding more bits exponentially reduces
the likeliness of a corruption inadvertently turning off checksumming
for a page.

All that said, I'm not opposed to using a few bits of the page header. I
wanted to outline an alternative that I think is viable and less intrusive.

 This is why I think any good solution to this problem needs to
 incorporate restartable conversion.

I fully agree to that.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

Jeff,

On 11/10/2012 12:08 AM, Jeff Davis wrote:
 The bit indicating that a checksum is present may be lost due to
 corruption.

Hm.. I see.

Sorry if that has been discussed before, but can't we do without that
bit at all? It adds a checksum switch to each page, where we just agreed
we don't event want a per-database switch.

Can we simply write a progress indicator to pg_control or someplace
saying that all pages up to X of relation Y are supposed to have valid
checksums?

That would mean having to re-calculate the checksums on pages that got
dirtied before VACUUM came along to migrate them to having a checksum,
but that seems acceptable. VACUUM could even detect that case and
wouldn't have to re-write it with the same contents.

I realize this doesn't support Jesper's use case of wanting to have the
checksums only for newly dirtied pages. However, I'd argue that
prolonging the migration to spread the load would allow even big shops
to go through this without much of an impact on performance.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

On 11/12/2012 05:55 AM, Greg Smith wrote:
 Adding an initdb option to start out with everything checksummed seems
 an uncontroversial good first thing to have available.

+1

So the following discussion really is for a future patch extending on
that initial checkpoint support.

 One of the really common cases I was expecting here is that conversions
 are done by kicking off a slow background VACUUM CHECKSUM job that might
 run in pieces.  I was thinking of an approach like this:
 
 -Initialize a last_checked_block value for each table
 -Loop:
 --Grab the next block after the last checked one
 --When on the last block of the relation, grab an exclusive lock to
 protect against race conditions with extension
 --If it's marked as checksummed and the checksum matches, skip it
 ---Otherwise, add a checksum and write it out
 --When that succeeds, update last_checked_block
 --If that was the last block, save some state saying the whole table is
 checkedsummed

Perfect, thanks. That's the rough idea I had in mind as well, written
out in detail and catching the extension case.

 With that logic, there is at least a forward moving pointer that removes
 the uncertainty around whether pages have been updated or not.  It will
 keep going usefully if interrupted too.  One obvious this way this can
 fail is if:
 
 1) A late page in the relation is updated and a checksummed page written
 2) The page is corrupted such that the is this checksummed? bits are
 not consistent anymore, along with other damage to it
 3) The conversion process gets to this page eventually
 4) The corruption of (2) isn't detected

IMO this just outlines how limited the use of the is this checksummed
bit in the page itself is. It just doesn't catch all cases. Is it worth
having that bit at all, given your block-wise approach above?

It really only serves to catch corruptions to *newly* dirtied pages
*during* the migration phase that *keep* that single bit set. Everything
else is covered by the last_checked_block variable. Sounds narrow enough
to be negligible. Then again, it's just a single bit per page...

 The only guarantee I see that we can give for online upgrades is that
 after a VACUUM CHECKSUM sweep is done, and every page is known to both
 have a valid checksum on it and have its checksum bits set, *then* any
 page that doesn't have both set bits and a matching checksum is garbage.

From that point in time on, we'd theoretically better use that bit as an
additional checksum bit rather than requiring it to be set all times.
Really just theoretically, I'm certainly not advocating a 33 bit
checksum :-)

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

On 11/12/2012 10:44 AM, Craig Ringer wrote:
 That'll make it hard for VACUUM, hint-bit setting, etc to
 opportunistically checksum pages whenever they're doing a page write anyway.

It *is* a hard problem, yes. And the single bit doesn't really solve it.
So I'm arguing against opportunistically checksumming in general. Who
needs that anyway?

 Is it absurd to suggest using another bitmap, like the FSM or visibility
 map, to store information on page checksumming while checksumming is
 enabled but incomplete?

Not absurd. But arguably inefficient, because that bitmap may well
become a bottleneck itself. Plus there's the problem of making sure
those pages are safe against corruptions, so you'd need to checksum the
checksum bitmap... doesn't sound like a nice solution to me.

This has certainly been discussed before.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-12 Thread Markus Wanner

Jeff,

On 11/12/2012 06:52 PM, Jeff Davis wrote:
 OK, so here's my proposal for a first patch (changes from Simon's
 patch):
 
   * Add a flag to the postgres executable indicating that it should use
 checksums on everything. This would only be valid if bootstrap mode is
 also specified.
   * Add a multi-state checksums flag in pg_control, that would have
 three states: OFF, ENABLING, and ON. It would only be set to ON during
 bootstrap, and in this first patch, it would not be possible to set
 ENABLING.
   * Remove GUC and use this checksums flag everywhere.
   * Use the TLI field rather than the version field of the page header.
   * Incorporate page number into checksum calculation (already done).
   
 Does this satisfy the requirements for a first step? Does it interfere
 with potential future work?

As described before in this thread, I think we might be able to do
without the has checksum-bit, as yet another simplification. But I
don't object to adding it, either.

 It's slightly better than that. It's more like: we can tell you if any
 of your data gets corrupted after transaction X. If old data is
 corrupted before transaction X, then there's nothing we can do. But if
 it's corrupted after transaction X (even if it's old data), the
 checksums should catch it.

I (mis?)read that as Greg referring to the intermediate (enabling)
state, where pages with old data may or may not have a checksum, yet. So
I think it was an argument against staying in that state any longer than
necessary.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner

Jeff,

On 11/09/2012 02:01 AM, Jeff Davis wrote:
 For the sake of simplicity (implementation as well as usability), it
 seems like there is agreement that checksums should be enabled or
 disabled for the entire instance, not per-table.

Agreed. I've quickly thought about making it a per-database setting, but
how about shared system catalogs... Let's keep it simple and have a
single per-cluster instance switch for now.

 I don't think a GUC entirely makes sense (in its current form, anyway).
 We basically care about 3 states:
   1. Off: checksums are not written, nor are they verified. Pages that
 are newly dirtied have the checksum information in the header cleared.
   2. Enabling: checksums are written for every dirty page, but only
 verified for pages where the checksum is present (as determined by
 information in the page header).
   3. On: checksums are written for every dirty page, and verified for
 every page that's read. If a page does not have a checksum, it's
 corrupt.

Sounds sane, yes.

 And the next question is what commands to add to change state. Ideas:
 
CHECKSUMS ENABLE; -- set state to Enabling
CHECKSUMS DISABLE; -- set state to Off

Yet another SQL command doesn't feel like the right thing for such a
switch. Quick googling revealed that CHECKSUM is a system function in MS
SQL and MySQL knows a CHECKSUM TABLE command. And you never know what
the committee is coming up with next.

Apart from that, I'd like something more descriptive that just
checksums. Block checksums? Heap checksums? Data checksums?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner

On 11/09/2012 06:18 AM, Jesper Krogh wrote:
 I would definately stuff our system in state = 2 in your
 description if it was available.

Hm.. that's an interesting statement.

What's probably worst when switching from OFF to ON is the VACUUM run
that needs to touch every page (provided you haven't ever turned
checksumming on before). Maybe you want to save that step and still get
the additional safety for newly dirtied pages, right?

A use case worth supporting?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2012-11-09 Thread Markus Wanner

On 11/09/2012 07:53 PM, Jeff Davis wrote:
 One problem is telling which pages are protected and which aren't. We
 can have a couple bits in the header indicating that a checksum is
 present, but it's a little disappointing to have only a few bits
 protecting a 16-bit checksum.

Given your description of option 2 I was under the impression that each
page already has a bit indicating whether or not the page is protected
by a checksum. Why do you need more bits than that?

 Also, I think that people will want to have a way to protect their old
 data somehow.

Well, given that specific set of users is not willing to go through a
rewrite of each and every page of its database, it's hard to see how we
can protect their old data better.

However, we certainly need to provide the option to go through the
rewrite for other users, who are well willing to bite that bullet.

From a users perspective, the trade-off seems to be: if you want your
old data to be covered by checksums, you need to go through such an
expensive VACUUM run that touches every page in your database.

If you don't want to or cannot do that, you can still turn on
checksumming for newly written pages. You won't get full protection and
it's hard to tell what data is protected and what not, but it's still
better than no checksumming at all. Especially for huge databases, that
might be a reasonable compromise.

One could even argue, that this just leads to a prolonged migration and
with time, the remaining VACUUM step becomes less and less frightening.

Do you see any real foot-guns or other show-stoppers for permanently
allowing that in-between-state?

Or do we have other viable options that prolong the migration and thus
spread the load better over time?

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical to physical page mapping

2012-10-30 Thread Markus Wanner

On 10/29/2012 12:05 PM, Robert Haas wrote:
 OK, I'll stop babbling now...

Not perceived as babbling here. Thanks for that nice round-up of options
and ideas around the torn page problem.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Global Sequences

2012-10-17 Thread Markus Wanner

Simon,

On 10/16/2012 02:36 PM, Simon Riggs wrote:
 Where else would you put the hook? The hook's location as described
 won't change whether you decide you want 1, 2 or 3.

You assume we want an API that supports all three options. In that case,
yes, the hooks need to be very general.

Given that option 3 got by far the most support, I question whether we
need such a highly general API. I envision an API that keeps the
bookkeeping and cache lookup functionality within Postgres. So we have a
single, combined-effort, known working implementation for that.

What remains to be done within the plugin effectively is the consensus
problem: it all boils down to the question of which node gets the next
chunk of N sequence numbers. Where N can be 1 (default CACHE setting in
Postgres) or any higher number for better performance (reduces the total
communication overhead by a factor of N - or at least pretty close to
that, if you take into account lost chucks due to node failures).

A plugin providing that has to offer a method to request for a global
ordering and would have to trigger a callback upon reaching consensus
with other nodes on who gets the next chunk of sequence numbers. That
works for all N = 1. And properly implements option 3 (but doesn't
allow implementations of options 1 or 2, which I claim we don't need,
anyway).

 Implementations will be similar, differing mostly in the topology and
 transport layer

I understand that different users have different needs WRT transport
layers - moving the hooks as outlined above still allows flexibility in
that regard.

What different topologies do you have in mind? I'd broadly categorize
this all as multi-master. Do you need finer grained differentiation? Or
do you somehow include slaves (i.e. read-only transactions) in this process?

As you yourself are saying, implementations will only differ in that
way, let's keep the common code the same. And not require plugins to
duplicate that. (This also allows us to use the system catalogs for book
keeping, as another benefit).

 which means its not going to be possible to provide
 such a thing initially without slowing it down to the point we don't
 actually get it at all.

Sorry, I don't quite understand what you are trying to say, here.

Overall, thanks for bringing this up. I'm glad to see something
happening in this area, after all.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Global Sequences

2012-10-17 Thread Markus Wanner

Tom,

On 10/16/2012 06:15 PM, Tom Lane wrote:
 I challenge you to find anything in the SQL standard that suggests that
 sequences have any nonlocal behavior.  If anything, what you propose
 violates the standard, it doesn't make us follow it more closely.

If you look at a distributed database as a transparent equivalent of a
single-node system, I'd say the SQL standard applies to the entire
distributed system. From that point of view, I'd rather argue that any
local-only behavior violates the standard.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Global Sequences

2012-10-17 Thread Markus Wanner

Simon,

On 10/17/2012 10:34 AM, Simon Riggs wrote:
 IMHO an API is required for give me the next allocation of numbers,
 essentially a bulk equivalent of nextval().

Agreed. That pretty exactly matches what I described (and what's
implemented in Postgres-R). The API then only needs to be called every N
invocations of nextval(), because otherwise nextval() can simply return
a cached number previously allocated in a single step, eliminating a lot
of the communication overhead.

You realize an API at that level doesn't allow for an implementation of
options 1 and 2? (Which I'm convinced we don't need, so that's fine with
me).

 Anything lower level is going to depend upon implementation details
 that I don't think we should expose.

Exactly. Just like we shouldn't expose other implementation details,
like writing to system catalogs or WAL.

 I'm sure there will be much commonality between 2 similar
 implementations, just as there is similar code in each index type. But
 maintaining modularity is important and ahead of us actually seeing 2
 implementations, trying to prejudge that is going to slow us all down
 and likely screw us up.

Agreed. Let me add, that modularity only serves a purpose, if the
boundaries between the modules are chosen wisely. It sounds like we are
on the same page, though.

To testify this: IMHO an API for setval() is required to invalidate all
node's caches and re-set an initial value, as a starting point for the
next bulk of numbers that nextval() will return.

currval() doesn't need to be changed or hooked at all, because it's a
read-only operation.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Global Sequences

2012-10-16 Thread Markus Wanner

On 10/16/2012 12:47 AM, Josh Berkus wrote:
 I'd also love to hear from the PostgresXC folks on whether this solution
 works for them.  Postgres-R too.

In Postgres-R, option 3) is implemented. Though, by default sequences
work just like on a single machine, giving you monotonically increasing
sequence values - independent from the node you call nextval() from. IMO
that's the user's expectation. (And yes, this has a performance penalty.
But no, there's no compromise in availability).

It is implemented very much like the per-backend cache we already have
in vanilla Postgres, but taken to the per-node level. This gives the
user a nice compromise between strongly ordered and entirely random
values, allowing fine-tuning the trade off between performance and
laziness in the ordering (think of CACHE 10 vs. CACHE 1).

 If it works for all three of those
 tools, it's liable to work for any potential new tool.

In Postgres-R, this per-node cache uses additional attributes in the
pg_sequence system catalog to store state of this cache. This is
something I'm sure is not feasible to do from within a plugin.

Why does a Global Sequences API necessarily hook at the nextval() and
setval() level? That sounds like it yields an awkward amount of
duplicate work. Reading this thread, so far it looks like we agree that
option 3) is the most feasible optimization (the strict ordering being
the un-optimized starting point). Do we really need an API that allows
for implementations of options 1) and 2)?

What I'd appreciate more is a common implementation for option 3) with
an API to plug in different solutions to the underlying consensus problem.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] WAL_DEBUG logs spurious data

2012-10-11 Thread Markus Wanner

Hi,

I stumbled across a minor issue in xlog.c:1030: the WAL_DEBUG code block
there passes rdata-data to the rm_desc() methode. However, that's only
the first XLogRecData struct, not the entire XLog record. So the
rm_desc() method effectively reports spurious data for any subsequent part.

Take a commit record with subxacts as an example: during XLogInsert,
Postgres reports the following:

LOG:  INSERT @ 0/16F3270: prev 0/16F3234; xid 688; len 16: Transaction -
commit: 2012-10-11 09:31:17.790368-07; subxacts: 3214563816

Note that the xid in subxacts is way off. During recovery from WAL, the
record is logged correctly:

LOG:  REDO @ 0/16F3270; LSN 0/16F329C: prev 0/16F3234; xid 688; len 16:
Transaction - commit: 2012-10-11 09:31:17.790368-07; subxacts: 689

Attached is a possible fix.

Regards

Markus Wanner
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***
*** 1033,1039  begin:;
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
! 		StringInfoData buf;
  
  		initStringInfo(buf);
  		appendStringInfo(buf, INSERT @ %X/%X: ,
--- 1033,1056 
  #ifdef WAL_DEBUG
  	if (XLOG_DEBUG)
  	{
! 		StringInfoData	buf;
! 		char		   *full_rec = palloc(write_len), *ins_ptr;
! 
! 		/*
! 		 * We need assemble the entire record once, to be able to dump it out
! 		 * properly.
! 		 */
! 		rdt = rdata;
! 		ins_ptr = full_rec;
! 		while (rdt)
! 		{
! 			if (rdt-data != NULL)
! 			{
! memcpy(ins_ptr, rdt-data, rdt-len);
! ins_ptr += rdt-len;
! 			}
! 			rdt = rdt-next;
! 		}
  
  		initStringInfo(buf);
  		appendStringInfo(buf, INSERT @ %X/%X: ,
***
*** 1042,1051  begin:;
  		if (rdata-data != NULL)
  		{
  			appendStringInfo(buf,  - );
! 			RmgrTable[rechdr-xl_rmid].rm_desc(buf, rechdr-xl_info, rdata-data);
  		}
  		elog(LOG, %s, buf.data);
  		pfree(buf.data);
  	}
  #endif
  
--- 1059,1069 
  		if (rdata-data != NULL)
  		{
  			appendStringInfo(buf,  - );
! 			RmgrTable[rechdr-xl_rmid].rm_desc(buf, rechdr-xl_info, full_rec);
  		}
  		elog(LOG, %s, buf.data);
  		pfree(buf.data);
+ 		pfree(full_rec);
  	}
  #endif
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL_DEBUG logs spurious data

2012-10-11 Thread Markus Wanner

Tom,

On 10/11/2012 03:11 PM, Tom Lane wrote:
 The original design intention was that rm_desc should not attempt to
 print any such data, but obviously some folks didn't get the word.

FWIW: in case the source code contains comments explaining that
intention, I certainly missed them so far.

Regards

Markus


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL_DEBUG logs spurious data

2012-10-11 Thread Markus Wanner

On 10/11/2012 04:06 PM, Tom Lane wrote:
 Yeah, if we decide to stick with the limitation, some documentation
 would be called for.  I remember having run into this and having removed
 functionality from an rm_desc function rather than question the premise.
 But maybe the extra functionality is worth the cycles.

Well, I've been interested in getting it correct, and didn't care about
performance.

But I can certainly imagine someone enabling it on a production system
to get more debug info. In which case performance would matter more.
However, WAL_DEBUG being a #define, I bet only very few admins do that.
So I tend towards sacrificing performance for better debug info in the
WAL_DEBUG case. (Especially given that you can still disable it via GUC).

Regards

Markus


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BlockNumber initialized to InvalidBuffer?

2012-07-13 Thread Markus Wanner

On 07/11/2012 05:45 AM, Tom Lane wrote:
 I'm also inclined to think that the while (stack) coding of the rest
 of it is wrong, misleading, or both, on precisely the same grounds: if
 that loop ever did fall out at the test, the function would have failed
 to honor its contract.  The only correct exit points are the returns
 in the middle.

I came to the same conclusion, yes. Looks like the additional asserts in
the attached patch all hold true.

As another minor improvement, it doesn't seem necessary to repeatedly
set the rootBlkno.

Regards

Markus Wanner
#
# old_revision [d90761edf47c6543d4686a76baa0b4e2a7ed113b]
#
# patch src/backend/access/gin/ginbtree.c
#  from [2d3e63387737b4034fc25ca3cb128d9ac57f4f01]
#to [62efe71b8e3429fc1bf2d2742f8d2fa69f2f4de6]
#

*** src/backend/access/gin/ginbtree.c	2d3e63387737b4034fc25ca3cb128d9ac57f4f01
--- src/backend/access/gin/ginbtree.c	62efe71b8e3429fc1bf2d2742f8d2fa69f2f4de6
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 276,294 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno = InvalidBuffer;
  	Page		page,
  rpage,
  lpage;
  
  	/* remember root BlockNumber */
! 	while (parent)
  	{
  		rootBlkno = parent-blkno;
  		parent = parent-parent;
! 	}
  
! 	while (stack)
  	{
  		XLogRecData *rdata;
  		BlockNumber savedRightLink;
--- 276,298 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno;
  	Page		page,
  rpage,
  lpage;
  
  	/* remember root BlockNumber */
! 	Assert(stack != NULL);
! 	parent = stack;
! 	do
  	{
  		rootBlkno = parent-blkno;
  		parent = parent-parent;
! 	} while (parent);
  
! 	Assert(BlockNumberIsValid(rootBlkno));
! 
! 	for (;;)
  	{
  		XLogRecData *rdata;
  		BlockNumber savedRightLink;
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 469,474 
--- 473,479 
  
  		UnlockReleaseBuffer(stack-buffer);
  		pfree(stack);
+ 		Assert(parent != NULL);		/* parent == NULL case is handled above */
  		stack = parent;
  	}
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BlockNumber initialized to InvalidBuffer?

2012-07-13 Thread Markus Wanner

On 07/13/2012 11:33 AM, Markus Wanner wrote:
 As another minor improvement, it doesn't seem necessary to repeatedly
 set the rootBlkno.

Sorry, my mail program delivered an older version of the patch, which
didn't reflect that change. Here's what I intended to send.

Regards

Markus Wanner
#
# Minor cleanup of ginInsertValue with additional asserts and
# a tighter loop for finding the root in the GinBtreeStack.
#

*** src/backend/access/gin/ginbtree.c	2d3e63387737b4034fc25ca3cb128d9ac57f4f01
--- src/backend/access/gin/ginbtree.c	f6dc88ae5716275e62a6f6715aa7204abd430089
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 276,294 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno = InvalidBuffer;
  	Page		page,
  rpage,
  lpage;
  
! 	/* remember root BlockNumber */
! 	while (parent)
! 	{
! 		rootBlkno = parent-blkno;
  		parent = parent-parent;
- 	}
  
! 	while (stack)
  	{
  		XLogRecData *rdata;
  		BlockNumber savedRightLink;
--- 276,296 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno;
  	Page		page,
  rpage,
  lpage;
  
! 	/* extract root BlockNumber from stack */
! 	Assert(stack != NULL);
! 	parent = stack;
! 	while (parent-parent)
  		parent = parent-parent;
  
! 	rootBlkno = parent-blkno;
! 	Assert(BlockNumberIsValid(rootBlkno));
! 
! 	for (;;)
  	{
  		XLogRecData *rdata;
  		BlockNumber savedRightLink;
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 469,474 
--- 471,477 
  
  		UnlockReleaseBuffer(stack-buffer);
  		pfree(stack);
+ 		Assert(parent != NULL);		/* parent == NULL case is handled above */
  		stack = parent;
  	}
  }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] BlockNumber initialized to InvalidBuffer?

2012-07-10 Thread Markus Wanner

Hackers,

I stumbled across an initialization of a BlockNumber with InvalidBuffer,
which seems strange to me, as the values for invalid of the two types
are different, see attached patch.

In case the 'stack' argument passed to that function is not NULL, the
variable in question gets overridden immediately, in which case it
certainly doesn't matter. I don't know nor did I check whether or not it
can ever be NULL. So this might not be a real issue at all.

Regards

Markus Wanner
# InvalidBlockNumber is -1 (or rather 0x), while
# the currently used InvalidBuffer is 0, which is a valid
# BlockNumber.

*** src/backend/access/gin/ginbtree.c	2d3e63387737b4034fc25ca3cb128d9ac57f4f01
--- src/backend/access/gin/ginbtree.c	67351e1b6541b25ab3c8e8dc7a57487c2422e124
*** ginInsertValue(GinBtree btree, GinBtreeS
*** 276,282 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno = InvalidBuffer;
  	Page		page,
  rpage,
  lpage;
--- 276,282 
  ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  {
  	GinBtreeStack *parent = stack;
! 	BlockNumber rootBlkno = InvalidBlockNumber;
  	Page		page,
  rpage,
  lpage;


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pruning the TODO list

2012-06-30 Thread Markus Wanner

Hi,

On 06/22/2012 05:38 PM, Andrew Dunstan wrote:
 I think the real problem with the TODO list is that some people see it
 as some sort of official roadmap, and it really isn't. Neither is there
 anything else that is.

To me, it looks like TODO is just a misnomer. The file should be named
TODISCUSS, IDEAS, or something. But the current file name implies consensus.

Wouldn't renaming solve that kind of misunderstanding? (..in the vain of
address(ing) real problems in the simplest way..)

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner

Robert,

On 08/25/2011 03:24 PM, Robert Haas wrote:
 My hope (and it might turn out that I'm an optimist) is that even with
 a reasonably small buffer it will be very rare for a backend to
 experience a wraparound condition.

It certainly seems less likely than with the ring-buffer for imessages, yes.

Note, however, that for imessages, I've also had the policy in place
that a backend *must* consume its message before sending any.  And that
I took great care for all receivers to consume their messages as early
as possible.  None the less, I kept incrementing the buffer size (to
multiple megabytes) to make this work.  Maybe I'm overcautious because
of that experience.

 - a handful of XIDs at most - because, on the average, transactions
 are going to commit in *approximately* increasing XID order

This assumption quickly turns false, if you happen to have just one
long-running transaction, I think.  Or in general, if transaction
duration varies a lot.

 So the backend taking a snapshot only needs
 to be able to copy  ~64 bytes of information from the ring buffer
 before other backends write ~27k of data into that buffer, likely
 requiring hundreds of other commits.

You said earlier, that only the latest snapshot is required.  It takes
only a single commit for such a snapshot to not be the latest anymore.

Instead, if you keep around older snapshots for some time - as what your
description here implies - readers are free to copy from those older
snapshots while other backends are able to make progress concurrently
(writers or readers of other snapshots).

However, that either requires keeping track of readers of a certain
snapshot (reference counting) or - as I understand your description -
you simply invalidate all concurrent readers upon wrap-around, or something.

 That seems vanishingly unlikely;

Agreed.

 Now, as the size of the snapshot gets bigger, things will eventually
 become less good.

Also keep configurations with increased max_connections in mind.  With
that, we not only the snapshots get bigger, but more processes have to
share CPU time, on avg. making memcpy slower for a single process.

 Of course even if wraparound turns out not to be a problem there are
 other things that could scuttle this whole approach, but I think the
 idea has enough potential to be worth testing.  If the whole thing
 crashes and burns I hope I'll at least learn enough along the way to
 design something better...

That's always a good motivation.  In that sense: happy hacking!

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner

Robert,

On 08/25/2011 04:48 PM, Robert Haas wrote:
 What's a typical message size for imessages?

Most message types in Postgres-R are just a couple bytes in size.
Others, especially change sets, can be up to 8k.

However, I think you'll have an easier job guaranteeing that backends
consume their portions of the ring-buffer in time.  Plus wrap-around
isn't that much of a problem in your case.  (I couldn't drop imessage,
but had to let senders wait).

 Well, one long-running transaction that only has a single XID is not
 really a problem: the snapshot is still small.  But one very old
 transaction that also happens to have a large number of
 subtransactions all of which have XIDs assigned might be a good way to
 stress the system.

Ah, right, that's why its a list of transactions in progress and not a
list of completed transactions in SnapshotData... good.

 Each reader decides which data he needs to copy from the buffer, and
 then copies it, and then checks whether any of it got overwritten
 before the copy was completed.  So there's a lively possibility that
 the snapshot that was current when the reader began copying it will no
 longer be current by the time he finishes copying it, because a commit
 has intervened.  That's OK: it just means that, effectively, the
 snapshot is taken at the moment the start and stop pointers are read,
 and won't take into account any commits that happen later, which is
 exactly what a snapshot is supposed to do anyway.

Agreed, that makes sense.  Thanks for explaining.

 There is a hopefully quite small possibility that by the time the
 reader finishes copying it so much new data will have been written to
 the buffer that it will have wrapped around and clobbered the portion
 the reader was interested in.  That needs to be rare.

Yeah.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-25 Thread Markus Wanner

Tom,

On 08/25/2011 04:59 PM, Tom Lane wrote:
 That's a good point.  If the ring buffer size creates a constraint on
 the maximum number of sub-XIDs per transaction, you're going to need a
 fallback path of some sort.

I think Robert envisions the same fallback path we already have:
subxids.overflowed.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner

Hello Dimitri,

On 08/23/2011 06:39 PM, Dimitri Fontaine wrote:
 I'm far from familiar with the detailed concepts here, but allow me to
 comment.  I have two open questions:
 
  - is it possible to use a distributed algorithm to produce XIDs,
something like Vector Clocks?
 
Then each backend is able to create a snapshot (well, and XID) on its
own, and any backend is still able to compare its snapshot to any
other snapshot (well, XID)

Creation of snapshots and XID assignment are not as related as you imply
here.  Keep in mind that a read-only transaction have a snapshot, but no
XID.  (Not sure if it's possible for a transaction to have an XID, but
no snapshot.  If it only touches system catalogs with SnapshotNow,
maybe?  Don't think we support that, ATM).

  - is it possible to cache the production of the next snapshots so that
generating an XID only means getting the next in a pre-computed
vector?

The way I look at it, what Robert proposed can be thought of as cache
the production of the next snapshot, with a bit of a stretch of what a
cache is, perhaps.  I'd rather call it early snapshot creation, maybe
look-ahead something.

ATM backends all scan ProcArray to generate their snapshot.  Instead,
what Robert proposes would - sometimes, somewhat - move that work from
snapshot creation time to commit time.

As Tom points out, the difficulty lies in the question of when it's
worth doing that:  if you have lots of commits in a row, and no
transaction ever uses the (pre generated) snapshots of the point in time
in between, then those were wasted.  OTOH, if there are just very few
COMMITs spread across lots of writes, the read-only backends will
re-create the same snapshots, over and over again.  Seems wasteful as
well (as GetSnapshotData popping up high on profiles confirms somewhat).

Hope to have cleared up things a bit.

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner

Robert, Jim,

thanks for thinking out loud about dynamic allocation of shared memory.
 Very much appreciated.

On 08/23/2011 01:22 AM, Robert Haas wrote:
 With respect to a general-purpose shared memory allocator, I think
 that there are cases where that would be useful to have, but I don't
 think there are as many of them as many people seem to think.  I
 wouldn't choose to implement this using a general-purpose allocator
 even if we had it, both because it's undesirable to allow this or any
 subsystem to consume an arbitrary amount of memory (nor can it fail...
 especially in the abort path) and because a ring buffer is almost
 certainly faster than a general-purpose allocator.

I'm in respectful disagreement regarding the ring-buffer approach and
think that dynamic allocation can actually be more efficient if done
properly, because there doesn't need to be head and tail pointers, which
might turn into a point of contention.

As a side note: that I've been there with imessages.  Those were first
organized as a ring-bufffer.  The major problem with that approach was
the imessages were consumed with varying delay.  In case an imessage was
left there for a longer amount of time, it blocked creation of new
imessages, because the ring-buffer cycled around once and its head
arrived back at the unconsumed imessage.

IIUC (which might not be the case) the same issue applies for snapshots.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] cheaper snapshots redux

2011-08-24 Thread Markus Wanner

Robert,

On 08/25/2011 04:59 AM, Robert Haas wrote:
 True; although there are some other complications.  With a
 sufficiently sophisticated allocator you can avoid mutex contention
 when allocating chunks, but then you have to store a pointer to the
 chunk somewhere or other, and that then requires some kind of
 synchronization.

Hm.. right.

 One difference with snapshots is that only the latest snapshot is of
 any interest.

Theoretically, yes.  But as far as I understood, you proposed the
backends copy that snapshot to local memory.  And copying takes some
amount of time, possibly being interrupted by other backends which add
newer snapshots...  Or do you envision the copying to restart whenever a
new snapshot arrives?

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] POSIX question

2011-06-26 Thread Markus Wanner

Radek,

On 06/20/2011 03:27 PM, Radosław Smogura wrote:
 When playing with mmap I done some approach how to deal with growing
 files, so...

Your approach seems to require a SysV alloc (for nattach) as well as
POSIX shmem and/or mmap.  Adding requirements for these syscalls
certainly needs to give a good benefit for Postgres, as they presumably
pose portability issues.

 3. a. Lock header when adding chunks (1st chunk is header) (we don't
 want concurrent chunk allocation)

Sure we don't?  There are at least a dozen memory allocators for
multi-threaded applications, all trying to optimize for concurrency.
The programmer of a multi-threaded application doesn't need to care much
about concurrent allocations.  He can allocate (and free) quite a lot of
tiny chunks concurrently from shared memory.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Exporting closePGconn from libpq

2011-05-16 Thread Markus Wanner

Leon,

On 05/14/2011 05:23 PM, Leon Smith wrote:
 A minor issue has come up in creating low-level bindings to libpq for
 safe garbage-collected languages,  namely that PQfinish is the only
 (AFAICT) way to close a connection but also de-allocates the memory
 used to represent the database connection.It would be preferable
 to call PQfinish to free the memory in a finalizer,  but appilcations
 need a way to disconnect from the database at a predictable and
 deterministic point in time,  whereas leaving a bit of memory around
 until the GC finally gets to it is relatively harmless.

It's harmless, but I think it's also useless.  Or why do you want to
keep that (libpq-private) bit of memory around beyond PQfinish()?

I'm not sure what language or VM you have in mind, but your description
sounds like you are writing a wrapper by definition.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] stored procedures - use cases?

2011-05-12 Thread Markus Wanner

Hi,

On 05/10/2011 02:55 PM, Robert Haas wrote:
 On Mon, May 9, 2011 at 11:58 PM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 no - you are little bit confused :). CALL and function execution
 shares nothing. There is significant differences between function and
 procedure. Function is called only from executor - from some plan, and
 you have to know a structure of result before run. The execution of
 CALL is much simple - you just execute code - without plan and waiting
 for any result - if there is.

I think the distinction between function and procedure is misleading
here.  Some envision stored *procedures* to be able to return values,
result sets and possibly even *multiple* result sets.

 The main features seem to be (1) explicit transaction control
 and/or execution of commands like VACUUM that can't be invoked from
 within a transaction,

I think that's the main point of stored procedures.

 (2) autonomous transactions

To me autonomous transactions seem orthogonal.  Those can be used to
implement (1) above, but might have other uses for regular transactions
as well.

(The point I'm taking home here is that you might want to control not
only one concurrent transaction, but several from a stored procedure.
 So far, I assumed only one.)

 and (3) returning
 multiple result sets.  But I don't think anybody would be desperately
 unhappy if it magically became possible to do those things from
 regular functions, unlikely as that may seem.

That point definitely is on my wish-list for UDFs already.  I didn't
think of this as having to do with stored procedures, either.

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] full_page_writes makes no difference?

2011-05-04 Thread Markus Wanner

Hi,

On 05/04/2011 03:46 AM, Tian Luo wrote:
 No matter I turn on or turn off the full_page_writes, I always
 observe 8192-byte writes of log data for simple write operations
 (write/update).

How did you measure that?  A single transaction doing a single write, I
guess.  Ever tried multiple transactions with a simple write operation
each and checking how much WAL that spits out per transaction?

As I understand it, dirty blocks are written to disk as soon as
feasible.  After all, that helps crash recovery.  With a basically idle
system, as soon as feasible might be pretty soon.  However, put your
(disk sub-) system under load and as soon as feasible might take awhile.

 But according to the document, when this is off, it could speed up
 operations but may cause problems during recovery. So, I guess this is
 because it writes less when the option is turned off. However, this
 contradicts my observations 

I think you didn't trigger the savings.  It's about writing full pages
on the first write to a block after a checkpoint.  Did you monitor
checkpoint times of Postgres in your tests?

 If I am not missing anything, I find that the writes of log data go
 through function XLogWrite in source file
 backend/access/transam/xlog.c.
 
 In this file, log data are written with the following code:
 
 from = XLogCtl-pages + startidx * (Size) XLOG_BLCKSZ;
 nbytes = npages * (Size) XLOG_BLCKSZ;
 if (write(openLogFile, from, nbytes) != nbytes)
 {
  ...
 }
 
 So, nbytes should always be multiples of XLOG_BLCKSZ, which in the
 default case, is 8192.

That observation seems correct.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal - asynchronous functions

2011-04-27 Thread Markus Wanner

On 04/26/2011 11:17 PM, Robert Haas wrote:
 IIRC, we kind of got stuck on the prerequisite wamalloc patch, and that sunk 
 the whole thing.  :-(

Right, that prerequisite was the largest stumbling block.  As I
certainly mentioned back then, it should be possible to get rid of the
imessages dependency (and thus wamalloc).  So whoever really wants to
implement asynchronous functions (or autonomous transactions) is more
than welcome to try that.

Please keep in mind that you'd need an alternative communication path.
Not only for the bgworker infrastructure itself, but for communication
between the requesting backend and the bgworker (except for
fire-and-forget jobs like autovacuum, of course.  OTOH even those could
benefit from communicating back their state to the coordinator.. eh..
autovacuum launcher).

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal - asynchronous functions

2011-04-26 Thread Markus Wanner

Robert,

On 04/26/2011 02:25 PM, Robert Haas wrote:
 We've talked about a number of features that could benefit from some
 kind of worker process facility (e.g. logical replication, parallel
 query).  So far no one has stepped forward to build such a facility,
 and I think without that this can't even get off the ground.

Remember the bgworker patches extracted from Postgres-R?

[ Interestingly enough, one of the complaints I heard back then (not
necessarily from you) was that there's no user for bgworkers, yet.
Smells a lot like a chicken and egg problem to me. ]

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-23 Thread Markus Wanner

On 03/22/2011 09:33 PM, Robert Haas wrote:
 We might have a version of synchronous replication that works this way
 some day, but it's not the version were shipping with 9.1.  The slave
 acknowledges the WAL records when they hit the disk (i.e. fsync) not
 when they are applied; WAL apply can lag arbitrarily.  The point is to
 guarantee clients that the WAL is on disk somewhere and that it will
 be replayed in the event of a failover.  Despite the fact that this
 doesn't work as you're describing, it's a useful feature in its own
 right.

In that sense, our approach may be more synchronous than most others,
because after the ACK is sent from the slave, the slave still needs to
apply the transaction data from WAL before it gets visible, while the
master needs to wait for the ACK to arrive at its side, before making it
visible there.

Ideally, these two latencies (disk seek and network induced) are just
about equal.  But of course, there's no such guarantee.  So whenever one
of the two is off by an order of magnitude or two (by use case or due to
a temporary overload), either the master or the slave may lag behind the
other machine.

What pleases me is that the guarantee from the slave is somewhat similar
to Postgres-R's: with its ACK, the receiving node doesn't guarantee the
transaction *is* applied locally, it just guarantees that it *will* be
able to do so sometime in the future.  Kind of a mind twister, though...

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-23 Thread Markus Wanner

On 03/23/2011 12:52 PM, Robert Haas wrote:
 Yes.  What this won't do is let you build a big load-balancing network
 (at least not without great caution about what you assume).

This sounds too strong to me.  Session-aware load balancing is pretty
common these days.  It's the default mode of PgBouncer, for example.
Not much caution required there, IMO.  Or what pitfalls did you have in
mind?

 What it
 will do is make it really, really hard to lose committed transactions.
 Both good things, but different.

..you can still get both at the same time.  At least as long as you are
happy with session-aware load balancing.  And who really needs finer
grained balancing?

(Note that no matter how fine-grained you balance, you are still bound
to a (single core of a) single node.  That changes with distributed
querying, and things really start to get interesting there... but we are
far from that, yet).

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 >

1 - 100 of 498 matches

Mail list logo