Re: [PATCHES] Preliminary GSSAPI Patches

2007-06-22 Thread Magnus Hagander
Magnus Hagander wrote:
 Be curious to see what you've done, but if you're actively changing  
 things I'll let them settle.
 
 I've got a bit more cleanup to do, but I'm almost there.
 
 Much of it is just cleanup. I've changed the structs arond to be more in
 line with the other code around it, and such. Refacored some of the code to
 cut down duplicate codes. Added some stuff to make it work on windows
 (still just with MIT kerberos and not native though). Fixed two (I think it
 was) small memory leaks.
 
 Protocol-wise, it no longer piggybacks int eh AuthenticationOk message -
 instead we send an extra continue message followed right away by an
 AuthenticationOk one.
 
 Oh, and I've added autoconf. Not complete yet, but getting there :-)
 
 I'll post the updated patch shortly :-)

Ok. Here's the version I have right now, sans autoconf (which I broke in
my attempts to make it work with mingw).

I have one major question remaining:
We enable the setting of the service name in the server configuration
file, but we never use that variable anywhere. We do, however, use the
service name on the client, in order to pick the correct key (and
turning this off makes GSSAPI no longer work).

If this is correct, we should not enable that parameter on the server.
If it's not correct, we should be using it somewhere.

Is this perhaps a leftover from the old gssapi-encryption code? In that
we need to use that parameter on the server in order to enable
encryption, but can remove it for now, until we have that? (Since the
parameter is around for krb5 anyway, it's just #ifdefing it back out, of
course, not actually removing it)

(Still working on the documentation part)

//Magnus


diff -cr pgsql.orig/src/backend/libpq/auth.c pgsql/src/backend/libpq/auth.c
*** pgsql.orig/src/backend/libpq/auth.c	2007-02-08 05:52:18.0 +0100
--- pgsql/src/backend/libpq/auth.c	2007-06-22 12:39:25.0 +0200
***
*** 295,300 
--- 295,539 
  }
  #endif   /* KRB5 */
  
+ #ifdef ENABLE_GSS
+ /*
+  * GSSAPI authentication system
+  *
+  */
+ 
+ #include gssapi/gssapi.h
+ 
+ #ifdef WIN32
+ /*
+  * MIT Kerberos GSSAPI DLL doesn't properly export the symbols
+  * that contain the OIDs required. Redefine here, values copied
+  * from src/athena/auth/krb5/src/lib/gssapi/generic/gssapi_generic.c
+  */
+ static const gss_OID_desc GSS_C_NT_USER_NAME_desc =
+  {10, (void *)\x2a\x86\x48\x86\xf7\x12\x01\x02\x01\x02};
+ static GSS_DLLIMP gss_OID GSS_C_NT_USER_NAME = GSS_C_NT_USER_NAME_desc;
+ #endif
+ 
+ 
+ static void
+ pg_GSS_error(int severity, char *text, OM_uint32 maj_stat, OM_uint32 min_stat)
+ {
+ 	gss_buffer_desc	gmsg;
+ 	OM_uint32		lmaj_s, lmin_s, msg_ctx;
+ 	char			localmsg1[128],
+ 	localmsg2[128];
+ 
+ 	/* Fetch major status message */
+ 	msg_ctx = 0;
+ 	lmaj_s = gss_display_status(lmin_s, maj_stat, GSS_C_GSS_CODE,
+ 			GSS_C_NO_OID, msg_ctx, gmsg);
+ 	strlcpy(localmsg1, gmsg.value, sizeof(localmsg1));
+ 	gss_release_buffer(lmin_s, gmsg);
+ 
+ 	if (msg_ctx)
+ 		/* More than one message available.
+ 		 * XXX: Should we loop and read all messages?
+ 		 * (same below)
+ 		 */
+ 		ereport(WARNING, 
+ (errmsg_internal(incomplete GSS error report)));
+ 
+ 	/* Fetch mechanism minor status message */
+ 	msg_ctx = 0;
+ 	lmaj_s = gss_display_status(lmin_s, min_stat, GSS_C_MECH_CODE,
+ 			GSS_C_NO_OID, msg_ctx, gmsg);
+ 	strlcpy(localmsg2, gmsg.value, sizeof(localmsg2));
+ 	gss_release_buffer(lmin_s, gmsg);
+ 
+ 	if (msg_ctx)
+ 		ereport(WARNING,
+ (errmsg_internal(incomplete GSS minor error report)));
+ 
+ 	/* errmsg_internal, since translation of the first part must be
+ 	 * done before calling this function anyway. */
+ 	ereport(severity,
+ 			(errmsg_internal(%s:%s\n%s, text, localmsg1, localmsg2)));
+ }
+ 
+ static int
+ pg_GSS_recvauth(Port *port)
+ {
+ 	OM_uint32		maj_stat, min_stat, lmin_s, gflags;
+ 	char		   *kt_path;
+ 	intmtype;
+ 	intn_eq;
+ 	StringInfoData	buf;
+ 	gss_buffer_desc	gbuf;
+ 	gss_name_t		gnbuf;
+ 
+ 	if (pg_krb_server_keyfile  strlen(pg_krb_server_keyfile)  0)
+ 	{
+ 		/*
+ 		 * Set default Kerberos keytab file for the Krb5 mechanism.
+ 		 *
+ 		 * setenv(KRB5_KTNAME, pg_krb_server_keyfile, 0);
+ 		 *		except setenv() not always available.
+ 		 */
+ 		if (!getenv(KRB5_KTNAME))
+ 		{
+ 			kt_path = palloc(PATH_MAX + 13);
+ 			snprintf(kt_path, PATH_MAX + 13,
+ 	KRB5_KTNAME=%s, pg_krb_server_keyfile);
+ 			putenv(kt_path);
+ 		}
+ 	}
+ 
+ 	port-gss-ctx = GSS_C_NO_CONTEXT;
+ 	port-gss-cred = GSS_C_NO_CREDENTIAL;
+ 
+ 	/*
+ 	 * Loop through GSSAPI message exchange. This exchange can consist
+ 	 * of multiple messags sent in both directions. First message is always
+ 	 * from the client. All messages from client to server are password
+ 	 * packets (type 'p').
+ 	 */
+ 	do 
+ 	{
+ 		mtype = pq_getbyte();
+ 		if (mtype != 'p')
+ 		{
+ 			/* Only log error 

Re: [PATCHES] Transaction Guarantee, updated version

2007-06-22 Thread Simon Riggs
On Thu, 2007-06-21 at 16:05 -0400, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Completed all of the agreed changes for TG:
 
 I've just realized that there's a fatal problem with this design.
 We've now got tqual.c setting page LSN when it holds only share lock
 on the buffer.  That will absolutely not work, eg two backends might
 concurrently set different values and end up with garbage (since it's
 unlikely that LSN store is atomic).
 
 Can we fix it to be a read test instead of a write test, that is, if
 we know WAL has been flushed through the target LSN, it's safe to set
 the hint bit, else not?

Yes, that's roughly how it worked in v1.

The various possible strategies for handling hint bits were:
1. flush WAL to current insert pointer - very heavy perf hit
2. defer setting the hint bits at all, if the written transactions are
recently written. To do this we need a cache of recently written
deferred commit Xids.
3. flush WAL up to the LSN of the Xid, if it is a deferred commit. To do
this we need a cache of recently written deferred commit Xids.

Any more?

v1 implemented (2) on the basis that it was a small timing hole that was
best handled by doing as little as possible, though we could easily
implement (3) instead. But you need the deferred commit cache either
way, AFAICS.

So proposal is: Maintain cache of unflushed deferred commit xacts and
their corresponding LSNs. When you need to set a commit bit you check
the required LSN for the xact from cache, then flush WAL up to that LSN.
If xact doesn't exist, then do nothing because it is either an already
flushed deferred commit or a normal commit (therefore already flushed
too). Don't update the page's LSN at all. More code, but it works
correctly and efficiently.

This is pretty much putting back most of v1, though.

 In general, I think a transaction abort should not need to flush
 anything, since the default assumption is that it crashed anyway.

It currently does that though and I haven't sought to change that.

 Hence for instance recording a transaction abort needn't advance
 the LSN of the clog page.  (You seem to have it flushing through
 the last xlog record written by the backend, which is exactly what
 it doesn't need to do.)  By extension, it should be OK to set INVALID
 (aborted) hint bits in a tuple header without any concerns about
 flushing.

Sure.

 Also, I'm sort of wondering if we really need a separate walwriter
 process; that code seems awfully duplicative.  Is there a reason
 not to have the bgwriter include this functionality?

The bgwriter is doing something else already. Fsyncing the WAL takes
significant time and that would detract from the main role of bgwriter.

 In lesser news:
 
 The caching logic in TransactionGetCommitLSN is obviously broken.

OK, but won't check because of what you've said at top of mail. No need
to fix that if the basic premise needs changing anyway.

 Is there really a use-case for adding a pgstat counter for guaranteed
 transactions?  That adds pgstat overhead, and bloats the patch
 noticeably, and I don't entirely see the value of it.

I'm easy on that: its easier to take code out than add it in. It
certainly helped to debug things.

 There's some padding junk inserted in XLogCtlData, which as far as I
 recall was never discussed, and is certainly not an integral part of the
 delayed-commit feature.  If you want that you should submit and defend
 it separately.

OK

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Bruce Momjian
Tom Lane wrote:
 And checkpoint_rate really needs to be named checkpoint_min_rate, if
 it's going to be a minimum.  However, I question whether we need it at
 all, because as the code stands, with the default BgWriterDelay you
 would have to increase checkpoint_rate to 4x its proposed default before
 writes_per_nap moves off its minimum of 1.  This says to me that the
 system's tested behavior has been so insensitive to checkpoint_rate
 that we probably need not expose such a parameter at all; just hardwire
 the minimum writes_per_nap at 1.

I agree on starting with as simple a GUC as possible and expand it as
there is need in the field.  99% of people are never going to test this
value as much as you are.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 (BTW, the patch seems
 a bit schizoid about whether checkpoint_rate is int or float.)

 Yeah, I've gone back and forth on the data type. I wanted it to be a 
 float, but guc code doesn't let you specify a float in KB, so I switched 
 it to int.

I seriously question trying to claim that it's blocks at all, seeing
that the *actual* units are pages per unit time.  Pretending that it's
a memory unit does more to obscure the truth than document it.

 What's bugging me about this is that we are either going to be writing
 at checkpoint_rate if ahead of schedule, or max possible rate if behind;
 that's not smoothing to me, that's introducing some pretty bursty
 behavior.

 That sounds a lot more complex. The burstiness at that level shouldn't 
 matter much. The OS is still going to cache the writes, and should even 
 out the bursts.

With the changes you're proposing here, the burstiness would be quite
severe.  OTOH, if writes_per_nap is always 1, then bufmgr is going to
recheck the delay situation after every page, so what you have actually
tested is as granular as it could get.

 And checkpoint_rate really needs to be named checkpoint_min_rate, if
 it's going to be a minimum.  However, I question whether we need it at
 all,

 Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, 
   using the non-broken formula above, we get:

 (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.

 So I think that's fine.

Fine?  That's 12x the value you have actually tested.  That's enough
of a change to invalidate all your performance testing IMHO.  I still
think you've not demonstrated a need to expose this parameter.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Transaction Guarantee, updated version

2007-06-22 Thread Simon Riggs
On Fri, 2007-06-22 at 10:22 -0400, Tom Lane wrote:

 4. If WAL has not been flushed far enough to be certain that the target
 transaction's commit is synced, then defer setting the hint bit.  This
 does not require any new per-transaction state, we can just measure it
 on the basis of the lsn associated with the clog page that the
 transaction's state is in.

Seems like a great idea, and avoids that annoying DFC cache.

I'll do that.


-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:

Tom Lane wrote:

(BTW, the patch seems
a bit schizoid about whether checkpoint_rate is int or float.)


Yeah, I've gone back and forth on the data type. I wanted it to be a 
float, but guc code doesn't let you specify a float in KB, so I switched 
it to int.


I seriously question trying to claim that it's blocks at all, seeing
that the *actual* units are pages per unit time.  Pretending that it's
a memory unit does more to obscure the truth than document it.


Hmm. What I wanted to avoid is that the I/O rate you get then depends on 
your bgwriter_delay, so you if you change that you need to change 
checkpoint_min_rate as well.


Now we already have that issue with bgwriter_all/lru_maxpages, and I 
don't like it there either. If you think it's better to let the user 
define it directly as pages/bgwriter_delay, fine.



And checkpoint_rate really needs to be named checkpoint_min_rate, if
it's going to be a minimum.  However, I question whether we need it at
all,


Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, 
  using the non-broken formula above, we get:



(512*1024/8192) * 200 / 1000 = 12.8, truncated to 12.



So I think that's fine.


Fine?  That's 12x the value you have actually tested.  That's enough
of a change to invalidate all your performance testing IMHO.  


I'll reschedule the tests to be sure, after we settle on how we want to 
control this feature.



I still think you've not demonstrated a need to expose this parameter.


Greg Smith wanted to explicitly control the I/O rate, and let the 
checkpoint duration vary. I personally think that fixing the checkpoint 
duration is better because it's easier to tune.


But if we only do that, you might end up with ridiculously long 
checkpoints when there's not many dirty pages. If we want to avoid that, 
we need some way of telling what's a safe minimum rate to write at, 
because that can vary greatly depending on your hardware.


But maybe we don't care about prolonging checkpoints, and don't really 
need any GUCs at all. We could then just hardcode writes_per_nap to some 
low value, and target duration close to 1.0. You would have a checkpoint 
running practically all the time, and you would use 
checkpoint_timeout/checkpoint_segments to control how long it takes. I'm 
a bit worried about jumping to such a completely different regime, 
though. For example, pg_start_backup needs to create a new checkpoint, 
so it would need to wait on average 1.5 * checkpoint_timeout/segments, 
and recovery would need to process on average 1.5 as much WAL as before. 
Though with LDC, you should get away with shorter checkpoint intervals 
than before, because the checkpoints aren't as invasive.


If we do that, we should remove bgwriter_all_* settings. They wouldn't 
do much because we would have checkpoint running all the time, writing 
out dirty pages.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I still think you've not demonstrated a need to expose this parameter.

 Greg Smith wanted to explicitly control the I/O rate, and let the 
 checkpoint duration vary. I personally think that fixing the checkpoint 
 duration is better because it's easier to tune.

 But if we only do that, you might end up with ridiculously long 
 checkpoints when there's not many dirty pages. If we want to avoid that, 
 we need some way of telling what's a safe minimum rate to write at, 
 because that can vary greatly depending on your hardware.

As long as the minimum rate is at least 1/bgdelay, I don't think this is
a big problem.

 But maybe we don't care about prolonging checkpoints, and don't really 
 need any GUCs at all. We could then just hardcode writes_per_nap to some 
 low value, and target duration close to 1.0. You would have a checkpoint 
 running practically all the time, and you would use 
 checkpoint_timeout/checkpoint_segments to control how long it takes. I'm 
 a bit worried about jumping to such a completely different regime, 
 though. For example, pg_start_backup needs to create a new checkpoint, 
 so it would need to wait on average 1.5 * checkpoint_timeout/segments, 

Maybe I misread the patch, but I thought that if someone requested an
immediate checkpoint, the checkpoint-in-progress would effectively flip
to immediate mode.  So that could be handled by offering an immediate vs
extended checkpoint option in pg_start_backup.  I'm not sure it's a
problem though, since as previously noted you probably want
pg_start_backup to be noninvasive.  Also, one could do a manual
CHECKPOINT command then immediately pg_start_backup if one wanted
as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?)

 and recovery would need to process on average 1.5 as much WAL as before. 
 Though with LDC, you should get away with shorter checkpoint intervals 
 than before, because the checkpoints aren't as invasive.

No, you still want a pretty long checkpoint interval, because of the
increase in WAL traffic due to more page images being dumped when the
interval is short.

 If we do that, we should remove bgwriter_all_* settings. They wouldn't 
 do much because we would have checkpoint running all the time, writing 
 out dirty pages.

Yeah, I'm not sure that we've thought through the interactions with the
existing bgwriter behavior.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Preliminary GSSAPI Patches

2007-06-22 Thread Magnus Hagander
Stephen Frost wrote:
 * Magnus Hagander ([EMAIL PROTECTED]) wrote:
 We enable the setting of the service name in the server configuration
 file, but we never use that variable anywhere. We do, however, use the
 service name on the client, in order to pick the correct key (and
 turning this off makes GSSAPI no longer work).

 If this is correct, we should not enable that parameter on the server.
 If it's not correct, we should be using it somewhere.
 
 Uh, shouldn't you be acquiring the server credentials before accepting
 the context?  That'd be done using gss_acquire_cred(), which takes the
 service name (in gss_name_t structure) as an argument.  That would then
 be passed in to gss_accept_sec_context() instead of using
 GSS_C_NO_CREDENTIAL (in port-gss-cred). 

That's the direction I was thinking in. I just wanted to have it
confirmed. Henry, what's your take on this?

 I'm kind of suprised it's
 working without that and rather curious as to what it's doing under the
 hood to make that happen. :/

Most likely it's just checking the keytab to find a principal with the
same name as the one presented from the client. Since one is present, it
loads it up automatically, and verifies against it.

//Magnus


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Tom Lane wrote:

Maybe I misread the patch, but I thought that if someone requested an
immediate checkpoint, the checkpoint-in-progress would effectively flip
to immediate mode.  So that could be handled by offering an immediate vs
extended checkpoint option in pg_start_backup.  I'm not sure it's a
problem though, since as previously noted you probably want
pg_start_backup to be noninvasive.  Also, one could do a manual
CHECKPOINT command then immediately pg_start_backup if one wanted
as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?)


Yeah, that's possible.

and recovery would need to process on average 1.5 as much WAL as before. 
Though with LDC, you should get away with shorter checkpoint intervals 
than before, because the checkpoints aren't as invasive.


No, you still want a pretty long checkpoint interval, because of the
increase in WAL traffic due to more page images being dumped when the
interval is short.

If we do that, we should remove bgwriter_all_* settings. They wouldn't 
do much because we would have checkpoint running all the time, writing 
out dirty pages.


Yeah, I'm not sure that we've thought through the interactions with the
existing bgwriter behavior.


I searched the archives a bit for the discussions when the current 
bgwriter settings were born, and found this thread:


http://archives.postgresql.org/pgsql-hackers/2004-12/msg00784.php

The idea of Load Distributed Checkpoints certainly isn't new :).

Ok, if we approach this from the idea that there will be *no* GUC 
variables at all to control this, and we remove the bgwriter_all_* 
settings as well, does anyone see a reason why that would be bad? Here's 
the ones mentioned this far:


1. we need to keep 2x as much WAL segments around as before.

2. pg_start_backup will need to wait for a long time.

3. Recovery will take longer, because the distance last committed redo 
ptr will lag behind more.


1. and 3. can be alleviated by using a smaller 
checkpoint_timeout/segments though as you pointed out that leads to 
higher WAL traffic. 2. is not a big deal, and we can add an 'immediate' 
parameter to pg_start_backup if necessary.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Greg Smith

On Fri, 22 Jun 2007, Tom Lane wrote:


Yeah, I'm not sure that we've thought through the interactions with the
existing bgwriter behavior.


The entire background writer mess needs a rewrite, and the best way to 
handle that is going to shift considerably with LDC applied.


As the person who was complaining about corner cases I'm not in a position 
to talk more explicitly about, I can at least summarize my opinion of how 
I feel everyone should be thinking about this patch and you can take what 
you want from that.  In the context of getting an 8.3 release finalized, I 
think you should be building a LDC implementation that accomplishes the 
main goal of spreading the checkpoints out, which is clearly working, and 
erring on the side of more knobs in cases where it's unsure if they're 
needed or not.  It's clear what my position is which non-obvious knobs I 
think are important for some people.


Which is worse:  putting in a tuning setting that it's discovered 
everybody just uses the same value for, or discovering after release that 
there's a use-case for that setting and by hard-coding it you've made the 
DBAs for a class of applications you didn't think about miserable?  In the 
cases where there's good evidence so far of the right setting, just make 
that the default, and the only harm is GUC bloat.


Nothing should be done that changes the existing behavior if the LDC 
feature is turned off, so anything more obtrusive to the background writer 
is right out.  Make reducing the knobs, optimizing the default behavior, 
and rewriting the background writer to better fit into its new context a 
major goal of 8.4.  I know I've got a whole notebook full of stuff on that 
topic I've been ignoring as not to distract you guys from getting 8.3 
done.


That's the low risk plan, and the design/beta/release period here is short 
enough that I think going too experimental beyond that is a bad idea.  To 
pick an example, when I read this idea from Heikki:


You would have a checkpoint running practically all the time, and you 
would use checkpoint_timeout/checkpoint_segments to control how long it 
takes... If we do that, we should remove bgwriter_all_* settings


Whether or not I think this is an awesome idea, the very idea of a change 
that big at this point gives me the willies.  Just off the top of my head, 
there's a whole class of issues involving recycling xlog segments this 
would introduce I would be really unhappy with the implications of.  Did 
anyone else ever notice that when a new xlog segment is created, the write 
to clear it out doesn't happen via direct I/O like the rest of the xlog 
writes do?  That write goes through the regular buffered path instead. 
The checkpoint logging patch I submitted logs when this happens 
specifically because that particular issue messed with some operations and 
I found it important to be aware of.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Ok, if we approach this from the idea that there will be *no* GUC 
 variables at all to control this, and we remove the bgwriter_all_* 
 settings as well, does anyone see a reason why that would be bad? Here's 
 the ones mentioned this far:

 1. we need to keep 2x as much WAL segments around as before.

No, only 50% more.  We've always kept WAL back to the
checkpoint-before-last, so the peak usage has been about 2 checkpoint
distances (plus a bit), and now it'd tend to be about 3.

 2. pg_start_backup will need to wait for a long time.

 3. Recovery will take longer, because the distance last committed redo 
 ptr will lag behind more.

True, you'd have to replay 1.5 checkpoint intervals on average instead
of 0.5 (more or less, assuming checkpoints had been short).  I don't
think we're in the business of optimizing crash recovery time though.

It might be that getting rid of checkpoint_smoothing is an overreaction,
and that we should keep that parameter.  IIRC Greg had worried about
being able to turn this behavior off, so we'd still need at least a
bool, and we might as well expose the fraction instead.  (Still don't
like the name smoothing though.)  I agree with removing the non-LRU
part of the bgwriter's write logic though; that should simplify matters
a bit and cut down the overall I/O volume.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Tom Lane
Greg Smith [EMAIL PROTECTED] writes:
 As the person who was complaining about corner cases I'm not in a position 
 to talk more explicitly about, I can at least summarize my opinion of how 
 I feel everyone should be thinking about this patch and you can take what 
 you want from that.

Sorry, but we're not going to make decisions on the basis of evidence
you won't show us.  Right at the moment the best thing to do seems to
be to enable LDC with a low minimum write rate and a high target
duration, and remove the thereby-obsoleted all buffers scan of the
existing bgwriter logic.  If you've got specific evidence why any of
these things need to be parameterized, let's see it.  Personally I think
that we have a bad track record of exposing GUC variables as a
substitute for understanding performance issues at the start, and this
approach isn't doing any favors for DBAs.

 Whether or not I think this is an awesome idea, the very idea of a change 
 that big at this point gives me the willies.  Just off the top of my head, 
 there's a whole class of issues involving recycling xlog segments this 
 would introduce I would be really unhappy with the implications of.

Really?  Name one.  AFAICS this should not affect xlog recycling in the
slightest (other than that we have to bump up the target number of live
segments from 2x to 3x the checkpoint distance).

 Did anyone else ever notice that when a new xlog segment is created,
 the write to clear it out doesn't happen via direct I/O like the rest
 of the xlog writes do?

It's not supposed to matter, because that path isn't supposed to be
taken often.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Greg Smith

On Fri, 22 Jun 2007, Tom Lane wrote:

Greg had worried about being able to turn this behavior off, so we'd 
still need at least a bool, and we might as well expose the fraction 
instead.  I agree with removing the non-LRU part of the bgwriter's write 
logic though


If you accept that being able to turn LDC off is important, people who 
aren't turning it on may still want the existing bgwriter_all_* settings 
so they can keep running things the way they are now.  It's certainly 
reasonable to skip that code path when doing things the LDC way.


True, you'd have to replay 1.5 checkpoint intervals on average instead 
of 0.5 (more or less, assuming checkpoints had been short).  I don't 
think we're in the business of optimizing crash recovery time though.


If you're not, I think you should be.  Keeping that replay interval time 
down was one of the reasons why the people I was working with were 
displeased with the implications of the very spread out style of some LDC 
tunings.  They were already unhappy with the implied recovery time of how 
high they had to set checkpoint_settings for good performance, and making 
it that much bigger aggrevates the issue.  Given a knob where the LDC can 
be spread out a bit but not across the entire interval, that makes it 
easier to control how much expansion there is relative to the current 
behavior.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Greg Smith wrote:
True, you'd have to replay 1.5 checkpoint intervals on average instead 
of 0.5 (more or less, assuming checkpoints had been short).  I don't 
think we're in the business of optimizing crash recovery time though.


If you're not, I think you should be.  Keeping that replay interval time 
down was one of the reasons why the people I was working with were 
displeased with the implications of the very spread out style of some 
LDC tunings.  They were already unhappy with the implied recovery time 
of how high they had to set checkpoint_settings for good performance, 
and making it that much bigger aggrevates the issue.  Given a knob where 
the LDC can be spread out a bit but not across the entire interval, that 
makes it easier to control how much expansion there is relative to the 
current behavior.


I agree on that one: we *should* optimize crash recovery time. It may 
not be the most important thing on earth, but it's a significant 
consideration for some systems.


However, I think shortening the checkpoint interval is a perfectly valid 
solution to that. It does lead to more full page writes, but in 8.3 more 
full page writes can actually make the recovery go faster, not slower, 
because with we no longer read in the previous contents of the page when 
we restore it from a full page image. In any case, while people 
sometimes complain that we have a large WAL footprint, it's not usually 
a problem.


This is off-topic, but at PGCon in May, Itagaki-san and his colleagues 
whose names I can't remember, pointed out to me very clearly that our 
recovery is *slow*. So slow, that in the benchmarks they were running, 
their warm stand-by slave couldn't keep up with the master generating 
the WAL, even though both are running on the same kind of hardware.


The reason is simple: There can be tens of backends doing I/O and 
generating WAL, but in recovery we serialize them. If you have decent 
I/O hardware that could handle for example 10 concurrent random I/Os, at 
recovery we'll be issuing them one at a time. That's a scalability 
issue, and doesn't show up on a laptop or a small server with a single disk.


That's one of the first things I'm planning to tackle when the 8.4 dev 
cycle opens. And I'm planning to look at recovery times in general; I've 
never even measured it before so who knows what comes up.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] Load Distributed Checkpoints, take 3

2007-06-22 Thread Heikki Linnakangas

Greg Smith wrote:

On Fri, 22 Jun 2007, Tom Lane wrote:

Greg had worried about being able to turn this behavior off, so we'd 
still need at least a bool, and we might as well expose the fraction 
instead.  I agree with removing the non-LRU part of the bgwriter's 
write logic though


If you accept that being able to turn LDC off is important, people who 
aren't turning it on may still want the existing bgwriter_all_* settings 
so they can keep running things the way they are now.  It's certainly 
reasonable to skip that code path when doing things the LDC way.


Well, if you don't want anything to change, don't upgrade.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] Preliminary GSSAPI Patches

2007-06-22 Thread Henry B. Hotz


On Jun 22, 2007, at 9:56 AM, Magnus Hagander wrote:


Stephen Frost wrote:

* Magnus Hagander ([EMAIL PROTECTED]) wrote:
We enable the setting of the service name in the server  
configuration
file, but we never use that variable anywhere. We do, however,  
use the

service name on the client, in order to pick the correct key (and
turning this off makes GSSAPI no longer work).

If this is correct, we should not enable that parameter on the  
server.

If it's not correct, we should be using it somewhere.


Uh, shouldn't you be acquiring the server credentials before  
accepting
the context?  That'd be done using gss_acquire_cred(), which takes  
the
service name (in gss_name_t structure) as an argument.  That would  
then

be passed in to gss_accept_sec_context() instead of using
GSS_C_NO_CREDENTIAL (in port-gss-cred).


That's the direction I was thinking in. I just wanted to have it
confirmed. Henry, what's your take on this?


I'm kind of suprised it's
working without that and rather curious as to what it's doing  
under the

hood to make that happen. :/


Most likely it's just checking the keytab to find a principal with the
same name as the one presented from the client. Since one is  
present, it

loads it up automatically, and verifies against it.


Bingo!

The server uses the keytab to decrypt the token provided by the  
client.  By using the GSS_C_NO_CREDENTIAL arg on the server anything  
put in the keytab is OK.  (The server doesn't need to authenticate  
itself to Kerberos, it just accepts authentication.  Mutual  
authentication is done using the same keys.)  The documentation needs  
to reflect that.


The real question is whether we *need* to check anything.  If the  
keytab is unique to PostgreSQL, as I think it should be, then I think  
the admin should put only the right stuff in the keytab, and no  
further checks are needed.


Now it *might* make sense to check that the credential that's  
accepted actually has some correspondence to what ./configure said.   
If we do do that, then we need to allow for the ways Microsoft mucks  
with the case of the name.  (Kerberos is supposed to be case  
sensitive, but Microsoft work that way.)  In particular I think we  
may need both postgres/server and POSTGRES/server in the keytab  
in order to support the to-be-written native Windows SSPI client at  
the same time as the current Kerberos 5 and GSSAPI Unix clients.


Now what happens with non-Kerberos 5 mechansims like SPKM and  
SPNEGO?  ;-)  I'll have to see, even if it doesn't matter for Java, yet.


The opinions expressed in this message are mine,
not those of Caltech, JPL, NASA, or the US Government.
[EMAIL PROTECTED], or [EMAIL PROTECTED]



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PATCHES] Preliminary GSSAPI Patches

2007-06-22 Thread Stephen Frost
* Henry B. Hotz ([EMAIL PROTECTED]) wrote:
 On Jun 22, 2007, at 9:56 AM, Magnus Hagander wrote:
 Most likely it's just checking the keytab to find a principal with the
 same name as the one presented from the client. Since one is  
 present, it
 loads it up automatically, and verifies against it.
 
 Bingo!
 
 The server uses the keytab to decrypt the token provided by the  
 client.  By using the GSS_C_NO_CREDENTIAL arg on the server anything  
 put in the keytab is OK.  (The server doesn't need to authenticate  
 itself to Kerberos, it just accepts authentication.  Mutual  
 authentication is done using the same keys.)  The documentation needs  
 to reflect that.

I agree there's some disconnect there between the documentation and the
apparent implementation but I'm not sure I'm in favor of changing the
documentation on this one.  Personally, I'd rather it return an error if
someone tries to use GSS_C_NO_CREDENTIAL when accepting a context than
to just be happy using anything in the keytab.

 The real question is whether we *need* to check anything.  If the  
 keytab is unique to PostgreSQL, as I think it should be, then I think  
 the admin should put only the right stuff in the keytab, and no  
 further checks are needed.

While I agree that in general it's best to have a keytab file for each
seperate service, there's no guarentee of that being done and using,
say, the 'host' service to allow authentication to PG when the PG
service is 'postgres' isn't right.

 Now it *might* make sense to check that the credential that's  
 accepted actually has some correspondence to what ./configure said.   

Or what's configured in the postgres.conf (as is done for Kerberos
now...).

 If we do do that, then we need to allow for the ways Microsoft mucks  
 with the case of the name.  (Kerberos is supposed to be case  
 sensitive, but Microsoft work that way.)  In particular I think we  
 may need both postgres/server and POSTGRES/server in the keytab  
 in order to support the to-be-written native Windows SSPI client at  
 the same time as the current Kerberos 5 and GSSAPI Unix clients.

Supporting multiple, specific, keys might be an interesting challenge,
but I'm not too keen on worrying about it right now regardless.  I'd
also much rather err on the side of overly paranoid than if it works,
just let it in.  If someone ends up having to support both windows SSPI
clients and unix Kerberos/GSSAPI clients it's entirely possible to
suggest they just make it POSTGRES and configure the clients
accordingly.

 Now what happens with non-Kerberos 5 mechansims like SPKM and  
 SPNEGO?  ;-)  I'll have to see, even if it doesn't matter for Java, yet.

Err, SPNEGO's not really a full mechanism itself, it's got sub-mechs of
NTLM and Kerberos and from what I've seen it does the same thing SSPI
does and assumes uppercase.  Again, it's ugly, but not unbearable.  Such
fun things are probably also why mod-auth-kerb for Apache lets you
configure the service name, like most other Kerberos servers...  I've
yet to run into one that will just use any key in the keytab it's given.

Thanks,

Stephen


signature.asc
Description: Digital signature