Re: [PATCHES] Preliminary GSSAPI Patches
Magnus Hagander wrote: Be curious to see what you've done, but if you're actively changing things I'll let them settle. I've got a bit more cleanup to do, but I'm almost there. Much of it is just cleanup. I've changed the structs arond to be more in line with the other code around it, and such. Refacored some of the code to cut down duplicate codes. Added some stuff to make it work on windows (still just with MIT kerberos and not native though). Fixed two (I think it was) small memory leaks. Protocol-wise, it no longer piggybacks int eh AuthenticationOk message - instead we send an extra continue message followed right away by an AuthenticationOk one. Oh, and I've added autoconf. Not complete yet, but getting there :-) I'll post the updated patch shortly :-) Ok. Here's the version I have right now, sans autoconf (which I broke in my attempts to make it work with mingw). I have one major question remaining: We enable the setting of the service name in the server configuration file, but we never use that variable anywhere. We do, however, use the service name on the client, in order to pick the correct key (and turning this off makes GSSAPI no longer work). If this is correct, we should not enable that parameter on the server. If it's not correct, we should be using it somewhere. Is this perhaps a leftover from the old gssapi-encryption code? In that we need to use that parameter on the server in order to enable encryption, but can remove it for now, until we have that? (Since the parameter is around for krb5 anyway, it's just #ifdefing it back out, of course, not actually removing it) (Still working on the documentation part) //Magnus diff -cr pgsql.orig/src/backend/libpq/auth.c pgsql/src/backend/libpq/auth.c *** pgsql.orig/src/backend/libpq/auth.c 2007-02-08 05:52:18.0 +0100 --- pgsql/src/backend/libpq/auth.c 2007-06-22 12:39:25.0 +0200 *** *** 295,300 --- 295,539 } #endif /* KRB5 */ + #ifdef ENABLE_GSS + /* + * GSSAPI authentication system + * + */ + + #include gssapi/gssapi.h + + #ifdef WIN32 + /* + * MIT Kerberos GSSAPI DLL doesn't properly export the symbols + * that contain the OIDs required. Redefine here, values copied + * from src/athena/auth/krb5/src/lib/gssapi/generic/gssapi_generic.c + */ + static const gss_OID_desc GSS_C_NT_USER_NAME_desc = + {10, (void *)\x2a\x86\x48\x86\xf7\x12\x01\x02\x01\x02}; + static GSS_DLLIMP gss_OID GSS_C_NT_USER_NAME = GSS_C_NT_USER_NAME_desc; + #endif + + + static void + pg_GSS_error(int severity, char *text, OM_uint32 maj_stat, OM_uint32 min_stat) + { + gss_buffer_desc gmsg; + OM_uint32 lmaj_s, lmin_s, msg_ctx; + char localmsg1[128], + localmsg2[128]; + + /* Fetch major status message */ + msg_ctx = 0; + lmaj_s = gss_display_status(lmin_s, maj_stat, GSS_C_GSS_CODE, + GSS_C_NO_OID, msg_ctx, gmsg); + strlcpy(localmsg1, gmsg.value, sizeof(localmsg1)); + gss_release_buffer(lmin_s, gmsg); + + if (msg_ctx) + /* More than one message available. + * XXX: Should we loop and read all messages? + * (same below) + */ + ereport(WARNING, + (errmsg_internal(incomplete GSS error report))); + + /* Fetch mechanism minor status message */ + msg_ctx = 0; + lmaj_s = gss_display_status(lmin_s, min_stat, GSS_C_MECH_CODE, + GSS_C_NO_OID, msg_ctx, gmsg); + strlcpy(localmsg2, gmsg.value, sizeof(localmsg2)); + gss_release_buffer(lmin_s, gmsg); + + if (msg_ctx) + ereport(WARNING, + (errmsg_internal(incomplete GSS minor error report))); + + /* errmsg_internal, since translation of the first part must be + * done before calling this function anyway. */ + ereport(severity, + (errmsg_internal(%s:%s\n%s, text, localmsg1, localmsg2))); + } + + static int + pg_GSS_recvauth(Port *port) + { + OM_uint32 maj_stat, min_stat, lmin_s, gflags; + char *kt_path; + intmtype; + intn_eq; + StringInfoData buf; + gss_buffer_desc gbuf; + gss_name_t gnbuf; + + if (pg_krb_server_keyfile strlen(pg_krb_server_keyfile) 0) + { + /* + * Set default Kerberos keytab file for the Krb5 mechanism. + * + * setenv(KRB5_KTNAME, pg_krb_server_keyfile, 0); + * except setenv() not always available. + */ + if (!getenv(KRB5_KTNAME)) + { + kt_path = palloc(PATH_MAX + 13); + snprintf(kt_path, PATH_MAX + 13, + KRB5_KTNAME=%s, pg_krb_server_keyfile); + putenv(kt_path); + } + } + + port-gss-ctx = GSS_C_NO_CONTEXT; + port-gss-cred = GSS_C_NO_CREDENTIAL; + + /* + * Loop through GSSAPI message exchange. This exchange can consist + * of multiple messags sent in both directions. First message is always + * from the client. All messages from client to server are password + * packets (type 'p'). + */ + do + { + mtype = pq_getbyte(); + if (mtype != 'p') + { + /* Only log error
Re: [PATCHES] Transaction Guarantee, updated version
On Thu, 2007-06-21 at 16:05 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Completed all of the agreed changes for TG: I've just realized that there's a fatal problem with this design. We've now got tqual.c setting page LSN when it holds only share lock on the buffer. That will absolutely not work, eg two backends might concurrently set different values and end up with garbage (since it's unlikely that LSN store is atomic). Can we fix it to be a read test instead of a write test, that is, if we know WAL has been flushed through the target LSN, it's safe to set the hint bit, else not? Yes, that's roughly how it worked in v1. The various possible strategies for handling hint bits were: 1. flush WAL to current insert pointer - very heavy perf hit 2. defer setting the hint bits at all, if the written transactions are recently written. To do this we need a cache of recently written deferred commit Xids. 3. flush WAL up to the LSN of the Xid, if it is a deferred commit. To do this we need a cache of recently written deferred commit Xids. Any more? v1 implemented (2) on the basis that it was a small timing hole that was best handled by doing as little as possible, though we could easily implement (3) instead. But you need the deferred commit cache either way, AFAICS. So proposal is: Maintain cache of unflushed deferred commit xacts and their corresponding LSNs. When you need to set a commit bit you check the required LSN for the xact from cache, then flush WAL up to that LSN. If xact doesn't exist, then do nothing because it is either an already flushed deferred commit or a normal commit (therefore already flushed too). Don't update the page's LSN at all. More code, but it works correctly and efficiently. This is pretty much putting back most of v1, though. In general, I think a transaction abort should not need to flush anything, since the default assumption is that it crashed anyway. It currently does that though and I haven't sought to change that. Hence for instance recording a transaction abort needn't advance the LSN of the clog page. (You seem to have it flushing through the last xlog record written by the backend, which is exactly what it doesn't need to do.) By extension, it should be OK to set INVALID (aborted) hint bits in a tuple header without any concerns about flushing. Sure. Also, I'm sort of wondering if we really need a separate walwriter process; that code seems awfully duplicative. Is there a reason not to have the bgwriter include this functionality? The bgwriter is doing something else already. Fsyncing the WAL takes significant time and that would detract from the main role of bgwriter. In lesser news: The caching logic in TransactionGetCommitLSN is obviously broken. OK, but won't check because of what you've said at top of mail. No need to fix that if the basic premise needs changing anyway. Is there really a use-case for adding a pgstat counter for guaranteed transactions? That adds pgstat overhead, and bloats the patch noticeably, and I don't entirely see the value of it. I'm easy on that: its easier to take code out than add it in. It certainly helped to debug things. There's some padding junk inserted in XLogCtlData, which as far as I recall was never discussed, and is certainly not an integral part of the delayed-commit feature. If you want that you should submit and defend it separately. OK -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, because as the code stands, with the default BgWriterDelay you would have to increase checkpoint_rate to 4x its proposed default before writes_per_nap moves off its minimum of 1. This says to me that the system's tested behavior has been so insensitive to checkpoint_rate that we probably need not expose such a parameter at all; just hardwire the minimum writes_per_nap at 1. I agree on starting with as simple a GUC as possible and expand it as there is need in the field. 99% of people are never going to test this value as much as you are. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: (BTW, the patch seems a bit schizoid about whether checkpoint_rate is int or float.) Yeah, I've gone back and forth on the data type. I wanted it to be a float, but guc code doesn't let you specify a float in KB, so I switched it to int. I seriously question trying to claim that it's blocks at all, seeing that the *actual* units are pages per unit time. Pretending that it's a memory unit does more to obscure the truth than document it. What's bugging me about this is that we are either going to be writing at checkpoint_rate if ahead of schedule, or max possible rate if behind; that's not smoothing to me, that's introducing some pretty bursty behavior. That sounds a lot more complex. The burstiness at that level shouldn't matter much. The OS is still going to cache the writes, and should even out the bursts. With the changes you're proposing here, the burstiness would be quite severe. OTOH, if writes_per_nap is always 1, then bufmgr is going to recheck the delay situation after every page, so what you have actually tested is as granular as it could get. And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, using the non-broken formula above, we get: (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. So I think that's fine. Fine? That's 12x the value you have actually tested. That's enough of a change to invalidate all your performance testing IMHO. I still think you've not demonstrated a need to expose this parameter. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Transaction Guarantee, updated version
On Fri, 2007-06-22 at 10:22 -0400, Tom Lane wrote: 4. If WAL has not been flushed far enough to be certain that the target transaction's commit is synced, then defer setting the hint bit. This does not require any new per-transaction state, we can just measure it on the basis of the lsn associated with the clog page that the transaction's state is in. Seems like a great idea, and avoids that annoying DFC cache. I'll do that. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: (BTW, the patch seems a bit schizoid about whether checkpoint_rate is int or float.) Yeah, I've gone back and forth on the data type. I wanted it to be a float, but guc code doesn't let you specify a float in KB, so I switched it to int. I seriously question trying to claim that it's blocks at all, seeing that the *actual* units are pages per unit time. Pretending that it's a memory unit does more to obscure the truth than document it. Hmm. What I wanted to avoid is that the I/O rate you get then depends on your bgwriter_delay, so you if you change that you need to change checkpoint_min_rate as well. Now we already have that issue with bgwriter_all/lru_maxpages, and I don't like it there either. If you think it's better to let the user define it directly as pages/bgwriter_delay, fine. And checkpoint_rate really needs to be named checkpoint_min_rate, if it's going to be a minimum. However, I question whether we need it at all, Hmm. With bgwriter_delay of 200 ms, and checkpoint_min_rate of 512 KB/s, using the non-broken formula above, we get: (512*1024/8192) * 200 / 1000 = 12.8, truncated to 12. So I think that's fine. Fine? That's 12x the value you have actually tested. That's enough of a change to invalidate all your performance testing IMHO. I'll reschedule the tests to be sure, after we settle on how we want to control this feature. I still think you've not demonstrated a need to expose this parameter. Greg Smith wanted to explicitly control the I/O rate, and let the checkpoint duration vary. I personally think that fixing the checkpoint duration is better because it's easier to tune. But if we only do that, you might end up with ridiculously long checkpoints when there's not many dirty pages. If we want to avoid that, we need some way of telling what's a safe minimum rate to write at, because that can vary greatly depending on your hardware. But maybe we don't care about prolonging checkpoints, and don't really need any GUCs at all. We could then just hardcode writes_per_nap to some low value, and target duration close to 1.0. You would have a checkpoint running practically all the time, and you would use checkpoint_timeout/checkpoint_segments to control how long it takes. I'm a bit worried about jumping to such a completely different regime, though. For example, pg_start_backup needs to create a new checkpoint, so it would need to wait on average 1.5 * checkpoint_timeout/segments, and recovery would need to process on average 1.5 as much WAL as before. Though with LDC, you should get away with shorter checkpoint intervals than before, because the checkpoints aren't as invasive. If we do that, we should remove bgwriter_all_* settings. They wouldn't do much because we would have checkpoint running all the time, writing out dirty pages. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: I still think you've not demonstrated a need to expose this parameter. Greg Smith wanted to explicitly control the I/O rate, and let the checkpoint duration vary. I personally think that fixing the checkpoint duration is better because it's easier to tune. But if we only do that, you might end up with ridiculously long checkpoints when there's not many dirty pages. If we want to avoid that, we need some way of telling what's a safe minimum rate to write at, because that can vary greatly depending on your hardware. As long as the minimum rate is at least 1/bgdelay, I don't think this is a big problem. But maybe we don't care about prolonging checkpoints, and don't really need any GUCs at all. We could then just hardcode writes_per_nap to some low value, and target duration close to 1.0. You would have a checkpoint running practically all the time, and you would use checkpoint_timeout/checkpoint_segments to control how long it takes. I'm a bit worried about jumping to such a completely different regime, though. For example, pg_start_backup needs to create a new checkpoint, so it would need to wait on average 1.5 * checkpoint_timeout/segments, Maybe I misread the patch, but I thought that if someone requested an immediate checkpoint, the checkpoint-in-progress would effectively flip to immediate mode. So that could be handled by offering an immediate vs extended checkpoint option in pg_start_backup. I'm not sure it's a problem though, since as previously noted you probably want pg_start_backup to be noninvasive. Also, one could do a manual CHECKPOINT command then immediately pg_start_backup if one wanted as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?) and recovery would need to process on average 1.5 as much WAL as before. Though with LDC, you should get away with shorter checkpoint intervals than before, because the checkpoints aren't as invasive. No, you still want a pretty long checkpoint interval, because of the increase in WAL traffic due to more page images being dumped when the interval is short. If we do that, we should remove bgwriter_all_* settings. They wouldn't do much because we would have checkpoint running all the time, writing out dirty pages. Yeah, I'm not sure that we've thought through the interactions with the existing bgwriter behavior. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Preliminary GSSAPI Patches
Stephen Frost wrote: * Magnus Hagander ([EMAIL PROTECTED]) wrote: We enable the setting of the service name in the server configuration file, but we never use that variable anywhere. We do, however, use the service name on the client, in order to pick the correct key (and turning this off makes GSSAPI no longer work). If this is correct, we should not enable that parameter on the server. If it's not correct, we should be using it somewhere. Uh, shouldn't you be acquiring the server credentials before accepting the context? That'd be done using gss_acquire_cred(), which takes the service name (in gss_name_t structure) as an argument. That would then be passed in to gss_accept_sec_context() instead of using GSS_C_NO_CREDENTIAL (in port-gss-cred). That's the direction I was thinking in. I just wanted to have it confirmed. Henry, what's your take on this? I'm kind of suprised it's working without that and rather curious as to what it's doing under the hood to make that happen. :/ Most likely it's just checking the keytab to find a principal with the same name as the one presented from the client. Since one is present, it loads it up automatically, and verifies against it. //Magnus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Load Distributed Checkpoints, take 3
Tom Lane wrote: Maybe I misread the patch, but I thought that if someone requested an immediate checkpoint, the checkpoint-in-progress would effectively flip to immediate mode. So that could be handled by offering an immediate vs extended checkpoint option in pg_start_backup. I'm not sure it's a problem though, since as previously noted you probably want pg_start_backup to be noninvasive. Also, one could do a manual CHECKPOINT command then immediately pg_start_backup if one wanted as-fast-as-possible (CHECKPOINT requests immediate checkpoint, right?) Yeah, that's possible. and recovery would need to process on average 1.5 as much WAL as before. Though with LDC, you should get away with shorter checkpoint intervals than before, because the checkpoints aren't as invasive. No, you still want a pretty long checkpoint interval, because of the increase in WAL traffic due to more page images being dumped when the interval is short. If we do that, we should remove bgwriter_all_* settings. They wouldn't do much because we would have checkpoint running all the time, writing out dirty pages. Yeah, I'm not sure that we've thought through the interactions with the existing bgwriter behavior. I searched the archives a bit for the discussions when the current bgwriter settings were born, and found this thread: http://archives.postgresql.org/pgsql-hackers/2004-12/msg00784.php The idea of Load Distributed Checkpoints certainly isn't new :). Ok, if we approach this from the idea that there will be *no* GUC variables at all to control this, and we remove the bgwriter_all_* settings as well, does anyone see a reason why that would be bad? Here's the ones mentioned this far: 1. we need to keep 2x as much WAL segments around as before. 2. pg_start_backup will need to wait for a long time. 3. Recovery will take longer, because the distance last committed redo ptr will lag behind more. 1. and 3. can be alleviated by using a smaller checkpoint_timeout/segments though as you pointed out that leads to higher WAL traffic. 2. is not a big deal, and we can add an 'immediate' parameter to pg_start_backup if necessary. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 22 Jun 2007, Tom Lane wrote: Yeah, I'm not sure that we've thought through the interactions with the existing bgwriter behavior. The entire background writer mess needs a rewrite, and the best way to handle that is going to shift considerably with LDC applied. As the person who was complaining about corner cases I'm not in a position to talk more explicitly about, I can at least summarize my opinion of how I feel everyone should be thinking about this patch and you can take what you want from that. In the context of getting an 8.3 release finalized, I think you should be building a LDC implementation that accomplishes the main goal of spreading the checkpoints out, which is clearly working, and erring on the side of more knobs in cases where it's unsure if they're needed or not. It's clear what my position is which non-obvious knobs I think are important for some people. Which is worse: putting in a tuning setting that it's discovered everybody just uses the same value for, or discovering after release that there's a use-case for that setting and by hard-coding it you've made the DBAs for a class of applications you didn't think about miserable? In the cases where there's good evidence so far of the right setting, just make that the default, and the only harm is GUC bloat. Nothing should be done that changes the existing behavior if the LDC feature is turned off, so anything more obtrusive to the background writer is right out. Make reducing the knobs, optimizing the default behavior, and rewriting the background writer to better fit into its new context a major goal of 8.4. I know I've got a whole notebook full of stuff on that topic I've been ignoring as not to distract you guys from getting 8.3 done. That's the low risk plan, and the design/beta/release period here is short enough that I think going too experimental beyond that is a bad idea. To pick an example, when I read this idea from Heikki: You would have a checkpoint running practically all the time, and you would use checkpoint_timeout/checkpoint_segments to control how long it takes... If we do that, we should remove bgwriter_all_* settings Whether or not I think this is an awesome idea, the very idea of a change that big at this point gives me the willies. Just off the top of my head, there's a whole class of issues involving recycling xlog segments this would introduce I would be really unhappy with the implications of. Did anyone else ever notice that when a new xlog segment is created, the write to clear it out doesn't happen via direct I/O like the rest of the xlog writes do? That write goes through the regular buffered path instead. The checkpoint logging patch I submitted logs when this happens specifically because that particular issue messed with some operations and I found it important to be aware of. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Load Distributed Checkpoints, take 3
Heikki Linnakangas [EMAIL PROTECTED] writes: Ok, if we approach this from the idea that there will be *no* GUC variables at all to control this, and we remove the bgwriter_all_* settings as well, does anyone see a reason why that would be bad? Here's the ones mentioned this far: 1. we need to keep 2x as much WAL segments around as before. No, only 50% more. We've always kept WAL back to the checkpoint-before-last, so the peak usage has been about 2 checkpoint distances (plus a bit), and now it'd tend to be about 3. 2. pg_start_backup will need to wait for a long time. 3. Recovery will take longer, because the distance last committed redo ptr will lag behind more. True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though. It might be that getting rid of checkpoint_smoothing is an overreaction, and that we should keep that parameter. IIRC Greg had worried about being able to turn this behavior off, so we'd still need at least a bool, and we might as well expose the fraction instead. (Still don't like the name smoothing though.) I agree with removing the non-LRU part of the bgwriter's write logic though; that should simplify matters a bit and cut down the overall I/O volume. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith [EMAIL PROTECTED] writes: As the person who was complaining about corner cases I'm not in a position to talk more explicitly about, I can at least summarize my opinion of how I feel everyone should be thinking about this patch and you can take what you want from that. Sorry, but we're not going to make decisions on the basis of evidence you won't show us. Right at the moment the best thing to do seems to be to enable LDC with a low minimum write rate and a high target duration, and remove the thereby-obsoleted all buffers scan of the existing bgwriter logic. If you've got specific evidence why any of these things need to be parameterized, let's see it. Personally I think that we have a bad track record of exposing GUC variables as a substitute for understanding performance issues at the start, and this approach isn't doing any favors for DBAs. Whether or not I think this is an awesome idea, the very idea of a change that big at this point gives me the willies. Just off the top of my head, there's a whole class of issues involving recycling xlog segments this would introduce I would be really unhappy with the implications of. Really? Name one. AFAICS this should not affect xlog recycling in the slightest (other than that we have to bump up the target number of live segments from 2x to 3x the checkpoint distance). Did anyone else ever notice that when a new xlog segment is created, the write to clear it out doesn't happen via direct I/O like the rest of the xlog writes do? It's not supposed to matter, because that path isn't supposed to be taken often. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
On Fri, 22 Jun 2007, Tom Lane wrote: Greg had worried about being able to turn this behavior off, so we'd still need at least a bool, and we might as well expose the fraction instead. I agree with removing the non-LRU part of the bgwriter's write logic though If you accept that being able to turn LDC off is important, people who aren't turning it on may still want the existing bgwriter_all_* settings so they can keep running things the way they are now. It's certainly reasonable to skip that code path when doing things the LDC way. True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though. If you're not, I think you should be. Keeping that replay interval time down was one of the reasons why the people I was working with were displeased with the implications of the very spread out style of some LDC tunings. They were already unhappy with the implied recovery time of how high they had to set checkpoint_settings for good performance, and making it that much bigger aggrevates the issue. Given a knob where the LDC can be spread out a bit but not across the entire interval, that makes it easier to control how much expansion there is relative to the current behavior. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith wrote: True, you'd have to replay 1.5 checkpoint intervals on average instead of 0.5 (more or less, assuming checkpoints had been short). I don't think we're in the business of optimizing crash recovery time though. If you're not, I think you should be. Keeping that replay interval time down was one of the reasons why the people I was working with were displeased with the implications of the very spread out style of some LDC tunings. They were already unhappy with the implied recovery time of how high they had to set checkpoint_settings for good performance, and making it that much bigger aggrevates the issue. Given a knob where the LDC can be spread out a bit but not across the entire interval, that makes it easier to control how much expansion there is relative to the current behavior. I agree on that one: we *should* optimize crash recovery time. It may not be the most important thing on earth, but it's a significant consideration for some systems. However, I think shortening the checkpoint interval is a perfectly valid solution to that. It does lead to more full page writes, but in 8.3 more full page writes can actually make the recovery go faster, not slower, because with we no longer read in the previous contents of the page when we restore it from a full page image. In any case, while people sometimes complain that we have a large WAL footprint, it's not usually a problem. This is off-topic, but at PGCon in May, Itagaki-san and his colleagues whose names I can't remember, pointed out to me very clearly that our recovery is *slow*. So slow, that in the benchmarks they were running, their warm stand-by slave couldn't keep up with the master generating the WAL, even though both are running on the same kind of hardware. The reason is simple: There can be tens of backends doing I/O and generating WAL, but in recovery we serialize them. If you have decent I/O hardware that could handle for example 10 concurrent random I/Os, at recovery we'll be issuing them one at a time. That's a scalability issue, and doesn't show up on a laptop or a small server with a single disk. That's one of the first things I'm planning to tackle when the 8.4 dev cycle opens. And I'm planning to look at recovery times in general; I've never even measured it before so who knows what comes up. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Load Distributed Checkpoints, take 3
Greg Smith wrote: On Fri, 22 Jun 2007, Tom Lane wrote: Greg had worried about being able to turn this behavior off, so we'd still need at least a bool, and we might as well expose the fraction instead. I agree with removing the non-LRU part of the bgwriter's write logic though If you accept that being able to turn LDC off is important, people who aren't turning it on may still want the existing bgwriter_all_* settings so they can keep running things the way they are now. It's certainly reasonable to skip that code path when doing things the LDC way. Well, if you don't want anything to change, don't upgrade. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Preliminary GSSAPI Patches
On Jun 22, 2007, at 9:56 AM, Magnus Hagander wrote: Stephen Frost wrote: * Magnus Hagander ([EMAIL PROTECTED]) wrote: We enable the setting of the service name in the server configuration file, but we never use that variable anywhere. We do, however, use the service name on the client, in order to pick the correct key (and turning this off makes GSSAPI no longer work). If this is correct, we should not enable that parameter on the server. If it's not correct, we should be using it somewhere. Uh, shouldn't you be acquiring the server credentials before accepting the context? That'd be done using gss_acquire_cred(), which takes the service name (in gss_name_t structure) as an argument. That would then be passed in to gss_accept_sec_context() instead of using GSS_C_NO_CREDENTIAL (in port-gss-cred). That's the direction I was thinking in. I just wanted to have it confirmed. Henry, what's your take on this? I'm kind of suprised it's working without that and rather curious as to what it's doing under the hood to make that happen. :/ Most likely it's just checking the keytab to find a principal with the same name as the one presented from the client. Since one is present, it loads it up automatically, and verifies against it. Bingo! The server uses the keytab to decrypt the token provided by the client. By using the GSS_C_NO_CREDENTIAL arg on the server anything put in the keytab is OK. (The server doesn't need to authenticate itself to Kerberos, it just accepts authentication. Mutual authentication is done using the same keys.) The documentation needs to reflect that. The real question is whether we *need* to check anything. If the keytab is unique to PostgreSQL, as I think it should be, then I think the admin should put only the right stuff in the keytab, and no further checks are needed. Now it *might* make sense to check that the credential that's accepted actually has some correspondence to what ./configure said. If we do do that, then we need to allow for the ways Microsoft mucks with the case of the name. (Kerberos is supposed to be case sensitive, but Microsoft work that way.) In particular I think we may need both postgres/server and POSTGRES/server in the keytab in order to support the to-be-written native Windows SSPI client at the same time as the current Kerberos 5 and GSSAPI Unix clients. Now what happens with non-Kerberos 5 mechansims like SPKM and SPNEGO? ;-) I'll have to see, even if it doesn't matter for Java, yet. The opinions expressed in this message are mine, not those of Caltech, JPL, NASA, or the US Government. [EMAIL PROTECTED], or [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Preliminary GSSAPI Patches
* Henry B. Hotz ([EMAIL PROTECTED]) wrote: On Jun 22, 2007, at 9:56 AM, Magnus Hagander wrote: Most likely it's just checking the keytab to find a principal with the same name as the one presented from the client. Since one is present, it loads it up automatically, and verifies against it. Bingo! The server uses the keytab to decrypt the token provided by the client. By using the GSS_C_NO_CREDENTIAL arg on the server anything put in the keytab is OK. (The server doesn't need to authenticate itself to Kerberos, it just accepts authentication. Mutual authentication is done using the same keys.) The documentation needs to reflect that. I agree there's some disconnect there between the documentation and the apparent implementation but I'm not sure I'm in favor of changing the documentation on this one. Personally, I'd rather it return an error if someone tries to use GSS_C_NO_CREDENTIAL when accepting a context than to just be happy using anything in the keytab. The real question is whether we *need* to check anything. If the keytab is unique to PostgreSQL, as I think it should be, then I think the admin should put only the right stuff in the keytab, and no further checks are needed. While I agree that in general it's best to have a keytab file for each seperate service, there's no guarentee of that being done and using, say, the 'host' service to allow authentication to PG when the PG service is 'postgres' isn't right. Now it *might* make sense to check that the credential that's accepted actually has some correspondence to what ./configure said. Or what's configured in the postgres.conf (as is done for Kerberos now...). If we do do that, then we need to allow for the ways Microsoft mucks with the case of the name. (Kerberos is supposed to be case sensitive, but Microsoft work that way.) In particular I think we may need both postgres/server and POSTGRES/server in the keytab in order to support the to-be-written native Windows SSPI client at the same time as the current Kerberos 5 and GSSAPI Unix clients. Supporting multiple, specific, keys might be an interesting challenge, but I'm not too keen on worrying about it right now regardless. I'd also much rather err on the side of overly paranoid than if it works, just let it in. If someone ends up having to support both windows SSPI clients and unix Kerberos/GSSAPI clients it's entirely possible to suggest they just make it POSTGRES and configure the clients accordingly. Now what happens with non-Kerberos 5 mechansims like SPKM and SPNEGO? ;-) I'll have to see, even if it doesn't matter for Java, yet. Err, SPNEGO's not really a full mechanism itself, it's got sub-mechs of NTLM and Kerberos and from what I've seen it does the same thing SSPI does and assumes uppercase. Again, it's ugly, but not unbearable. Such fun things are probably also why mod-auth-kerb for Apache lets you configure the service name, like most other Kerberos servers... I've yet to run into one that will just use any key in the keytab it's given. Thanks, Stephen signature.asc Description: Digital signature