Re: [HACKERS] pgbench vs. wait events
On 10/7/16 10:42 AM, Andres Freund wrote: Hi, On 2016-10-06 20:52:22 -0700, Alfred Perlstein wrote: This contention on WAL reminds me of another scenario I've heard about that was similar. To fix things what happened was that anyone that the first person to block would be responsible for writing out all buffers for anyone blocked behind "him". We pretty much do that already. But while that's happening, the other would-be-writers show up as blocking on the lock. We don't use kind of an odd locking model for the waiters (LWLockAcquireOrWait()), which waits for the lock to be released, but doesn't try to acquire it afterwards. Instead the wal position is rechecked, and in many cases we'll be done afterwards, because enough has been written out. Greetings, Andres Freund Are the batched writes all done before fsync is called? Are you sure that A only calls fsync after flushing all the buffers from B, C, and D? Or will it fsync twice? Is there instrumentation to show that? I know there's a tremendous level of skill involved in this code, but simply asking in case there's some tricks. Another strategy that may work is actually intentionally waiting/buffering some few ms between flushes/fsync, for example, make sure that the number of flushes per second doesn't exceed some configurable amount because each flush likely eats at least one iop from the disk and there is a maximum iops per disk, so might as well buffer more if you're exceeding that iops count. You'll trade some latency, but gain throughput for doing that. Does this make sense? Again apologies if this has been covered. Is there a whitepaper or blog post or clear way I can examine the algorithm wrt locks/buffering for flushing WAL logs? -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgbench vs. wait events
Robert, This contention on WAL reminds me of another scenario I've heard about that was similar. To fix things what happened was that anyone that the first person to block would be responsible for writing out all buffers for anyone blocked behind "him". The for example if you have many threads, A, B, C, D If while A is writing to WAL and hold the lock, then B arrives, B of course blocks, then C comes along and blocks as well, then D. Finally A finishes its write and then Now you have two options for resolving this, either 1) A drops its lock, B picks up the lock... B writes its buffer and then drops the lock. Then C gets the lock, writes its buffer, drops the lock, then finally D gets the lock, writes its buffer and then drops the lock. 2) A then writes out B's, C's, and D's buffers, then A drops the lock, B, C and D wake up, note that their respective buffers are written and just return. This greatly speeds up the system. (just be careful to make sure A doesn't do "too much work" otherwise you can get a sort of livelock if too many threads are blocked behind it, generally only issue one additional flush on behalf of other threads, do not "loop until the queue is empty") I'm not sure if this is actually possible with the way WAL is implemented, (or perhaps if this strategy is already implemented) but it's definitely worth if not done already as it can speed things up enormously. On 10/6/16 11:38 AM, Robert Haas wrote: Hi, I decided to do some testing on hydra (IBM-provided community resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using the newly-enhanced wait event stuff to try to get an idea of what we're waiting for during pgbench. I did 30-minute pgbench runs with various configurations, but all had max_connections = 200, shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = on. During each run, I ran this psql script in another window and captured the output: \t select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid() \watch 0.5 Then, I used a little shell-scripting to count up the number of times each wait event occurred in the output. First, I tried scale factor 3000 with 32 clients and got these results: 1 LWLockTranche | buffer_mapping 9 LWLockNamed | CLogControlLock 14 LWLockNamed | ProcArrayLock 16 Lock| tuple 25 LWLockNamed | CheckpointerCommLock 49 LWLockNamed | WALBufMappingLock 122 LWLockTranche | clog 182 Lock| transactionid 287 LWLockNamed | XidGenLock 1300 Client | ClientRead 1375 LWLockTranche | buffer_content 3990 Lock| extend 21014 LWLockNamed | WALWriteLock 28497 | 58279 LWLockTranche | wal_insert tps = 1150.803133 (including connections establishing) What jumps out here is, at least to me, is that there is furious contention on both the wal_insert locks and on WALWriteLock. Apparently, the system simply can't get WAL on disk fast enough to keep up with this workload. Relation extension locks and buffer_content locks also are also pretty common, both ahead of ClientRead, a relatively uncommon wait event on this test. The load average on the system was only about 3 during this test, indicating that most processes are in fact spending most of their time off-CPU. The first thing I tried was switching to unlogged tables, which produces these results: 1 BufferPin | BufferPin 1 LWLockTranche | lock_manager 2 LWLockTranche | buffer_mapping 8 LWLockNamed | ProcArrayLock 9 LWLockNamed | CheckpointerCommLock 9 LWLockNamed | CLogControlLock 11 LWLockTranche | buffer_content 37 LWLockTranche | clog 153 Lock| tuple 388 LWLockNamed | XidGenLock 827 Lock| transactionid 1267 Client | ClientRead 20631 Lock| extend 91767 | tps = 1223.239416 (including connections establishing) If you don't look at the TPS number, these results look like a vast improvement. The overall amount of time spent not waiting for anything is now much higher, and the problematic locks have largely disappeared from the picture. However, the load average now shoots up to about 30, because most of the time that the backends are "not waiting for anything" they are in fact in kernel wait state D; that is, they're stuck doing I/O. This suggests that we might want to consider advertising a wait state when a backend is doing I/O, so we can measure this sort of thing. Next, I tried lowering the scale factor to something that fits in shared buffers. Here are the results at scale factor 300: 14 Lock|
Re: [HACKERS] pgbench vs. wait events
Robert, This contention on WAL reminds me of another scenario I've heard about that was similar. To fix things what happened was that anyone that the first person to block would be responsible for writing out all buffers for anyone blocked behind "him". The for example if you have many threads, A, B, C, D If while A is writing to WAL and hold the lock, then B arrives, B of course blocks, then C comes along and blocks as well, then D. Finally A finishes its write and then Now you have two options for resolving this, either 1) A drops its lock, B picks up the lock... B writes its buffer and then drops the lock. Then C gets the lock, writes its buffer, drops the lock, then finally D gets the lock, writes its buffer and then drops the lock. 2) A then writes out B's, C's, and D's buffers, then A drops the lock, B, C and D wake up, note that their respective buffers are written and just return. This greatly speeds up the system. (just be careful to make sure A doesn't do "too much work" otherwise you can get a sort of livelock if too many threads are blocked behind it, generally only issue one additional flush on behalf of other threads, do not "loop until the queue is empty") I'm not sure if this is actually possible with the way WAL is implemented, (or perhaps if this strategy is already implemented) but it's definitely worth if not done already as it can speed things up enormously. On 10/6/16 11:38 AM, Robert Haas wrote: Hi, I decided to do some testing on hydra (IBM-provided community resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using the newly-enhanced wait event stuff to try to get an idea of what we're waiting for during pgbench. I did 30-minute pgbench runs with various configurations, but all had max_connections = 200, shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit = off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints = on. During each run, I ran this psql script in another window and captured the output: \t select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid() \watch 0.5 Then, I used a little shell-scripting to count up the number of times each wait event occurred in the output. First, I tried scale factor 3000 with 32 clients and got these results: 1 LWLockTranche | buffer_mapping 9 LWLockNamed | CLogControlLock 14 LWLockNamed | ProcArrayLock 16 Lock| tuple 25 LWLockNamed | CheckpointerCommLock 49 LWLockNamed | WALBufMappingLock 122 LWLockTranche | clog 182 Lock| transactionid 287 LWLockNamed | XidGenLock 1300 Client | ClientRead 1375 LWLockTranche | buffer_content 3990 Lock| extend 21014 LWLockNamed | WALWriteLock 28497 | 58279 LWLockTranche | wal_insert tps = 1150.803133 (including connections establishing) What jumps out here is, at least to me, is that there is furious contention on both the wal_insert locks and on WALWriteLock. Apparently, the system simply can't get WAL on disk fast enough to keep up with this workload. Relation extension locks and buffer_content locks also are also pretty common, both ahead of ClientRead, a relatively uncommon wait event on this test. The load average on the system was only about 3 during this test, indicating that most processes are in fact spending most of their time off-CPU. The first thing I tried was switching to unlogged tables, which produces these results: 1 BufferPin | BufferPin 1 LWLockTranche | lock_manager 2 LWLockTranche | buffer_mapping 8 LWLockNamed | ProcArrayLock 9 LWLockNamed | CheckpointerCommLock 9 LWLockNamed | CLogControlLock 11 LWLockTranche | buffer_content 37 LWLockTranche | clog 153 Lock| tuple 388 LWLockNamed | XidGenLock 827 Lock| transactionid 1267 Client | ClientRead 20631 Lock| extend 91767 | tps = 1223.239416 (including connections establishing) If you don't look at the TPS number, these results look like a vast improvement. The overall amount of time spent not waiting for anything is now much higher, and the problematic locks have largely disappeared from the picture. However, the load average now shoots up to about 30, because most of the time that the backends are "not waiting for anything" they are in fact in kernel wait state D; that is, they're stuck doing I/O. This suggests that we might want to consider advertising a wait state when a backend is doing I/O, so we can measure this sort of thing. Next, I tried lowering the scale factor to something that fits in shared buffers. Here are the results at scale factor 300: 14 Lock|
Re: [HACKERS] Why we lost Uber as a user
On 8/3/16 3:29 AM, Greg Stark wrote: Honestly the take-away I see in the Uber story is that they apparently had nobody on staff that was on -hackers or apparently even -general and tried to go it alone rather than involve experts from outside their company. As a result they misdiagnosed their problems based on prejudices seeing what they expected to see rather than what the real problem was. Agree strongly, but there are still lessons to be learned on the psql side. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 8/2/16 10:02 PM, Mark Kirkwood wrote: On 03/08/16 02:27, Robert Haas wrote: Personally, I think that incremental surgery on our current heap format to try to fix this is not going to get very far. If you look at the history of this, 8.3 was a huge release for timely cleanup of dead tuple. There was also significant progress in 8.4 as a result of 5da9da71c44f27ba48fdad08ef263bf70e43e689. As far as I can recall, we then made no progress at all in 9.0 - 9.4. We made a very small improvement in 9.5 with 94028691609f8e148bd4ce72c46163f018832a5b, but that's pretty niche. In 9.6, we have "snapshot too old", which I'd argue is potentially a large improvement, but it was big and invasive and will no doubt pose code maintenance hazards in the years to come; also, many people won't be able to use it or won't realize that they should use it. I think it is likely that further incremental improvements here will be quite hard to find, and the amount of effort will be large relative to the amount of benefit. I think we need a new storage format where the bloat is cleanly separated from the data rather than intermingled with it; every other major RDMS works that way. Perhaps this is a case of "the grass is greener on the other side of the fence", but I don't think so. Yeah, I think this is a good summary of the state of play. The only other new db development to use a non-overwriting design like ours that I know of was Jim Starky's Falcon engine for (ironically) Mysql 6.0. Not sure if anyone is still progressing that at all now. I do wonder if Uber could have successfully tamed dead tuple bloat with aggressive per-table autovacuum settings (and if in fact they tried), but as I think Robert said earlier, it is pretty easy to come up with a highly update (or insert + delete) workload that makes for a pretty ugly bloat component even with real aggressive autovacuuming. I also wonder if they had used "star schema" which to my understanding would mean multiple tables to replace the single-table that has multiple indecies to work around the write amplification problem in postgresql. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 8/4/16 2:00 AM, Torsten Zuehlsdorff wrote: On 03.08.2016 21:05, Robert Haas wrote: On Wed, Aug 3, 2016 at 2:23 PM, Tom Lanewrote: Robert Haas writes: I don't think they are saying that logical replication is more reliable than physical replication, nor do I believe that to be true. I think they are saying that if logical corruption happens, you can fix it by typing SQL statements to UPDATE, INSERT, or DELETE the affected rows, whereas if physical corruption happens, there's no equally clear path to recovery. Well, that's not an entirely unreasonable point, but I dispute the implication that it makes recovery from corruption an easy thing to do. How are you going to know what SQL statements to issue? If the master database is changing 24x7, how are you going to keep up with that? I think in many cases people fix their data using business logic. For example, suppose your database goes down and you have to run pg_resetxlog to get it back up. You dump-and-restore, as one does, and find that you can't rebuild one of your unique indexes because there are now two records with that same PK. Well, what you do is you look at them and judge which one has the correct data, often the one that looks more complete or the one with the newer timestamp. Or, maybe you need to merge them somehow. In my experience helping users through problems of this type, once you explain the problem to the user and tell them they have to square it on their end, the support call ends. The user may not always be entirely thrilled about having to, say, validate a problematic record against external sources of truth, but they usually know how to do it. Database bugs aren't the only way that databases become inaccurate. If the database that they use to keep track of land ownership in the jurisdiction where I live says that two different people own the same piece of property, somewhere there is a paper deed in a filing cabinet. Fishing that out to understand what happened may not be fun, but a DBA can explain that problem to other people in the organization and those people can get it fixed. It's a problem, but it's fixable. On the other hand, if a heap tuple contains invalid infomask bits that cause an error every time you read the page (this actually happened to an EnterpriseDB customer!), the DBA can't tell other people how to fix it and can't fix it personally either. Instead, the DBA calls me. After reading this statement the ZFS filesystem pops into my mind. It has protection build in against various problems (data degradation, current spikes, phantom writes, etc). For me this raises two questions: 1) would the usage of ZFS prevent such errors? My feeling would say yes, but i have no idea about how a invalid infomask bit could occur. 2) would it be possible to add such prevention to PostgreSQL I know this could add a massive overhead, but it its optional this could be a fine thing? Postgresql is very "zfs-like" in its internals. The problem was a bug in postgresql that caused it to just write data to the wrong place. Some vendors use ZFS under databases to provide very cool services such as backup snapshots, test snapshots and other such uses. I think Joyent is one such vendor but I'm not 100% sure. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
> On Aug 3, 2016, at 3:29 AM, Greg Starkwrote: > >> > > Honestly the take-away I see in the Uber story is that they apparently > had nobody on staff that was on -hackers or apparently even -general > and tried to go it alone rather than involve experts from outside > their company. As a result they misdiagnosed their problems based on > prejudices seeing what they expected to see rather than what the real > problem was. > +1 very true. At the same time there are some lessons to be learned. At the very least putting in big bold letters where to come for help is one. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 8/2/16 2:14 PM, Tom Lane wrote: Stephen Frostwrites: With physical replication, there is the concern that a bug in *just* the physical (WAL) side of things could cause corruption. Right. But with logical replication, there's the same risk that the master's state could be fine but a replication bug creates corruption on the slave. Assuming that the logical replication works by issuing valid SQL commands to the slave, one could hope that this sort of "corruption" only extends to having valid data on the slave that fails to match the master. But that's still not a good state to be in. And to the extent that performance concerns lead the implementation to bypass some levels of the SQL engine, you can easily lose that guarantee too. In short, I think Uber's position that logical replication is somehow more reliable than physical is just wishful thinking. If anything, my money would be on the other way around: there's a lot less mechanism that can go wrong in physical replication. Which is not to say there aren't good reasons to use logical replication; I just do not believe that one. regards, tom lane The reason it can be less catastrophic is that for logical replication you may futz up your data, but you are safe from corrupting your entire db. Meaning if an update is missed or doubled that may be addressed by a fixup SQL stmt, however if the replication causes a write to the entirely wrong place in the db file then you need to "fsck" your db and hope that nothing super critical was blown away. The impact across a cluster is potentially magnified by physical replication. So for instance, let's say there is a bug in the master's write to disk. The logical replication acts as a barrier from that bad write going to the slaves. With bad writes going to slaves then any corruption experienced on the master will quickly reach the slaves and they too will be corrupted. With logical replication a bug may be stopped at the replication layer. At that point you can resync the slave from the master. Now in the case of physical replication all your base are belong to zuul and you are in a very bad state. That said with logical replication, who's to say that if the statement is replicated to a slave that the slave won't experience the same bug and also corrupt itself. We may be saying the same thing, but still there is something to be said for logical replication... also, didnt they show that logical replication was faster for some use cases at Uber? -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
> On Aug 2, 2016, at 2:33 AM, Geoff Winkless <pgsqlad...@geoff.dj> wrote: > >> On 2 August 2016 at 08:11, Alfred Perlstein <alf...@freebsd.org> wrote: >>> On 7/2/16 4:39 AM, Geoff Winkless wrote: >>> I maintain that this is a nonsense argument. Especially since (as you >>> pointed out and as I missed first time around) the bug actually occurred at >>> different records on different slaves, so he invalidates his own point. > >> Seriously? > > No, I make a habit of spouting off random arguments to a list full of > people whose opinions I massively respect purely for kicks. What do > you think? > >> There's a valid point here, you're sending over commands at the block level, >> effectively "write to disk at this location" versus "update this record >> based on PK", obviously this has some drawbacks that are reason for concern. > > Writing values directly into file offsets is only problematic if > something else has failed that has caused the file to be an inexact > copy. If a different bug occurred that caused the primary key to be > corrupted on the slave (or indeed the master), PK-based updates would > exhibit similar propagation errors. > > To reiterate my point, uber's described problem came about because of > a bug. Every software has bugs at some point in its life, to pretend > otherwise is simply naive. I'm not trying to excuse the bug, or to > belittle the impact that such a bug has on data integrity or on uber > or indeed on the reputation of PostgreSQL. While I'm prepared to > accept (because I have a job that requires I spend time on things > other than digging through obscure reddits and mailing lists to > understand more fully the exact cause) that in _this particular > instance_ the bug was propagated because of the replication mechanism > (although I'm still dubious about that, as per my comment above), that > does _not_ preclude other bugs propagating in a statement-based > replication. That's what I said is a nonsense argument, and no-one has > yet explained in what way that's incorrect. > > Geoff Geoff, You are quite technical, my feeling is that you will understand it, however it will need to be a self learned lesson. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 7/26/16 9:54 AM, Joshua D. Drake wrote: Hello, The following article is a very good look at some of our limitations and highlights some of the pains many of us have been working "around" since we started using the software. https://eng.uber.com/mysql-migration/ Specifically: * Inefficient architecture for writes * Inefficient data replication * Issues with table corruption * Poor replica MVCC support * Difficulty upgrading to newer releases It is a very good read and I encourage our hackers to do so with an open mind. Sincerely, JD It was a good read. Having based a high performance web tracking service as well as a high performance security appliance on Postgresql I too have been bitten by these issues. I had a few questions that maybe the folks with core knowledge can answer: 1) Would it be possible to create a "star-like" schema to fix this problem? For example, let's say you have a table that is similar to Uber's: col0pk, col1, col2, col3, col4, col5 All cols are indexed. Assuming that updates happen to only 1 column at a time. Why not figure out some way to encourage or automate the splitting of this table into multiple tables that present themselves as a single table? What I mean is that you would then wind up with the following tables: table1: col0pk, col1 table2: col0pk, col2 table3: col0pk, col3 table4: col0pk, col4 table5: col0pk, col5 Now when you update "col5" on a row, you only have to update the index on table5:col5 and table5:col0pk as opposed to beforehand where you would have to update more indecies. In addition I believe that vacuum would be somewhat mitigated as well in this case. 2) Why not have a look at how innodb does its storage, would it be possible to do this? 3) For the small-ish table that Uber mentioned, is there a way to "have it in memory" however provide some level of sync to disk so that it is consistent? thanks! -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 7/28/16 7:08 AM, Merlin Moncure wrote: *) postgres may not be the ideal choice for those who want a thin and simple database This is a huge market, addressing it will bring mindshare and more jobs, code and braintrust to psql. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we lost Uber as a user
On 7/28/16 4:39 AM, Geoff Winkless wrote: On 28 Jul 2016 12:19, "Vitaly Burovoy"> wrote: > > On 7/28/16, Geoff Winkless > wrote: > > On 27 July 2016 at 17:04, Bruce Momjian > wrote: > > > >> Well, their big complaint about binary replication is that a bug can > >> spread from a master to all slaves, which doesn't happen with statement > >> level replication. > > > > > > I'm not sure that that makes sense to me. If there's a database bug that > > occurs when you run a statement on the master, it seems there's a decent > > chance that that same bug is going to occur when you run the same statement > > on the slave. > > > > Obviously it depends on the type of bug and how identical the slave is, but > > statement-level replication certainly doesn't preclude such a bug from > > propagating. > > > > Geoff > > Please, read the article first! The bug is about wrong visibility of > tuples after applying WAL at slaves. > For example, you can see two different records selecting from a table > by a primary key (moreover, their PKs are the same, but other columns > differ). I read the article. It affected slaves as well as the master. I quote: "because of the way replication works, this issue has the potential to spread into all of the databases in a replication hierarchy" I maintain that this is a nonsense argument. Especially since (as you pointed out and as I missed first time around) the bug actually occurred at different records on different slaves, so he invalidates his own point. Geoff Seriously? There's a valid point here, you're sending over commands at the block level, effectively "write to disk at this location" versus "update this record based on PK", obviously this has some drawbacks that are reason for concern. Does it validate the move on its own? NO. Does it add to the reasons to move away? Yes, that much is obvious. Please read this thread: https://www.reddit.com/r/programming/comments/4vms8x/why_we_lost_uber_as_a_user_postgresql_mailing_list/d5zx82n Do I love postgresql? Yes. Have I been bitten by things such as this? Yes. Should the community learn from these things and think of ways to avoid it? Absolutely! -Alfred
[HACKERS] Question about durability and postgresql.
Hello, We have a combination of 9.3 and 9.4 databases used for logging of data. We do not need a strong durability guarantee, meaning it is ok if on crash a minute or two of data is lost from our logs. (This is just stats for our internal tool). I am looking at this page: http://www.postgresql.org/docs/9.4/static/non-durability.html And it's not clear which setting I should turn on. What we do NOT want is to lose the entire table or corrupt the database. We do want to gain speed though by not making DATA writes durable. Which setting is appropriate for this use case? At a glance it looks like a combination of 1) Turn off synchronous_commit and possibly: 2) Increase checkpoint_segments and checkpoint_timeout ; this reduces the frequency of checkpoints, but increases the storage requirements of /pg_xlog. 3) Turn off full_page_writes; there is no need to guard against partial page writes. The point here is to never get a corrupt database, but in case of crash we might lose a few minutes of last transactions. Any suggestions please? thank you, -Alfred
[HACKERS] Question about durability and postgresql.
Hello, We have a combination of 9.3 and 9.4 databases used for logging of data. We do not need a strong durability guarantee, meaning it is ok if on crash a minute or two of data is lost from our logs. (This is just stats for our internal tool). I am looking at this page: http://www.postgresql.org/docs/9.4/static/non-durability.html And it's not clear which setting I should turn on. What we do NOT want is to lose the entire table or corrupt the database. We do want to gain speed though by not making DATA writes durable. Which setting is appropriate for this use case? At a glance it looks like a combination of 1) Turn off synchronous_commit and possibly: 2) Increase checkpoint_segments and checkpoint_timeout ; this reduces the frequency of checkpoints, but increases the storage requirements of /pg_xlog. 3) Turn off full_page_writes; there is no need to guard against partial page writes. The point here is to never get a corrupt database, but in case of crash we might lose a few minutes of last transactions. Any suggestions please? thank you, -Alfred
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
JFYI we have 3 or 4 machines racked for the pgsql project in our DC. Tom informed me he would be lighting them up this week time permitting. Sent from my iPhone On Apr 26, 2014, at 6:15 PM, Stephen Frost sfr...@snowman.net wrote: Jim, * Jim Nasby (j...@nasby.net) wrote: On 4/22/14, 5:01 PM, Alfred Perlstein wrote: We also have colo space and power, etc. So this would be the whole deal. The cluster would be up for as long as needed. Are the machine specs sufficient? Any other things we should look for? CC'd Tom on this email. Did anyone respond to this off-list? Yes, I did follow-up with Tom. I'll do so again, as the discussion had died down. Would these machines be more useful as dedicated performance test servers for the community or generic BenchFarm members? I don't believe they would be terribly useful as buildfarm systems; we could set up similar systems with VMs to just run the regression tests. Where I see these systems being particularly valuable would be as the start of our performance farm, and perhaps one of the systems as a PG infrastructure server. Thanks! Stephen -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/22/14, 8:26 AM, Andrew Dunstan wrote: On 04/22/2014 01:36 AM, Joshua D. Drake wrote: On 04/21/2014 06:19 PM, Andrew Dunstan wrote: If we never start we'll never get there. I can think of several organizations that might be approached to donate hardware. Like .Org? We have a hardware farm, a rack full of hardware and spindles. It isn't the most current but it is there. I'm going away tomorrow for a few days RR. when I'm back next week I will set up a demo client running this module. If you can have a machine prepped for this purpose by then so much the better, otherwise I will have to drag out a box I recently rescued and have been waiting for something to use it with. It's more important that it's stable (i.e. nothing else running on it) than that it's very powerful. It could be running Ubuntu or some Redhattish variant or, yes, even FreeBSD. cheers andrew Hey folks, I just spoke with our director of netops Tom Sparks here at Norse and we have a vested interest in Postgresql. We can throw together a cluster of 4 machines with specs approximately in the range of dual quad core westmere with ~64GB of ram running FreeBSD 10 or 11. We can also do an Ubungu install as well or other Linux distro. Please let me know if that this would be a something that the project could make use of please. We also have colo space and power, etc. So this would be the whole deal. The cluster would be up for as long as needed. Are the machine specs sufficient? Any other things we should look for? CC'd Tom on this email. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14 4:10 AM, Andres Freund wrote: Hi, On 2014-04-20 11:24:38 +0200, Palle Girgensohn wrote: I see performance degradation with PostgreSQL 9.3 vs 9.2 on FreeBSD, and I'm wondering who to poke to mitigate the problem. In reference to this thread [1], who where the FreeBSD people that Francois mentioned? If mmap needs to perform well in the kernel, I'd like to know of someone with FreeBSD kernel knowledge who is interested in working with mmap perfocmance. If mmap is indeed the cuplrit, I've just tested 9.2.8 vs 9.3.4, I nevere isolated the mmap patch, although I believe Francois did just that with similar results. If there are indeed such large regressions on FreeBSD we need to treat them as postgres regressions. It's nicer not to add config options for things that don't need it, but apparently that's not the case here. Imo this means we need to add GUC to control wether anon mmap() or sysv shmem is to be used. In 9.3. Greetings, Andres Freund Andres, thank you. Speaking as a FreeBSD developer that would be a good idea. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14 8:45 AM, Andrew Dunstan wrote: On 04/21/2014 11:39 AM, Magnus Hagander wrote: On Mon, Apr 21, 2014 at 4:51 PM, Andres Freund and...@2ndquadrant.com mailto:and...@2ndquadrant.com wrote: On 2014-04-21 10:45:24 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com mailto:and...@2ndquadrant.com writes: If there are indeed such large regressions on FreeBSD we need to treat them as postgres regressions. It's nicer not to add config options for things that don't need it, but apparently that's not the case here. Imo this means we need to add GUC to control wether anon mmap() or sysv shmem is to be used. In 9.3. I will resist this mightily. One of the main reasons to switch to mmap was so we would no longer have to explain about SysV shm configuration. It's still explained in the docs and one of the dynshm implementations is based on sysv shmem. So I don't see this as a convincing reason. Regressing installed OSs by 15-20% just to save a couple of lines of docs and code seems rather unconvincing to me. There's also the fact that even if it's changed in FreeBSD, that might be somethign that takes years to trickle out to whatever stable release people are actually using. But do we really want a *guc* for it though? Isn't it enough (and in fact better) with a configure switch to pick the implementation when multiple are available, that could then be set by default for example by the freebsd ports build? That's a lot less overhead to keep dragging around... That seems to make more sense. I can't imagine why this would be a runtime parameter as opposed to build time. I am unsure of the true overhead of making this a runtime tunable so pardon if I'm asking for a lot. From the perspective of both an OS developer and postgresql user (I am both) it really makes more sense to have it a runtime tunable for the following reasons: From an OS developer making this a runtime allows us to much more easily do the testing (instead of needing two compiled versions). From a sysadmin perspective it makes switching to/from a LOT easier in case the new mmap code exposes a stability or performance bug. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14 8:58 AM, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2014-04-21 11:45:49 -0400, Andrew Dunstan wrote: That seems to make more sense. I can't imagine why this would be a runtime parameter as opposed to build time. Because that implies that packagers and porters need to make that decision. If it's a GUC people can benchmark it and decide. As against that, the packager would be more likely to get it right (or even to know that there's an issue). Can the package builder not set the default for the runtime tunable? Honestly we're about to select a db platform for another FreeBSD based system we are building, I strongly hoping that we can get back to sysvshm easily otherwise we may have to select another store. -Alfred (who still remembers back when Tom had a login on our primary db to help us. :) ) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14 9:13 AM, Stephen Frost wrote: * Alfred Perlstein (alf...@freebsd.org) wrote: Can the package builder not set the default for the runtime tunable? Yeah, I was thinking about that also, but at least in this case it seems pretty clear that the 'right' answer is known at build time. Honestly we're about to select a db platform for another FreeBSD based system we are building, I strongly hoping that we can get back to sysvshm easily otherwise we may have to select another store. Is there no hope of this getting fixed in FreeBSD..? PG wouldn't be the only application impacted by this, I'm sure... There is definitely hope, however changes to the FreeBSD vm are taken as seriously as changes to core changes to Postresql's store. In addition changes to vm is somewhat in the realm of complexity of Postgresql store as well so it may not be coming in the next few days/weeks, but rather a month or two. I am not sure if an easy fix is available in FreeBSD but we will see in short order. I need to do some research. I work with Adrian (FreeBSD kernel dev mentioned earlier in the thread), I'll grab him today and discuss what the issue may be. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14 9:24 AM, Andrew Dunstan wrote: On 04/21/2014 11:59 AM, Alfred Perlstein wrote: On 4/21/14 8:45 AM, Andrew Dunstan wrote: On 04/21/2014 11:39 AM, Magnus Hagander wrote: On Mon, Apr 21, 2014 at 4:51 PM, Andres Freund and...@2ndquadrant.com mailto:and...@2ndquadrant.com wrote: On 2014-04-21 10:45:24 -0400, Tom Lane wrote: Andres Freund and...@2ndquadrant.com mailto:and...@2ndquadrant.com writes: If there are indeed such large regressions on FreeBSD we need to treat them as postgres regressions. It's nicer not to add config options for things that don't need it, but apparently that's not the case here. Imo this means we need to add GUC to control wether anon mmap() or sysv shmem is to be used. In 9.3. I will resist this mightily. One of the main reasons to switch to mmap was so we would no longer have to explain about SysV shm configuration. It's still explained in the docs and one of the dynshm implementations is based on sysv shmem. So I don't see this as a convincing reason. Regressing installed OSs by 15-20% just to save a couple of lines of docs and code seems rather unconvincing to me. There's also the fact that even if it's changed in FreeBSD, that might be somethign that takes years to trickle out to whatever stable release people are actually using. But do we really want a *guc* for it though? Isn't it enough (and in fact better) with a configure switch to pick the implementation when multiple are available, that could then be set by default for example by the freebsd ports build? That's a lot less overhead to keep dragging around... That seems to make more sense. I can't imagine why this would be a runtime parameter as opposed to build time. I am unsure of the true overhead of making this a runtime tunable so pardon if I'm asking for a lot. From the perspective of both an OS developer and postgresql user (I am both) it really makes more sense to have it a runtime tunable for the following reasons: From an OS developer making this a runtime allows us to much more easily do the testing (instead of needing two compiled versions). From a sysadmin perspective it makes switching to/from a LOT easier in case the new mmap code exposes a stability or performance bug. 1. OS developers are not the target audience for GUCs. If the OS developers want to test and can't be botherrd with building with a couple of different parameters then I'm not very impressed. 2. We should be trying to get rid of GUCs where possible, and only add them when we must. The more there are the more we confuse users. If a packager can pick a default surely they can pick build options too. Thank you for the lecture Andrew! Really pleasant way to treat a user and a fan of the system. :) -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14 9:34 AM, Stephen Frost wrote: * Alfred Perlstein (alf...@freebsd.org) wrote: There is definitely hope, however changes to the FreeBSD vm are taken as seriously as changes to core changes to Postresql's store. In addition changes to vm is somewhat in the realm of complexity of Postgresql store as well so it may not be coming in the next few days/weeks, but rather a month or two. I am not sure if an easy fix is available in FreeBSD but we will see in short order. This has been known for over a year.. :( I know! I remember warning y'all about it back at pgcon last year. :) I need to do some research. I work with Adrian (FreeBSD kernel dev mentioned earlier in the thread), I'll grab him today and discuss what the issue may be. Hopefully that'll get things moving in the right direction, finally.. Sure, to be fair, we are under the gun here for a product, it may just mean that the end result of that conversation is mysql. I'm hoping we can use Postgresql as I've been a huge fan since 1999. I based my first successful project on it and had a LOT of help from the pgsql community, Tom, Bruce and we even contracted Vadim for some work on incremental vacuums! -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14 9:38 AM, Andrew Dunstan wrote: On 04/21/2014 12:25 PM, Alfred Perlstein wrote: 1. OS developers are not the target audience for GUCs. If the OS developers want to test and can't be botherrd with building with a couple of different parameters then I'm not very impressed. 2. We should be trying to get rid of GUCs where possible, and only add them when we must. The more there are the more we confuse users. If a packager can pick a default surely they can pick build options too. Thank you for the lecture Andrew! Really pleasant way to treat a user and a fan of the system. :) I confess to being mightily confused. Sure, to clarify: Andrew, you just told someone who in a db stack sits both below (as a pgsql user 15 years) and above (as a FreeBSD kernel dev 15 years) your software what they really need. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14 4:10 AM, Andres Freund wrote: Hi, On 2014-04-20 11:24:38 +0200, Palle Girgensohn wrote: I see performance degradation with PostgreSQL 9.3 vs 9.2 on FreeBSD, and I'm wondering who to poke to mitigate the problem. In reference to this thread [1], who where the FreeBSD people that Francois mentioned? If mmap needs to perform well in the kernel, I'd like to know of someone with FreeBSD kernel knowledge who is interested in working with mmap perfocmance. If mmap is indeed the cuplrit, I've just tested 9.2.8 vs 9.3.4, I nevere isolated the mmap patch, although I believe Francois did just that with similar results. If there are indeed such large regressions on FreeBSD we need to treat them as postgres regressions. It's nicer not to add config options for things that don't need it, but apparently that's not the case here. Imo this means we need to add GUC to control wether anon mmap() or sysv shmem is to be used. In 9.3. Greetings, Andres Freund Andres, thank you. Speaking as a FreeBSD developer that would be a good idea. -- Alfred Perlstein -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14, 9:51 AM, Andrew Dunstan wrote: On 04/21/2014 12:44 PM, Alfred Perlstein wrote: On 4/21/14 9:38 AM, Andrew Dunstan wrote: On 04/21/2014 12:25 PM, Alfred Perlstein wrote: 1. OS developers are not the target audience for GUCs. If the OS developers want to test and can't be botherrd with building with a couple of different parameters then I'm not very impressed. 2. We should be trying to get rid of GUCs where possible, and only add them when we must. The more there are the more we confuse users. If a packager can pick a default surely they can pick build options too. Thank you for the lecture Andrew! Really pleasant way to treat a user and a fan of the system. :) I confess to being mightily confused. Sure, to clarify: Andrew, you just told someone who in a db stack sits both below (as a pgsql user 15 years) and above (as a FreeBSD kernel dev 15 years) your software what they really need. I told you what *we* (i.e. the PostgreSQL community) need, IMNSHO (and speaking as a Postgres developer and consultant of 10 or so years standing). How high on the hierarchy of PostgreSQL's needs is making a single option a tunable versus compile time thing? I mean seriously you mean to stick on this one point when one of your users are asking you about this? That is pretty concerning to me. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14, 9:51 AM, Andres Freund wrote: On 2014-04-21 09:42:06 -0700, Alfred Perlstein wrote: Sure, to be fair, we are under the gun here for a product, it may just mean that the end result of that conversation is mysql. Personally arguments in that vain are removing just about any incentive I have to work on the problem. I was just explaining that we have a timeline over here and while that may disincentive you for providing what we need it would be very unfair. In that I mean sometimes the reality of a situation can be inconvenient and for that I do apologize. What I am seeing here is unfortunately a very strong departure from FreeBSD support by the community from several of the developers. In fact over drinks at pgcon last year there were a TON of jokes making fun of FreeBSD users and developers which I took in stride as professional joking with alcohol involved. I thought it was pretty funny. However a year later and I realize that there appears to be a real problem with FreeBSD in the pgsql community. There are other Linux centric dbs to pick from. If pgsql is just another Linux centric DB then that is unfortunate but something I can deal with. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14, 9:52 AM, Alvaro Herrera wrote: Alfred Perlstein wrote: I am unsure of the true overhead of making this a runtime tunable so pardon if I'm asking for a lot. From the perspective of both an OS developer and postgresql user (I am both) it really makes more sense to have it a runtime tunable for the following reasons: From an OS developer making this a runtime allows us to much more easily do the testing (instead of needing two compiled versions). From a sysadmin perspective it makes switching to/from a LOT easier in case the new mmap code exposes a stability or performance bug. In this case, AFAICS the only overhead of a runtime option (what we call a GUC) is the added potential for user confusion, and the necessary documentation. If we instead go for a compile-time option, both items become smaller. In any case, I don't see that there's much need for a runtime option, really; you already know that the mmap code path is slower in FreeBSD. You only need to benchmark both options once the FreeBSD vm code itself is fixed, right? In fact, it might not even need to be a configure option; I would suggest a pg_config_manual.h setting instead, and perhaps tweaks to the src/template/freebsd file to enable it automatically on the broken FreeBSD releases. We could then, in the future, have the template itself turn the option off for the future FreeBSD release that fixes the problem. That is correct, until you're in prod and suddenly one option becomes unstable, or you want to try a quick kernel patch without rebooting. Look, this is an argument I've lost time and time again in open source software communities, the idea of a software option as opposed to compile time really seems to hit people the wrong way. I think from now on it just makes sense to sit back and let whatever happens happen. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14, 11:14 AM, Stephen Frost wrote: Alfred, * Alfred Perlstein (alf...@freebsd.org) wrote: On 4/21/14, 9:51 AM, Andres Freund wrote: On 2014-04-21 09:42:06 -0700, Alfred Perlstein wrote: Sure, to be fair, we are under the gun here for a product, it may just mean that the end result of that conversation is mysql. Personally arguments in that vain are removing just about any incentive I have to work on the problem. I was just explaining that we have a timeline over here and while that may disincentive you for providing what we need it would be very unfair. I'm pretty sure Andres was referring to the part where there's a 'threat' to move to some other platform due to a modest performance degredation, as if it's the only factor involved in making a decision among the various RDBMS options. If that's really your deciding criteria instead of the myriad of other factors, I daresay you have your priorities mixed up. There are other Linux centric dbs to pick from. If pgsql is just another Linux centric DB then that is unfortunate but something I can deal with. These attacks really aren't going to get you anywhere. We're talking about a specific performance issue that FreeBSD has and how much PG (surely not the only application impacted by this issue) should bend to address it, even though the FreeBSD folks were made aware of the issue over year ago and have done nothing to address it. Moreover, you'd like to also define the way we deal with the issue as being to make it runtime configurable rather than as a compile-time option, even though 90% of the users out there won't understand the difference nor would know how to correctly set it (and, in many cases, may end up making the wrong decision because it's the default for other platforms, unless we add more code to address this at initdb time). Basically, it doesn't sound like you're terribly concerned with the majority of our user base, even on FreeBSD, and would prefer to try and browbeat us into doing what you've decided is the correct solution because it'd work better for you. I've been guiltly of the same in the past and it's not fun having to back off from a proposal when it's pointed out that there's a better option, particularly when it doesn't seem like the alternative is better for me, but that's just part of working in any large project. Stephen, please calm down on the hyperbole, seriously, picking another db is not an attack. I was simply asking for a feature that would make my life easier as both an admin deploying postgresql and a kernel dev attempting to fix a problem. I'm one guy, probably the only guy right now asking. Honestly the thought of needing to compile two versions of postgresql to do sysv vs mmap performance would take me more time than I would like to devote to the issue when my time is already limited. Again, it was an ask, you are free to do what you like, the same way you were free to ignore my advice at pgcon about mmap being less efficient. It does not make what I'm saying an attack. Just like when interviewing people choosing a different candidate for a job is not an attack on the other candidates! -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14, 12:47 PM, Stephen Frost wrote: Asking for help to address the FreeBSD performance would have been much better received. Thanks, Stephen That is exactly what I did, I asked for a version of postgresql that was easy to switch at runtime between two behaviors. That would make it a LOT easier to run a few scripts and make sure I got the correct binary without having to munge PREFIX and a bunch of PATH and other tools to get my test harness to DTRT. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Perfomance degradation 9.3 (vs 9.2) for FreeBSD
On 4/21/14, 2:23 PM, Stephen Frost wrote: Alfred, * Alfred Perlstein (alf...@freebsd.org) wrote: On 4/21/14, 12:47 PM, Stephen Frost wrote: Asking for help to address the FreeBSD performance would have been much better received. Thanks, Stephen That is exactly what I did, I asked for a version of postgresql that was easy to switch at runtime between two behaviors. That would make it a LOT easier to run a few scripts and make sure I got the correct binary without having to munge PREFIX and a bunch of PATH and other tools to get my test harness to DTRT. I'm sure one of the hackers would be happy to provide you with a patch to help you with your testing. That would be fine. That's quite a different thing from asking for a GUC to be provided and then supported over the next 5 years as part of the core release, which is what I believe we all thought you were asking for. I did not know that GUCs were not classified into experimental/non-experimental. The fact that a single GUC would need to be supported for 5 years is definitely something to consider. Now I understand the push back a little more. -Alfred -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PGCON meetup FreeNAS/FreeBSD: In Ottawa Tue Wed.
Hello PostgreSQL Hackers, I am now in Ottawa, last week we wrapped up the BSDCon and I was hoping to chat with a few Postgresql developers in person about using Postgresql in FreeNAS and offering it as an extension to the platform as a plug-in technology. Unfortunately due to time constraints I can not attend the entire conference and I am only in town until Wednesday at noon. I'm hoping there's a good time to talk to a few developers about Postgresql + FreeNAS before I have to depart back to the bay area. Some info on me: My name is Alfred Perlstein, I am a FreeBSD developer and FreeNAS project lead. I am the VP of Software Engineering at iXsystems. I have been a fan of Postgresql for many years. In the early 2000s we build a high speed web tracking application on top of Postgresql and worked closely with the community to shake out performance and bug, so closely that Tom Lane and Vadim Mikheevhad logins on our box. Since that time I have tried to get Postgresql into as many places as possible. Some info on the topics I wanted to briefly discuss: 1) Using Postgresql as the config store for FreeNAS. We currently use SQLITE, SQLITE fits our needs until we get to the point of replication between HA (high availability) units. Then we are forced to manually sync data between configurations. A discussion on how we might do this better using Postgresql, while still maintaining our ease of config export (single file) and small footprint would be interesting. 2) Postgresql plugin for FreeNAS. Flip a switch and suddenly your file server is also serving enterprise data. We currently have a plug-in architecture, but would like to discuss the possibility of a tighter integration so that Postgresql looks like a more cohesive addition to FreeNAS. 3) Statistic monitoring / EagleEye In FreeBSD/FreeNAS I have developed a system called EagleEye. EagleEye is a system where all mibs are easily exportable with timestamps in a common format (for now YAML modified CSV) which is then consumed by a utility which can then provide graphs. The entire point of EagleEye is to eventually upstream the modifications to future proof statistics tracking into the FreeBSD and FreeNAS systems. I have spoken with some Illuminos/ZFS developers and they are interested as well. I think that is all I have, please drop me a note if you'll have some time in Ottawa today, tomorrow or early Wednesday. I'd love to discuss and buy some beers for the group. thank you, -Alfred Perlstein VP Software Engineering, iXsystems. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New Linux xfs/reiser file systems
* Bruce Momjian [EMAIL PROTECTED] [010502 14:01] wrote: I was talking to a Linux user yesterday, and he said that performance using the xfs file system is pretty bad. He believes it has to do with the fact that fsync() on log-based file systems requires more writes. With a standard BSD/ext2 file system, WAL writes can stay on the same cylinder to perform fsync. Is that true of log-based file systems? I know xfs and reiser are both log based. Do we need to be concerned about PostgreSQL performance on these file systems? I use BSD FFS with soft updates here, so it doesn't affect me. The problem with log based filesystems is that they most likely do not know the consequences of a write so an fsync on a file may require double writing to both the log and the real portion of the disk. They can also exhibit the problem that an fsync may cause all pending writes to require scheduling unless the log is constructed on the fly rather than incrementally. There was also the problem that was brought up recently that certain versions (maybe all?) of Linux perform fsync() in a very non-optimal manner, if the user is able to use the O_FSYNC option rather than fsync he may see a performance increase. But his guess is probably nearly as good as mine. :) -- -Alfred Perlstein - [[EMAIL PROTECTED]] http://www.egr.unlv.edu/~slumos/on-netbsd.html ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] New Linux xfs/reiser file systems
* Bruce Momjian [EMAIL PROTECTED] [010502 15:20] wrote: The problem with log based filesystems is that they most likely do not know the consequences of a write so an fsync on a file may require double writing to both the log and the real portion of the disk. They can also exhibit the problem that an fsync may cause all pending writes to require scheduling unless the log is constructed on the fly rather than incrementally. Yes, this double-writing is a problem. Suppose you have your WAL on a separate drive. You can fsync() WAL with zero head movement. With a log based file system, you need two head movements, so you have gone from zero movements to two. It may be worse depending on how the filesystem actually does journalling. I wonder if an fsync() may cause ALL pending meta-data to be updated (even metadata not related to the postgresql files). Do you know if reiser or xfs have this problem? -- -Alfred Perlstein - [[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] COPY commands could use an enhancement.
It would be very helpful if the COPY command could be expanded in order to provide positional parameters. I noticed that it didn't a while back and it can really hurt someone when they happen to try to use pg_dump to move data from one database to another database and they happened to create the feilds in the tables in different orders. Basically: COPY webmaster FROM stdin; could become: COPY webmaster FIELDS id, name, ssn FROM stdin; this way when sourcing it would know where to place the feilds. -- -Alfred Perlstein - [[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] COPY commands could use an enhancement.
* Tom Lane [EMAIL PROTECTED] [010430 08:37] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: It would be very helpful if the COPY command could be expanded in order to provide positional parameters. I think it's a bad idea to try to expand COPY into a full-tilt data import/conversion utility, which is the direction that this sort of suggestion is headed in. COPY is designed as a simple, fast, reliable, low-overhead data transfer mechanism for backup and restore. The more warts we add to it, the less well it will serve that purpose. Honestly it would be hard for COPY to be any more less serving of people's needs, it really makes sense for it to be able to parse positional paramters for both speed and correctness. Example: if we allow selective column import, what do we do with missing columns? What is already done, if you initiate a copy into a 5 column table using only 4 columns of copy data the fifth is left empty. Must COPY now be able to handle insertion of default-value expressions? No, copy should be what it is simple but at the same time useful enough for bulk transfer without painful contortions and fear of modifying tables. -- -Alfred Perlstein - [[EMAIL PROTECTED]] Represent yourself, show up at BABUG http://www.babug.org/ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Re: SAP-DB
* Bruce Momjian [EMAIL PROTECTED] [010429 10:44] wrote: I swore I'd never post to the hackers list again, but this is an amazing statement by Bruce. Boy, the robustness of the software is determined by the number of characters in the directory name? By the languages used? [Snip] My guess is that Bruce was implying that the code was obfuscated. It is a common trick for closed source to be open but not really. I don't think it was any sort of technology snobbery. Far be it for me to suggest an explanation to the words of others, that is just how I read it. I don't think they intentionally confused the code. The real problem I see in that it was very hard for me to find anything in the code. I would be interested to see if others can find stuff. I think this is general problem in a lot of projects, you open up foo.c and say... what the heck is this... after a few hours of studying the source you finally figure out is something that does miniscule part X of massive part Y and by then you're too engrossed to write a little banner for the file or dir explaining what it's for and incorrectly assume that even if you did, it wouldn't help that user unless he went through the same painful steps that you did. Been there, done that.. er, actually, still there, mostly still doing that. :) -- -Alfred Perlstein - [[EMAIL PROTECTED]] http://www.egr.unlv.edu/~slumos/on-netbsd.html ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Thanks, naming conventions, and count()
* Bruce Momjian [EMAIL PROTECTED] [010429 20:14] wrote: Yes, I like that idea, but the problem is that it is hard to update just one table in the file. You sort of have to update the entire file each time a table changes. That is why I liked symlinks because they are per-table, but you are right that the symlink creation could fail because the new table file was never created or something, leaving the symlink pointing to nothing. Not sure how to address this. Is there a way to update a flat file when a single table changes? Sort of, if that flat file is in the form of: 123456;tablename 33;another_table ie, each line is a fixed length. -- -Alfred Perlstein - [[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Thanks, naming conventions, and count()
* Tom Lane [EMAIL PROTECTED] [010429 23:12] wrote: Bruce Momjian [EMAIL PROTECTED] writes: big problem is that there is no good way to make the symlinks reliable because in a crash, the symlink could point to a table creation that got rolled back or the renaming of a table that got rolled back. Yes. Have you already forgotten the very long discussion we had about this some months back? There is no way to provide a reliable symlink mapping without re-introducing all the same problems that we went to numeric filenames to avoid. Now if you want an *UNRELIABLE* symlink mapping, maybe we could talk about it ... but IMHO such a feature would be worse than useless. Murphy's law says that the symlinks would be right often enough to mislead dbadmins into trusting them, and wrong exactly when it would do the most damage to trust them. The same goes for other methods of unreliably exporting the name-to-number mapping, such as dumping it into a flat file. We do need to document how to get the mapping (ie, select relfilenode, relname from pg_class). But I really doubt that an automated method for exporting the mapping would be worth the cycles it would cost, even if it could be made reliable which it can't. Perhaps an external tool to rebuild the symlink state that could be run on an offline database. But I'm sure you have more important things to do. :) -- -Alfred Perlstein - [[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] 7.1 vacuum
* Magnus Naeslund(f) [EMAIL PROTECTED] [010426 21:17] wrote: How does 7.1 work now with the vacuum and all? Does it go for indexes by default, even when i haven't run a vacuum at all? Does vacuum lock up postgres? It says the analyze part shouldn't, but how's that for all of the vacuum? An 7.0.3 db we have here we are forced to run vacuum every hour to get an acceptable speed, and while doing that vacuum (5-10 minutes) it totaly blocks our application that's mucking with the db. http://people.freebsd.org/~alfred/vacfix/ -- -Alfred Perlstein - [[EMAIL PROTECTED]] Instead of asking why a piece of software is using 1970s technology, start asking why software is ignoring 30 years of accumulated wisdom. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
[HACKERS] Re: 7.1 vacuum
* mlw [EMAIL PROTECTED] [010427 05:50] wrote: Alfred Perlstein wrote: * Magnus Naeslund(f) [EMAIL PROTECTED] [010426 21:17] wrote: How does 7.1 work now with the vacuum and all? Does it go for indexes by default, even when i haven't run a vacuum at all? Does vacuum lock up postgres? It says the analyze part shouldn't, but how's that for all of the vacuum? An 7.0.3 db we have here we are forced to run vacuum every hour to get an acceptable speed, and while doing that vacuum (5-10 minutes) it totaly blocks our application that's mucking with the db. http://people.freebsd.org/~alfred/vacfix/ What's the deal with vacuum lazy in 7.1? I was looking forward to it. It was never clear whether or not you guys decided to put it in. If it is in as a feature, how does one use it? If it is a patch, how does one get it? If you actually download and read the enclosed READMEs it's pretty clear. If it is neither a patch nor an existing feature, has development stopped? I have no idea, I haven't been tracking postgresql all that much since leaving the place where we contracted that work. -- -Alfred Perlstein - [[EMAIL PROTECTED]] Represent yourself, show up at BABUG http://www.babug.org/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] CVS tags for betas and release candidate
* The Hermit Hacker [EMAIL PROTECTED] [010327 04:53] wrote: On Mon, 26 Mar 2001, Matthias Juchem wrote: Hi there. I was just looking for the CVS tags for downloading the beta6 and the RC1 of 7.1 but there are only the following tags: REL_7_1_BETA2 REL_7_1_BETA3 REL_7_1 Aren't there tags for the versions I am looking for? Nope ... doing the tags didn't work as well as was hoped, so we've just been using date ranges instead ... release itself will be tag'd ... You know you can nuke tags right? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] pgindent run?
* Bruce Momjian [EMAIL PROTECTED] [010321 21:14] wrote: The Hermit Hacker [EMAIL PROTECTED] writes: and most times, those have to be merged into the source tree due to extensive changes anyway ... maybe we should just get rid of the use of pgindent altogether? I think pgindent is a good thing; the style of different parts of the code would vary too much without it. I'm only unhappy about the risk issues of running it at this late stage of the release cycle. This is the usualdiscussion. Some like it, some don't like the risk, some don't like the timing. I don't think we ever came up with a better time than before RC, though I think we could do it a little earlier in beta if people were not holding patches during that period. It is the beta patching folks that we have the most control over. It seems that you guys are dead set on using this pgindent tool, this is cool, we'd probably use some indentation tool on the FreeBSD sources if there was one that met our code style(9) guidelines. With that said, I really scares the crud out of me to see those massive pg_indent runs right before you guys do a release. It would make a lot more sense to force a pgindent run after applying each patch. This way you don't loose the history. You want to be upset with yourself Bruce? Go into a directory and type: cvs annotate any file that's been pgindented cvs annotate is a really, really handy tool, unfortunetly these indent runs remove this very useful tool as well as do a major job of obfuscating the code changes. It's not like you guys have a massive devel team with new people each week that have a steep committer learning curve ahead of them, making pgindent as patches are applied should work. There's also the argument that a developer's pgindent may force a contributor to resolve conflicts, while this is true, it's also true that you guys expect diffs to be in context format, comments to be in english, function prototypes to be new style, etc, etc.. I think contributors can deal with this. just my usual 20 cents. :) -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Fw: [vorbis-dev] ogg123: shared memory by mmap()
* Bruce Momjian [EMAIL PROTECTED] [010320 14:10] wrote: The patch below adds: - acinclude.m4: A new macro A_FUNC_SMMAP to check that sharing pages through mmap() works. This is taken from Joerg Schilling's star. - configure.in: A_FUNC_SMMAP - ogg123/buffer.c: If we have a working mmap(), use it to create a region of shared memory instead of using System V IPC. Works on BSD. Should also work on SVR4 and offspring (Solaris), and Linux. This is a really bad idea performance wise. Solaris has a special code path for SYSV shared memory that doesn't require tons of swap tracking structures per-page/per-process. FreeBSD also has this optimization (it's off by default, but should work since FreeBSD 4.2 via the sysctl kern.ipc.shm_use_phys=1) Both OS's use a trick of making the pages non-pageable, this allows signifigant savings in kernel space required for each attached process, as well as the use of large pages which reduce the amount of TLB faults your processes will incurr. That is interesting. BSDi has SysV shared memory as non-pagable, and I always thought of that as a bug. Seems you are saying that having it pagable has a significant performance penalty. Interesting. Yes, having it pageable is actually sort of bad. It doesn't allow you to do several important optimizations. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Final Call: RC1 about to go out the door ...
* Tom Lane [EMAIL PROTECTED] [010320 10:21] wrote: The Hermit Hacker [EMAIL PROTECTED] writes: Speak now, or forever hold your piece (where forever is the time between now and RC1 is packaged) ... I rather hope it's *NOT* And still no LAZY vacuum. *sigh* -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Fw: [vorbis-dev] ogg123: shared memory by mmap()
WOOT WOOT! DANGER WILL ROBINSON! - Original Message - From: "Christian Weisgerber" [EMAIL PROTECTED] Newsgroups: list.vorbis.dev To: [EMAIL PROTECTED] Sent: Saturday, March 17, 2001 12:01 PM Subject: [vorbis-dev] ogg123: shared memory by mmap() The patch below adds: - acinclude.m4: A new macro A_FUNC_SMMAP to check that sharing pages through mmap() works. This is taken from Joerg Schilling's star. - configure.in: A_FUNC_SMMAP - ogg123/buffer.c: If we have a working mmap(), use it to create a region of shared memory instead of using System V IPC. Works on BSD. Should also work on SVR4 and offspring (Solaris), and Linux. This is a really bad idea performance wise. Solaris has a special code path for SYSV shared memory that doesn't require tons of swap tracking structures per-page/per-process. FreeBSD also has this optimization (it's off by default, but should work since FreeBSD 4.2 via the sysctl kern.ipc.shm_use_phys=1) Both OS's use a trick of making the pages non-pageable, this allows signifigant savings in kernel space required for each attached process, as well as the use of large pages which reduce the amount of TLB faults your processes will incurr. Anyhow, if you could make this a runtime option it wouldn't be so evil, but as a compile time option, it's a really bad idea for Solaris and FreeBSD. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] ODBC/FreeBSD/LinuxEmulation/RPM?
* Larry Rosenman [EMAIL PROTECTED] [010319 10:35] wrote: Is there any way to get just the ODBC RPM to install with OUT installing the whole DB? I have a strange situation: StarOffice 5.2 (Linux) Running under FreeBSD Linux Emulation PG running NATIVE. I want the two to talk, using ODBC. How do I make this happen? rpm2cpio pg_rpmfile.rpm pg_rpmfile.cpio cpio -i pg_rpmfile.cpio tar xzvf pg_rpmfile.tgz -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] ODBC/FreeBSD/LinuxEmulation/RPM?
* Alfred Perlstein [EMAIL PROTECTED] [010319 11:27] wrote: * Larry Rosenman [EMAIL PROTECTED] [010319 10:35] wrote: Is there any way to get just the ODBC RPM to install with OUT installing the whole DB? I have a strange situation: StarOffice 5.2 (Linux) Running under FreeBSD Linux Emulation PG running NATIVE. I want the two to talk, using ODBC. How do I make this happen? rpm2cpio pg_rpmfile.rpm pg_rpmfile.cpio cpio -i pg_rpmfile.cpio tar xzvf pg_rpmfile.tgz Sorry, i was just waking up when I wrote this... the idea is to extract the rpm then just grab the required ODBC files. best of luck, -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* William K. Volkman [EMAIL PROTECTED] [010318 11:56] wrote: The Hermit Hacker wrote: But, with shared libraries, are you really pulling in a "whole thread-support library"? My understanding of shared libraries (altho it may be totally off) was that instead of pulling in a whole library, you pulled in the bits that you needed, pretty much as you needed them ... Just by making a thread call libc changes personality to use thread safe routines (I.E. add mutex locking). Use one thread feature, get the whole set...which may not be that bad. Actually it can be pretty bad. Locked bus cycles needed for mutex operations are very, very expensive, not something you want to do unless you really really need to do it. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Larry Rosenman [EMAIL PROTECTED] [010318 14:17] wrote: * Tom Lane [EMAIL PROTECTED] [010318 14:55]: Alfred Perlstein [EMAIL PROTECTED] writes: Just by making a thread call libc changes personality to use thread safe routines (I.E. add mutex locking). Use one thread feature, get the whole set...which may not be that bad. Actually it can be pretty bad. Locked bus cycles needed for mutex operations are very, very expensive, not something you want to do unless you really really need to do it. It'd be interesting to try to get some numbers about the actual cost of using a thread-aware libc, on platforms where there's a difference. Shouldn't be that hard to build a postgres executable with the proper library and run some benchmarks ... anyone care to try? I can get the code compiled, but don't have the skills to generate a test case worthy of anything There's a 'make test' or something ('regression' maybe?) target that runs a suite of tests on the database, you could use that as a bench/timer, you could also try mysql's "crashme" script. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Xu Yifeng [EMAIL PROTECTED] [010316 01:15] wrote: Hello Alfred, Friday, March 16, 2001, 3:21:09 PM, you wrote: AP * Xu Yifeng [EMAIL PROTECTED] [010315 22:25] wrote: Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. AP I suggested this about a year ago. :) AP The problem is that you need that process to potentially open and close AP many files over and over. AP I still think it's somewhat of a good idea. I am not a DBMS guru. Hah, same here. :) couldn't the syncer process cache opened files? is there any problem I didn't consider ? 1) IPC latency, the amount of time it takes to call fsync will increase by at least two context switches. 2) a working set (number of files needed to be fsync'd) that is larger than the amount of files you wish to keep open. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: Re[2]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Bruce Momjian [EMAIL PROTECTED] [010316 07:11] wrote: Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. I suggested this about a year ago. :) The problem is that you need that process to potentially open and close many files over and over. I still think it's somewhat of a good idea. I like the idea too, but people want the transaction to return COMMIT only after data has been fsync'ed so I don't see a big win. This isn't simply handing off the sync to this other process, it requires an ack from the syncer before returning 'COMMIT'. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Tom Lane [EMAIL PROTECTED] [010316 08:16] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: couldn't the syncer process cache opened files? is there any problem I didn't consider ? 1) IPC latency, the amount of time it takes to call fsync will increase by at least two context switches. 2) a working set (number of files needed to be fsync'd) that is larger than the amount of files you wish to keep open. These days we're really only interested in fsync'ing the current WAL log file, so working set doesn't seem like a problem anymore. However context-switch latency is likely to be a big problem. One thing we'd definitely need before considering this is to replace the existing spinlock mechanism with something more efficient. What sort of problems are you seeing with the spinlock code? Vadim has designed the WAL stuff in such a way that a separate writer/syncer process would be easy to add; in fact it's almost that way already, in that any backend can write or sync data that's been added to the queue by any other backend. The question is whether it'd actually buy anything to have another process. Good stuff to experiment with for 7.2. The delayed/coallecesed (sp?) fsync looked interesting. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Tom Lane [EMAIL PROTECTED] [010315 09:35] wrote: BTW, are there any platforms where O_DSYNC exists but has a different spelling? Yes, FreeBSD only has: O_FSYNC it doesn't have O_SYNC nor O_DSYNC. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Tom Lane [EMAIL PROTECTED] [010315 11:07] wrote: "Mikheev, Vadim" [EMAIL PROTECTED] writes: ... I would either use fsync as default or don't deal with O_SYNC at all. But if O_DSYNC is defined and O_DSYNC != O_SYNC then we should use O_DSYNC by default. Hm. We could do that reasonably painlessly as a compile-time test in xlog.c, but I'm not clear on how it would play out as a GUC option. Peter, what do you think about configuration-dependent defaults for GUC variables? Sorry, what's a GUC? :) -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Tom Lane [EMAIL PROTECTED] [010315 11:45] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: And since we're sorta on the topic of IO, I noticed that it looks like (at least in 7.0.3) that vacuum and certain other routines read files in reverse order. Vacuum does that because it's trying to push tuples down from the end into free space in earlier blocks. I don't see much way around that (nor any good reason to think that it's a critical part of vacuum's performance anyway). Where else have you seen such behavior? Just vacuum, but the source is large, and I'm sort of lacking on database-foo so I guessed that it may be done elsewhere. You can optimize this out by implementing the read behind yourselves sorta like this: struct sglist * read(fd, len) { if (fd.lastpos - fd.curpos = THRESHOLD) { fd.curpos = fd.lastpos - THRESHOLD; len = THRESHOLD; } return (do_read(fd, len)); } of course this is entirely wrong, but illustrates what would/could help. I would fix FreeBSD, but it's sort of a mess and beyond what I've got time to do ATM. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Mikheev, Vadim [EMAIL PROTECTED] [010315 13:52] wrote: I believe that we don't know enough yet to nail down a hard-wired decision. Vadim's idea of preferring O_DSYNC if it appears to be different from O_SYNC is a good first cut, but I think we'd better make it possible to override that, at least for testing purposes. So let's leave fsync as default and add option to open log files with O_DSYNC/O_SYNC. I have a weird and untested suggestion: How many files need to be fsync'd? If it's more than one, what might work is using mmap() to map the files in adjacent areas, then calling msync() on the entire range, this would allow you to batch fsync the data. The only problem is that I'm not sure: 1) how portable msync() is. 2) if msync garauntees metadata consistancy. Another benifit of mmap() is the 'zero' copy nature of it. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Tom Lane [EMAIL PROTECTED] [010315 14:54] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: How many files need to be fsync'd? Only one. If it's more than one, what might work is using mmap() to map the files in adjacent areas, then calling msync() on the entire range, this would allow you to batch fsync the data. Interesting thought, but mmap to a prespecified address is most definitely not portable, whether or not you want to assume that plain mmap is ... Yeah... :( Evil thought though (for reference): mmap(anon memory) returns addr1 addr2 = addr1 + maplen split addr1-addr2 on points A B and C mmap(file1 over addr1 to A) mmap(file2 over A to B) mmap(file3 over B to C) mmap(file4 over C to addr2) It _should_ work, but there's probably some corner cases where it doesn't. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Performance monitor signal handler
* Philip Warner [EMAIL PROTECTED] [010315 16:14] wrote: At 06:57 15/03/01 -0500, Jan Wieck wrote: And shared memory has all the interlocking problems we want to avoid. I suspect that if we keep per-backend data in a separate area, then we don;t need locking since there is only one writer. It does not matter if a reader gets an inconsistent view, the same as if you drop a few UDP packets. No, this is completely different. Lost data is probably better than incorrect data. Either use locks or a copying mechanism. People will depend on the data returned making sense. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Performance monitor signal handler
* Philip Warner [EMAIL PROTECTED] [010315 16:46] wrote: At 16:17 15/03/01 -0800, Alfred Perlstein wrote: Lost data is probably better than incorrect data. Either use locks or a copying mechanism. People will depend on the data returned making sense. But with per-backend data, there is only ever *one* writer to a given set of counters. Everyone else is a reader. This doesn't prevent a reader from getting an inconsistant view. Think about a 64bit counter on a 32bit machine. If you charged per megabyte, wouldn't it upset you to have a small chance of loosing 4 billion units of sale? (ie, doing a read after an addition that wraps the low 32 bits but before the carry is done to the top most signifigant 32bits?) Ok, what what if everything can be read atomically by itself? You're still busted the minute you need to export any sort of compound stat. If A, B and C need to add up to 100 you have a read race. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Performance monitor signal handler
* Philip Warner [EMAIL PROTECTED] [010315 17:08] wrote: At 16:55 15/03/01 -0800, Alfred Perlstein wrote: * Philip Warner [EMAIL PROTECTED] [010315 16:46] wrote: At 16:17 15/03/01 -0800, Alfred Perlstein wrote: Lost data is probably better than incorrect data. Either use locks or a copying mechanism. People will depend on the data returned making sense. But with per-backend data, there is only ever *one* writer to a given set of counters. Everyone else is a reader. This doesn't prevent a reader from getting an inconsistant view. Think about a 64bit counter on a 32bit machine. If you charged per megabyte, wouldn't it upset you to have a small chance of loosing 4 billion units of sale? (ie, doing a read after an addition that wraps the low 32 bits but before the carry is done to the top most signifigant 32bits?) I assume this means we can not rely on the existence of any kind of interlocked add on 64 bit machines? Ok, what what if everything can be read atomically by itself? You're still busted the minute you need to export any sort of compound stat. Which is why the backends should not do anything other than maintain the raw data. If there is atomic data than can cause inconsistency, then a dropped UDP packet will do the same. The UDP packet (a COPY) can contain a consistant snapshot of the data. If you have dependancies, you fit a consistant snapshot into a single packet. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: Re[2]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Xu Yifeng [EMAIL PROTECTED] [010315 22:25] wrote: Hello Tom, Friday, March 16, 2001, 6:54:22 AM, you wrote: TL Alfred Perlstein [EMAIL PROTECTED] writes: How many files need to be fsync'd? TL Only one. If it's more than one, what might work is using mmap() to map the files in adjacent areas, then calling msync() on the entire range, this would allow you to batch fsync the data. TL Interesting thought, but mmap to a prespecified address is most TL definitely not portable, whether or not you want to assume that TL plain mmap is ... TL regards, tom lane Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. I suggested this about a year ago. :) The problem is that you need that process to potentially open and close many files over and over. I still think it's somewhat of a good idea. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Performance monitor signal handler
* Philip Warner [EMAIL PROTECTED] [010312 18:56] wrote: At 13:34 12/03/01 -0800, Alfred Perlstein wrote: Is it possible to have a spinlock over it so that an external utility can take a snapshot of it with the spinlock held? I'd suggest that locking the stats area might be a bad idea; there is only one writer for each backend-specific chunk, and it won't matter a hell of a lot if a reader gets inconsistent views (since I assume they will be re-reading every second or so). All the stats area should contain would be a bunch of counters with timestamps, I think, and the cost up writing to it should be kept to an absolute minimum. just some ideas.. Unfortunatley, based on prior discussions, Bruce seems quite opposed to a shared memory solution. Ok, here's another nifty idea. On reciept of the info signal, the backends collaborate to piece together a status file. The status file is given a temporay name. When complete the status file is rename(2)'d over a well known file. This ought to always give a consistant snapshot of the file to whomever opens it. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Performance monitor signal handler
* Philip Warner [EMAIL PROTECTED] [010313 06:42] wrote: This ought to always give a consistant snapshot of the file to whomever opens it. I think Tom has previously stated that there are technical reasons not to do IO in signal handlers, and I have philosophical problems with performance monitors that ask 50 backends to do file IO. I really do think shared memory is TWTG. I wasn't really suggesting any of those courses of action, all I suggested was using rename(2) to give a seperate appilcation a consistant snapshot of the stats. Actually, what makes the most sense (although it may be a performance killer) is to have the backends update a system table that the external app can query. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] WAL SHM principles
* Matthew Kirkwood [EMAIL PROTECTED] [010313 13:12] wrote: On Tue, 13 Mar 2001, Ken Hirsch wrote: mlock() guarantees that the locked address space is in memory. This doesn't imply that updates are not written to the backing file. I've wondered about this myself. It _is_ true on Linux that mlock prevents writes to the backing store, I don't believe that this is true. The manpage offers no such promises, and the semantics are not useful. Afaik FreeBSD's Linux emulator: revision 1.13 date: 2001/02/28 04:30:27; author: dillon; state: Exp; lines: +3 -1 Linux does not filesystem-sync file-backed writable mmap pages on a regular basis. Adjust our linux emulation to conform. This will cause more dirty pages to be left for the pagedaemon to deal with, but our new low-memory handling code can deal with it. The linux way appears to be a trend, and we may very well make MAP_NOSYNC the default for FreeBSD as well (once we have reasonable sequential write-behind heuristics for random faults). (will be MFC'd prior to 4.3 freeze) Suggested by: Andrew Gallatin Basically any mmap'd data doesn't seem to get sync()'d out on a regular basis. and this is used as a security feature for cryptography software. mlock() is used to prevent pages being swapped out. Its use for crypto software is essentially restricted to anon memory (allocated via brk() or mmap() of /dev/zero). What about userland device drivers that want to send parts of a disk backed file to a driver's dma routine? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
[HACKERS] Re: Performance monitor signal handler
* Thomas Swan [EMAIL PROTECTED] [010313 13:37] wrote: On reciept of the info signal, the backends collaborate to piece together a status file. The status file is given a temporay name. When complete the status file is rename(2)'d over a well known file. Reporting to files, particularly well known ones, could lead to race conditions. All in all, I think your better off passing messages through pipes or a similar communication method. I really liked the idea of a "server" that could parse/analyze data from multiple backends. My 2/100 worth... Take a few moments to think about the semantics of rename(2). Yes, you would still need syncronization between the backend processes to do this correctly, but not any external app. The external app can just open the file, assuming it exists it will always have a complete and consistant snapshot of whatever the backends agreed on. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Performance monitor signal handler
* Bruce Momjian [EMAIL PROTECTED] [010312 12:12] wrote: I was going to implement the signal handler like we do with Cancel, where the signal sets a flag and we check the status of the flag in various _safe_ places. Can anyone think of a better way to get information out of a backend? Why not use a static area of the shared memory segment? Is it possible to have a spinlock over it so that an external utility can take a snapshot of it with the spinlock held? Also, this could work for other stuff as well, instead of overloading a lot of signal handlers one could just periodically poll a region of the shared segment. just some ideas.. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/ ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Tom Lane [EMAIL PROTECTED] [010306 10:10] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: I'm sure some sort of encoding of the PGDATA directory along with the pids stored in the shm segment... I thought about this too, but it strikes me as not very trustworthy. The problem is that there's no guarantee that the new postmaster will even notice the old shmem segment: it might select a different shmem key. (The 7.1 coding of shmem key selection makes this more likely than it used to be, but even under 7.0, it will certainly fail to work if I choose to start the new postmaster using a different port number than the old one had. The shmem key is driven primarily by port number not data directory ...) This seems like a mistake. I'm suprised you guys aren't just using some form of the FreeBSD ftok() algorithm for this: FTOK(3)FreeBSD Library Functions ManualFTOK(3) ... The ftok() function attempts to create a unique key suitable for use with the msgget(3), semget(2) and shmget(2) functions given the path of an ex- isting file and a user-selectable id. The specified path must specify an existing file that is accessible to the calling process or the call will fail. Also, note that links to files will return the same key, given the same id. BUGS The returned key is computed based on the device minor number and inode of the specified path in combination with the lower 8 bits of the given id. Thus it is quite possible for the routine to return duplicate keys. The "BUGS" seems to be exactly what you guys are looking for, a somewhat reliable method of obtaining a system id. If that sounds evil, read below for an alternate suggestion. The interlock has to be tightly tied to the PGDATA directory, because what we're trying to protect is the files in and under that directory. It seems that something based on file(s) in that directory is the way to go. The best idea I've seen so far is Hiroshi's idea of having all the backends hold fcntl locks on the same file (probably postmaster.pid would do fine). Then the new postmaster can test whether any backends are still alive by trying to lock the old postmaster.pid file. Unfortunately, I read in the fcntl man page: Locks are not inherited by a child process in a fork(2) system call. This makes the idea much less attractive than I originally thought: a new backend would not automatically inherit a lock on the postmaster.pid file from the postmaster, but would have to open/lock it for itself. That means there's a window where the new backend exists but would be invisible to a hypothetical new postmaster. We could work around this with the following, very ugly protocol: 1. Postmaster normally maintains fcntl read lock on its postmaster.pid file. Each spawned backend immediately opens and read-locks postmaster.pid, too, and holds that file open until it dies. (Thus wasting a kernel FD per backend, which is one of the less attractive things about this.) If the backend is unable to obtain read lock on postmaster.pid, then it complains and dies. We must use read locks here so that all these processes can hold them separately. 2. If a newly started postmaster sees a pre-existing postmaster.pid file, it tries to obtain a *write* lock on that file. If it fails, conclude that an old postmaster or backend is still alive; complain and quit. If it succeeds, sit for say 1 second before deleting the file and creating a new one. (The delay here is to allow any just-started old backends to fail to acquire read lock and quit. A possible objection is that we have no way to guarantee 1 second is enough, though it ought to be plenty if the lock acquisition is just after the fork.) One thing that worries me a little bit is that this means an fcntl read-lock request will exist inside the kernel for each active backend. Does anyone know of any performance problems or hard kernel limits we might run into with large numbers of backends (lots and lots of fcntl locks)? At least the locks are on a file that we don't actually touch in the normal course of business. A small savings is that the backends don't actually need to open new FDs for the postmaster.pid file; they can use the one they inherit from the postmaster, even though they do need to lock it again. I'm not sure how much that saves inside the kernel, but at least something. There are also the usual set of concerns about portability of flock, though this time we're locking a plain file and not a socket, so it shouldn't be as much trouble as it was before. Comments? Does anyone see a better way to do it? Possibly... What about encoding the shm id in the pidfile? Then one can just ask how many processes are attached to that segment? (if it doesn't exist, one can assume all backends have exited) you want the field 'shm_nattch' The shmid
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Tom Lane [EMAIL PROTECTED] [010306 10:35] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: What about encoding the shm id in the pidfile? Then one can just ask how many processes are attached to that segment? (if it doesn't exist, one can assume all backends have exited) Hmm ... that might actually be a pretty good idea. A small problem is that the shm key isn't yet selected at the time we initially create the lockfile, but I can't think of any reason that we could not go back and append the key to the lockfile afterwards. you want the field 'shm_nattch' Are there any portability problems with relying on shm_nattch to be available? If not, I like this a lot... Well it's available on FreeBSD and Solaris, I'm sure Redhat has some deamon that resets the value to 0 periodically just for kicks so it might not be viable... :) Seriously, there's some dispute on the type that 'shm_nattch' is, under Solaris it's "shmatt_t" (unsigned long afaik), under FreeBSD it's 'short' (i should fix this. :)). But since you're really only testing for 0'ness then it shouldn't really be a problem. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Tom Lane [EMAIL PROTECTED] [010306 11:03] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: Are there any portability problems with relying on shm_nattch to be available? If not, I like this a lot... Well it's available on FreeBSD and Solaris, I'm sure Redhat has some deamon that resets the value to 0 periodically just for kicks so it might not be viable... :) I notice that our BeOS and QNX emulations of shmctl() don't support IPC_STAT, but that could be dealt with, at least to the extent of stubbing it out. Well since we already have spinlocks, I can't see why we can't keep the refcount and spinlock in a special place in the shm for all cases? This does raise the question of what to do if shmctl(IPC_STAT) fails for a reason other than EINVAL. I think the conservative thing to do is refuse to start up. On EPERM, for example, it's possible that there is a postmaster running in your PGDATA but with a different userid. Yes, if possible a more meaningfull error message and pointer to some docco would be nice or even a nice "i don't care, i killed all the backends, just start darnit" flag, it's really no fun at all to have to attempt to decypher some cryptic error message at 3am when the database/system is acting up. :) Seriously, there's some dispute on the type that 'shm_nattch' is, under Solaris it's "shmatt_t" (unsigned long afaik), under FreeBSD it's 'short' (i should fix this. :)). But since you're really only testing for 0'ness then it shouldn't really be a problem. We need not copy the value anywhere, so as long as the struct is correctly declared in the system header files I don't think it matters what the field type is ... Yup, my point exactly. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Tom Lane [EMAIL PROTECTED] [010306 11:30] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: * Tom Lane [EMAIL PROTECTED] [010306 11:03] wrote: I notice that our BeOS and QNX emulations of shmctl() don't support IPC_STAT, but that could be dealt with, at least to the extent of stubbing it out. Well since we already have spinlocks, I can't see why we can't keep the refcount and spinlock in a special place in the shm for all cases? No, we mustn't go there. If the kernel isn't keeping the refcount then it's worse than useless: as soon as some process crashes without decrementing its refcount, you have a condition that you can't recover from without reboot. Not if the postmaster outputs the following: What I'm currently imagining is that the stub implementations will just return a failure code for IPC_STAT, and the outer code will in turn fail with a message along the lines of "It looks like there's a pre-existing shmem block (id XXX) still in use. If you're sure there are no old backends still running, remove the shmem block with ipcrm(1), or just delete $PGDATA/postmaster.pid." I dunno what shmem management tools exist on BeOS/QNX, but deleting the lockfile will definitely suppress the startup interlock ;-). Yes, if possible a more meaningfull error message and pointer to some docco would be nice Is the above good enough? Sure. :) -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Lamar Owen [EMAIL PROTECTED] [010306 11:39] wrote: Peter Eisentraut wrote: Not only note the shm_nattch type, but also shm_segsz, and the "unused" fields in between. I don't know a thing about the Linux kernel sources, but this doesn't seem right. Red Hat 7, right? My RedHat 7 system isn't running RH 7 right now (it's this notebook that I'm running Win95 on right now), but see which RPM's own the two headers. You may be in for a shock. IIRC, the first system include is from the 2.4 kernel, and the second in the kernel source tree is from the 2.2 kernel. Odd, but not really broken. Should be fixed in the latest public beta of RedHat, that actually has the 2.4 kernel. I can't really say any more about that, however. Y'know, I was only kidding about Linux going out of its way to defeat the 'shm_nattch' trick... *sigh* As a FreeBSD developer I'm wondering if Linux keeps compatibility calls around for old binaries or not. Any idea? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Tom Lane [EMAIL PROTECTED] [010306 11:49] wrote: Peter Eisentraut [EMAIL PROTECTED] writes: What I don't like is that my /usr/include/sys/shm.h (through other headers) has [foo] whereas /usr/src/linux/include/shm.h has [bar] Are those declarations perhaps bit-compatible? Looks a tad endian- dependent, though ... Of course not, the size of the struct changed (short-unsigned long, basically int16_t - uint32_t), because the kernel and userland in Linux are hardly in sync you have the fun of guessing if you get: old struct - old syscall (ok) new struct - old syscall (boom) old struct - new syscall (boom) new struct - new syscall (ok) Honestly I think this problem should be left to the vendor to fix properly (if it needs fixing), the sysV API was published at least 6 years ago, they ought to have it mostly correct by now. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Lamar Owen [EMAIL PROTECTED] [010306 13:27] wrote: Nathan Myers wrote: That is why there is no problem with version skew in the syscall argument structures on a correctly-configured Linux system. (On a Red Hat system it is very easy to get them out of sync, but RH fans are used to problems.) Is RedHat bashing really necessary here? At least they are payrolling Second Chair on the Linux kernel hierarchy. And they are very supportive of PostgreSQL (by shipping us with their distribution). Just because they do some really nice things and have some really nice stuff doesn't mean they should really get cut any slack for doing things like shipping out of sync kernel/system headers, kill -9'ing databases and having programs like 'tmpwatch' running on the boxes. It really shows a lack of understanding of how Unix is supposed to run. What they really need to do is hire some grey beards (old school Unix folks) to QA the releases and keep stuff like this from happening/shipping. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
Alfred Perlstein [EMAIL PROTECTED] writes: Are there any portability problems with relying on shm_nattch to be available? If not, I like this a lot... Well it's available on FreeBSD and Solaris, I'm sure Redhat has some deamon that resets the value to 0 periodically just for kicks so it might not be viable... :) I notice that our BeOS and QNX emulations of shmctl() don't support IPC_STAT, but that could be dealt with, at least to the extent of stubbing it out. * Cyril VELTER [EMAIL PROTECTED] [010306 16:15] wrote: BeOS haven't this stat (I have a bunch of others but not this one). If I unsterstand correctly, you want to check if there is some backend still attached to shared mem segment of a given key ? In this case, I have an easy solution to fake the stat, because all segment have an encoded name containing this key, so I can count them. We need to be able to take a single shared memory segment and determine if any other process is using it. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Tom Lane [EMAIL PROTECTED] [010305 14:51] wrote: I think we need a stronger interlock to prevent this scenario, but I'm unsure what it should be. Ideas? Re having multiple postmasters active by accident. The sysV IPC stuff has some hooks in it that may help you. One idea is to check the 'struct shmid_ds' feild 'shm_nattch', basically at startup if it's not 1 (or 0) then you have more than one postgresql instance messing with it and it should not proceed. I'd also suggest looking into using sysV semaphores and the semundo stuff, afaik it can be used to track the number of consumers of a reasource. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] How to shoot yourself in the foot: kill -9 postmaster
* Tom Lane [EMAIL PROTECTED] [010305 19:13] wrote: Lamar Owen [EMAIL PROTECTED] writes: Tom Lane wrote: Postmaster down, backends alive is not a scenario we're currently prepared for. We need a way to plug that gap. Postmaster can easily enough find out if zombie backends are 'out there' during startup, right? If you think it's easy enough, enlighten the rest of us ;-). Be sure your solution only finds leftover backends from the previous instance of the same postmaster, else it will prevent running multiple postmasters on one system. I'm sure some sort of encoding of the PGDATA directory along with the pids stored in the shm segment... What can postmaster _do_ about it, though? It won't necessarily be able to kill them -- but it also can't control them. If it _can_ kill them, should it try? I think refusal to start is sufficient. They should go away by themselves as their clients disconnect, and forcing the issue doesn't seem like it will improve matters. The admin can kill them (hopefully with just a SIGTERM ;-)) if he wants to move things along ... but I'd not like to see a newly-starting postmaster do that automatically. I agree, shooting down processes incorrectly should be left up to vendors braindead scripts. :) Should the backend look for the presence of its parent postmaster periodically and gracefully come down if postmaster goes away without the proper handshake? Unless we checked just before every disk write, this wouldn't represent a safe failure mode. The onus has to be on the newly-starting postmaster, I think, not on the old backends. Should a set of backends detect a new postmaster coming up and try to 'sync up' with that postmaster, Nice try ;-). How will you persuade the kernel that these processes are now children of the new postmaster? Oh, easy, use ptrace. :) -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] preproc.y error
* Tom Lane [EMAIL PROTECTED] [010207 17:24] wrote: Vince Vielhaber [EMAIL PROTECTED] writes: Now I get: byacc -d preproc.y byacc: f - maximum table size exceeded gmake[4]: *** [preproc.c] Error 2 Better install bison if you want to work with CVS sources ... the lack of bison probably explains why it's failing for you on this system when it's OK on other FreeBSD boxes. I wonder if we ought not accept byacc as a suitable yacc in configure? Peter, what do you think? I think I reported this broken a couple of months ago, but it was too late to add the check to configure for 7.0. byacc doesn't work, you need bison (or maybe some special flags to byacc). -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Auto-indexing
* Christopher Kings-Lynne [EMAIL PROTECTED] [010206 18:29] wrote: Is it a feasible idea that PostgreSQL could detect when an index would be handy, and create it itself, or at least log that a table is being queried but the indices are not appropriate? I suggest this as it's a feature of most windows databases, and MySQL does it. I think it would be a great timesaver as we have hundreds of different queries, and it's a real pain to have to EXPLAIN them all, etc. Is that possible? Feasible? Probably both, but if it's done there should be options to: .) disable it completely or by table/database or even threshold or disk free parameters (indicies can be large) .) log any auto-created databases to inform the DBA. .) if disabled optionally log when it would have created an index on the fly. (suggest an index) .) expire old and unused auto-created indecies. Generally Postgresql assumes the user knows what he's doing, but it couldn't hurt too much to provide an option to have it assist the user. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] RE: Index grows huge, possible leakage?
* Mikheev, Vadim [EMAIL PROTECTED] [010202 10:39] wrote: After several weeks our idicies grow very large (in one case to 4-5 gigabytes) After droppping and recreating the indecies they shrink back to something more reasonable (500megs same case). We are currently using Vadim's vacuum patches for VLAZY and MMNB, against 7.0.3. We are using a LAZY vacuum on these tables However a normal (non-lazy) vacuum doesn't shrink the index, the only thing that helps reduce the size is dropping and recreating. Is this a bug in 7.0.3? A possible bug in Vadim's patches? Or is this somewhat expected behavior that we have to cope with? When index is created its pages are filled in full = any insert into such pages results in page split - ie in additional page. So, it's very easy to get 4Gb from 500Mb. Well that certainly stinks. :( Vacuum was never able to shrink indices - it just removes dead index tuples and so allows to re-use space ... if you'll insert the same keys. This doesn't make sense to me, seriously, if the table is locked during a normal vacuum (not VLAZY), why not have vaccum make a new index by copying valid index entries into a new index instead of just vacating slots that aren't used? To know does VLAZY work properly or not I would need in vacuum debug messages. Did you run vacuum with verbose option or do you have postmaster' logs? With LAZY vacuum writes messages like Index _name_: deleted XXX unfound YYY YYY supposed to be 0... With what you explained (indecies normally growing) I don't think VLAZY is the problem here. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
[HACKERS] Index grows huge, possible leakage?
After several weeks our idicies grow very large (in one case to 4-5 gigabytes) After droppping and recreating the indecies they shrink back to something more reasonable (500megs same case). We are currently using Vadim's vacuum patches for VLAZY and MMNB, against 7.0.3. We are using a LAZY vacuum on these tables However a normal (non-lazy) vacuum doesn't shrink the index, the only thing that helps reduce the size is dropping and recreating. Is this a bug in 7.0.3? A possible bug in Vadim's patches? Or is this somewhat expected behavior that we have to cope with? As a side note, the space requirement is actually 'ok' it's just that performance gets terrible once the indecies reach such huge sizes. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Sure enough, the lock file is gone
* Peter Eisentraut [EMAIL PROTECTED] [010126 12:11] wrote: The 'tmpwatch' program on Red Hat will remove the /tmp/.s.PGSQL.5432.lock file after the server has run 6 days. This will be a problem. We could touch (open) the file once every time the ServerLoop() runs around. It's not perfect but it should work in practice. Why not have the RPM/configure scripts stick it in where ever redhat says it's safe to? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Libpq async issues
res as well (too long query sends it into an infinite loop trying to queue data, most likely). A possible answer is to specify that a return of +N means "N bytes remain unqueued due to risk of blocking" (after having queued as much as you could). This would put the onus on the caller to update his pointers/counts properly; propagating that into all the internal uses of pqPutBytes would be no fun. (Of course, so far you haven't updated *any* of the internal callers to behave reasonably in case of a won't-block return; PQfn is just one example.) Another possible answer is to preserve pqPutBytes' old API, "queue or bust", by the expedient of enlarging the output buffer to hold whatever we can't send immediately. This is probably more attractive, even though a long query might suck up a lot of space that won't get reclaimed as long as the connection lives. If you don't do this then you are going to have to make a lot of ugly changes in the internal callers to deal with won't-block returns. Actually, a bulk COPY IN would probably be the worst case --- the app could easily load data into the buffer far faster than it could be sent. It might be best to extend PQputline to have a three-way return and add code there to limit the growth of the output buffer, while allowing all internal callers to assume that the buffer is expanded when they need it. pqFlush has the same kind of interface design problem: the same EOF code is returned for either a hard error or can't-flush-yet, but it would be disastrous to treat those cases alike. You must provide a 3-way return code. Furthermore, the same sort of 3-way return code convention will have to propagate out through anything that calls pqFlush (with corresponding documentation updates). pqPutBytes can be made to hide a pqFlush won't- block return by trying to enlarge the output buffer, but in most other places you won't have a choice except to punt it back to the caller. PQendcopy has the same interface design problem. It used to be that (unless you passed a null pointer) PQendcopy would *guarantee* that the connection was no longer in COPY state on return --- by resetting it, if necessary. So the return code was mainly informative; the application didn't have to do anything different if PQendcopy reported failure. But now, a nonblocking application does need to pay attention to whether PQendcopy completed or not --- and you haven't provided a way for it to tell. If 1 is returned, the connection might still be in COPY state, or it might not (PQendcopy might have reset it). If the application doesn't distinguish these cases then it will fail. I also think that you want to take a hard look at the automatic "reset" behavior upon COPY failure, since a PQreset call will block the application until it finishes. Really, what is needed to close down a COPY safely in nonblock mode is a pair of entry points along the line of "PQendcopyStart" and "PQendcopyPoll", with API conventions similar to PQresetStart/PQresetPoll. This gives you the ability to do the reset (if one is necessary) without blocking the application. PQendcopy itself will only be useful to blocking applications. I'm sorry if they don't work for some situations other than COPY IN, but it's functionality that I needed and I expect to be expanded on by myself and others that take interest in nonblocking operation. I don't think that the nonblock code is anywhere near production quality at this point. It may work for you, if you don't stress it too hard and never have a communications failure; but I don't want to see us ship it as part of Postgres unless these issues get addressed. regards, tom lane -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Libpq async issues
* Tom Lane [EMAIL PROTECTED] [010124 10:27] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: * Bruce Momjian [EMAIL PROTECTED] [010124 07:58] wrote: I have added this email to TODO.detail and a mention in the TODO list. The bug mentioned here is long gone, Au contraire, the misdesign is still there. The nonblock-mode code will *never* be reliable under stress until something is done about that, and that means fairly extensive code and API changes. The "bug" is the one mentioned in the first paragraph of the email where I broke _blocking_ connections for a short period. I still need to fix async connections for myself (and of course contribute it back), but I just haven't had the time. If anyone else wants it fixed earlier they can wait for me to do it, do it themself, contract me to do it or hope someone else comes along to fix it. I'm thinking that I'll do what you said and have seperate paths for writing/reading to the socket and API's to do so that give the user the option of a boundry, basically: buffer this, but don't allow me to write until it's flushed which would allow for larger than 8k COPY rows to go into the backend. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Patches with vacuum fixes available for 7.0.x
* Bruce Momjian [EMAIL PROTECTED] [010122 19:55] wrote: Vadim, did these patches ever make it into 7.1? According to: http://www.postgresql.org/cgi/cvsweb.cgi/pgsql/src/backend/parser/gram.y?rev=2.217content-type=text/x-cvsweb-markup nope. :( We recently had a very satisfactory contract completed by Vadim. Basically Vadim has been able to reduce the amount of time taken by a vacuum from 10-15 minutes down to under 10 seconds. We've been running with these patches under heavy load for about a week now without any problems except one: don't 'lazy' (new option for vacuum) a table which has just had an index created on it, or at least don't expect it to take any less time than a normal vacuum would. There's three patchsets and they are available at: http://people.freebsd.org/~alfred/vacfix/ complete diff: http://people.freebsd.org/~alfred/vacfix/v.diff only lazy vacuum option to speed up index vacuums: http://people.freebsd.org/~alfred/vacfix/vlazy.tgz only lazy vacuum option to only scan from start of modified data: http://people.freebsd.org/~alfred/vacfix/mnmb.tgz Although the patches are for 7.0.x I'm hoping that they can be forward ported (if Vadim hasn't done it already) to 7.1. enjoy! -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk." -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Does Oracle store values in indices?
* Bruce Momjian [EMAIL PROTECTED] [010123 11:17] wrote: [ Charset KOI8-R unsupported, converting... ] The reason you have to visit the main table is that tuple validity status is only stored in the main table, not in each index. See prior discussions in the archives. But how Oracle handles this? Oracle doesn't have non-overwriting storage manager but uses rollback segments to maintain MVCC. Rollback segments are used to restore valid version of entire index/table page. Are there any plans to have something like this? I mean overwriting storage manager. We hope to have it some day, hopefully soon. Vadim says that he hopes it to be done by 7.2, so if things go well it shouldn't be that far off... -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Re: [GENERAL] postgres memory management
* Peter Mount [EMAIL PROTECTED] [010122 13:21] wrote: At 13:18 21/01/01 +0100, Alexander Jerusalem wrote: Hi all, I'm experiencing some strange behaviour with postgresql 7.0.3 on Red Hat Linux 7. I'm sending lots of insert statements to the postgresql server from another machine via JDBC. During that process postgresql continues to take up more and more memory and seemingly never returns it to the system. Oddly if I watch the postmaster and it's sub processes in ktop, I can't see which process takes up this memory. ktop shows that the postgresql related processes have a constant memory usage but the overall memory usage always increases as long as I continue to send insert statements. When the database connection is closed, no memory is reclaimed, the overall memory usage stays the same. And when I close down all postgresql processes including postmaster, it's the same. I'm rather new to Linux and postgresql so I'm not sure if I should call this a memory leak :-) Has anybody experienced a similar thing? I'm not sure myself. You can rule out JDBC (or Java) here as you say you are connecting from another machine. When your JDBC app closes, does it call the connection's close() method? Does any messages like "Unexpected EOF from client" appear on the server side? The only other thing that comes to mine is possibly something weird is happening with IPC. After you closed down postgres, does ipcclean free up any memory? I don't know if this is valid for Linux, but it is how FreeBSD works, for the most part used memory is never free'd, it is only marked as reclaimable. This is so the system can cache more data. On a freshly booted FreeBSD box you'll have a lot of 'free' memory, after the box has been running for a long time the 'free' memory will probably never go higher that 10megs, the rest is being used as cache. The main things you have to worry about is: a) really running out of memory (are you useing a lot of swap?) b) not cleaning up IPC as Peter suggested. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Transactions vs speed.
* mlw [EMAIL PROTECTED] [010113 17:19] wrote: I have a question about Postgres: Take this update: update table set field = 'X' ; This is a very expensive function when the table has millions of rows, it takes over an hour. If I dump the database, and process the data with perl, then reload the data, it takes minutes. Most of the time is used creating indexes. I am not asking for a feature, I am just musing. Well you really haven't said if you've tuned your database at all, the way postgresql ships by default it doesn't use a very large shared memory segment, also all the writing (at least in 7.0.x) is done syncronously. There's a boatload of email out there that explains various ways to tune the system. Here's some of the flags that I use: -B 32768 # uses over 300megs of shared memory -o "-F" # tells database not to call fsync on each update -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
[HACKERS] Re: Transactions vs speed.
* mlw [EMAIL PROTECTED] [010113 19:37] wrote: Alfred Perlstein wrote: * mlw [EMAIL PROTECTED] [010113 17:19] wrote: I have a question about Postgres: Take this update: update table set field = 'X' ; This is a very expensive function when the table has millions of rows, it takes over an hour. If I dump the database, and process the data with perl, then reload the data, it takes minutes. Most of the time is used creating indexes. I am not asking for a feature, I am just musing. Well you really haven't said if you've tuned your database at all, the way postgresql ships by default it doesn't use a very large shared memory segment, also all the writing (at least in 7.0.x) is done syncronously. There's a boatload of email out there that explains various ways to tune the system. Here's some of the flags that I use: -B 32768 # uses over 300megs of shared memory -o "-F" # tells database not to call fsync on each update I have a good number of buffers (Not 32768, but a few), I have the "-F" option. Explain a "good number of buffers" :) Also, when was the last time you ran vacuum on this database? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Quite strange crash
* Mikheev, Vadim [EMAIL PROTECTED] [010108 23:08] wrote: Killing an individual backend with SIGTERM is bad luck. The backend will assume that it's being killed by the postmaster, and will exit without a whole lot of concern for cleaning up shared memory --- the SIGTERM -- die() -- elog(FATAL) Is it true that elog(FATAL) doesn't clean up shmem etc? This would be very bad... What code will be returned to postmaster in this case? Right at the moment, the backend will exit with status 0. I think you are thinking the same thing I am: maybe a backend that receives SIGTERM ought to exit with nonzero status. That would mean that killing an individual backend would instantly translate into an installation-wide restart. I am not sure whether that's a good idea. Perhaps this cure is worse than the disease. Well, it's not good idea because of SIGTERM is used for ABORT + EXIT (pg_ctl -m fast stop), but shouldn't ABORT clean up everything? Er, shouldn't ABORT leave the system in the exact state that it's in so that one can get a crashdump/traceback on a wedged process without it trying to clean up after itself? -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Patches with vacuum fixes available for 7.0.x
* Peter Schmidt [EMAIL PROTECTED] [010102 12:53] wrote: Will these patchsets be available to the public? I get: "You don't have permission to access /~alfred/vacfix/vlazy.tgz on this server" Thanks. Peter There's three patchsets and they are available at: http://people.freebsd.org/~alfred/vacfix/ complete diff: http://people.freebsd.org/~alfred/vacfix/v.diff only lazy vacuum option to speed up index vacuums: http://people.freebsd.org/~alfred/vacfix/vlazy.tgz only lazy vacuum option to only scan from start of modified data: http://people.freebsd.org/~alfred/vacfix/mnmb.tgz Oops! The permissions should be fixed now, if anyone wants to grab these feel free. Peter, thanks for pointing it out. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
* Bruce Momjian [EMAIL PROTECTED] [010101 23:59] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: One trick that may help is calling sched_yield(2) on a lock miss, it's a POSIX call and quite new so you'd need a 'configure' test for it. The author of the current s_lock code seems to have thought that select() with a zero delay would do the equivalent of sched_yield(). I'm not sure if that's true on very many kernels, if indeed any... I doubt we could buy much by depending on sched_yield(); if you want to assume POSIX facilities, ISTM you might as well go for user-space semaphores and forget the whole TAS mechanism. Another issue is that sched_yield brings in the pthreads library/hooks on some OS's, which we certainly want to avoid. I know it's a major undertaking, but since the work is sort of done, have you guys considered the port to solaris threads and seeing about making a pthreads port of that? I know it would probably get you considerable gains under Windows at the expense of dropping some really really legacy system. Or you could do what apache (is rumored) does and have it do either threads or processes or both... -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] GNU readline and BSD license
* The Hermit Hacker [EMAIL PROTECTED] [001229 14:11] wrote: On Sat, 23 Dec 2000, Bruce Momjian wrote: FreeBSD has a freely available library called 'libedit' that could be shipped with postgresql, it's under the BSD license. Yes, that is our solution if we have a real problem here. Is there a reason *not* to move towards that for v7.2 so that the functions we are making optional with readline are automatic? Since we could then ship the code, we could make it a standard vs optional "feature" ... My thought would be to put 'make history feaure standard using libedit' onto the TODO list and take it from there ... I doubt I'd have the time to do it, but if you guys want to use libedit it'd probably be a good idea at least to reduce the amount of potential GPL tainting in the source code. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] GNU readline and BSD license
* Tom Lane [EMAIL PROTECTED] [001229 15:43] wrote: Lamar Owen [EMAIL PROTECTED] writes: How different is the feature set? I was going to ask the same thing. If it's an exact replacement then OK, but I do not want to put up with non-Emacs-compatible keybindings, to mention just one likely issue. The whole thing really strikes me as make-work anyway. Linux is GPL'd; does anyone want to argue that we shouldn't run on Linux? Since we are not including libreadline in our distribution, there is NO reason to worry about using it when it's available. Wanting to find a replacement purely because of the license amounts to license bigotry, IMHO. Rasmus Lerdorf warned one of you guys that simply linking to GNU readline can contaminate code with the GPL. Readline isn't LGPL which permits linking without lincense issues, it is GPL which means that if you link to it, you must be GPL as well. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] GNU readline and BSD license
* Peter Eisentraut [EMAIL PROTECTED] [001229 16:01] wrote: The Hermit Hacker writes: Is there a reason *not* to move towards that for v7.2 so that the functions we are making optional with readline are automatic? Since we could then ship the code, we could make it a standard vs optional "feature" ... My thought would be to put 'make history feaure standard using libedit' onto the TODO list and take it from there ... In my mind this is a pointless waste of developer time because there is no problem to solve here. I'm sure we all have better things to do than porting libedit to a dozen systems and then explaining to users why the tarball is bloated and their carefully composed readline configuration doesn't work anymore. If there is something functionally wrong with Readline then let's talk about it, but let's not replace it with something because some PHP dude said that RMS said something. From http://www.gnu.org/copyleft/gpl.html This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License. My understanding (from the recent discussion) is that Postgresql has certain dependancies on libreadline and won't compile/work without it, if true this effectively forces anyone wishing to derive a viable commercial product based on Postgresql to switch to the GPL or port to libedit anyway. If readline is completely optional then there's really no problem. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] GNU readline and BSD license
* Tom Lane [EMAIL PROTECTED] [001229 16:38] wrote: The Hermit Hacker [EMAIL PROTECTED] writes: Actually, IMHO, the pro to moving to libedit is that we could include it as part of the distribution and make history a *standard* feature How big is libedit? If it's tiny, that might be a good argument ... but I don't want to see us bulking up our distro with something that people could and should get directly from its source. ~350k -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] GNU readline and BSD license
* The Hermit Hacker [EMAIL PROTECTED] [001229 17:06] wrote: On Fri, 29 Dec 2000, Tom Lane wrote: Alfred Perlstein [EMAIL PROTECTED] writes: My understanding (from the recent discussion) is that Postgresql has certain dependancies on libreadline and won't compile/work without it, Then you're working from a misconception. I think the misconception that he might be working on here is the point someone brought up that when configure runs, it is adding -lreadline to the backend compile, even though that I don't think there is any reason for doing such? I thought psql required libreadline, I'm not sure who said it. If nothing requires it then there's not much point in moving to libedit from a devel cost/benifit analysis. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Assuming that TAS() will succeed the first time is verboten
* Tom Lane [EMAIL PROTECTED] [001228 14:25] wrote: [EMAIL PROTECTED] (Nathan Myers) writes: I wonder about the advisability of using spinlocks in user-level code which might be swapped out any time. The reason we use spinlocks is that we expect the lock to succeed (not block) the majority of the time, and we want the code to fall through as quickly as possible in that case. In particular we do *not* want to expend a kernel call when we are able to acquire the lock immediately. It's not a true "spin" lock because we don't sit in a tight loop when we do have to wait for the lock --- we use select() to delay for a small interval before trying again. See src/backend/storage/buffer/s_lock.c. The design is reasonable, even if a little bit offbeat. It sounds pretty bad, if you have a contested lock you'll trap into the kernel each time you miss, crossing the protection boundry and then waiting. It's a tough call to make, because on UP systems you loose bigtime by spinning for your quantum, however on SMP systems there's a chance that the lock is owned by a process on another CPU and spinning might be benificial. One trick that may help is calling sched_yield(2) on a lock miss, it's a POSIX call and quite new so you'd need a 'configure' test for it. http://www.freebsd.org/cgi/man.cgi?query=sched_yieldapropos=0sektion=0manpath=FreeBSD+4.2-RELEASEformat=html -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] GNU readline and BSD license
* Bruce Momjian [EMAIL PROTECTED] [001223 06:59] wrote: Rasmus Lerdorf, the big PHP developer, told me that the existance of GNU readline hooks in our source tree could cause RMS/GNU to force us to a GNU license. Obviously, we could remove readline hooks and ship a BSD line editing library, but does this make any sense to you? It doesn't make sense to me, but he was quite certain. Our ODBC library is also GNU licensed, but I am told this is not a problem because it doesn't link into the backend. However, neither does readline. However, readline does link into psql. FreeBSD has a freely available library called 'libedit' that could be shipped with postgresql, it's under the BSD license. If you have access to a FreeBSD box see the editline(3) manpage, or go to: http://www.freebsd.org/cgi/man.cgi?query=editlineapropos=0sektion=0manpath=FreeBSD+4.2-RELEASEformat=html -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Re: Too many open files (was Re: spinlock problems reported earlier)
* Tom Lane [EMAIL PROTECTED] [001223 14:16] wrote: Department of Things that Fell Through the Cracks: Back in August we had concluded that it is a bad idea to trust "sysconf(_SC_OPEN_MAX)" as an indicator of how many files each backend can safely open. FreeBSD was reported to return 4136, and I have since noticed that LinuxPPC returns 1024. Both of those are unreasonably large fractions of the actual kernel file table size. A few dozen backends opening hundreds of files apiece will fill the kernel file table on most Unix platforms. getdtablesize(2) on BSD should tell you the per-process limit. sysconf on FreeBSD shouldn't lie to you. getdtablesize should take into account limits in place. later versions of FreeBSD have a sysctl 'kern.openfiles' which can be checked to see if the system is approaching the systemwide limit. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."
Re: [HACKERS] Unable to check out REL7_1 via cvs
* Yusuf Goolamabbas [EMAIL PROTECTED] [001222 15:34] wrote: Hi, I am using the following command to check out the 7.1 branch of PostgreSQL. cvs -d :pserver:[EMAIL PROTECTED]:/home/projects/pgsql/cvsroot co -r REL7_1 pgsql This is the error I am getting. cvs [server aborted]: cannot write /home/projects/pgsql/cvsroot/CVSROOT/val-tags: Permission denied I can check out HEAD perfectly alright Anybody else seeing similar results ? Try using "cvs -Rq ..." or just use CVSup it's (cvsup) a lot quicker. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] "I have the heart of a child; I keep it in a jar on my desk."