Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Sun, 2002-10-06 at 04:03, Tom Lane wrote: Hannu Krosing [EMAIL PROTECTED] writes: Or its solution ;) as instead of the predicting we just write all data in log that is ready to be written. If we postpone writing, there will be hickups when we suddenly discover that we need to write a whole lot of pages (fsync()) after idling the disk for some period. This part is exactly the same point that I've been proposing to solve with a background writer process. We don't need aio_write for that. The background writer can handle pushing completed WAL pages out to disk. The sticky part is trying to gang the writes for multiple transactions whose COMMIT records would fit into the same WAL page, and that WAL page isn't full yet. I just hoped that kernel could be used as the background writer process and in the process also solve the multiple commits on the same page problem The rest of what you wrote seems like wishful thinking about how aio_write might behave :-(. I have no faith in it. Yeah, and the fact that there are several slightly different implementations of AIO even on Linux alone does not help. I have to test the SGI KAIO implementation for conformance with my wishful thinking ;) Perhaps you could ask around about AIO in RedHat Advanced Server (is it the same AIO as SGI, how does it behave in multiple writes on the same page case) as you may have better links to RedHat ? -- Hannu ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Parallel Executors [was RE: [HACKERS] Threaded Sorting]
tom lane wrote: Curtis Faith [EMAIL PROTECTED] writes: What about splitting out parsing, optimization and plan generation from execution and having a separate pool of exececutor processes. As an optimizer finished with a query plan it would initiate execution by grabbing an executor from a pool and passing it the plan. So different executors would potentially handle the queries from a single transaction? How will you deal with pushing transaction-local state from one to the other? Even if you restrict it to switching at transaction boundaries, you still have session-local state (at minimum user ID and SET settings) to worry about. Hmmm, what transaction boundaries did you mean? Since we are talking about single statement parallization, there must be some specific internal semantics that you believe need isolation. It seems like we'd be able to get most of the benefit and restrict the parallization in a way that would preserve this isolation but I'm curious what you were specifically referring to? The current transaction/user state seems to be stored in process global space. This could be changed to be a sointer to a struct stored in a back-end specific shared memory area which would be accessed by the executor process at execution start. The backend would destroy and recreate the shared memory and restart execution in the case where an executor process dies much like the postmaster does with backends now. To the extent the executor process might make changes to the state, which I'd try to avoid if possible (don't know if it is), the executors could obtain locks, otherwise if the executions were constrained to isolated elements (changes to different indexes for example) it seems like it would be possible using an architecture where you have: Main Executor: Responsible for updating global meta data from each sub-executor and assembling the results of multiple executions. In the case of multiple executor sorts, the main executor would perform a merge sort on the results of it and it's subordinates pre-sorted sub-sets of the relation. Subordinate Executor: Executes sub-plans and returns results or meta-data update information into front-end shared memory directly. To make this optimal, the index code would have to be changed to support the idea of partial scans. In the case of btrees it would be pretty easy using the root page to figure out what index values delineated different 1/2's, 1/3's, 1/4's etc. of the index space. I'm not sure what you'd have to do to support this for table scans as I don't know the PostgreSQL tuple storage mechanism, yet. This does not seem like too much architectural complexity or performance overhead (even for the single executor case) for a big gain for complex query performance. Being able to apply multiple CPUs to a single query is attractive, but I've not yet seen schemes for it that don't look like the extra CPU power would be chewed up in overhead :-(. Do you remember specifc overhead problems/issues? - Curtis ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Improving speed of copy
On Fri, 20 Sep 2002, Shridhar Daithankar wrote: On 20 Sep 2002 at 21:22, Shridhar Daithankar wrote: Mysql takes 221 sec. v/s 1121 sec. for postgres. For postgresql, that is around 11.5K rows per second. Each tuple has 23 fields with fixed length of around 100 bytes Yes, postgres is much slower than MySQL for doing bulk loading of data. There's not much, short of hacking on the code, that can be done about this. One more issue is time taken for composite index creation. It's 4341 sec as opposed to 436 sec for mysql. These are three non-unique character fields where the combination itself is not unique as well. Setting sort_mem appropriately makes a big difference here. I generally bump it up to 2-8 MB for everyone, and when I'm building a big index, I set it to 32 MB or so just for that session. But make sure you don't set it so high you drive your system into swapping, or it will kill your performance. Remember also, that in 7.2.x, postgres will actually use almost three times the value you give sort_mem (i.e., sort_mem of 32 MB will actually allocate close to 96 MB of memory for the sort). cjs -- Curt Sampson [EMAIL PROTECTED] +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [GENERAL] Performance while loading data and indexing
On 26 Sep 2002, Neil Conway wrote: The fact that ext2 defaults to asynchronous mode and UFS (at least on the BSDs) defaults to synchronous mode seems like a total non-issue to me. Is there any more to the alleged difference in reliability? It was sort of pointed out here, but perhaps not made completely clear, that Berkley FFS defaults to synchronous meta-data updates, but asynchronous data updates. You can also specify entirely synchronous or entirely asynchronous updates. Linux ext2fs supports only these last two modes, which is the problem. cjs -- Curt Sampson [EMAIL PROTECTED] +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Sat, 2002-10-05 at 14:46, Curtis Faith wrote: 2) aio_write vs. normal write. Since as you and others have pointed out aio_write and write are both asynchronous, the issue becomes one of whether or not the copies to the file system buffers happen synchronously or not. Actually, I believe that write will be *mostly* asynchronous while aio_write will always be asynchronous. In a buffer poor environment, I believe write will degrade into a synchronous operation. In an ideal situation, I think they will prove to be on par with one another with a slight bias toward aio_write. In less than ideal situations where buffer space is at a premium, I think aio_write will get the leg up. The kernel doesn't need to know anything about platter rotation. It just needs to keep the disk write buffers full enough not to cause a rotational latency. Which is why in a buffer poor environment, aio_write is generally preferred as the write is still queued even if the buffer is full. That means it will be ready to begin placing writes into the buffer, all without the process having to wait. On the other hand, when using write, the process must wait. In a worse case scenario, it seems that aio_write does get a win. I personally would at least like to see an aio implementation and would be willing to even help benchmark it to benchmark/validate any returns in performance. Surely if testing reflected a performance boost it would be considered for baseline inclusion? Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
Greg Copeland [EMAIL PROTECTED] writes: I personally would at least like to see an aio implementation and would be willing to even help benchmark it to benchmark/validate any returns in performance. Surely if testing reflected a performance boost it would be considered for baseline inclusion? It'd be considered, but whether it'd be accepted would have to depend on the size of the performance boost, its portability (how many platforms/scenarios do you actually get a boost for), and the extent of bloat/uglification of the code. I can't personally get excited about something that only helps if your server is starved for RAM --- who runs servers that aren't fat on RAM anymore? But give it a shot if you like. Perhaps your analysis is pessimistic. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] New lock types
On Sat, Oct 05, 2002 at 08:25:35PM -0400, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: It seems to me I'll have to add some amount of lock types in the lock manager. Does that bother you? Such as? There are three new lock modes: R, RX and RS (Reorganizer, Reorganizer Exclusive and Reorganizer Shared). Actually, they are not new lock types; rather, new objects on which locks should be obtained before using the index pages. We've got a ton of lock modes already; perhaps these operations can be mapped into acquiring some existing lock modes on index pages? I'll have a look. Can't say right now. Some of the locks have special semantics on what to do when a conflicting request arrives. Yeah, you can't recycle pages without a freelist :-(. In access/nbtree/README says that the metapage of the index contains a pointer to the list of free pages. However I don't see anything like that in the BTMetaPageData. Should I suppose that it was ripped out sometime ago? I'll add it if that's the case. Or maybe I'm looking at the wrong place? Maybe the freelist head link can be treated as a separately lockable object. I think creating a new LWLockId (BTFreeListLock?) can help here. The operations on freelist are short lived and rather infrequent so it doesn't seem to matter that it is global to all indexes. Another way would be to create one LockId per index, but it seems a waste to me. -- Alvaro Herrera (alvherre[a]atentus.com) No hay ausente sin culpa ni presente sin disculpa (Prov. frances) ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Proposed LogWriter Scheme, WAS: Potential Large
On Sun, 2002-10-06 at 11:46, Tom Lane wrote: I can't personally get excited about something that only helps if your server is starved for RAM --- who runs servers that aren't fat on RAM anymore? But give it a shot if you like. Perhaps your analysis is pessimistic. I do suspect my analysis is somewhat pessimistic too but to what degree, I have no idea. You make a good case on your memory argument but please allow me to further kick it around. I don't find it far fetched to imagine situations where people may commit large amounts of memory for the database yet marginally starve available memory for file system buffers. Especially so on heavily I/O bound systems or where sporadicly other types of non-database file activity may occur. Now, while I continue to assure myself that it is not far fetched I honestly have no idea how often this type of situation will typically occur. Of course, that opens the door for simply adding more memory and/or slightly reducing the amount of memory available to the database (thus making it available elsewhere). Now, after all that's said and done, having something like aio in use would seemingly allowing it to be somewhat more self-tuning from a potential performance perspective. Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] New lock types
Alvaro Herrera [EMAIL PROTECTED] writes: I think creating a new LWLockId (BTFreeListLock?) can help here. The operations on freelist are short lived and rather infrequent so it doesn't seem to matter that it is global to all indexes. Seems like a really bad idea to me ... what makes you think that this would not be a bottleneck? You'd have to take such a lock during every index-page split, which is not that uncommon. Another way would be to create one LockId per index, but it seems a waste to me. No, you should be looking at a way to represent index locking in the standard lock manager, not as an LWLock. We've already got a concept of page-level lockable entities there. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Naming convention
Hello hackers, What's the naming convention for new functions/variables? I've seen this_way() and ThisWay() used without visible distinction. I've used both in previously submitted and accepted patches... Does it matter? -- Alvaro Herrera (alvherre[a]atentus.com) La libertad es como el dinero; el que no la sabe emplear la pierde (Alvarez) ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Analysis of ganged WAL writes
I said: There is a simple error in the current code that is easily corrected: in XLogFlush(), the wait to acquire WALWriteLock should occur before, not after, we try to acquire WALInsertLock and advance our local copy of the write request pointer. (To be exact, xlog.c lines 1255-1269 in CVS tip ought to be moved down to before line 1275, inside the if that tests whether we are going to call XLogWrite.) That patch was not quite right, as it didn't actually flush the later-arriving data. The correct patch is *** src/backend/access/transam/xlog.c.orig Thu Sep 26 18:58:33 2002 --- src/backend/access/transam/xlog.c Sun Oct 6 18:45:57 2002 *** *** 1252,1279 /* done already? */ if (!XLByteLE(record, LogwrtResult.Flush)) { - /* if something was added to log cache then try to flush this too */ - if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE)) - { - XLogCtlInsert *Insert = XLogCtl-Insert; - uint32 freespace = INSERT_FREESPACE(Insert); - - if (freespace SizeOfXLogRecord) /* buffer is full */ - WriteRqstPtr = XLogCtl-xlblocks[Insert-curridx]; - else - { - WriteRqstPtr = XLogCtl-xlblocks[Insert-curridx]; - WriteRqstPtr.xrecoff -= freespace; - } - LWLockRelease(WALInsertLock); - } /* now wait for the write lock */ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); LogwrtResult = XLogCtl-Write.LogwrtResult; if (!XLByteLE(record, LogwrtResult.Flush)) { ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = record; XLogWrite(WriteRqst); } LWLockRelease(WALWriteLock); --- 1252,1284 /* done already? */ if (!XLByteLE(record, LogwrtResult.Flush)) { /* now wait for the write lock */ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); LogwrtResult = XLogCtl-Write.LogwrtResult; if (!XLByteLE(record, LogwrtResult.Flush)) { ! /* try to write/flush later additions to XLOG as well */ ! if (LWLockConditionalAcquire(WALInsertLock, LW_EXCLUSIVE)) ! { ! XLogCtlInsert *Insert = XLogCtl-Insert; ! uint32 freespace = INSERT_FREESPACE(Insert); ! ! if (freespace SizeOfXLogRecord) /* buffer is full */ ! WriteRqstPtr = XLogCtl-xlblocks[Insert-curridx]; ! else ! { ! WriteRqstPtr = XLogCtl-xlblocks[Insert-curridx]; ! WriteRqstPtr.xrecoff -= freespace; ! } ! LWLockRelease(WALInsertLock); ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = WriteRqstPtr; ! } ! else ! { ! WriteRqst.Write = WriteRqstPtr; ! WriteRqst.Flush = record; ! } XLogWrite(WriteRqst); } LWLockRelease(WALWriteLock); To test this, I made a modified version of pgbench in which each transaction consists of a simple insert into table_NNN values(0); where each client thread has a separate insertion target table. This is about the simplest transaction I could think of that would generate a WAL record each time. Running this modified pgbench with postmaster parameters postmaster -i -N 120 -B 1000 --wal_buffers=250 and all other configuration settings at default, CVS tip code gives me a pretty consistent 115-118 transactions per second for anywhere from 1 to 100 pgbench client threads. This is exactly what I expected, since the database (including WAL file) is on a 7200 RPM SCSI drive. The theoretical maximum rate of sync'd writes to the WAL file is therefore 120 per second (one per disk revolution), but we lose a little because once in awhile the disk has to seek to a data file. Inserting the above patch, and keeping all else the same, I get: $ mybench -c 1 -t 1 bench1 number of clients: 1 number of transactions per client: 1 number of transactions actually processed: 1/1 tps = 116.694205 (including connections establishing) tps = 116.722648 (excluding connections establishing) $ mybench -c 5 -t 2000 -S -n bench1 number of clients: 5
Re: [HACKERS] Analysis of ganged WAL writes
On Sun, 2002-10-06 at 18:07, Tom Lane wrote: CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients to 10% idle at 10 or more. So this does seem to be a nice win, and unless I hear objections I will apply it ... Wow Tom! That's wonderful! On the other hand, maybe people needed the extra idle CPU time that was provided by the unpatched code. ;) Greg signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Analysis of ganged WAL writes
On Sun, 2002-10-06 at 19:35, Greg Copeland wrote: On Sun, 2002-10-06 at 18:07, Tom Lane wrote: CPU loading goes from 80% idle at 1 client to 50% idle at 5 clients to 10% idle at 10 or more. So this does seem to be a nice win, and unless I hear objections I will apply it ... Wow Tom! That's wonderful! On the other hand, maybe people needed the extra idle CPU time that was provided by the unpatched code. ;) Naw. Distributed.net finally got through RC5-64. Lots of CPU to spare now. -- Rod Taylor ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [GENERAL] Large databases, performance
On Thu, 3 Oct 2002, Shridhar Daithankar wrote: Well, we were comparing ext3 v/s reiserfs. I don't remember the journalling mode of ext3 but we did a 10 GB write test. Besides converting the RAID to RAID- 0 from RAID-5 might have something to do about it. That will have a massive, massive effect on performance. Depending on your RAID subsystem, you can except RAID-0 to be between two and twenty times as fast for writes as RAID-5. If you compared one filesystem on RAID-5 and another on RAID-0, your results are likely not at all indicative of file system performance. Note that I've redirected followups to the pgsql-performance list. Avoiding cross-posting would be nice, since I am getting lots of duplicate messages these days. cjs -- Curt Sampson [EMAIL PROTECTED] +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [GENERAL] Large databases, performance
On Thu, 3 Oct 2002, Shridhar Daithankar wrote: Our major concern remains load time as data is generated in real time and is expecetd in database with in specified time period. If your time period is long enough, you can do what I do, which is to use partial indexes so that the portion of the data being loaded is not indexed. That will speed your loads quite a lot. Aftewards you can either generate another partial index for the range you loaded, or generate a new index over both old and new data, and then drop the old index. The one trick is that the optimizer is not very smart about combining multiple indexes, so you often need to split your queries across the two partitions of the table that have separate indexes. Shall I subscribe to performance? Yes, you really ought to. The list is [EMAIL PROTECTED] cjs -- Curt Sampson [EMAIL PROTECTED] +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] cross-posts (was Re: [GENERAL] Large databases, performance)
Curt Sampson [EMAIL PROTECTED] writes: ... Avoiding cross-posting would be nice, since I am getting lots of duplicate messages these days. Cross-posting is a fact of life, and in fact encouraged, on the pg lists. I suggest adapting. Try sending set all unique your-email-address to the PG majordomo server; this sets you up to get only one copy of each cross-posted message. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Naming convention
Alvaro Herrera [EMAIL PROTECTED] writes: What's the naming convention for new functions/variables? I've seen this_way() and ThisWay() used without visible distinction. I've used both in previously submitted and accepted patches... Does it matter? Consistency? We don't need no steenking consistency ;-) Seriously, you can find a wide range of naming conventions in the PG sources. It might be better if the range weren't so wide, but I doubt anyone really wants to engage in wholesale renaming (let alone getting into the flamewars that would ensue if we tried to pick a One True Naming Style). I'd suggest conforming to the namestyle that you see in code closely related to what you are doing, or at least some namestyle you can find precedent for somewhere in the backend. Beyond that, no one will question you. My own two cents: pay more attention to the semantic content of your names, and not so much to how you capitalize 'em. FooBar() is a useless name no matter how beautifully you present it. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] pg_filedump
Hello hackers, I'm trying to get something from pg_filedump. However, the version published in sources.redhat.com/rhdb doesn't grok a lot of changes in current CVS. I changed all those and made it compile... but looks like that's only the easy part. I get bogus values everywhere (block sizes, item numbers, etc). Does somebody know whether it's mantained for current versions? -- Alvaro Herrera (alvherre[a]dcc.uchile.cl) Cuando miro a alguien, mas me atrae como cambia que quien es (J. Binoche) ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP
Hello! On Sat, 5 Oct 2002, Bruce Momjian wrote: Yes, I agree with you Manfred, but more people _don't_ want it to change, and like it the way it is, so we will just keep it and add now(string). IMHO the best way could be GUC-default/SET session-based variable controlling the behaviour. By default old Pg one, but ppl can set standard-compliant. Such changes were done often in past, look at group by behaviour changes 6.4-6.5, default Pg datetime representation format change, etc. I think those who concern interoperability confirm that it's much easy to add one SET per session then replace all CURRENT_STAMP to now(blah-blah-blah). Moreover, ppl who need old behaviour can easily revert to this by just one SET (in case GUC is set to new behaviour). Added to TODO: * Add now(transaction|statement|clock) functionality I have attached an SGML patch that explains the issues with CURRENT_TIMESTAMP in more detail. -- WBR, Yury Bokhoncovich, Senior System Administrator, NOC of F1 Group. Phone: +7 (3832) 106228, ext.140, E-mail: [EMAIL PROTECTED] Unix is like a wigwam -- no Gates, no Windows, and an Apache inside. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [SQL] [GENERAL] CURRENT_TIMESTAMP
Yury Bokhoncovich wrote: Hello! On Sat, 5 Oct 2002, Bruce Momjian wrote: Yes, I agree with you Manfred, but more people _don't_ want it to change, and like it the way it is, so we will just keep it and add now(string). IMHO the best way could be GUC-default/SET session-based variable controlling the behaviour. By default old Pg one, but ppl can set standard-compliant. Such changes were done often in past, look at group by behaviour changes 6.4-6.5, default Pg datetime representation format change, etc. I think those who concern interoperability confirm that it's much easy to add one SET per session then replace all CURRENT_STAMP to now(blah-blah-blah). Moreover, ppl who need old behaviour can easily revert to this by just one SET (in case GUC is set to new behaviour). Let's see if people want the more standards-compliant behavior before adding a GUC, no? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html