Re: [PATCHES] Dead Space Map version 3 (simplified)
ITAGAKI Takahiro wrote: Attached is an updated DSM patch. I've left the core function of DSM only and dropped other complicated features in this release. We discussed it a long time ago already, but I really wished the DSM wouldn't need a fixed size shared memory area. It's one more thing the DBA needs to tune manually. It also means we need to have an algorithm for deciding what to keep in the DSM and what to leave out. And I don't see a good way to extend the current approach to implement the index-only-scans that we've been talking about, and the same goes for recovery. :( The way you update the DSM is quite interesting. When a page is dirtied, the BM_DSM_DIRTY flag is set in the buffer descriptor. The corresponding bit in the DSM is set lazily in FlushBuffer whenever BM_DSM_DIRTY is set. That's a clever way to avoid contention on updates. But does it work for tables that have a small hot part that's updated very frequently? That's exactly the scenario where the DSM is the most useful. Hot pages stay in the buffer cache because they're frequently accessed, which means that FlushBuffer isn't getting called for them and the bits in the DSM aren't getting set until checkpoint. This could lead to unnecessary bloating of the hot part. A straightforward fix would be to scan the buffer cache for buffers marked with BM_DSM_DIRTY to update the DSM before starting the vacuum scan. It might not be a problem in practice, but it bothers me that the DSM isn't 100% accurate. You end up having a page with dead tuples on it marked as non-dirty in the DSM at least when a page is vacuumed but there's some RECENTLY_DEAD tuples on it that become dead later on. There might be other scenarios as well. If I'm reading the code correctly, DSM makes no attempt to keep the chunks ordered by block number. If that's the case, vacuum needs to be modified because it currently relies on the fact that blocks are scanned and the dead tuple list is therefore populated in order. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] parser dilemma
Zoltan Boszormenyi wrote: On the other hand, marking GENERATED as %right solves this issue. I hope it's an acceptable solution. If anything I should have thought it would be marked %nonassoc. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
On Fri, 2007-04-20 at 10:16 +0200, Zeugswetter Andreas ADI SD wrote: > Your work in this area is extremely valuable and I hope my comments are > not discouraging. I think its too late in the day to make the changes suggested by yourself and Tom. They make the patch more invasive and more likely to error, plus we don't have much time. This really means the patch is likely to be rejected at the 11th hour when what we have essentially works... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
Here's only a part of the reply I should do, but as to I/O error checking ... Here's a list of system calls and other external function/library calls used in pg_lesslog patch series, together with how current patch checks each errors and how current postgresql source handles the similar calls: 1. No error check is done 1-1. fileno() fileno() is called against stdin and stdout from pg_compresslog.c and pg_decompresslog.c. They are intended to be invoked from a shell and so stdin and stdout are both available. fileno() error occurs only if invoker of pg_compresslog or pg_decompresslog closes stdin and/or stdout before the invoker executes them. I found similar fileno() usage in pg_dump/pg_backup_archive.c and postmaster/syslogger.c. I don't think this is an issue. 1-2. fflush() fflush() is called against stdout within a debug routine, debug.c. Such usage can also be found in bin/initdb.c, bin/scripts/createdb.c, bin/psql/common.c and more. I don't think this is an issue either. 1-3. printf() and fprintf() It is common practice not to check the error. We can find such calls in many of existing source codes. 1-4. strerror() It is checked that system call returns error before calling strerror. Similar code can be found in other PostgreSQL source too. -- 2. Error check is done All the following function calls are associated with return value check. open(), close(), fstat(), read(), write() --- 3. Functions do not return error The following functin will not return errors, so no error check is needed. exit(), memcopy(), memset(), strcmp() I hope this helps. Regards; Tom Lane wrote: > "Simon Riggs" <[EMAIL PROTECTED]> writes: >> Writing lots of additional code simply to remove a parameter that >> *might* be mis-interpreted doesn't sound useful to me, especially when >> bugs may leak in that way. My take is that this is simple and useful >> *and* we have it now; other ways don't yet exist, nor will they in time >> for 8.3. > > The potential for misusing the switch is only one small part of the > argument; the larger part is that this has been done in the wrong way > and will cost performance unnecessarily. The fact that it's ready > now is not something that I think should drive our choices. > > I believe that it would be possible to make the needed core-server > changes in time for 8.3, and then to work on compress/decompress > on its own time scale and publish it on pgfoundry; with the hope > that it would be merged to contrib or core in 8.4. Frankly the > compress/decompress code needs work anyway before it could be > merged (eg, I noted a distinct lack of I/O error checking). > > regards, tom lane > -- - Koichi Suzuki ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
Hi, I agree that pg_compresslog should be aware of all the WAL records' details so that it can optimize archive log safely. In my patch, I've examined 8.2's WAL records to make pg_compresslog/pg_decompresslog safe. Also I agree further pg_compresslog maintenance needs to examine changes in WAL record format. Because the number of such format will be limited, I think the amount of work will be reasonable enough. Regards; Simon Riggs wrote: On Fri, 2007-04-13 at 10:36 -0400, Tom Lane wrote: "Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]> writes: But you also turn off the optimization that avoids writing regular WAL records when the info is already contained in a full-page image (increasing the uncompressed size of WAL). It was that part I questioned. I think its right to question it, certainly. That's what bothers me about this patch, too. It will be increasing the cost of writing WAL (more data -> more CRC computation and more I/O, not to mention more contention for the WAL locks) which translates directly to a server slowdown. I don't really understand this concern. Koichi-san has included a parameter setting that would prevent any change at all in the way WAL is written. If you don't want this slight increase in WAL, don't enable it. If you do enable it, you'll also presumably be compressing the xlog too, which works much better than gzip using less CPU. So overall it saves more than it costs, ISTM, and nothing at all if you choose not to use it. The main arguments that I could see against Andreas' alternative are: 1. Some WAL record types are arranged in a way that actually would not permit the reconstruction of the short form from the long form, because they throw away too much data when the full-page image is substituted. An example that's fresh in my mind is that the current format of the btree page split WAL record discards newitemoff in that case, so you couldn't identify the inserted item in the page image. Now this is only saving two bytes in what's usually going to be a darn large record anyway, and it complicates the code to do it, so I wouldn't cry if we changed btree split to include newitemoff always. But there might be some other cases where more data is involved. In any case, someone would have to look through every single WAL record type to determine whether reconstruction is possible and fix it if not. 2. The compresslog utility would have to have specific knowledge about every compressible WAL record type, to know how to convert it to the short format. That means an ongoing maintenance commitment there. I don't think this is unacceptable, simply because we need only teach it about a few common record types, not everything under the sun --- anything it doesn't know how to fix, just leave alone, and if it's an uncommon record type it really doesn't matter. (I guess that means that we don't really have to do #1 for every last record type, either.) So I don't think either of these is a showstopper. Doing it this way would certainly make the patch more acceptable, since the argument that it might hurt rather than help performance in some cases would go away. Yeh, its additional code paths, but it sounds like Koichi-san and colleagues are going to be trail blazing any bugs there and will be around to fix any more that emerge. What about disconnecting WAL LSN from physical WAL record position during replay ? Add simple short WAL records in pg_compresslog like: advance LSN by 8192 bytes. I don't care for that, as it pretty much destroys some of the more important sanity checks that xlog replay does. The page boundaries need to match the records contained in them. So I think we do need to have pg_decompresslog insert dummy WAL entries to fill up the space saved by omitting full pages. Agreed. I don't want to start touching something that works so well. We've been thinking about doing this for at least 3 years now, so I don't see any reason to baulk at it now. I'm happy with Koichi-san's patch as-is, assuming further extensive testing will be carried out on it during beta. -- - Koichi Suzuki ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] parser dilemma
Zoltan Boszormenyi írta: Martijn van Oosterhout írta: On Thu, Apr 19, 2007 at 11:19:40AM +0200, Zoltan Boszormenyi wrote: The problem comes from cases like colname coltype DEFAULT 5! GENERATED ... Since b_expr allows postfix operators, it takes one more token of lookahead than we have to tell if the default expression is "5!" or "5!GENERATED ...". ISTM that as long as: colname coltype DEFAULT (5!) GENERATED ... works I don't see why it would be a problem to require the parentheses in this case. Postfis operators are not going to be that common here I think. Have a nice day, You mean like this one? *** gram.y.old 2007-04-20 09:23:16.0 +0200 --- gram.y 2007-04-20 09:25:34.0 +0200 *** *** 7550,7557 { $$ = (Node *) makeA_Expr(AEXPR_OP, $2, $1, $3, @2); } | qual_Op b_expr%prec Op { $$ = (Node *) makeA_Expr(AEXPR_OP, $1, NULL, $2, @1); } ! | b_expr qual_Op%prec POSTFIXOP ! { $$ = (Node *) makeA_Expr(AEXPR_OP, $2, $1, NULL, @2); } | b_expr IS DISTINCT FROM b_expr %prec IS { $$ = (Node *) makeSimpleA_Expr(AEXPR_DISTINCT, "=", $1, $5, @2); --- 7550,7557 { $$ = (Node *) makeA_Expr(AEXPR_OP, $2, $1, $3, @2); } | qual_Op b_expr%prec Op { $$ = (Node *) makeA_Expr(AEXPR_OP, $1, NULL, $2, @1); } ! | '(' b_expr qual_Op ')' %prec POSTFIXOP ! { $$ = (Node *) makeA_Expr(AEXPR_OP, $3, $2, NULL, @3); } | b_expr IS DISTINCT FROM b_expr %prec IS { $$ = (Node *) makeSimpleA_Expr(AEXPR_DISTINCT, "=", $1, $5, @2); This change alone brings 13 reduce/reduce conflicts. On the other hand, marking GENERATED as %right solves this issue. I hope it's an acceptable solution. Best regards, Zoltán Böszörményi -- -- Zoltán Böszörményi Cybertec Geschwinde & Schönig GmbH http://www.postgresql.at/ psql-serial-42.diff.gz Description: Unix tar archive ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
> With DBT-2 benchmark, I've already compared the amount of WAL. The > result was as follows: > > Amount of WAL after 60min. run of DBT-2 benchmark > wal_add_optimization_info = off (default) 3.13GB how about wal_fullpage_optimization = on (default) > wal_add_optimization_info = on (new case) 3.17GB -> can be > optimized to 0.31GB by pg_compresslog. > > So the difference will be around a couple of percents. I think this is > very good figure. > > For information, > DB Size: 12.35GB (120WH) > Checkpoint timeout: 60min. Checkpoint occured only once in the run. Unfortunately I think DBT-2 is not a good benchmark to test the disabled wal optimization. The test should contain some larger rows (maybe some updates on large toasted values), and maybe more frequent checkpoints. Actually the poor ratio between full pages and normal WAL content in this benchmark is strange to begin with. Tom fixed a bug recently, and it would be nice to see the new ratio. Have you read Tom's comment on not really having to be able to reconstruct all record types from the full page image ? I think that sounded very promising (e.g. start out with only heap insert/update). Then: - we would not need the wal optimization switch (the full page flag would always be added depending only on backup) - pg_compresslog would only remove such "full page" images where it knows how to reconstruct a "normal" WAL record from - with time and effort pg_compresslog would be able to compress [nearly] all record types's full images (no change in backend) > I don't think replacing LSN works fine. For full recovery to > the current time, we need both archive log and WAL. > Replacing LSN will make archive log LSN inconsistent with > WAL's LSN and the recovery will not work. WAL recovery would have had to be modified (decouple LSN from WAL position during recovery). An "archive log" would have been a valid WAL (with appropriate LSN advance records). > Reconstruction to regular WAL is proposed as > pg_decompresslog. We should be careful enough not to make > redo routines confused with the dummy full page writes, as > Simon suggested. So far, it works fine. Yes, Tom didn't like "LSN replacing" eighter. I withdraw my concern regarding pg_decompresslog. Your work in this area is extremely valuable and I hope my comments are not discouraging. Thank you Andreas ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] HOT Patch - Ready for review
On 4/19/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote: What's the purpose of the "HeapScanHintPagePrune" mechanism in index builds? I lost track of the discussion on create index, is the it necessary for correctness? Its not required strictly for correctness, but it helps us prune the HOT-chains while index building. During index build, if we skip a tuple which is RECENTLY_DEAD, existing transactions can not use the index for queries. Pruning the HOT-chains reduces the possibility of finding such tuples while building the index. A comment in IndexBuildHeapScan explaining why it's done would be nice. I would do that. In any case a PG_TRY/CATCH block should be used to make sure it's turned off after an unsuccessful index build. Oh thanks. Would do that too I would wait for other review comments before submitting a fresh patch. I hope thats ok. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
[PATCHES] actualised forgotten Magnus's patch for plpgsql MOVE statement
Hello I refreshed Magnus's patch http://archives.postgresql.org/pgsql-patches/2007-02/msg00275.php from februar. Regards Pavel Stehule p.s. scrollable cursors in plpgsql need little work still. I forgot for nonstandard (postgresql extension) direction forward all, forward n, backward n. Forward all propably hasn't sense. _ Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/ *** ./doc/src/sgml/plpgsql.sgml.orig 2007-04-20 09:01:50.0 +0200 --- ./doc/src/sgml/plpgsql.sgml 2007-04-20 09:09:04.0 +0200 *** *** 1524,1529 --- 1524,1536 + A MOVE statement sets FOUND + true if is success, false if is out of table. + + + + + A FOR statement sets FOUND true if it iterates one or more times, else false. This applies to all three variants of the FOR statement (integer *** *** 2567,2572 --- 2574,2624 + MOVE + + + MOVE direction FROM cursor; + + + + MOVE repositions a cursor without retrieving any data. MOVE works + exactly like the FETCH command, except it only positions the + cursor and does not return rows. As with SELECT + INTO, the special variable FOUND can + be checked to see whether a cursor was repositioned or not. + + + + The direction clause can be any of the + variants allowed in the SQL command except the ones that can fetch + more than one row; namely, it can be + NEXT, + PRIOR, + FIRST, + LAST, + ABSOLUTE count, + RELATIVE count, + FORWARD, or + BACKWARD. + Omitting direction is the same + as specifying NEXT. + direction values that require moving + backward are likely to fail unless the cursor was declared or opened + with the SCROLL option. + + + + Examples: + + MOVE curs1; + MOVE LAST FROM curs3; + MOVE RELATIVE -2 FROM curs4; + + + + + CLOSE *** ./src/pl/plpgsql/src/gram.y.orig 2007-04-19 19:15:28.0 +0200 --- ./src/pl/plpgsql/src/gram.y 2007-04-19 19:39:36.0 +0200 *** *** 125,131 %type stmt_assign stmt_if stmt_loop stmt_while stmt_exit %type stmt_return stmt_raise stmt_execsql stmt_execsql_insert %type stmt_dynexecute stmt_for stmt_perform stmt_getdiag ! %type stmt_open stmt_fetch stmt_close stmt_null %type proc_exceptions %type exception_sect --- 125,131 %type stmt_assign stmt_if stmt_loop stmt_while stmt_exit %type stmt_return stmt_raise stmt_execsql stmt_execsql_insert %type stmt_dynexecute stmt_for stmt_perform stmt_getdiag ! %type stmt_open stmt_fetch stmt_move stmt_close stmt_null %type proc_exceptions %type exception_sect *** *** 179,184 --- 179,185 %token K_IS %token K_LOG %token K_LOOP + %token K_MOVE %token K_NEXT %token K_NOSCROLL %token K_NOT *** *** 635,640 --- 636,643 { $$ = $1; } | stmt_fetch { $$ = $1; } + | stmt_move + { $$ = $1; } | stmt_close { $$ = $1; } | stmt_null *** *** 1478,1483 --- 1481,1499 fetch->rec = rec; fetch->row = row; fetch->curvar = $4->varno; + fetch->is_move = false; + + $$ = (PLpgSQL_stmt *)fetch; + } + ; + + stmt_move : K_MOVE lno opt_fetch_direction cursor_variable ';' + { + PLpgSQL_stmt_fetch *fetch = $3; + + fetch->lineno = $2; + fetch->curvar = $4->varno; + fetch->is_move = true; $$ = (PLpgSQL_stmt *)fetch; } *** ./src/pl/plpgsql/src/pl_exec.c.orig 2007-04-20 09:24:27.0 +0200 --- ./src/pl/plpgsql/src/pl_exec.c 2007-04-20 09:25:14.0 +0200 *** *** 3112,3118 return PLPGSQL_RC_OK; } - /* -- * exec_stmt_fetch Fetch from a cursor into a target * -- --- 3112,3117 *** *** 3164,3208 } /* -- ! * Determine if we fetch into a record or a row ! * -- ! */ ! if (stmt->rec != NULL) ! rec = (PLpgSQL_rec *) (estate->datums[stmt->rec->recno]); ! else if (stmt->row != NULL) ! row = (PLpgSQL_row *) (estate->datums[stmt->row->rowno]); ! else ! elog(ERROR, "unsupported target"); ! ! /* -- ! * Fetch 1 tuple from the cursor * -- */ ! SPI_scroll_cursor_fetch(portal, stmt->direction, how_many); ! tuptab = SPI_tuptable; ! n = SPI_processed; ! ! /* -- ! * Set the target and the global FOUND variable appropriately. ! * -- ! */ ! if (n == 0) { ! exec_move_row(estate, rec, row, NULL, tuptab->tupdesc); ! exec_set_found(estate, false); } else { ! exec_move_row(estate, rec