Re: [HACKERS] POSTGRES WAL
Is there any method or utility to convert content of WAL files into Human Readable format. xlogdump (or xlog viewer) might help. Sorry, I've never used it yet. http://pgfoundry.org/projects/xlogviewer/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dead Space Map version 2
Hello, long time no see. This topic looks interesting. I'm enrious of Itagaki-san and others. I can't do now what I want, due to other work that I don't want to do (isn't my boss seeing this?). I wish I could join the community some day and contribute to the development like the great experts here. # I can't wait to try Itagakis-san's latest patch for load distributed checkpoint in my environment and report the result. # But I may not have enough time... Let me give some comment below. From: "Heikki Linnakangas" <[EMAIL PROTECTED]> > While I understand that 100% reliable coverage is a significantly > stronger guarantee, I don't see any particular problems in implementing > that. WAL logging isn't that hard. > > I won't insist, I'm not the one doing the programming after all. > Anything is better than what we have now. However, I do hope that > whatever is implemented doesn't need a complete rewrite to make it 100% > reliable in the future. > > The basic wish I have is to not use a fixed size shared memory area like > FSM for the DSM. I'd like it to use the shared buffers instead, which > makes the memory management and tuning easier. And it also makes it > easier to get the WAL logging right, even if it's not done for 8.3 but > added later. > I hope for the same thing as Heikki-san. Though I'm relatively new to PostgreSQL source code, I don't think it is very difficult (at least for experts here) to implement the reliable space management scheme, so I proposed the following before -- not separate memory area for FSM, but treating it the same way as data files in the shared buffers. Though Tom-san is worrying about performance, what makes the performance degrade greatly? Additional WAL records for updating space management structures are written sequentially in batch. Additional dirty shared buffers are written efficiently by kernel (at least now.) And PostgreSQL is released from the giant lwlock for FSM. Some performance degradation would surely result. However, reliability is more important because "vacuum" is almost the greatest concern for real serious users (not for hobbists who enjoy performance.) Can anybody say to users "we are working hard, but our work may not be reliable and sometimes fails. Can you see if our vacuuming effort failed and try this...?" And I'm afraid that increasing the number of configuration parameters is unacceptable for users. It is merely the excuse of developers. PostgreSQL already has more than 100 parameters. Some of them, such as bgwriter_*, are difficult for normal users to understand. It's best to use shared_buffers parameter and show how to set it in the document. Addressing the vacuum problem correctly is very important. I hope you don't introduce new parameters for unfinished work and force users to check the manual to change the parameters in later versions, i.e. "managed_* parameters are not supported from this release. Please use shared_buffers..." Is it a "must" to release 8.3 by this summer? I think that delaying the release a bit for correct (reliable) vacuum resolution is worth. From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> > Yes! I'm completely in favor of Itagaki-san. Separating the cache for > FSM may produce a new configuration parameter like fsm_cache_size, > which the normal users would not desire (unless they like enjoying > difficult DBMS.) > I think that integrating the treatment of space management structure > and data area is good. That means, for example, implementing "Free > Space Table" described in section 14.2.2.1 of Jim Gray's book > "Transaction Processing: Concepts and Techniques", though it may have > been discussed in PostgreSQL community far long ago (really?). Of > course, some refinements may be necessary to tune to PostgreSQL's > concept, say, creating one free space table file for each data file to > make the implementation easy. It would reduce the source code solely > for FSM. > > In addition, it would provide the transactional space management. If > I understand correctly, in the current implementation, updates to FSM > are lost when the server crashes, aren't they? The idea assumes that > FSM will be rebuilt by vacuum because vacuum is inevitable. If > updates to space management area were made transactional, it might > provide the infrastructure for "vacuumless PostgreSQL." ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?
From: "Magnus Hagander" <[EMAIL PROTECTED]> > Right. Which is why you're likely to see better performance if you keep > shared buffers smaller. There is something in dealing with it that's > slow on win32, per reports from the field. It needs to be investigated > further... > We've had reports that it's slow with large shared_buffers, yes. That's a shocking news. I'm sad. I wonder whether the field you are talking about set Windows to use more memory for programs than for filesystem cache, which is selectable from [System] applet of Control Panel (Oh, I wonder how my machine is set in this respect... have to check.) If filesystem cache is preferred, the following senario may be possible: 1. PostgreSQL tries to read data from disk into database cache. 2. The kernel tries to allocate filesystem buffers by paging out PostgreSQL's memory (possibly shared buffers). 3. PostgreSQL finds data requested by its clients in database cache, and tries to get it in memory. 4. But the shared buffers are paged out, and page-ins happen. > Are you sure you're not running this on for example > IDE disks with write-cache that lies? Windows will write through that > write-cache even if the disk lies, whereas most linux versions won't. At > least that used to be the case not too long ago, but there has also been > talking about fixign that in linux, so maybe that's done... I'm using a PC server whose disks are all SCSI. It has no IDE disk. > Also note that when you run pg_bench on the local machine, you take a > much higher hit from the fact that context switching between processes > is a lot more expensive on Windows than it is on Linux. But it shouldn't > be big enough to explain the huge difference you had in your test. Yes, I suspect it, too. So, Oracle uses one multi-threaded server process on Windows, while it employs multi-process architecture. SQL Server is of course multi-threaded. SRA's original PostgreSQL for Windows (based on 7.x) was also multi-threaded. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?
From: "Magnus Hagander" <[EMAIL PROTECTED]> > hnetcfg.dll is a part of Windows. "Home Networking Configuration > Manager". LPK.DLL is also a part of Windows - it's the language pack. Thank you for information. > On Thu, Feb 08, 2007 at 09:50:26PM +0900, Takayuki Tsunakawa wrote: >> When I try to start PostgreSQL 8.2.1 on Windows 2003 Server with >> shared_buffers=1024MB, I get the following error messages in the Event >> Log (with log_min_messages=debug5) and can't start PostgreSQL: > > Is this for testing, or for production? From what I've heard, you would > normally never want that much shared memory - I've seen more reports on > taht you shuld keep it as low as possible, really. For performance > reasons. For testing. I wanted to place all data in shared buffers to eliminate reads from disk while I run pgbench repeatedly (actually most reads should come from kernel cache, though.) Does PostgreSQL for Windows have any problem when using a large database cache unlike UNIX versions? I'm excited about your current great work to enable building all of PostgreSQL with MSVC. I thought you are aiming at making PostgreSQL 64-bit on Windows in the near future (though you may not have signified in ML.) I'm afraid MinGW will not extend to 64-bit (for x64 and Itanium) at least reliably and immediately, due to the difference of data model -- 'long' is still 32-bit in 64-bit applications on Windows. I thought Magnus-san got worried about it and started the activity of completely switching to MSVC. BTW, the current PostgreSQL for Windows is very slow, isn't it? I compared the performance of PostgreSQL 8.2.x for Linux (RHEL4 for x86, kernel 2.6.x) and Windows Server 2003. I ran 'pgbench -c32 -t500' on the same machine with the same disk layout for data files and WAL, i.e. they are stored on separate disks. The settings in postgresql.conf is the same, except for wal_sync_method -- it is set to open_sync on Linux and open_datasync on Windows, because they are the best for each platform. Linux version shows 1100 tps, but Windows version shows only 450 tps. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] What is the motivation of include directive and
From: "Andrew Dunstan" <[EMAIL PROTECTED]> > Meeting FHS requirements is no bad thing, though. And the ability to > include a common configuration set in multiple instances is surely > useful to a number of people. After all, you aren't forced to use these > facilities - I typically don't. Thank you, Andrew-san. What I want to know is a more concreet thing. How useful are those facilities to what kind of users in what cases? Is there a reason why users in the real world positively use those facilities? If I want to develop an add-on tool that manipulates PostgreSQL settings even when PostgreSQL is stopped, should that tool support the environments where include directive is used and config files are placed outside the data directory? If the need for include directive and flexible config file placement is very low or vague, I thought I could say "this tool does not support those environments". Are any facilities (functions) provided in PostgreSQL that external tool developers can use to read/write config files? Suppose a user created a database instance manually, with the config files as follows: [$PGDATA/postgresql.conf] ... max_connections=10 include '/some_dir/some.conf' include '/other_dir/other.conf' [/some_dir/some.conf] ... max_connections=50 [/other_dir/other.conf] ... max_connections=100 If an external tool wants to determine the current value of max_connections or modify the setting, what should the tool do? Should the tool parse and write out the files by itself? ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Idea for fixing the Windows fsync problem
From: "Tom Lane" <[EMAIL PROTECTED]> > It's still not 100% bulletproof, because it's possible that some other > backend is holding an open file in the database as a consequence of > having had to dump some shared buffer for itself, but that should be > pretty darn rare if the bgwriter is getting its job done. I've understood that you are talking about the case where backends have to evict dirty buffers containing data of a database they are not connected to. The problem is that the backends don't close the data file after writing dirty buffers. Then, how about making the backends close the data files? Concretely, in FlushBuffer() in src/backend/storage/buffer/bufmgr.c, call SmgrClose() after SmgrWrite() like this: -- if (reln passed to FlushBuffer() was NULL && reln->smgr_rnode.dbNode != my database's oid(where is this stored?) SmgrClose(reln); } -- Or, to make the intention clearer, it may be better to add calls to SmgrOpen() and SmgrClose() in succession after FlushBuffer() in BufferAlloc(). BTW, fsync() is causing trouble here in addition to load-distributed checkpoint that Itagaki-san has been addressing, isn't it? If fsync were eliminated by using O_SYNC as commercial databases, Tom-san didn't have to make efforts to solve this problem. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Idea for fixing the Windows fsync problem
Hello, Stefan-san tom is talking about the postgresql distributed buildfarm: > > http://buildfarm.postgresql.org/cgi-bin/show_status.pl Thank you for telling me. This is a great system, isn't it? - Original Message - From: "Stefan Kaltenbrunner" <[EMAIL PROTECTED]> To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> Cc: "Magnus Hagander" <[EMAIL PROTECTED]>; ; "Tom Lane" <[EMAIL PROTECTED]> Sent: Wednesday, January 17, 2007 4:44 PM Subject: Re: [HACKERS] Idea for fixing the Windows fsync problem > Takayuki Tsunakawa wrote: >> From: "Tom Lane" <[EMAIL PROTECTED]> >>> I wrote: >>>> I've committed a tentative patch along these lines to HEAD. Please >>>> test. >>> So I come home from dinner out, and find the buildfarm all red :-( >>> >>> I'm not sure why I didn't see this failure in my own testing, but in >>> hindsight it's quite obvious that if the bgwriter is to take a hard >>> line about fsync failures, it's got to be told about DROP DATABASE >>> not only DROP TABLE --- that is, there has to be a signaling message >>> for "revoke fsync requests across whole database". >>> >> >> Excuse me if I misunderstand English and say something strange. >> I thought "buildfarm is red" meant the failure of regression test. >> What kind of errors did you get in what operation (e.g. DROP INDEX)? >> Is everyone is able to see the test result freely? Sorry, I'll read >> developer's FAQ when I have more time. > > tom is talking about the postgresql distributed buildfarm: > > http://buildfarm.postgresql.org/cgi-bin/show_status.pl > > and right now most of the members are indeed "red" due to the fsync > related changes. > > > Stefan > ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Idea for fixing the Windows fsync problem
From: "Tom Lane" <[EMAIL PROTECTED]> >I wrote: >> I've committed a tentative patch along these lines to HEAD. Please >> test. > > So I come home from dinner out, and find the buildfarm all red :-( > > I'm not sure why I didn't see this failure in my own testing, but in > hindsight it's quite obvious that if the bgwriter is to take a hard > line about fsync failures, it's got to be told about DROP DATABASE > not only DROP TABLE --- that is, there has to be a signaling message > for "revoke fsync requests across whole database". > Excuse me if I misunderstand English and say something strange. I thought "buildfarm is red" meant the failure of regression test. What kind of errors did you get in what operation (e.g. DROP INDEX)? Is everyone is able to see the test result freely? Sorry, I'll read developer's FAQ when I have more time. And I thought you meant you have a question: why doesn't bgwriter report "checkpoint request failed" when a checkpoint occurs after dropping a database? No, checkpoint after DROP DATABASE should not report "checkpoint request failed." When dropping a database, checkpoint is forced before removing the database files, so bgwriter does not try to open for fsync the database files. I saw this in the following code fragment in src/backend/commands/dbcommands.c: /* * On Windows, force a checkpoint so that the bgwriter doesn't hold any * open files, which would cause rmdir() to fail. */ #ifdef WIN32 RequestCheckpoint(true, false); #endif /* * Remove all tablespace subdirs belonging to the database. */ remove_dbtablespaces(db_id); > I think that it should not be necessary to invent a signal for "drop > across tablespace", though, because we don't allow DROP TABLESPACE to>> remove any tables --- you've got to drop tables and/or databases to > clean out the tablespace, first. Anyone see a flaw in that? I think you are right. BTW: what happens on Windows if we're trying to do the equivalent > of "rm -rf database-dir" and someone is holding open one of the files > in the directory? Or has the directory itself open for readdir()? And I wonder what happens if Windows "copy" command is accessing the data files when bgwriter tries to open them for fsync, or the reverse of it. copy would fail? If so, it means that online backup sometimes fails. And how about a checkpoint after ALTER TABLE/INDEX ... SET TABLESPACE? When bgwriter tries to open the table/index file, the original file does not exist. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] What is the motivation of include directive and
From: "Tom Lane" <[EMAIL PROTECTED]> > "Takayuki Tsunakawa" <[EMAIL PROTECTED]> writes: >> Still, I don't understand well why config files need to be placed >> outside the data directory, except for daring conform to FHS. > > The killer argument for it is that most of what is in $PGDATA should be > excluded from your normal filesystem backup method, because you need to > be using some database-aware mechanism for backing up the database. But > the config files are perfectly suited for standard filesystem backup, > and indeed will *not* be covered by, say, pg_dumpall. So putting them > somewhere else helps in creating a coherent backup strategy. Thank you, I've understood the reason for placement. As I supposed, it is provided mainly for placing config files in /etc to allow frequent backup of the configuration as a whole system, because most config files of UNIX are stored in /etc (recommended so in FHS) and /etc is small. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] What is the motivation of include directive and
From: "Bruce Momjian" <[EMAIL PROTECTED]> > Takayuki Tsunakawa wrote: >> In section "17.1. Setting Parameters", include directive is described. >> Why was this directive prepared? What usage is assumed? Is it for >> GUI tools, or for placing custom parameters in other files? >> >> In section "17.2. File Locations", the following parameters are >> described: >> What demand is assumed for placing configuration files outside the >> data directory? Is it for placing configuration files in /etc to >> conform to the FHS (Filesystem Hierarchy Standard, as documented >> below) to enable the backup of all configuration files on the system? > > It was designed for people who have multiple postmasters, but and want > to centralize some of that configuation. Thank you. But I'm still a bit unclear. Do you mean those functionalities were designed for, e.g., organizations that provide hosting service -- runs multiple database instances on a single large machine? The image is: /DataDirForCustomer1/postgresql.conf [content] include '/etc/postgresql_common.conf' hba_file='/etc/pg_hba.conf' ...settings specific to customer 1 /DataDirForCustomer2/postgresql.conf [content] include '/etc/postgresql_common.conf' hba_file='/etc/pg_hba.conf' ...settings specific to customer 2 /etc/postgresql_common.conf [content] ...settings common among all customers Was this mechanism invented in response to the demand from users, or invented from the idea of PostgreSQL developers? Still, I don't understand well why config files need to be placed outside the data directory, except for daring conform to FHS. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] What is the motivation of include directive and configuration files outside PGDATA
Hello, Let me ask about the background of configuration files. I couldn't find the relevant information in the 8.2 documentation. I'm sorry to cause you trouble. In section "17.1. Setting Parameters", include directive is described. Why was this directive prepared? What usage is assumed? Is it for GUI tools, or for placing custom parameters in other files? In section "17.2. File Locations", the following parameters are described: data_directory (string) config_file (string) hba_file (string) ident_file (string) external_pid_file (string) What demand is assumed for placing configuration files outside the data directory? Is it for placing configuration files in /etc to conform to the FHS (Filesystem Hierarchy Standard, as documented below) to enable the backup of all configuration files on the system? http://www.pathname.com/fhs/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Idea for fixing the Windows fsync problem
From: "Tom Lane" <[EMAIL PROTECTED]> I suggested that here > http://archives.postgresql.org/pgsql-hackers/2007-01/msg00642.php > but have received no feedback about it ... I'm sorry, I missed it. From: "Tom Lane" <[EMAIL PROTECTED]> > Magnus Hagander <[EMAIL PROTECTED]> writes: >> Tom Lane wrote: >>> So: maybe the solution is to add a step to the drop sequence, namely >>> revoking any pending fsync request, before unlink. > >> Perhaps we could have the bgwrite check the queue *if* it gets the >> ENOENT/EACCESS error and then re-check the queue for drops on that file? > > I've committed a tentative patch along these lines to HEAD. Please > test. I agree with Magnus-san's suggestion, too. Though I'm willing to test, I'm not familiar with building on Windows yet and do not have enogh time for other works right now. If someone builds and gives me the new postgres.exe, I'll put it on my 8.2 installation and test. Or, could anyone do the following? These are what I did in yesterday's test. 1. Open two psql sessions. Let me call those session1 and session2. 2. On session1, execute: create table a (c int); insert into a values(1); 3. On session2, execute: select * from a; 4. On session1, execute: drop table a; checkpoint; Checkpoint command here reported an error yesterday. If Tom-san's patch is effective, it should not fail and no messages are put in the event log. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [GENERAL] Checkpoint request failed on version 8.2.1.
From: "Magnus Hagander" <[EMAIL PROTECTED]> > But yeah, that's probably a good idea. A quick look at the code says we > should at least ask people who have this problem to give it a run with > logging at DEBUG5 which should then log exactly what the errorcode was. > Or are you seeing more places that need such logging first? I'm sorry we can't get get the Win32 error code to be displayed. I got the following messages: 2007-01-16 09:24:48 DEBUG: checkpoint starting 2007-01-16 09:24:48 ERROR: could not open relation 1663/10819/18296: Permission denied 2007-01-16 09:24:48 ERROR: checkpoint request failed 2007-01-16 09:24:48 HINT: Consult recent messages in the server log for details. 2007-01-16 09:24:48 STATEMENT: checkpoint; The reason is that src/port/open.c does not use _dosmaperr(). It converts the Win32 error code to errno directly. EACCES is converted from ERROR_ACCESS_DENIED only. Mmm, we may have to compromise as Tom-san says. BTW, why does the bgwriter try to open and write the pages of already dropped relations? When dropping relations, DropRelFileNodeBuffers is called to discard dirty buffers. If the relation being dropeed has already been registered in the list of files to be fsynced, isn't it possible to remove the file from the list before unlinking the file, asking bgwriter in a similar way as ForwardFsyncRequest()? # The timestamp at the head of each message is noisy since the event viewer has the time info, isn't it? Besides, several PostgreSQL messages appeared as one entry of event log, separated by a LF instead of CR LF. On Windows, CR LF should separate lines. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Dynamically sizing FSM?
From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> > Tom Lane <[EMAIL PROTECTED]> wrote: > >> I'm of the opinion that the solution to FSM being fixed-size is to keep >> it somewhere else, ie, on disk (possibly with some sort of cache in >> shared memory for currently-used entries). > > What do you think dynamic allocation from shared_buffers? ie, remove > a buffer page in the shared buffer pool and use the 8kB of memory > for another purpose. To be sure, we don't free from out-of-FSM-memory, > but it can get rid of deciding the amount of FSM buffers. > I think we could use the above as "shared memory allocator". > It is useful for Dead Space Map, shared prepared statements, and so on. Yes! I'm completely in favor of Itagaki-san. Separating the cache for FSM may produce a new configuration parameter like fsm_cache_size, which the normal users would not desire (unless they like enjoying difficult DBMS.) I think that integrating the treatment of space management structure and data area is good. That means, for example, implementing "Free Space Table" described in section 14.2.2.1 of Jim Gray's book "Transaction Processing: Concepts and Techniques", though it may have been discussed in PostgreSQL community far long ago (really?). Of course, some refinements may be necessary to tune to PostgreSQL's concept, say, creating one free space table file for each data file to make the implementation easy. It would reduce the source code solely for FSM. In addition, it would provide the transactional space management. If I understand correctly, in the current implementation, updates to FSM are lost when the server crashes, aren't they? The idea assumes that FSM will be rebuilt by vacuum because vacuum is inevitable. If updates to space management area were made transactional, it might provide the infrastructure for "vacuumless PostgreSQL." ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Load distributed checkpoint
Happy new year From: "Simon Riggs" <[EMAIL PROTECTED]> > Have you tried setting deadline scheduler on the WAL device and CFQ on > the data device? That should allow the I/Os to move through different > queues and prevent interference. No, I've not tried yet. Inaam-san told me that Linux had a few I/O schedulers but I'm not familiar with them. I'll find information about them (how to change the scheduler settings) and try the same test. - Original Message - From: "Simon Riggs" <[EMAIL PROTECTED]> To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> Cc: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>; Sent: Thursday, December 28, 2006 7:07 AM Subject: Re: [HACKERS] Load distributed checkpoint > On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote: >> Hello, Itagaki-san, all >> >> Sorry for my long mail. I've had trouble in sending this mail because >> it's too long for pgsql-hackers to accept (I couldn't find how large >> mail is accepted.) So I'm trying to send several times. >> Please see the attachment for the content. > > Your results for fsync are interesting. > > I've noticed that a checkpoint seems to increase the activity on the WAL > drive as well as increasing I/O wait times. That doesn't correspond to > any real increase in WAL traffic I'm aware of. > > Have you tried setting deadline scheduler on the WAL device and CFQ on > the data device? That should allow the I/Os to move through different > queues and prevent interference. > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > > ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Load distributed checkpoint
From: "Bruce Momjian" <[EMAIL PROTECTED]> > On an idle system, would someone dirty a large file, and watch the disk > I/O to see how long it takes for the I/O to complete to disk? I ran "dd if=/dev/zero of= bs=8k count=`expr 1048576 / 8`, that is, writing 1GB file with 8KB write()'s. It took about 175 seconds for the kernel to flush buffers. I'll put the result of "iostat -x 5" which was started at the same time as dd. 175 seconds means 35 rows with nonzero %util * 5 seconds of interval. Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sde 0.00 981.56 0.00 50.900.00 8439.28 0.00 4219.64 165.8034.38 525.01 6.25 31.82 sde 0.00 25737.00 0.00 118.600.00 219820.80 0.00 109910.40 1853.46 6529.74 1587.08 8.43 100.02 sde 0.00 1912.97 0.00 127.740.00 16325.75 0.00 8162.87 127.80 8192.67 6502.98 7.81 99.82 sde 0.00 1728.00 0.00 117.800.00 14745.60 0.00 7372.80 125.17 8209.36 11227.29 8.49 100.02 sde 0.00 1536.00 0.00 103.800.00 13107.20 0.00 6553.60 126.27 8209.10 16729.09 9.64 100.02 sde 0.00 1344.00 0.00 91.600.00 11468.80 0.00 5734.40 125.21 8208.69 21481.22 10.92 100.02 sde 0.00 1532.80 0.00 98.000.00 13081.60 0.00 6540.80 133.49 8209.34 26583.57 10.21 100.02 sde 0.00 1632.00 0.00 107.000.00 13926.40 0.00 6963.20 130.15 8208.89 31662.93 9.35 100.02 sde 0.00 1536.00 0.00 106.400.00 13107.20 0.00 6553.60 123.19 8209.66 36443.07 9.40 100.02 sde 0.00 1442.89 0.00 95.590.00 12312.63 0.00 6156.31 128.81 8227.23 41446.78 10.49 100.24 sde 0.00 1532.80 0.00 103.000.00 13081.60 0.00 6540.80 127.01 8210.77 46606.03 9.71 100.04 sde 0.00 1440.00 0.00 92.000.00 12288.00 0.00 6144.00 133.57 8208.82 51421.78 10.87 100.02 sde 0.00 1344.00 0.00 91.800.00 11468.80 0.00 5734.40 124.93 8209.86 56524.37 10.90 100.02 sde 0.00 1539.08 0.00 101.000.00 13133.47 0.00 6566.73 130.03 8225.59 61477.93 9.92 100.22 sde 0.00 1436.80 0.00 95.400.00 12262.40 0.00 6131.20 128.54 8208.88 66566.42 10.48 100.02 sde 0.00 1344.00 0.00 92.400.00 11468.80 0.00 5734.40 124.12 8209.47 71466.12 10.82 100.02 sde 0.00 1102.40 0.00 101.000.00 9408.38 0.00 4704.1993.15 8174.36 76538.41 9.88 99.82 sde 0.00 0.00 0.00 89.000.000.00 0.00 0.00 0.00 7855.72 80795.64 11.24 100.02 sde 0.00 0.00 0.00 91.380.000.00 0.00 0.00 0.00 7422.53 81823.89 10.97 100.22 sde 0.00 0.00 0.00 96.800.000.00 0.00 0.00 0.00 6919.07 83194.91 10.33 100.02 sde 0.00 0.00 0.00 91.780.000.00 0.00 0.00 0.00 6480.85 84657.04 10.92 100.22 sde 0.00 0.00 0.00 92.600.000.00 0.00 0.00 0.00 5997.22 84749.79 10.80 100.02 sde 0.00 0.00 0.00 91.020.000.00 0.00 0.00 0.00 5528.97 85345.07 10.97 99.82 sde 0.00 0.00 0.00 102.610.000.00 0.00 0.00 0.00 5059.61 85057.91 9.77 100.22 sde 0.00 0.00 0.00 93.200.000.00 0.00 0.00 0.00 4572.57 85284.49 10.73 100.02 sde 0.00 0.00 0.00 98.200.000.00 0.00 0.00 0.00 4109.15 86086.50 10.21 100.22 sde 0.00 0.00 0.00 91.420.000.00 0.00 0.00 0.00 3611.72 86405.24 10.92 99.82 sde 0.00 0.00 0.00 100.000.000.00 0.00 0.00 0.00 3135.62 86292.49 10.00 100.02 sde 0.00 0.00 0.00 100.400.000.00 0.00 0.00 0.00 2652.63 86609.79 9.96 100.02 sde 0.00 0.00 0.00 92.800.000.00 0.00 0.00 0.00 2153.69 86168.58 10.78 100.02 sde 0.00 0.00 0.00 88.800.000.00 0.00 0.00 0.00 1694.74 86275.58 11.26 100.02 sde 0.00 0.00 0.00 98.200.000.00 0.00 0.00 0.00 1241.41 86708.40 10.19 100.02 sde 0.00 0.00 0.00 95.200.000.00 0.00 0.00 0.00 747.52 86505.59 10.51 100.02 sde 0.00 0.00 0.00 89.200.000.00 0.00 0.00 0.00 283.44 86551.11 11.21 100.02 sde 0.00 51.20 0.00 17.200.00 449.60 0.00 224.80 26.14 4.30 61572.65 9.05 15.56 sde 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sde 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sde 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sde 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sde 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.00
Re: [HACKERS] Load distributed checkpoint
Hello, Inaam-san, > There are four IO schedulers in Linux. Anticipatory, CFQ (default), deadline, and noop. For typical OLTP type loads generally deadline is recommended. If you are constrained on CPU and you have a good controller then its better to use noop. > Deadline attempts to merge requests by maintaining two red black trees in sector sort order and it also ensures that a request is serviced in given time by using FIFO. I don't expect it to do the magic but was wondering that it may dilute the issue of fsync() elbowing out WAL writes. > You can look into /sys/block//queue/scheduler to see which scheduler you are using. Thank you for your information. I could only find the following files in /sys/block//queue/: iosched max_hw_sectors_kb max_sectors_kb nr_requests read_ahead_kb In iosched, the following files exist: quantum (the content is "4") queued (the content is "8")
Re: [HACKERS] Load distributed checkpoint
> (3) is very strange. Your machine seems to be too restricted > by WAL so that other factors cannot be measured properly. Right... It takes as long as 15 seconds to fsync 1GB file. It's strange. This is a borrowed PC server, so the disk may be RAID 5? However, the WAL disk and DB disks show the same throughput. I'll investigate. I may have to find another machine. - Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-) Oh, Windows. Maybe the fsync() problem Itagaki-san pointed out does not exist. BTW, your env is showing attractive result, isn't it? - Original Message - From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> Cc: Sent: Friday, December 22, 2006 6:09 PM Subject: Re: [HACKERS] Load distributed checkpoint "Takayuki Tsunakawa" <[EMAIL PROTECTED]> wrote: > (1) Default case(this is show again for comparison and reminder) > 235 80 226 77 240 > (2) Default + WAL 1MB case > 302 328 82 330 85 > (3) Default + wal_sync_method=open_sync case > 162 67 176 67 164 > (4) (2)+(3) case > 322 350 85 321 84 > (5) (4) + /proc/sys/vm/dirty* tuning > 308 349 84 349 84 (3) is very strange. Your machine seems to be too restricted by WAL so that other factors cannot be measured properly. I'll send results on my machine. - Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-) - shared_buffers=1GB - wal_sync_method = open_datasync - wal_buffers = 1MB - checkpoint_segments = 16 - checkpoint_timeout = 5min I repeated "pgbench -c16 -t500 -s50" and picked up results around checkpoints. [HEAD] ... 560.8 373.5 <- checkpoint is here 570.8 ... [with patch] ... 562.0 528.4 <- checkpoint (fsync) is here 547.0 ... Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Load distributed checkpoint
From: "Greg Smith" <[EMAIL PROTECTED]> > This is actually a question I'd been meaning to throw out myself to this > list. How hard would it be to add an internal counter to the buffer > management scheme that kept track of the current number of dirty pages? > I've been looking at the bufmgr code lately trying to figure out how to > insert one as part of building an auto-tuning bgwriter, but it's unclear > to me how I'd lock such a resource properly and scalably. I have a > feeling I'd be inserting a single-process locking bottleneck into that > code with any of the naive implementations I considered. To put it in an extreme way, how about making bgwriter count the dirty buffers periodically scanning all the buffers? Do you know the book "Principles of Transaction Processing"? Jim Gray was one of the reviewers of this book. http://www.amazon.com/gp/aa.html?HMAC=&CartId=&Operation=ItemLookup&&ItemId=1558604154&ResponseGroup=Request,Large,Variations&bStyle=aaz.jpg&MerchantId=All&isdetail=true&bsi=Books&logo=foo&Marketplace=us&AssociateTag=pocketpc In chapter 8, the author describes fuzzy checkpoint combined with two-checkpoint approach. In his explanation, recovery manager (which would be bgwriter in PostgreSQL) scans the buffers and records the list of dirty buffers at each checkpoint. This won't need any locking in PostgreSQL if I understand correctly. Then, the recovery manager performs the next checkpoint after writing those dirty buffers. In two-checkpoint approach, crash recovery starts redoing from the second to last checkpoint. Two-checkpoint is described in Jim Gray's book, too. But they don't refer to how the recovery manager tunes the speed of writing. > slightly different from the proposals here. What if all the database page > writes (background writer, buffer eviction, or checkpoint scan) were > counted and periodic fsync requests send to the bgwriter based on that? > For example, when I know I have a battery-backed caching controller that > will buffer 64MB worth of data for me, if I forced a fsync after every > 6000 8K writes, no single fsync would get stuck waiting for the disk to > write for longer than I'd like. That seems interesting. > You can do sync > writes with perfectly good performance on systems with a good > battery-backed cache, but I think you'll get creamed in comparisons > against MySQL on IDE disks if you start walking down that path; since > right now a fair comparison with similar logging behavior is an even match > there, that's a step backwards. I wonder what characteristics SATA disks have compared to IDE. Recent PCs are equiped with SATA disks, aren't they? What do you feel your approach compares to MySQL on IDE disks? > Also on the topic of sync writes to the database proper: wouldn't using > O_DIRECT for those potentially counter-productive? I was under the > impressions that one of the behaviors counted on by Postgres was that data > evicted from its buffer cache, eventually intended for writing to disk, > was still kept around for a bit in the OS buffer cache. A subsequent read > because the data was needed again might find the data already in the OS > buffer, therefore avoiding an actual disk read; that substantially reduces > the typical penalty for the database engine making a bad choice on what to > evict. I fear a move to direct writes would put more pressure on the LRU > implementation to be very smart, and that's code that you really don't > want to be more complicated. I'm worried about this, too. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Load distributed checkpoint
From: Inaam Rana > Which IO Shceduler (elevator) you are using? Elevator? Sorry, I'm not familiar with the kernel implementation, so I don't what it is. My Linux distribution is Red Hat Enterprise Linux 4.0 for AMD64/EM64T, and the kernel is 2.6.9-42.ELsmp. I probably havn't changed any kernel settings, except for IPC settings to run PostgreSQL.
Re: [HACKERS] Load distributed checkpoint
From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> > (5) (4) + /proc/sys/vm/dirty* tuning > dirty_background_ratio is changed from 10 to 1, and dirty_ratio is > changed from 40 to 4. > > 308 349 84 349 84 Sorry, I forgot to include the result when using Itagaki-san's patch. The patch showd the following tps for case (5). 323 350 340 59 225 The best response time was 4 msec, and the worst one was 16 seconds. - Original Message - From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> To: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> Cc: Sent: Friday, December 22, 2006 3:20 PM Subject: Re: [HACKERS] Load distributed checkpoint > Hello, Itagaki-san, > > Thank you for an interesting piece of information. > > From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> >> If you use linux, try the following settings: >> 1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio. >> 2. Increase wal_buffers to redule WAL flushing. >> 3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync(). >> 4. Separate data and WAL files into different partitions or disks. >> >> I suppose 1 is important for you, because kernel will not write > dirty >> buffers until 10% of buffers become dirty in default settings. >> You have large memory (8GB), but small data set (800MB). So kernel >> almost never writes buffers not in checkpoints. Accumulate dirty > buffers >> are written at a burst in fsync(). > > I'll show the results of this tuning to share information with people > who don't have experience of this kind. > The numbers shown below are the tps when running "pgbench -c16 -t100 > postgres" five times in succession. > > (1) Default case(this is show again for comparison and reminder) > The bgwriter_* and checkpoint_* are set to those defaults. > wal_buffers and wal_sync_method are also set to those defaults (64kB > and fdatasync respectively.) > > 235 80 226 77 240 > > > (2) Default + WAL 1MB case > The configuration is the same as case (1) except that wal_buffers is > set to 1024kB. > > 302 328 82 330 85 > > This is better improvement than I expected. > > > (3) Default + wal_sync_method=open_sync case > The configuration is the same as case (1) except that wal_sync_method > is set to open_sync. > > 162 67 176 67 164 > > Too bad compared to case (2). Do you know the reason? > > > (4) (2)+(3) case > > 322 350 85 321 84 > > This is good, too. > > > (5) (4) + /proc/sys/vm/dirty* tuning > dirty_background_ratio is changed from 10 to 1, and dirty_ratio is > changed from 40 to 4. > > 308 349 84 349 84 > > The tuning of kernel cache doesn't appear to bring performance > improvement in my env. The kernel still waits too long before it > starts flushing dirty buffers because the cache is large? If so, > increasingly available RAM may cause trouble more frequently in the > near future. Do the dirty_*_ratio accept values less than 1? > > BTW, in case (1), the best response time of a transaction was 6 > milliseconds. On the other hand, the worst response time was 13 > seconds. > > >> We would be happy if we would be free from a difficult combination >> of tuning. If you have *idea for improvements*, please suggest it. >> I think we've already understood *problem itself*. > > I agree with you. Let's make the ideas more concrete, doing some > experimentations. > > > > > ---(end of broadcast)--- > TIP 7: You can help support the PostgreSQL project by donating at > >http://www.postgresql.org/about/donate > ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Load distributed checkpoint
Hello, Itagaki-san, Thank you for an interesting piece of information. From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> > If you use linux, try the following settings: > 1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio. > 2. Increase wal_buffers to redule WAL flushing. > 3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync(). > 4. Separate data and WAL files into different partitions or disks. > > I suppose 1 is important for you, because kernel will not write dirty > buffers until 10% of buffers become dirty in default settings. > You have large memory (8GB), but small data set (800MB). So kernel > almost never writes buffers not in checkpoints. Accumulate dirty buffers > are written at a burst in fsync(). I'll show the results of this tuning to share information with people who don't have experience of this kind. The numbers shown below are the tps when running "pgbench -c16 -t100 postgres" five times in succession. (1) Default case(this is show again for comparison and reminder) The bgwriter_* and checkpoint_* are set to those defaults. wal_buffers and wal_sync_method are also set to those defaults (64kB and fdatasync respectively.) 235 80 226 77 240 (2) Default + WAL 1MB case The configuration is the same as case (1) except that wal_buffers is set to 1024kB. 302 328 82 330 85 This is better improvement than I expected. (3) Default + wal_sync_method=open_sync case The configuration is the same as case (1) except that wal_sync_method is set to open_sync. 162 67 176 67 164 Too bad compared to case (2). Do you know the reason? (4) (2)+(3) case 322 350 85 321 84 This is good, too. (5) (4) + /proc/sys/vm/dirty* tuning dirty_background_ratio is changed from 10 to 1, and dirty_ratio is changed from 40 to 4. 308 349 84 349 84 The tuning of kernel cache doesn't appear to bring performance improvement in my env. The kernel still waits too long before it starts flushing dirty buffers because the cache is large? If so, increasingly available RAM may cause trouble more frequently in the near future. Do the dirty_*_ratio accept values less than 1? BTW, in case (1), the best response time of a transaction was 6 milliseconds. On the other hand, the worst response time was 13 seconds. > We would be happy if we would be free from a difficult combination > of tuning. If you have *idea for improvements*, please suggest it. > I think we've already understood *problem itself*. I agree with you. Let's make the ideas more concrete, doing some experimentations. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [PATCHES] Load distributed checkpoint patch
Hello, Mr. Grittner, From: "Kevin Grittner" <[EMAIL PROTECTED]> > We have 3,000 "directly connected" users, various business partner > interfaces, and public web entry doing OLTP in 72 databases distributed > around the state, with real-time replication to central databases which > are considered derived copies. What a big system you have. > If all the pages modified on the central > databases were held in buffers or cache until after peak hours, query > performance would suffer -- assuming it would all even fit in cache. We > must have a way for dirty pages to be written under load while > responding to hundreds of thousands of queries per hour without > disturbing "freezes" during checkpoints. I agree with you. My words were not good. I consider it is necessary to always advance checkpoints even under heavy load, caring OLTP transactions. > I raise this only to be sure that such environments are considered with > these changes, not to discourage improvements in the checkpoint > techniques. We have effectively eliminated checkpoint problems in our > environment with a combination of battery backed controller cache and > aggressive background writer configuration. When you have a patch which > seems to help those who still have problems, I'll try to get time > approved to run a transaction replication stream onto one of our servers > (in "catch up mode") while we do a web "stress test" by playing back > requests from our production log. That should indicate how the patch > will affect us. Thank you very much for your kind offer. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Load distributed checkpoint
- Original Message - From: "Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]> To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>; "ITAGAKI Takahiro" <[EMAIL PROTECTED]> > > Yes, I used half the size of RAM as the shared buffers, which is > > reasonable. And I cached all the data. > For pg, half RAM for shared_buffers is too much. The ratio is good for > other db software, that does not use the OS cache. What percentage of RAM is recommended for shared buffers in general? 40%? 30%? Or, is the general recommendation like "According to the amount of your data, this much RAM should be left for the kernel cache. But tha's the story on Linux. It may be different for other OSes."? Hmm, if it is so, it sounds hard for system designers/administrators to judge. - Original Message - From: "Zeugswetter Andreas ADI SD" <[EMAIL PROTECTED]> To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]>; "ITAGAKI Takahiro" <[EMAIL PROTECTED]> Cc: Sent: Thursday, December 21, 2006 11:04 PM Subject: RE: [HACKERS] Load distributed checkpoint > > You were running the test on the very memory-depend machine. > >> shared_buffers = 4GB / The scaling factor is 50, 800MB of data. > > Thet would be why the patch did not work. I tested it with DBT-2, 10GB of > > data and 2GB of memory. Storage is always the main part of performace here, > > even not in checkpoints. > > Yes, I used half the size of RAM as the shared buffers, which is > reasonable. And I cached all the data. For pg, half RAM for shared_buffers is too much. The ratio is good for other db software, that does not use the OS cache. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Load distributed checkpoint
To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> Cc: Sent: Thursday, December 21, 2006 6:46 PM Subject: Re: [HACKERS] Load distributed checkpoint > From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> > "Takayuki Tsunakawa" <[EMAIL PROTECTED]> wrote: >> Oh, really, what an evil fsync is! Yes, I sometimes saw a backend >> waiting for lseek() to complete when it committed. But why does the >> backend which is syncing WAL/pg_control have to wait for syncing the >> data file? They are, not to mention, different files, and WAL and >> data files are stored on separate disks. > > Backends call lseek() in planning, so they have to wait fsync() to > the table that they will access. Even if all of data in the file is in > the cache, lseek() conflict with fsync(). You can see a lot of backends > are waiting in planning phase in checkpoints, not executing phase. I see. I found one backend like the following. But one in my case one out of 16 backends. Most of others are waiting to acquire WALWRITE lock. #0 0x003a629c6902 in __lseek_nocancel () from /lib64/tls/libc.so.6 #1 0x0056789f in FileSeek () #2 0x00574053 in mdnblocks () #3 0x00574f4a in smgrnblocks () #4 0x005489e8 in estimate_rel_size () #5 0x00548bee in get_relation_info () #6 0x0054aa3d in build_simple_rel () #7 0x00539c6b in add_base_rels_to_query () #8 0x0053b955 in query_planner () #9 0x0053c1c9 in grouping_planner () #10 0x0053d3b4 in subquery_planner () #11 0x0053d5b3 in planner () #12 0x005778fc in pg_plan_query () #13 0x0057798c in pg_plan_queries () #14 0x00577c53 in exec_simple_query () ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Load distributed checkpoint
From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> > You were running the test on the very memory-depend machine. >> shared_buffers = 4GB / The scaling factor is 50, 800MB of data. > Thet would be why the patch did not work. I tested it with DBT-2, 10GB of > data and 2GB of memory. Storage is always the main part of performace here, > even not in checkpoints. Yes, I used half the size of RAM as the shared buffers, which is reasonable. And I cached all the data. The effect of fsync() is a heavier offence, isn't it? System administrators would say "I have enough memory. The data hasn't exhausted the DB cache yet. But the users complain to me about the response. Why? What should I do? What? Checkpoint?? Why doesn't PostgreSQL take care of frontend users?" BTW, is DBT-2 an OLTP benchmark which randomly access some parts of data, or a batch application which accesses all data? I'm not familiar with it. I know that IPA opens it to the public. > If you use Linux, it has very unpleased behavior in fsync(); It locks all > metadata of the file being fsync-ed. We have to wait for the completion of > fsync when we do read(), write(), and even lseek(). > Almost of your data is in the accounts table and it was stored in a single > file. All of transactions must wait for fsync to the single largest file, > so you saw the bottleneck was in the fsync. Oh, really, what an evil fsync is! Yes, I sometimes saw a backend waiting for lseek() to complete when it committed. But why does the backend which is syncing WAL/pg_control have to wait for syncing the data file? They are, not to mention, different files, and WAL and data files are stored on separate disks. >> [Conclusion] >> I believe that the problem cannot be solved in a real sense by >> avoiding fsync/fdatasync(). > > I think so, too. However, I assume we can resolve a part of the > checkpoint spikes with smoothing of write() alone. First, what's the goal (if possible numerically? Have you explained to community members why the patch would help many people? At least, I haven't heard that fsync() can be seriously bad and we would close our eyes to what fsync() does. By the way, what good results did you get with DBT-2? If you don't mind, can you show us? > BTW, can we use the same way to fsync? We call fsync()s to all modified > files without rest in mdsync(), but it's not difficult at all to insert > sleeps between fsync()s. Do you think it helps us? One of issues is that > we have to sleep in file unit, which is maybe rough granularity. No, it definitely won't help us. There is no reason why it will help. It might help in some limited environments, though, how can we characterize such environments? Can we say "our approach helps our environments, but it won't help you. The kernel VM settings may help you. Good luck!"? We have to consider seriously. I think it's time to face the problem and we should follow the approaches of experts like Jim Gray and DBMS vendors, unless we have a new clever idea like them. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Load distributed checkpoint
On 12/20/06, Takayuki Tsunakawa <[EMAIL PROTECTED]> wrote: > > [Conclusion] > > I believe that the problem cannot be solved in a real sense by > > avoiding fsync/fdatasync(). We can't ignore what commercial databases > > have done so far. The kernel does as much as he likes when PostgreSQL > > requests him to fsync(). From: Inaam Rana > I am new to the community and am very interested in the tests that you have done. I am also working on resolving the sudden IO spikes at checkpoint time. I agree with you that fsync() is the core issue here. Thank you for understanding my bad English correctly. Yes, what I've been insisting is that it is necessary to avoid fsync()/fdatasync() and to use O_SYNC (plus O_DIRECT if supported on the target platform) to really eliminate the big spikes. In my mail, the following sentence made a small mistake. "I believe that the problem cannot be solved in a real sense by avoiding fsync/fdatasync()." The correct sentence is: "I believe that the problem cannot be solved in a real sense without avoiding fsync/fdatasync()." > Being a new member I was wondering if someone on this list has done testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the most logical way of dealing with fsync() flood at checkpoint time. If so, I'll be very interested in the results. Could you see the mail I sent on Dec 18? Its content was so long that I zipped the whole content and attached to the mail. I just performed the same test simply adding O_SYNC to open() in mdopen() and another function in md.c. I couldn't succeed in running with O_DIRECT because O_DIRECT requires the shared buffers to be aligned on the sector-size boundary. To perform O_DIRECT test, a little more modification is necessary to the code where the shared buffers are allocated. The result was bad. But that's just a starting point. We need some improvements that commercial databases have done. I think some approaches we should take are: (1) two-checkpoint (described in Jim Gray's textbook "Transaction Processing: Concepts and Techniques" (2) what Oracle suggests in its manual (see my previous mails) (3) write multiple contiguous buffers with one write() to decrease the count of write() calls > As mentioned in this thread that a single bgwriter with O_DIRECT will not be able to keep pace with cleaning effort causing backend writes. I think (i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this issue. I agree with you. Oracle provides a parameter called DB_WRITER_PROCESSES to set the number of database writer processes. Oracle also provides asynchronous I/O to solve the problem you are saying about. Please see section 10.3.9 the following page: http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14211/instance_tune.htm#sthref1049 > Talking of bgwriter_* parameters I think we are missing a crucial internal counter i.e. number of dirty pages. How much work bgwriter has to do at each wakeup call should be a function of total buffers and currently dirty buffers. Relying on both these values instead of just one static NBuffers should allow bgwriter to adapt more quickly to workload changes and ensure that not much work is accumulated for checkpoint. I agree with you in the sense that the current bgwriter is a bit careless about the system load. I believe that PostgreSQL should be more gentle to OLTP transactions -- many users of the system as a result. I think the speed of WAL accumulation should also be taken into account. Let's list up the problems and ideas. --
Re: [HACKERS] Load distributed checkpoint
> That implies that fsyncing a datafile blocks fsyncing the WAL. That > seems terribly unlikely (although...). What OS/Kernel/Filesystem is > this. I note a sync bug in linux for ext3 that may have relevence. Oh, really? What bug? I've heard that ext3 reports wrong data to iostat when it performs writes (the data is correct when performing reads.) My env is: OS: RHEL 4.0 for AMD64/EM64T kernel: 2.6.9-42.ELsmp The file system is ext3. Terribly unlikely? But I've seen the disk utilization quite often. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCHES] Load distributed checkpoint patch
From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> > Bruce Momjian <[EMAIL PROTECTED]> wrote: >> Do you use the same delay autovacuum uses? > > What do you mean 'the same delay'? Autovacuum does VACUUM, not CHECKPOINT. > If you think cost-based-delay, I think we cannot use it here. It's hard to > estimate how much checkpoints delay by cost-based sleeping, but we should > finish asynchronous checkpoints by the start of next checkpoint. So I gave > priority to punctuality over load smoothing. I consider that smoothing the load (more meaningfully, response time) has higher priority over checkpoint punctuality in a practical sense, because the users of a system benefit from good steady response and give good reputation to the system. If the checkpoint processing is not punctual, crash recovery would take longer time. But which would you give higher priority, the unlikely event (=crash of the system) or likely event (=peek hours of the system)? I believe the latter should be regarded. The system can write dirty buffers after the peek hours pass. User experience should be taken much case of. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Load distributed checkpoint
Hello, Itagaki-san, all I have to report a sad result. Your patch didn't work. Let's consider the solution together. What you are addressing is very important for the system designers in the real world -- smoothing response time. Recall that unpatched PostgreSQL showed the following tps's in case (1) (i.e. with default bgwriter_* and checkpoint_* settings.) 235 80 226 77 240 The patched PostgreSQL showed the following tps's: 230 228 77 209 66 [disk usage] The same tendency can be seen as with the unpatched PostgreSQL. That is: When the tps is low, the %util of disk for data files is high, and %util of disk for WAL is low. Why is transaction logging is disturbed by cleaning and/or syncing activity? While the bgwriter is fsync()ing, it does not lock any data structures that the transactions want to access. Even though they share the same SCSI controller and bus, they are different disks. The bandwidth does not appear to be exhausted, since Ultra320 is said to have 256MB band width in practice. (Recall that WAL is on sdd and data files are on sde.) Device:rrqm/s wrqm/s r/s w/s rsec/s wsec/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 810.78 0.00 102.200.00 7306.99 0.00 3653.4971.50 1.12 10.95 7.32 74.77 sde 0.00 25.35 0.00 6.190.00 252.30 0.00 126.15 40.77 0.50 81.32 5.94 3.67 sdd 0.00 884.20 0.00 126.000.00 8080.00 0.00 4040.0064.13 1.26 10.00 7.11 89.64 sde 0.00 21.40 0.00 5.000.00 211.20 0.00 105.60 42.24 0.31 62.56 6.52 3.26 sdd 0.00 924.80 0.00 116.200.00 8326.40 0.00 4163.2071.66 1.23 10.59 7.37 85.64 sde 0.00 27.60 0.00 26.600.00 433.60 0.00 216.80 16.30 4.24 159.29 2.44 6.50 sdd 0.00 721.20 0.00 102.400.00 6588.80 0.00 3294.4064.34 0.999.71 7.07 72.40 sde 0.00 1446.80 0.00 101.600.00 20289.60 0.00 10144.80 199.70 1192.40 572.45 2.29 23.30 sdd 0.00 0.00 0.00 0.200.001.60 0.00 0.80 8.00 0.11 539.00 539.00 10.80 sde 0.00 0.00 0.00 452.100.000.00 0.00 0.00 0.00 3829.57 3715.83 2.22 100.22 sdd 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sde 0.00 0.00 0.00 349.800.000.00 0.00 0.00 0.00 1745.52 8515.74 2.86 100.02 sdd 0.00 442.40 0.00 51.000.00 3948.80 0.00 1974.40 77.43 0.60 11.73 7.54 38.46 sde 0.00 2.80 0.00 184.000.00 25.60 0.00 12.80 0.14 277.52 12629.41 3.19 58.74 sdd 0.00 898.00 0.00 124.800.00 8182.40 0.00 4091.2065.56 1.30 10.40 7.24 90.30 sde 0.00 19.20 0.00 3.800.00 184.00 0.0092.00 48.42 0.24 62.11 14.11 5.36 sdd 0.00 842.28 0.00 109.020.00 7612.02 0.00 3806.0169.82 1.33 12.26 8.35 91.02 sde 0.00 45.49 0.00 46.890.00 739.08 0.00 369.54 15.76 9.04 192.73 3.38 15.85 sdd 0.00 1198.41 0.00 71.510.00 10505.18 0.00 5252.59 146.90 128.19 99.76 13.48 96.43 sde 0.00 1357.77 0.00 199.800.00 19263.75 0.00 9631.8796.41 2251.09 1179.42 2.39 47.81 sdd 0.00 0.00 0.00 7.200.000.00 0.00 0.00 0.00 203.87 5671.83 138.92 100.02 sde 0.00 0.00 0.00 409.600.000.00 0.00 0.00 0.00 3171.04 4779.83 2.44 100.02 sdd 0.00 0.00 0.00 17.800.000.00 0.00 0.00 0.00 137.87 10240.90 56.19 100.02 sde 0.00 0.00 0.00 240.600.000.00 0.00 0.00 0.00 1573.85 9815.29 4.16 100.02 sdd 0.00 109.80 0.00 35.400.00 1012.80 0.00 506.40 28.6142.14 7974.47 27.86 98.64 sde 0.00 2.80 0.00 198.800.00 30.40 0.00 15.20 0.15 428.49 14474.39 4.30 85.56 sdd 0.00 466.20 0.00 62.800.00 4230.40 0.00 2115.20 67.36 0.599.49 6.79 42.62 sde 0.00 5.20 0.00 0.800.00 48.00 0.0024.00 60.00 0.01 16.25 11.25 0.90 sdd 0.00 0.00 0.00 0.200.001.60 0.00 0.80 8.00 0.01 35.00 35.00 0.70 sde 0.00 0.00 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 I suspect that fsync() is the criminal as I've been afraid. I'll show you an interesting data. I collected the stack traces of backend processes while a checkpoint is happening. [bgwriter] Oh, he is fsync()ing hard. #0 0x003a629bfbb2 in __fsync_nocancel () from /lib64/tls/libc.so.6 #1 0x005742a1 in mdsync () #2 0x005753d7 in smgrsync () #3 0x00564d65 in FlushBufferPool () ... [some backends] They are forced to wait for some lock pertaining to WAL when they try to insert a log record. #0
Re: [HACKERS] Load distributed checkpoint
Hello, Itagaki-san > I posted a patch to PATCHES. Please try out it. Really!? I've just joined pgsql-patches. When did you post it, yesterday? I couldn't find the patch in the following page which lists the mails to pgsql-patches of this month: http://archives.postgresql.org/pgsql-patches/2006-12/index.php Could you send me the patch if it has not already registered on the above page. I want to try the patch by all means, because smoothing response times is very important so that we can recommend PostgreSQL to system designers. I really wish your patch will be the real solution. > It does write() smoothly, but fsync() at a burst. > I suppose the result will be between (3) and (5). Hmm... I think some logical reasoning is needed to get the understanding from community members (sorry if the community members have already agreed.) Excuse me for repeating myself, but I'm afraid fsync() will be the evil sometime in some environments. Success in one test environment is not the real success. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Question about debugging bootstrapping and catalog entries
From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> How about mimicing postgres with a script that starts gdb to run > postgres? That is, rename the original postgres module to > postgres.org and create a shell script named postgres like this: > > #!/bin/bash > gdb postgres $* Sorry, this should be postgres.org $*. - Original Message - From: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> To: "Gregory Stark" <[EMAIL PROTECTED]>; "PostgreSQL Hackers" Sent: Tuesday, December 19, 2006 9:37 AM Subject: Re: [HACKERS] Question about debugging bootstrapping and catalog entries > Hello, Mr. Stark > >> Are there any tricks people have for debugging bootstrapping > processing? I >> just need to know what index it's trying to build here and that > should be >> enough to point me in the right direction: > > As Mr. Lane says, it would be best to be able to make postgres sleep > for an arbitrary time. The direction may be either a command line > option or an environment variable (like BOOTSTRAP_SLEEP) or both. iI > think the env variable is easy to handle n this case. > > How about mimicing postgres with a script that starts gdb to run > postgres? That is, rename the original postgres module to > postgres.org and create a shell script named postgres like this: > > #!/bin/bash > gdb postgres $* > > Tell me if it works. > > > > > > > ---(end of broadcast)--- > TIP 7: You can help support the PostgreSQL project by donating at > >http://www.postgresql.org/about/donate > ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Question about debugging bootstrapping and catalog entries
Hello, Mr. Stark > Are there any tricks people have for debugging bootstrapping processing? I > just need to know what index it's trying to build here and that should be > enough to point me in the right direction: As Mr. Lane says, it would be best to be able to make postgres sleep for an arbitrary time. The direction may be either a command line option or an environment variable (like BOOTSTRAP_SLEEP) or both. iI think the env variable is easy to handle n this case. How about mimicing postgres with a script that starts gdb to run postgres? That is, rename the original postgres module to postgres.org and create a shell script named postgres like this: #!/bin/bash gdb postgres $* Tell me if it works. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Is there any limitation on the size of a mail?
Hello, Mr. Lane > "Takayuki Tsunakawa" <[EMAIL PROTECTED]> writes: >> But none has appeared on pgsql-hackers ML yet. What's wrong? >> One thing I worry about is the size. The size of my mail is 42KB. It >> has only text and no attachment. Is there any limitation on size? > > Yes. Consider gzip ... or if appropriate, post to pgsql-patches, > which has a higher limit. Thank you for your advice. I've just succeeded in sending my mail. 22KB has been accepted. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Is there any limitation on the size of a mail?
Hello, Sorry for this noisy mail. If there is more appropriate address to send to, please tell me. I sent one mail about load-distributed checkpoint three times on the following dates as I couldn't see the mail on the ML: 2006/12/16 17:53 2006/12/18 9:07 2006/12/18 12:10 But none has appeared on pgsql-hackers ML yet. What's wrong? One thing I worry about is the size. The size of my mail is 42KB. It has only text and no attachment. Is there any limitation on size? I'm sending this as a test too. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Load distributed checkpoint
Hello, From: "Jim C. Nasby" <[EMAIL PROTECTED]> Also, I have a dumb question... BgBufferSync uses buf_id1 to keep track > of what buffer the bgwriter_all scan is looking at, which means that > it should remember where it was at the end of the last scan; yet it's > initialized to 0 every time BgBufferSync is called. Is there someplace > else that is remembering where the complete scan is leaving off when > bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the scan > in fact just keep re-scanning the beginning of the buffers? No. BgBufferSync() correctly keeps track of the position to restart scanning at. bufid1 is not initialized to 0 every time BgBufferSync() is called, because bufid1 is a static local variable. Please see the following code. It prints: a=0 a=1 a=2 #include void func(void) { static int a = 0; printf("a=%d\n", a); a++; } int main(void) { func(); func(); func(); return 0; } ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Load distributed checkpoint
Mr. Riggs, Thank you for teaching me the following. I seem to have misunderstood. I'll learn more. From: "Simon Riggs" <[EMAIL PROTECTED]> > On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote: >> I understand that checkpoints occur during crash >> recovery and PITR, so time for those operations would get longer. > > A restorepoint happens during recovery, not a checkpoint. The recovery > is merely repeating the work of the checkpoint that occurred in the > original WAL stream. Elongating the checkpoint would not have any effect > on a restorepoint: we only record the checkpoint when it is complete and > we only create a restorepoint when we see the checkpoint record. Regards, - Original Message - From: "Simon Riggs" <[EMAIL PROTECTED]> To: "Takayuki Tsunakawa" <[EMAIL PROTECTED]> Cc: "ITAGAKI Takahiro" <[EMAIL PROTECTED]>; Sent: Monday, December 11, 2006 6:30 PM Subject: Re: [HACKERS] Load distributed checkpoint > On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote: >> I understand that checkpoints occur during crash >> recovery and PITR, so time for those operations would get longer. > > A restorepoint happens during recovery, not a checkpoint. The recovery > is merely repeating the work of the checkpoint that occurred in the > original WAL stream. Elongating the checkpoint would not have any effect > on a restorepoint: we only record the checkpoint when it is complete and > we only create a restorepoint when we see the checkpoint record. > > Crash recovery and PITR use almost exactly the same code path (by > design), so there isn't anything special to say about PITR either. > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > > ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Load distributed checkpoint
Hello, From: "ITAGAKI Takahiro" <[EMAIL PROTECTED]> "Takayuki Tsunakawa" <[EMAIL PROTECTED]> wrote: >> I'm afraid it is difficult for system designers to expect steady >> throughput/response time, as long as PostgreSQL depends on the >> flushing of file system cache. How does Oracle provide stable >> performance? >> Though I'm not sure, isn't it the key to use O_SYNC so that write()s >> transfer data to disk? > > AFAIK, other databases use write() and fsync() in combination. They call > fsync() immediately after they write buffers in some small batches. Otherwise, > they uses asynchronous and direct I/O options. Therefore, dirty pages in > kernel buffers are keeped to be low at any time. Oracle seems to use O_SYNC. I've found a related page in the Oracle manuals. -- http://download-west.oracle.com/docs/cd/B19306_01/win.102/b15688/ap_unix.htm Direct Writes to Disk On both UNIX and Windows platforms, bypassing the file system buffer cache ensures data is written to disk. On UNIX, Oracle Database uses the O_SYNC flag to bypass the file system buffer cache. The flag name depends on the UNIX port. On Windows, Oracle Database bypasses the file system buffer cache completely. -- As Itagaki-san says, asynchronous+direct I/O provides best performance, I believe. Oracle supplies the combination as follows: -- http://download-west.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref870 Oracle Database supports kernel asynchronous I/O. This feature is disabled by default. By default, the DISK_ASYNCH_IO initialization parameter in the parameter file is set to TRUE to enable asynchronous I/O on raw devices. To enable asynchronous I/O on file system files: Ensure that all Oracle Database files are located on file systems that support asynchronous I/O. Set the FILESYSTEMIO_OPTIONS initialization parameter in the parameter file to one of the following values: Linux Distribution Recommended Value SUSE Linux Enterprise Server 9 SETALL Other distributions ASYNCH -- I believe SQL Server also uses direct+asynchronous I/O, because Microsoft recommends in MSDN that the combination plus appropriate multi-threading provides best performance. I tested Oracle9i on RHEL 2.1. I straced DBWn (database writer, which is like the bgwriter of PostgreSQL) while creating tables, indexes, etc. and shutting down the database server. Oracle surely uses the O_SYNC as follows, but it doesn't use fsync(). 24462 open("/work4/ora/tuna/users01.dbf", O_RDWR|O_SYNC|O_LARGEFILE) = 16 I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san referring to DB2? ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org