Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
Hi With your tests did you try to write the hot buffers first? ie buffers with a high refcount, either by sorting them on refcount or at least sweeping the buffer list in reverse? In my understanding there's an 'impedance mismatch' between what postgresql wants and what the OS offers. when it called fsync() Postresql wants a set of buffers selected quickly at checkpoint start time written to disks, but the OS only offers to write all dirties buffers at fsync time, not exactly the same contract, on a loaded server with checkpoint spreading the difference could be big, worst case checkpoint want 8KB fsync write 1GB. As a control, there's 150 years of math, up to Maxwell himself, behind t Adding as little energy (packets) as randomly as possible to a control system you couldn't measure actuators do make a by writing to the OS the less likely to be recycle buffers first it may have less work to do at fsync time, hopefully they have been written by the OS background task during the spread and are not re-dirtied by other backends. Didier
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith g...@2ndquadrant.com wrote: On 7/20/13 4:48 AM, didier wrote: With your tests did you try to write the hot buffers first? ie buffers with a high refcount, either by sorting them on refcount or at least sweeping the buffer list in reverse? I never tried that version. After a few rounds of seeing that all changes I tried were just rearranging the good and bad cases, I got pretty bored with trying new changes in that same style. by writing to the OS the less likely to be recycle buffers first it may have less work to do at fsync time, hopefully they have been written by the OS background task during the spread and are not re-dirtied by other backends. That is the theory. In practice write caches are so large now, there is almost no pressure forcing writes to happen until the fsync calls show up. It's easily possible to enter the checkpoint fsync phase only to discover there are 4GB of dirty writes ahead of you, ones that have nothing to do with the checkpoint's I/O. Backends are constantly pounding the write cache with new writes in situations with checkpoint spikes. The writes and fsync calls made by the checkpoint process are only a fraction of the real I/O going on. The volume of data being squeezed out by each fsync call is based on total writes to that relation since the checkpoint. That's connected to the writes to that relation happening during the checkpoint, but the checkpoint writes can easily be the minority there. It is not a coincidence that the next feature I'm working on attempts to quantify the total writes to each 1GB relation chunk. That's the most promising path forward on the checkpoint problem I've found. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses
Hi, On Sat, Jul 20, 2013 at 6:28 PM, Greg Smith g...@2ndquadrant.com wrote: On 7/20/13 4:48 AM, didier wrote: That is the theory. In practice write caches are so large now, there is almost no pressure forcing writes to happen until the fsync calls show up. It's easily possible to enter the checkpoint fsync phase only to discover there are 4GB of dirty writes ahead of you, ones that have nothing to do with the checkpoint's I/O. Isn't adding another layer of cache the usual answer? The best would be in the OS, a fs with a big journal able to write sequentially a lot of blocks. If not and If you can spare at worst 2bit in memory per data blocks, don't mind preallocated data files (assuming meta data are stable then) and have a working mmap( MAP_NONBLOCK), and mincore() syscalls you could have a checkpoint in bound time, worst case you sequentially write the whole server RAM to a separate disk every checkpoint. Not sure I would trust such a beast with my data though :) Didier
[HACKERS] small typo in src/backend/access/transam/xlog.c
Hi in void BootStrapXLOG(void) * to seed it other than the system clock value...) The upper half of the * uint64 value is just the tv_sec part, while the lower half is the XOR * of tv_sec and tv_usec. This is to ensure that we don't lose uniqueness * unnecessarily if uint64 is really only 32 bits wide. A person * knowing this encoding can determine the initialization time of the * installation, which could perhaps be useful sometimes. */ gettimeofday(tv, NULL); sysidentifier = ((uint64) tv.tv_sec) 32; sysidentifier |= (uint32) (tv.tv_sec | tv.tv_usec); should be sysidentifier |= (uint32) (tv.tv_sec ^ tv.tv_usec); Regards Didier
Re: [HACKERS] Design proposal: fsync absorb linear slider
Hi On Tue, Jul 23, 2013 at 5:48 AM, Greg Smith g...@2ndquadrant.com wrote: Recently I've been dismissing a lot of suggested changes to checkpoint fsync timing without suggesting an alternative. I have a simple one in mind that captures the biggest problem I see: that the number of backend and checkpoint writes to a file are not connected at all. We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync. It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? With storage random speed at least five to ten time slower it could help a lot. Thanks Didier
Re: [HACKERS] Design proposal: fsync absorb linear slider
Hi, Sure, that's what the WAL does. But you still have to checkpoint eventually. Sure, when you run pg_ctl stop. Unlike the WAL it only needs two files, shared_buffers size. I did bogus tests by replacing mask |= BM_PERMANENT; with mask = -1 in BufferSync() and simulating checkpoint with a periodic dd if=/dev/zero of=foo conv=fsync On a saturated storage with %usage locked solid at 100% I got up to 30% speed improvement and fsync latency down by one order of magnitude, some fsync were still slow of course if buffers were already in OS cache. But it's the upper bound, it's was done one a slow storage with bad ratios : (OS cache write)/(disk sequential write) in 50, (sequential write)/(effective random write) in 10 range and a proper implementation would have a 'little' more work to do... (only checkpoint task can write BM_CHECKPOINT_NEEDED buffers keeping them dirty and so on) Didier
Re: [HACKERS] Design proposal: fsync absorb linear slider
Hi, On Fri, Jul 26, 2013 at 11:42 AM, Greg Smith g...@2ndquadrant.com wrote: On 7/25/13 6:02 PM, didier wrote: It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file? If you do that, reads of the data will have to traverse that temporary file to assemble their data. You'll make every later reader pay the random I/O penalty that's being avoided right now. Checkpoints are already postponing these random writes as long as possible. You have to take care of them eventually though. No the log file is only used at recovery time. in check point code: - loop over cache, marks dirty buffers with BM_CHECKPOINT_NEEDED as in current code - other workers can't write and evicted these marked buffers to disk, there's a race with fsync. - check point fsync now or after the next step. - check point loop again save to log these buffers, clear BM_CHECKPOINT_NEEDED but *doesn't* clear BM_DIRTY, of course many buffers will be written again, as they are when check point isn't running. - check point done. During recovery you have to load the log in cache first before applying WAL. Didier
Re: [HACKERS] Design proposal: fsync absorb linear slider
Hi, On Fri, Jul 26, 2013 at 3:41 PM, Greg Smith greg@2ndquadg...@2ndquadrant.com(needrant.comg...@2ndquadrant.com wrote: On 7/26/13 9:14 AM, didier wrote: During recovery you have to load the log in cache first before applying WAL. Checkpoints exist to bound recovery time after a crash. That is their only purpose. What you're suggesting moves a lot of work into the recovery path, which will slow down how long it takes to process. Yes it's slower but you're sequentially reading only one file at most the size of your buffer cache, moreover it's a constant time. Let say you make a checkpoint and crash just after with a next to empty WAL. Now recovery is very fast but you have to repopulate your cache with random reads from requests. With the snapshot it's slower but you read, sequentially again, a lot of hot cache you will need later when the db starts serving requests. Of course the worst case is if it crashes just before a checkpoint, most of the snapshot data are stalled and will be overwritten by WAL ops. But If the WAL recovery is CPU bound, loading from the snapshot may be done concurrently while replaying the WAL. More work at recovery time means someone who uses the default of checkpoint_timeout='5 minutes', expecting that crash recovery won't take very long, will discover it does take a longer time now. They'll be forced to shrink the value to get the same recovery time as they do currently. You might need to make checkpoint_timeout 3 minutes instead, if crash recovery now has all this extra work to deal with. And when the time between checkpoints drops, it will slow the fundamental efficiency of checkpoint processing down. You will end up writing out more data in the end. Yes it's a trade off, now you're paying the price at checkpoint time, every time, with the log you're paying only once, at recovery. The interval between checkpoints and recovery time are all related. If you let any one side of the current requirements slip, it makes the rest easier to deal with. Those are all trade-offs though, not improvements. And this particular one is already an option. If you want less checkpoint I/O per capita and don't care about recovery time, you don't need a code change to get it. Just make checkpoint_timeout huge. A lot of checkpoint I/O issues go away if you only do a checkpoint per hour, because instead of random writes you're getting sequential ones to the WAL. But when you crash, expect to be down for a significant chunk of an hour, as you go back to sort out all of the work postponed before. It's not the same it's a snapshot saved and loaded in constant time unlike the WAL log. Didier
Re: [HACKERS] Properly initialize negative/empty cache entries in relfilenodemap
Hi, On Thu, Aug 29, 2013 at 2:35 PM, MauMau maumau...@gmail.com wrote: Great! Could anybody find the root cause for the following memory leak problem, and if possible, fix this? http://www.postgresql.org/**message-id/**214653D8DF574BFEAA6ED53E545E99** E4@maumauhttp://www.postgresql.org/message-id/214653D8DF574BFEAA6ED53E545E99E4@maumau Heiki helped to solve this and found that pg_statistic entries are left in CacheMemoryContext, but we have no idea where and how they are created and left. This seems difficult to me. VALGRIND won't help you for this one You hit 2 issues - user can create negative cache entries in pg_statistic with SELECT but they are unbound (at first there was a LRU aging but it was removed in 2006) - if there's no row in pg_statistic for a relation/column then RemoveStatistics, called by DROP ..., doesn't invalidate the cache (which should remove these negative entries).
Re: [HACKERS] Freezing without write I/O
Hi On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund and...@2ndquadrant.comwrote: On 2013-09-20 16:47:24 +0200, Andres Freund wrote: I think we should go through the various implementations and make sure they are actual compiler barriers and then change the documented policy. From a quick look * S_UNLOCK for PPC isn't a compiler barrier * S_UNLOCK for MIPS isn't a compiler barrier * I don't know enough about unixware (do we still support that as a platform even) to judge * True64 Alpha I have no clue about * PA-RISCs tas() might not be a compiler barrier for !GCC * PA-RISCs S_UNLOCK might not be a compiler barrier * HP-UX !GCC might not * IRIX 5 seems to be a compiler barrier * SINIX - I don't care * AIX PPC - compiler barrier * Sun - TAS is implemented in external assembly, normal function call, compiler barrier * Win(32|64) - compiler barrier * Generic S_UNLOCK *NOT* necessarily a compiler barrier. Ok, so I might have been a bit too optimistic... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freezing without write I/O
Hi, IMO it's a bug if S_UNLOCK is a not a compiler barrier. Moreover for volatile remember: https://www.securecoding.cert.org/confluence/display/seccode/DCL17-C.+Beware+of+miscompiled+volatile-qualified+variables Who is double checking compiler output? :) regards Didier On Fri, Sep 20, 2013 at 5:11 PM, Andres Freund and...@2ndquadrant.comwrote: On 2013-09-20 16:47:24 +0200, Andres Freund wrote: I think we should go through the various implementations and make sure they are actual compiler barriers and then change the documented policy. From a quick look * S_UNLOCK for PPC isn't a compiler barrier * S_UNLOCK for MIPS isn't a compiler barrier * I don't know enough about unixware (do we still support that as a platform even) to judge * True64 Alpha I have no clue about * PA-RISCs tas() might not be a compiler barrier for !GCC * PA-RISCs S_UNLOCK might not be a compiler barrier * HP-UX !GCC might not * IRIX 5 seems to be a compiler barrier * SINIX - I don't care * AIX PPC - compiler barrier * Sun - TAS is implemented in external assembly, normal function call, compiler barrier * Win(32|64) - compiler barrier * Generic S_UNLOCK *NOT* necessarily a compiler barrier. Ok, so I might have been a bit too optimistic... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] trivial one-off memory leak in guc-file.l ParseConfigFile
Hi fix a small memory leak in guc-file.l ParseConfigFile AbsoluteConfigLocation() return a strdup string but it's never free or referenced outside ParseConfigFile Courtesy Valgrind and Noah Misch MEMPOOL work. Regards Didier memory_leak_in_parse_config_file.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OSX doesn't accept identical source/target for strcpy() anymore
Hi, On Mon, Oct 28, 2013 at 7:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: If copying takes place between objects that overlap, the behavior is undefined. Both gcc and glibc have been moving steadily in the direction of aggressively exploiting undefined behavior cases for optimization purposes. I don't know if there is yet a platform where strncpy with src == dest behaves oddly, but we'd be foolish to imagine that it's not going to happen eventually. If anything, Apple is probably doing us a service by making it obvious where we're failing to adhere to spec. However ... I still can't replicate this here, and as you say, there's about zero chance of keeping our code clean of this problem unless we can set up a buildfarm member that will catch it. regards, tom lane I haven't a 10.9 box for double checking but there's a gcc command line triggering the same assert for strcpy and gcc at http://lists.gnu.org/archive/html/bug-bash/2013-07/msg00011.html. Didier
Re: [HACKERS] postgresql latency bgwriter not doing its job
hi On Thu, Sep 4, 2014 at 7:01 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Sep 4, 2014 at 3:09 AM, Ants Aasma a...@cybertec.at wrote: On Thu, Sep 4, 2014 at 12:36 AM, Andres Freund and...@2ndquadrant.com wrote: It's imo quite clearly better to keep it allocated. For one after postmaster started the checkpointer successfully you don't need to be worried about later failures to allocate memory if you allocate it once (unless the checkpointer FATALs out which should be exceedingly rare - we're catching ERRORs). It's much much more likely to succeed initially. Secondly it's not like there's really that much time where no checkpointer isn't running. In principle you could do the sort with the full sized array and then compress it to a list of buffer IDs that need to be written out. This way most of the time you only need a small array and the large array is only needed for a fraction of a second. It's not the size of the array that's the problem; it's the size of the detonation when the allocation fails. You can use a file backed memory array Or because it's only a hint and - keys are in buffers (BufferTag), right? - transition is only from 'data I care to data I don't care' if a buffer is concurrently evicted when sorting. Use a pre allocate buffer index array an read keys from buffers when sorting, without memory barrier, spinlocks, whatever. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] posix_fadvise() and pg_receivexlog
Hi Well, I'd like to hear someone from the field complaining that pg_receivexlog is thrashing the cache and thus reducing the performance of some other process. Or a least a synthetic test case that demonstrates that happening. It's not with pg_receivexlog but it's related. On a small box without replication server connected perfs were good enough but not so with a replication server connected, there was 1GB worth of WAL sitting in RAM vs next to nothing without slave! setup: 8GB RAM 2GB shared_buffers (smaller has other issues) checkpoint_segments 40 (smaller value trigger too much xlog checkpoint) checkpoints spread over 10 mn and write 30 to 50% of shared buffers. live data set fit in RAM. constant load. On startup (1 or 2/hour) applications were running requests on cold data which were now saturating IO. I'm not sure it's an OS bug as the WAL were 'hotter' than the cold data. A cron task every minute with vmtouch -e for evicting old WAL files from memory has solved the issue. Regards -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] proposal: adding a GUC for BAS_BULKREAD strategy
Hi, Currently the value is hard code to NBuffers / 4 but ISTM that with bigger shared_buffer it's too much, ie even with a DB 10 to 20 time the memory size there's a lot of tables under this limit and nightly batch reports are trashing the shared buffers cache as if there's no tomorrow. regards, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Failback to old master
Hi, On Sat, Nov 15, 2014 at 5:31 PM, Maeldron T. maeld...@gmail.com wrote: A safely shut down master (-m fast is safe) can be safely restarted as a slave to the newly promoted master. Fast shutdown shuts down all normal connections, does a shutdown checkpoint and then waits for this checkpoint to be replicated to all active streaming clients. Promoting slave to master creates a timeline switch, that prior to version 9.3 was only possible to replicate using the archive mechanism. As of version 9.3 you don't need to configure archiving to follow timeline switches, just add a recovery.conf to the old master to start it up as a slave and it will fetch everything it needs from the new master. I took your advice and I understood that removing the recovery.conf followed by a restart is wrong. I will not do that on my production servers. However, I can't make it work with promotion. What did I wrong? It was 9.4beta3. mkdir 1 mkdir 2 initdb -D 1/ edit config: change port, wal_level to hot_standby, hot_standby to on, max_wal_senders=7, wal_keep_segments=100, uncomment replication in hba.conf pg_ctl -D 1/ start createdb -p 5433 psql -p 5433 pg_basebackup -p 5433 -R -D 2/ mcedit 2/postgresql.conf change port chmod -R 700 1 chmod -R 700 2 pg_ctl -D 2/ start psql -p 5433 psql -p 5434 everything works pg_ctl -D 1/ stop pg_ctl -D 2/ promote psql -p 5434 cp 2/recovery.done 1/recovery.conf mcedit 1/recovery.conf change port pg_ctl -D 1/ start LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 at 0/3000AE0. LOG: restarted WAL streaming at 0/300 on timeline 1 LOG: replication terminated by primary server DETAIL: End of WAL reached on timeline 1 at 0/3000AE0. This is what I experienced in the past when I tried with promote. The old master disconnects from the new. What am I missing? I think you have to add recovery_target_timeline = '2' in recovery.conf with '2' being the new primary timeline . cf http://www.postgresql.org/docs/9.4/static/recovery-target-settings.html Didier -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WALWriter active during recovery
Hi, On Tue, Dec 16, 2014 at 6:07 PM, Simon Riggs si...@2ndquadrant.com wrote: On 16 December 2014 at 14:12, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 12/15/2014 08:51 PM, Simon Riggs wrote: Currently, WALReceiver writes and fsyncs data it receives. Clearly, while we are waiting for an fsync we aren't doing any other useful work. Following patch starts WALWriter during recovery and makes it responsible for fsyncing data, allowing WALReceiver to progress other useful actions. On many Linux systems it may not do that much (2.6.32 and 3.2 are bad, 3.13 is better but still it slows the fsync). If there's a fsync in progress WALReceiver will: 1- slow the fsync because its writes to the same file are grabbed by the fsync 2- stall until the end of fsync. from 'stracing' a test program simulating this pattern: two processes, one writes to a file the second fsync it. 20279 11:51:24.037108 fsync(5 unfinished ... 20278 11:51:24.053524 ... nanosleep resumed NULL) = 0 0.020281 20278 11:51:24.053691 lseek(3, 1383612416, SEEK_SET) = 1383612416 0.000119 20278 11:51:24.053965 write(3, ..., 8192) = 8192 0.000111 20278 11:51:24.054190 nanosleep({0, 2000}, NULL) = 0 0.020243 20278 11:51:24.404386 lseek(3, 194772992, SEEK_SET unfinished ... 20279 11:51:24.754123 ... fsync resumed ) = 0 0.716971 20279 11:51:24.754202 close(5 unfinished ... 20278 11:51:24.754232 ... lseek resumed ) = 194772992 0.349825 Yes that's a 300ms lseek... What other useful actions can WAL receiver do while it's waiting? It doesn't do much else than receive WAL, and fsync it to disk. So now it will only need to do one of those two things. Regards Didier -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WALWriter active during recovery
Hi On Wed, Dec 17, 2014 at 2:39 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: didier wrote: On many Linux systems it may not do that much (2.6.32 and 3.2 are bad, 3.13 is better but still it slows the fsync). If there's a fsync in progress WALReceiver will: 1- slow the fsync because its writes to the same file are grabbed by the fsync 2- stall until the end of fsync. Is this behavior filesystem-dependent? I don't know. I only tested ext4 Attach the trivial code I used, there's a lot of junk in it. Didier /* * Compile with: gcc testf.c -Wall -W -O0 */ #include stdio.h #include unistd.h #include string.h #include sys/types.h #include sys/stat.h #include sys/fcntl.h #include sys/time.h #include stdlib.h #include stdint.h #include sys/file.h #include errno.h static long long microseconds(void) { struct timeval tv; long long mst; gettimeofday(tv, NULL); mst = ((long long)tv.tv_sec)*100; mst += tv.tv_usec; return mst; } int out= 0; //#define FLOCK(a,b) flock(a,b) #define FLOCK(a,b) (0) //== // fsync void child(void) { int fd, retval; long long start; while(1) { fd = open(/tmp/foo.txt,O_RDONLY); //usleep(3000); usleep(500); FLOCK(fd, LOCK_EX); if (out) { printf(Start sync\n); fflush(stdout); start = microseconds(); } retval = fsync(fd); FLOCK(fd, LOCK_UN); if (out) { printf(Sync in %lld microseconds (%d)\n, microseconds()-start,retval); fflush(stdout); } close(fd); } exit(0); } char buf[8*1024]; #define f_size (2lu*1024*1024*1024) //== // read void child2(void) { int fd, retval; long long start; off_t lfsr; fd = open(/tmp/foo.txt,O_RDWR /*|O_CREAT | O_SYNC*/,0644); srandom(2000 +time(NULL)); while(1) { if (out) { start = microseconds(); } lfsr = random()/sizeof(buf); if (pread (fd, buf, sizeof(buf), sizeof(buf)*lfsr) == -1) { perror(read); exit(1); } // posix_fadvise(fd, sizeof(buf)*lfsr, sizeof(buf), POSIX_FADV_DONTNEED); if (out) { printf(read %lu in %lld microseconds\n, lfsr *sizeof(buf), microseconds()-start); fflush(stdout); } usleep(500); } close(fd); exit(0); } //== void child3(int end) { int fd, retval; long long start; off_t lfsr; int i; int j = 2; fd = open(/tmp/foo.txt,O_RDWR /*|O_CREAT | O_SYNC*/,0644); for (i = 0; i 131072/j; i++) { int u; lseek(fd, sizeof(buf)*(i*j), SEEK_SET); write(fd, buf , sizeof(buf)); } close(fd); if (end) exit(0); sleep(60); } int main(void) { int fd0 = open(/tmp/foo.txt,O_RDWR |O_CREAT /*| O_SYNC*/,0644); int fd1 = open(/tmp/foo1.txt,O_RDWR |O_CREAT /*| O_SYNC*/,0644); int fd; long long start; long long end = 0; off_t lfsr = 0; memset(buf, 'a', sizeof(buf)); ftruncate(fd0, f_size); ftruncate(fd1, f_size); printf (%d\n,RAND_MAX); // child3(0); if (!fork()) { child(); exit(1); } #if 0 if (!fork()) { child2(); exit(1); } if (!fork()) { child3(1); exit(1); } #endif srandom(1000+time(NULL)); while(1) { fd = fd0; if (FLOCK(fd, LOCK_EX| LOCK_NB) == -1) { if (errno == EWOULDBLOCK) fd =fd1; } lfsr = random()/sizeof(buf); if (out) { start = microseconds(); } // if (pwrite(fd ,buf ,sizeof(buf), sizeof(buf)*lfsr) == -1) { lseek(fd, sizeof(buf)*lfsr, SEEK_SET); if (write(fd,buf,sizeof(buf)) == -1) { perror(write); exit(1); } if (out) { printf(Write %lu in %lld microseconds\n, lfsr *sizeof(buf), microseconds()-start); fflush(stdout); } if (fd == fd0) { FLOCK(fd, LOCK_UN); } usleep(2); } close(fd); exit(0); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: pgbench - merging transaction logs
Hi, On Sat, Mar 21, 2015 at 10:37 AM, Fabien COELHO coe...@cri.ensmp.fr wrote: no logging: 18672 18792 18667 18518 18613 18547 with logging: 18170 18093 18162 18273 18307 18234 So on average, that's 18634 vs. 18206, i.e. less than 2.5% difference. And with more expensive transactions (larger scale, writes, ...) the difference will be much smaller. Ok. Great! Let us take this as a worst-case figure and try some maths. If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the collision probability would be about 1/40 and the delayed thread would be waiting for half this time on average, so the performance impact due to fprintf locking would be negligeable (1/80 delay occured in 1/40 cases = 1/3200 time added on the computed average, if I'm not mistaken). If threads run more or less the same code with the same timing after a while they will lockstep on synchronization primitives and your collision probability will be very close to 1. Moreover they will write to the same cache lines for every fprintf and this is very very bad even without atomic operations. Regards Didier -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: pgbench - merging transaction logs
Hi, On Sat, Mar 21, 2015 at 8:42 PM, Fabien COELHO coe...@cri.ensmp.fr wrote: Hello Didier, If fprintf takes p = 0.025 (1/40) of the time, then with 2 threads the collision probability would be about 1/40 and the delayed thread would be waiting for half this time on average, so the performance impact due to fprintf locking would be negligeable (1/80 delay occured in 1/40 cases = 1/3200 time added on the computed average, if I'm not mistaken). Yes but for a third thread (each on a physical core) it will be 1/40 + 1/40 and so on up to roughly 40/40 for 40 cores. If threads run more or less the same code with the same timing after a while they will lockstep on synchronization primitives and your collision probability will be very close to 1. I'm not sure I understand. If transaction times were really constant, then after a while the mutexes would be synchronised so as to avoid contention, i.e. the collision probability would be 0? But they aren't constant only close. It may or not show up in this case but I've noticed that often the collision rate is a lot higher than the probability would suggest, I'm not sure why, Moreover they will write to the same cache lines for every fprintf and this is very very bad even without atomic operations. We're talking of transactions that involve network messages and possibly disk IOs on the server, so some cache issues issues within pgbench would not be a priori the main performance driver. Sure but : - good measurement is hard and by adding locking in fprintf it make its timing more noisy. - it's against 'good practices' for scalable code. Trivial code can show that elapsed time for as low as four cores writing to same cache line in a loop, without locking or synchronization, is greater than the elapsed time for running these four loops sequentially on one core. If they write to different cache lines it scales linearly. Regards Didier -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Foreign key wierdness
Dear Tom, Dave, Tom Lane wrote: Ah-hah, and I'll bet that the column being linked to this one by the foreign key constraint is still an integer? It sure is ; being a PostgreSQL novice (BTW : many thanks to the whole of the PG development team for such an excellent product), I got on this track by means of http://archives.postgresql.org/pgsql-sql/2001-05/msg00395.php . With two tables each containing some 20.000 entries, the fk creation time between both of them increases from ~ 1.8 secs to ~ 221 secs. Seems odd that the cost would get *that* much worse. Maybe we need to look at whether the FK checking queries need to include explicit casts ... Well, I reproduced the slowdown with some 20 to 30 different tables. Anyway, glad I could be of some help, albeit only by testing some (probably quite meaningless) border cases ... :) Regards, Didier -- Didier Moens - RUG/VIB - Dept. Molecular Biomedical Research - Core IT tel ++32(9)2645309 fax ++32(9)2645348 http://www.dmb.rug.ac.be ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Foreign key wierdness
Hi all, Dave Page wrote: If you really think the schema qualification has something to do with it, try issuing the ADD FOREIGN KEY command manually in psql, with and without schema name. Well to be honest I'm having a hard time believing it, but having looked at this in some depth, it's the only thing that the 2 versions of pgAdmin are doing differently. Even the PostgreSQL logs agree with that. I'm relying on Didier for test results though as I don't have a test system I can use for this at the moment. But it gives us something to try - Didier can you create a new database please, and load the data from 2 tables. VACUUM ANALYZE, then add the foreign key in psql using the syntax 1.4.2 uses. Then drop the database, and load exactly the same data in the same way, VACUUM ANALYZE again, and create the fkey using the qualified tablename syntax. I did some extensive testing using PostgreSQL 7.3.1 (logs and results available upon request), and the massive slowdown is NOT related to qualified tablename syntax or (lack of) VACUUM ANALYZE, but to the following change : pgAdminII 1.4.2 : --- CREATE TABLE articles ( article_id integer DEFAULT nextval('articles_article_id_key'::text) NOT NULL, ... test=# \d articles Table public.articles Column | Type | Modifiers -+---+- article_id | integer | not null default nextval('articles_article_id_key'::text) ... pgAdminII 1.4.12 : CREATE TABLE articles ( article_id bigint DEFAULT nextval('articles_article_id_key'::text) NOT NULL, ... test=# \d articles Table public.articles Column | Type | Modifiers -+---+- article_id | bigint| not null default nextval('articles_article_id_key'::text) ... With two tables each containing some 20.000 entries, the fk creation time between both of them increases from ~ 1.8 secs to ~ 221 secs. Regards, Didier -- Didier Moens - RUG/VIB - Dept. Molecular Biomedical Research - Core IT tel ++32(9)2645309 fax ++32(9)2645348 http://www.dmb.rug.ac.be ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Foreign key wierdness
Dave Page wrote: From what Tom has said in his reponse, I think the answer for you Didier is to remap your integer columns to int8 instead of int4 and see what happens. When I get a couple of minutes I will look at putting a Serials as... Option in the type map. Thanks Dave, for all of your invested time. I think the value of tools such as pgAdmin, which provide an almost bumpless cross-platform migration path, cannot be underestimated. Regards, Didier -- Didier Moens - RUG/VIB - Dept. Molecular Biomedical Research - Core IT http://www.dmb.rug.ac.be ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]