Re: SCHED_ULE should not be the default
Are you able to go through the emails here and grab out Attilio's example for generating KTR scheduler traces? Adrian On 21 December 2011 16:52, Steve Kargl s...@troutmask.apl.washington.edu wrote: On Fri, Dec 16, 2011 at 12:14:24PM +0100, Attilio Rao wrote: 2011/12/15 Steve Kargl s...@troutmask.apl.washington.edu: On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote: I basically went through all the e-mail you just sent and identified 4 real report on which we could work on and summarizied in the attached Excel file. I'd like that George, Steve, Doug, Andrey and Mike possibly review the few datas there and add more, if they want, or make more important clarifications in particular about the Xorg presence (or rather not) in their workload. Your summary of my observations appears correct. I have grabbed an up-to-date /usr/src, built and installed world, and built and installed a new kernel on one of the nodes in my cluster. ??It has It seems a perfect environment, just please make sure you made a debug-free userland (setting MALLOC_PRODUCTION in jemalloc basically). The first thing is, can you try reproducing your case? As far as I got it, for you it was enough to run N + small_amount of CPU-bound threads to show performance penalty, so I'd ask you to start with using dnetc or just your preferred cpu-bound workload and verify you can reproduce the issue. As it happens, please monitor the threads bouncing and CPU utilization via 'top' (you don't need to be 100% precise, jut to get an idea, and keep an eye on things like excessive threads migration, thread binding obsessity, low throughput on CPU). One note: if your workloads need to do I/O please use a tempfs or memory storage to do so, in order to reduce I/O effects at all. Also, verify this doesn't happen with 4BSD scheduler, just in case. Finally, if the problem is still in place, please recompile your kernel by adding: options KTR options KTR_ENTRIES=262144 options KTR_COMPILE=(KTR_SCHED) options KTR_MASK=(KTR_SCHED) And reproduce the issue. When you are in the middle of the scheduling issue go with: # ktrdump -ctf ktr-ule-problem-YOURNAME.out and send to the mailing list along with your dmesg and the informations on the CPU utilization you gathered by top(1). That should cover it all, but if you have further questions, please just go ahead. Attilio, I have placed several files at http://troutmask.apl.washington.edu/~kargl/freebsd dmesg.txt -- dmesg for ULE kernel summary -- A summary that includes top(1) output of all runs. sysctl.ule.txt -- sysctl -a for the ULE kernel ktr-ule-problem-kargl.out.gz I performed a series of tests with both 4BSD and ULE kernels. The 4BSD and ULE kernels are identical except of course for the scheduler. Both witness and invariants are disabled, and malloc has been compiled without debugging. Here's what I did. On the master node in my cluster, I ran an OpenMPI code that sends N jobs off to the node with the kernel of interest. There is communication between the master and slaves to generate 16 independent chunks of data. Note, there is no disk IO. So, for example, N=4 will start 4 essentially identical numerically intensity jobs. At the start of a run, the master node instructs each slave job to create a chunk of data. After the data is created, the slave sends it back to the master and the master sends instructions to create the next chunk of data. This communication continues until the 16 chunks have been assigned, computed, and returned to the master. Here is a rough measurement of the problem with ULE and numerical intensity loads. This command is executed on the master time mpiexec -machinefile mf3 -np N sasmp sas.in Since time is executed on the master, only the 'real' time is of interest (the summary file includes user and sys times). This command is run at 5 times for each N value and up to 10 time for some N values with the ULE kernel. The following table records the average 'real' time and the number in (...) is the mean absolute deviations. # N ULE 4BSD # - # 4 223.27 (0.502) 221.76 (0.551) # 5 404.35 (73.82) 270.68 (0.866) # 6 627.56 (173.0) 247.23 (1.442) # 7 475.53 (84.07) 285.78 (1.421) # 8 429.45 (134.9) 223.64 (1.316) These numbers to me demonstrate that ULE is not a good choice for a HPC workload. If you need more information, feel free to ask. If you would like access to the node, I can probably arrange that. But, we can discuss that off-line. -- Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list
Re: directory listing hangs in ufs state
On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote: On 15.12.2011 17:01, Kostik Belousov wrote: On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick free...@jdc.parodius.comwrote: On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. Sorry, that wasn't what I was implying. Let me try to explain differently. msync(2) looks, to me, like an mmap-specific version of fsync(2). Based on the man page, it seems that the with msync() you can effectively guaranteed flushing of certain pages within an mmap()'d region to disk. fsync() would flush **all** buffers/internal pages to be flushed to disk. One would need to look at the code to mongodb to find out what it's actually doing with msync(). That is to say, if it's doing something like this (I probably have the semantics wrong -- I've never spent much time with mmap()): fd = open(/some/file, O_RDWR); ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); ret = msync(ptr, 65536, MS_SYNC); /* or alternatively, this: ret = msync(ptr, NULL, MS_SYNC); */ Then this, to me, would be mostly the equivalent to: fd = fopen(/some/file, r+); ret = fsync(fd); Otherwise, if it's calling msync() only on an address/location within the region ptr points to, then that may be more efficient (less pages to flush). They call msync() for the whole file. So, there will not be any difference. The mmap() arguments -- specifically flags (see man page) -- also play a role here. The one that catches my attention is MAP_NOSYNC. So you may need to look at the mongodb code to figure out what it's mmap() call is. One might wonder why they don't just use open() with the O_SYNC. I imagine that has to do with, again, performance; possibly the don't want all I/O synchronous, and would rather flush certain pages in the mmap'd region to disk as needed. I see the legitimacy in that approach (vs. just using O_SYNC). There's really no easy way for me to tell you which is more efficient, better, blah blah without spending a lot of time with a benchmarking program that tests all of this, *plus* an entire system (world) built with profiling. I ran for two hours mongodb with fsync() and got the following: STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 10:34:52 2011 3 192744314 3080182 This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. Then I ran it with default msync(): STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 There are also two graphics of disk business [1] [2]. The difference is significant, in 37 times! That what I expected to get. In commentaries for vm_object_page_clean() I found this: * When stuffing pages asynchronously, allow clustering. XXX we need a * synchronous clustering mode implementation. It means for me that msync(MS_SYNC) flush every page on disk in single IO transaction. If we multiply 4K and 37 we get 150K. This number is size of the single transaction in my experience. +alc@, kib@ Am I right? Is there any plan to implement this? Current buffer clustering code can only do only async writes. In fact, I am not quite sure what would consitute the sync clustering, because the ability to delay the write is important to be able to cluster at all. Also, I am not sure that lack of clustering is the biggest problem. IMO, the fact that each write is sync is the first problem there. It would be quite a work to add the tracking of the issued writes
Re: SCHED_ULE should not be the default
On Wed, Dec 21, 2011 at 04:52:50PM -0800, Steve Kargl wrote: On Fri, Dec 16, 2011 at 12:14:24PM +0100, Attilio Rao wrote: 2011/12/15 Steve Kargl s...@troutmask.apl.washington.edu: On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote: I basically went through all the e-mail you just sent and identified 4 real report on which we could work on and summarizied in the attached Excel file. I'd like that George, Steve, Doug, Andrey and Mike possibly review the few datas there and add more, if they want, or make more important clarifications in particular about the Xorg presence (or rather not) in their workload. Your summary of my observations appears correct. I have grabbed an up-to-date /usr/src, built and installed world, and built and installed a new kernel on one of the nodes in my cluster. ??It has It seems a perfect environment, just please make sure you made a debug-free userland (setting MALLOC_PRODUCTION in jemalloc basically). The first thing is, can you try reproducing your case? As far as I got it, for you it was enough to run N + small_amount of CPU-bound threads to show performance penalty, so I'd ask you to start with using dnetc or just your preferred cpu-bound workload and verify you can reproduce the issue. As it happens, please monitor the threads bouncing and CPU utilization via 'top' (you don't need to be 100% precise, jut to get an idea, and keep an eye on things like excessive threads migration, thread binding obsessity, low throughput on CPU). One note: if your workloads need to do I/O please use a tempfs or memory storage to do so, in order to reduce I/O effects at all. Also, verify this doesn't happen with 4BSD scheduler, just in case. Finally, if the problem is still in place, please recompile your kernel by adding: options KTR options KTR_ENTRIES=262144 options KTR_COMPILE=(KTR_SCHED) options KTR_MASK=(KTR_SCHED) And reproduce the issue. When you are in the middle of the scheduling issue go with: # ktrdump -ctf ktr-ule-problem-YOURNAME.out and send to the mailing list along with your dmesg and the informations on the CPU utilization you gathered by top(1). That should cover it all, but if you have further questions, please just go ahead. Attilio, I have placed several files at http://troutmask.apl.washington.edu/~kargl/freebsd dmesg.txt -- dmesg for ULE kernel summary-- A summary that includes top(1) output of all runs. sysctl.ule.txt -- sysctl -a for the ULE kernel ktr-ule-problem-kargl.out.gz I performed a series of tests with both 4BSD and ULE kernels. The 4BSD and ULE kernels are identical except of course for the scheduler. Both witness and invariants are disabled, and malloc has been compiled without debugging. Here's what I did. On the master node in my cluster, I ran an OpenMPI code that sends N jobs off to the node with the kernel of interest. There is communication between the master and slaves to generate 16 independent chunks of data. Note, there is no disk IO. So, for example, N=4 will start 4 essentially identical numerically intensity jobs. At the start of a run, the master node instructs each slave job to create a chunk of data. After the data is created, the slave sends it back to the master and the master sends instructions to create the next chunk of data. This communication continues until the 16 chunks have been assigned, computed, and returned to the master. Here is a rough measurement of the problem with ULE and numerical intensity loads. This command is executed on the master time mpiexec -machinefile mf3 -np N sasmp sas.in Since time is executed on the master, only the 'real' time is of interest (the summary file includes user and sys times). This command is run at 5 times for each N value and up to 10 time for some N values with the ULE kernel. The following table records the average 'real' time and the number in (...) is the mean absolute deviations. # N ULE 4BSD # - # 4223.27 (0.502) 221.76 (0.551) # 5404.35 (73.82) 270.68 (0.866) # 6627.56 (173.0) 247.23 (1.442) # 7475.53 (84.07) 285.78 (1.421) # 8429.45 (134.9) 223.64 (1.316) One explanation for taking 1.5-2x times is that with ULE the threads are not migrated properly, so you end up with idle cores and ready threads not running (the other possible explanation would be that there are migrations, but they are so frequent and expensive that they completely trash the caches. But this seems unlikely for this type of task). Also, perhaps one could build a simple test process that replicates this workload (so one can run it as part of regression tests): 1. define a CPU-intensive function f(n) which issues no system calls, optionally touching a lot of memory, where n
Re: Using mmap(2) with a hint address
Hi Artem, Tijl, On Tue, 20 Dec 2011 09:27:43 -0800, Artem Belevich wrote Something like that. [...] These days malloc() by default uses mmap, so if you don't force it to use sbrk() you can probably lower MAXDSIZE and let kernel use most of address space for hinted mmaps. [...] On Tue, 20 Dec 2011 18:45:08 +0100, Tijl Coosemans wrote I don't know about NetBSD but Linux maps from the stack downwards when there's no hint and FreeBSD maps from the program upwards. [...] malloc(3) used to be implemented on top of brk(2) so the size was increased on amd64 so you could malloc more memory. Nowadays malloc can use mmap(2) so a large datasize isn't really needed anymore. I will use setrlimit(2) to lower datasize then. Thanks a lot for your time and explanations, Best regards, -- Ganael LAPLANCHE ganael.laplan...@martymac.org http://www.martymac.org | http://contribs.martymac.org FreeBSD: martymac marty...@freebsd.org, http://www.FreeBSD.org ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On 12/22/11 04:07, Adrian Chadd wrote: Are you able to go through the emails here and grab out Attilio's example for generating KTR scheduler traces? Adrian [...] I've put up two such files: http://www.m5p.com/~george/ktr-ule-problem.out http://www.m5p.com/~george/ktr-ule-interact.out but I don't know how to analyze them myself. What do all of us do next? -- George Mitchell ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On Thu, Dec 22, 2011 at 01:07:58AM -0800, Adrian Chadd wrote: Are you able to go through the emails here and grab out Attilio's example for generating KTR scheduler traces? Did your read this part of my email? Attilio, I have placed several files at http://troutmask.apl.washington.edu/~kargl/freebsd dmesg.txt -- dmesg for ULE kernel summary-- A summary that includes top(1) output of all runs. sysctl.ule.txt -- sysctl -a for the ULE kernel ktr-ule-problem-kargl.out.gz ktr-ule-problem-kargl.out is a 43 MB file. I don't the freebsd.org email server would allow that file through. -- Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
emacs-devel glib-warning
Hello! I considered switching from emacs23 to emacs24 over the christmas-holidays.. So I removed emacs23 and installed emacs-devel via ports. Emacs runs fine, in terminal, but it crashes my whole X-system when I try to start it as X-client... The error message tells me that there is a glib-problem: GLib-WARNING **: In call to g_spawn_sync(), exit status of a child process was requested but SIGCHLD action was set to SIG_IGN and ECHILD was received by waitpid(), so exit status can't be returned. This is a bug in the program calling g_spawn_sync(); either don't request the exit status, or don't set the SIGCHLD action. I have emacs-devel installed, and glib-2.28.8_2. I am runnig xmonad as WM, but it happened on awesome as well.. Does anyone know what I can do to get emacs to work as X-client? ;) Thanks in advance! Greetings from rainy Cologne, 1126 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: FreeBSD 9 RC3 and VirtualBox
On Wed, Dec 21, 2011 at 11:56 PM, Adam Vande More amvandem...@gmail.comwrote: VT-x(or the AMD equiv) is a CPU feature and is necessary to run 64-bit guests. VT-d(or the AMD equiv)/IOMMU is the what is done in the chipset however it isn't necessary to run 64-bit guests. Both of these features are only found on CPU's supporting long mode. Exactly. The E7300 lacks the VT-x bits. -- Joshua Boyd E-mail: boy...@jbip.net http://www.jbip.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On Thu, Dec 15, 2011 at 05:25:51PM +0100, Attilio Rao wrote: If someone else thinks he has a specific problem that is not characterized by one of the cases above please let me know and I will put this in the chart. It seems I stumbled over another thing. Setup: 2 Servers providing devices by ggated, 1 Server using ggatec for those devices. ZFS over each a pair of disks provided by both ggated servers. I use rsync to fill up the 6 zpools/zfs from an existing storage (2 TB zpools, about 500 to 700 GiB user per pool). 2 rsyncs running in parallel to fill the partitions. Main server (ggate client with ZFS and rsync) has an Intel Xeon X3450 2.66 GHz quadcore processor (+HTT or whatever it's called nowadays, gives 8 cpus in FreeBSD). With ULE ZFS gets slower after some time and finally gets stuck after 1 to 3 days of continouus synchronisation (ggate works like a charm as far as I can tell), with 4BSD (online since 6 days) the rsync seems to run a lot faster and I didn't get ZFS to stall. There's nearly no local I/O (system is on a local SSD) and the load/CPU usage are not actually high. All is running a quite recent RELENG_9 If anyone's interested I can get more detail and carry out some tests. - Oliver -- | Oliver Brandmueller http://sysadm.in/ o...@sysadm.in | |Ich bin das Internet. Sowahr ich Gott helfe. | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On Thu, Dec 22, 2011 at 11:31:45AM +0100, Luigi Rizzo wrote: On Wed, Dec 21, 2011 at 04:52:50PM -0800, Steve Kargl wrote: I have placed several files at http://troutmask.apl.washington.edu/~kargl/freebsd dmesg.txt -- dmesg for ULE kernel summary-- A summary that includes top(1) output of all runs. sysctl.ule.txt -- sysctl -a for the ULE kernel ktr-ule-problem-kargl.out.gz Since time is executed on the master, only the 'real' time is of interest (the summary file includes user and sys times). This command is run at 5 times for each N value and up to 10 time for some N values with the ULE kernel. The following table records the average 'real' time and the number in (...) is the mean absolute deviations. # N ULE 4BSD # - # 4223.27 (0.502) 221.76 (0.551) # 5404.35 (73.82) 270.68 (0.866) # 6627.56 (173.0) 247.23 (1.442) # 7475.53 (84.07) 285.78 (1.421) # 8429.45 (134.9) 223.64 (1.316) One explanation for taking 1.5-2x times is that with ULE the threads are not migrated properly, so you end up with idle cores and ready threads not running That's what I guessed back in 2008 when I first reported the behavior. http://freebsd.monkey.org/freebsd-current/200807/msg00278.html http://freebsd.monkey.org/freebsd-current/200807/msg00280.html The top(1) output at the above URL shows 10 completely independent instances of the same numerically intensive application running on a circa 2008 ULE kernel. Look at the PRI column. The high PRI jobs are not only pinned to a cpu, but these are running at 100% WCPU. The low PRI jobs seem to be pinned to a subset of the available cpus and simply ping-pong in and out of the same cpus. In this instance, there are 5 jobs competing for time on 3 cpus. Also, perhaps one could build a simple test process that replicates this workload (so one can run it as part of regression tests): 1. define a CPU-intensive function f(n) which issues no system calls, optionally touching a lot of memory, where n determines the number of iterations. 2. by trial and error (or let the program find it), pick a value N1 so that the minimum execution time of f(N1) is in the 10..100ms range 3. now run the function f() again from an outer loop so that the total execution time is large (10..100s) again with no intervening system calls. 4. use an external shell script can rerun a process when it terminates, and then run multiple instances in parallel. Instead of the external script one could fork new instances before terminating, but i am a bit unclear how CPU inheritance works when a process forks. Going through the shell possibly breaks the chain. The tests at the above URL does essentially what you propose except in 2008 the kzk90 programs were doing some IO. -- Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: FreeBSD 9 RC3 and VirtualBox
On 12/22/2011 9:56 AM, Joshua Boyd wrote: On Wed, Dec 21, 2011 at 11:56 PM, Adam Vande Moreamvandem...@gmail.comwrote: VT-x(or the AMD equiv) is a CPU feature and is necessary to run 64-bit guests. VT-d(or the AMD equiv)/IOMMU is the what is done in the chipset however it isn't necessary to run 64-bit guests. Both of these features are only found on CPU's supporting long mode. Exactly. The E7300 lacks the VT-x bits. Actually there are three different part numbers for the E7300. Two of them have VT-x, one does not. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: emacs-devel glib-warning
On Thursday, December 22, 2011 10:02:09 am 1126 wrote: Hello! I considered switching from emacs23 to emacs24 over the christmas-holidays.. So I removed emacs23 and installed emacs-devel via ports. Emacs runs fine, in terminal, but it crashes my whole X-system when I try to start it as X-client... The error message tells me that there is a glib-problem: GLib-WARNING **: In call to g_spawn_sync(), exit status of a child process was requested but SIGCHLD action was set to SIG_IGN and ECHILD was received by waitpid(), so exit status can't be returned. This is a bug in the program calling g_spawn_sync(); either don't request the exit status, or don't set the SIGCHLD action. That is just a bug in emacs (or some library emacs is using). It happens even when emacs doesn't crash. I suspect it is unrelated to the problem you are having with your X server and that the crash is caused by something else emacs is doing. What do you mean in detail by crashes my whole X-system. Does X actually core dump? Does X freeze or spin using 100% CPU? Does your window manager crash, etc.? One thing you can maybe try is building emacs without dbus or gconf and seeing if that works better. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On Thu, Dec 22, 2011 at 11:31:45AM +0100, Luigi Rizzo wrote: On Wed, Dec 21, 2011 at 04:52:50PM -0800, Steve Kargl wrote: I have placed several files at http://troutmask.apl.washington.edu/~kargl/freebsd dmesg.txt -- dmesg for ULE kernel summary-- A summary that includes top(1) output of all runs. sysctl.ule.txt -- sysctl -a for the ULE kernel ktr-ule-problem-kargl.out.gz I've replaced the original version of the ktr file with a new version. The old version was corrupt due to my failure to set 'sysctl debug.ktr.mask=0' prior to the dump. One explanation for taking 1.5-2x times is that with ULE the threads are not migrated properly, so you end up with idle cores and ready threads not running (the other possible explanation would be that there are migrations, but they are so frequent and expensive that they completely trash the caches. But this seems unlikely for this type of task). I've used schedgraph to look at the ktrdump output. A jpg is available at http://troutmask.apl.washington.edu/~kargl/freebsd/ktr.jpg This shows the ping-pong effect where here 3 processes appear to be using 2 cpus while the remaining 2 processes are pinned to their cpus. -- Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
on 22/12/2011 20:45 Steve Kargl said the following: I've used schedgraph to look at the ktrdump output. A jpg is available at http://troutmask.apl.washington.edu/~kargl/freebsd/ktr.jpg This shows the ping-pong effect where here 3 processes appear to be using 2 cpus while the remaining 2 processes are pinned to their cpus. I'd recommended enabling CPU-specific background colors via the menu in schedgraph for a better illustration of your findings. NB: I still don't understand the point of purposefully running N+1 CPU-bound processes. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On Thu, Dec 22, 2011 at 09:01:15PM +0200, Andriy Gapon wrote: on 22/12/2011 20:45 Steve Kargl said the following: I've used schedgraph to look at the ktrdump output. A jpg is available at http://troutmask.apl.washington.edu/~kargl/freebsd/ktr.jpg This shows the ping-pong effect where here 3 processes appear to be using 2 cpus while the remaining 2 processes are pinned to their cpus. I'd recommended enabling CPU-specific background colors via the menu in schedgraph for a better illustration of your findings. NB: I still don't understand the point of purposefully running N+1 CPU-bound processes. The point is that this is a node in a HPC cluster with multiple users. Sure, I can start my job on this node with only N cpu-bound jobs. Now, when user John Doe wants to run his OpenMPI program should he login into the 12 nodes in the cluster to see if someone is already running N cpu-bound jobs on a given node? 4BSD gives my jobs and John Doe's jobs a fair share of the available cpus. ULE does not give a fair share and if you read the summary file I put up on the web, you see that it is fairly non-deterministic on when a OpenMPI run will finish (see the mean absolute deviations in the table of 'real' times that I posted). There is the additional observation in one of my 2008 emails (URLs have been posted) that if you have N+1 cpu-bound jobs with, say, job0 and job1 ping-ponging on cpu0 (due to ULE's cpu-affinity feature) and if I kill job2 running on cpu1, then neither job0 nor job1 will migrate to cpu1. So, one now has N cpu-bound jobs running on N-1 cpus. Finally, my initial post in this email thread was to tell O. Hartman to quit beating his head against a wall with ULE (in an HPC environment). Switch to 4BSD. This was based on my 2008 observations and I've now wasted 2 days gather additional information which only re-affirms my recommendation. -- Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
on 22/12/2011 21:47 Steve Kargl said the following: On Thu, Dec 22, 2011 at 09:01:15PM +0200, Andriy Gapon wrote: on 22/12/2011 20:45 Steve Kargl said the following: I've used schedgraph to look at the ktrdump output. A jpg is available at http://troutmask.apl.washington.edu/~kargl/freebsd/ktr.jpg This shows the ping-pong effect where here 3 processes appear to be using 2 cpus while the remaining 2 processes are pinned to their cpus. I'd recommended enabling CPU-specific background colors via the menu in schedgraph for a better illustration of your findings. NB: I still don't understand the point of purposefully running N+1 CPU-bound processes. The point is that this is a node in a HPC cluster with multiple users. Sure, I can start my job on this node with only N cpu-bound jobs. Now, when user John Doe wants to run his OpenMPI program should he login into the 12 nodes in the cluster to see if someone is already running N cpu-bound jobs on a given node? 4BSD gives my jobs and John Doe's jobs a fair share of the available cpus. ULE does not give a fair share and if you read the summary file I put up on the web, you see that it is fairly non-deterministic on when a OpenMPI run will finish (see the mean absolute deviations in the table of 'real' times that I posted). OK. I think I know why the uneven load occurs. I remember even trying to explain my observations. There are two things: 1. ULE doesn't have either a common across CPUs runqueue nor any other kind of mechanism for enforcing true global fairness of CPU resource sharing. 2. ULE's rebalancing code is biased and that leads to the situation where sub-groups of threads can share subsets of CPUs rather fairly, but there won't be a global fairness. I haven't really given any thought as to how to fix or workaround these issues. One dumb idea is to add an element of randomness to a choice between equally loaded CPUs (and their subsets) instead of having a permanent bias. There is the additional observation in one of my 2008 emails (URLs have been posted) that if you have N+1 cpu-bound jobs with, say, job0 and job1 ping-ponging on cpu0 (due to ULE's cpu-affinity feature) and if I kill job2 running on cpu1, then neither job0 nor job1 will migrate to cpu1. So, one now has N cpu-bound jobs running on N-1 cpus. Have you checked recently that that is still the case? I would consider this a rather serious bug as opposed to a sub-optimal scheduling. Finally, my initial post in this email thread was to tell O. Hartman to quit beating his head against a wall with ULE (in an HPC environment). Switch to 4BSD. This was based on my 2008 observations and I've now wasted 2 days gather additional information which only re-affirms my recommendation. I think that any objective information has its value. So maybe the time is not really wasted. I think there is no argument that for your usage pattern 4BSD is better than ULE at the moment, because of the inherent design choices of both schedulers and their current implementations. But I think that ULE could be improved to produce more global fairness. P.S. But, but, this thread has seen so many different problem reports about ULE heaped together that it's very easy to get confused about what is caused by what and what is real and what is not. E.g. I don't think that there is a direct relation between this issue (N+1 CPU-bound tasks) and my X is sluggish with ULE when I untar a large file. P.P.S. About the subject line. Let's recall why ULE has become a default. It has happened because of many observations from users and developers that things were faster/snappier with ULE than with 4BSD and a significant stream of requests to make it the default. So it's business as usual. The schedulers are different, so there those for whom one scheduler works better and those for whom the other works better and those for whom both work reasonably well and those for whom neither is satisfactory and those who don't really care/compare. There is a silent majority and the vocal minorities. There are specific bugs and quirks, advantages and disadvantages, usage patterns, hardware configurations and what not. When everybody starts to talk at the same time, it's a huge mess. But silently triaging and debugging one problem at a time also doesn't always work. There, I've said it. Let me now try to recall why I felt a need to say all of this :-) -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Mystery panic, FreeBSD 7.2-PRE
We've got another mystery panic in 7.2-PRE. Upgrading is not an option; however, if this is familiar to anyone, backporting a patch would be. The stack trace is: db_trace_self_wrapper() at 0x8019120a = db_trace_self_wrapper+0x2a^M panic() at 0x80308797 = panic+0x187^M devfs_populate_loop() at 0x802a45c8 = devfs_populate_loop+0x548^M devfs_populate() at 0x802a46ab = devfs_populate+0x3b^M devfs_lookup() at 0x802a7824 = devfs_lookup+0x264^M VOP_LOO[24165][irq261: plx0] DEBUG (hasc_sv_rcv_cb): rcvd hrtbt ts 24051, 7/9, rc 0^M KUP_APV() at 0x804d5995 = VOP_LOOKUP_APV+0x95^M lookup() at 0x80384a3e = lookup+0x4ce^M namei() at 0x80385768 = namei+0x2c8^M vn_open_cred() at 0x8039b283 = vn_open_cred+0x1b3^M kern_open() at 0x8039a4a0 = kern_open+0x110^M syscall() at 0x804b0e3c = syscall+0x1ec^M Xfast_syscall() at 0x80494ecb = Xfast_syscall+0xab^M --- syscall (5, FreeBSD ELF64, open), rip = 0x800e022fc, rsp = 0x7fbfa128, rbp = 0x801002240 ---^M KDB: enter: panic^M -- Charles R. (Charlie) Martin Senior Software Engineer SGI logo 1900 Pike Road Longmont, CO 80501 Phone: 303-532-0209 E-Mail: crmar...@sgi.com mailto:crmar...@sgi.com Website: www.sgi.com http://www.sgi.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Benchmark (Phoronix): FreeBSD 9.0-RC2 vs. Oracle Linux 6.1 Server
On 12/21/11 19:41, Alexander Leidinger wrote: Hi, while the discussion continued here, some work started at some other place. Now... in case someone here is willing to help instead of talking, feel free to go to http://wiki.freebsd.org/BenchmarkAdvice and have a look what can be improved. The page is far from perfect and needs some additional people which are willing to improve it. This is only part of the problem. A tuning page in the wiki - which could be referenced from the benchmark page - would be great too. Any volunteers? A first step would be to take he tuning-man-page and wikify it. Other tuning sources are welcome too. Every FreeBSD dev with a wiki account can hand out write access to the wiki. The benchmark page gives contributor-access. If someone wants write access create a FirstnameLastname account and ask here for contributor-access. Don't worry if you think your english is not good enough, even some one-word notes can help (and _my_ english got already corrected by other people on the benchmark page). Bye, Alexander. Nice to see movement ;-) But there seems something unclear: man make.conf(5) says, that MALLOC_PRODUCTION is a knob set in /etc/make.conf. The WiJi says, MALLOC_PRODUCTION is to be set in /etc/src.conf. What's right and what's wrong now? Oliver signature.asc Description: OpenPGP digital signature
Re: Benchmark (Phoronix): FreeBSD 9.0-RC2 vs. Oracle Linux 6.1 Server
On Fri, Dec 23, 2011 at 12:44:14AM +0100, O. Hartmann wrote: On 12/21/11 19:41, Alexander Leidinger wrote: Hi, while the discussion continued here, some work started at some other place. Now... in case someone here is willing to help instead of talking, feel free to go to http://wiki.freebsd.org/BenchmarkAdvice and have a look what can be improved. The page is far from perfect and needs some additional people which are willing to improve it. This is only part of the problem. A tuning page in the wiki - which could be referenced from the benchmark page - would be great too. Any volunteers? A first step would be to take he tuning-man-page and wikify it. Other tuning sources are welcome too. Every FreeBSD dev with a wiki account can hand out write access to the wiki. The benchmark page gives contributor-access. If someone wants write access create a FirstnameLastname account and ask here for contributor-access. Don't worry if you think your english is not good enough, even some one-word notes can help (and _my_ english got already corrected by other people on the benchmark page). Bye, Alexander. Nice to see movement ;-) But there seems something unclear: man make.conf(5) says, that MALLOC_PRODUCTION is a knob set in /etc/make.conf. The WiJi says, MALLOC_PRODUCTION is to be set in /etc/src.conf. What's right and what's wrong now? I can say with certainty that this value belongs in /etc/make.conf (on RELENG_8 and earlier at least). src/share/mk/bsd.own.mk has no framework for MK_MALLOC_PRODUCTION, so, this is definitely a make.conf variable. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Mystery panic, FreeBSD 7.2-PRE
On Thu, Dec 22, 2011 at 04:04:48PM -0700, Charlie Martin wrote: We've got another mystery panic in 7.2-PRE. Upgrading is not an option; however, if this is familiar to anyone, backporting a patch would be. The stack trace is: db_trace_self_wrapper() at 0x8019120a = db_trace_self_wrapper+0x2a^M panic() at 0x80308797 = panic+0x187^M devfs_populate_loop() at 0x802a45c8 = devfs_populate_loop+0x548^M devfs_populate() at 0x802a46ab = devfs_populate+0x3b^M devfs_lookup() at 0x802a7824 = devfs_lookup+0x264^M VOP_LOO[24165][irq261: plx0] DEBUG (hasc_sv_rcv_cb): rcvd hrtbt ts 24051, 7/9, rc 0^M KUP_APV() at 0x804d5995 = VOP_LOOKUP_APV+0x95^M lookup() at 0x80384a3e = lookup+0x4ce^M namei() at 0x80385768 = namei+0x2c8^M vn_open_cred() at 0x8039b283 = vn_open_cred+0x1b3^M kern_open() at 0x8039a4a0 = kern_open+0x110^M syscall() at 0x804b0e3c = syscall+0x1ec^M Xfast_syscall() at 0x80494ecb = Xfast_syscall+0xab^M --- syscall (5, FreeBSD ELF64, open), rip = 0x800e022fc, rsp = 0x7fbfa128, rbp = 0x801002240 ---^M KDB: enter: panic^M devfs(5) has been massively worked on in RELENG_8 and newer. You should go through the below commits and see if you can find one that references a PR with a similar backtrace, or mentions things like devfs_lookup(). http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/fs/devfs/ Also, be aware that the above stack trace is interspersed. Ultimately you get to clean up the output yourself. This is a long-standing problem with FreeBSD which can be helped but only slightly/barely by using options PRINTF_BUFR_SIZE=256 in your kernel configuration (the default configs have a value of 128. Do not increase the value too high, there are concerns about it causing major issues; I can dig up the post that says that, but I'd rather not). It *will not* solve the problem of interspersed output entirely. There still is no fix for this problem... :-( What I'm referring to: devfs_lookup() at 0x802a7824 = devfs_lookup+0x264^M VOP_LOO[24165][irq261: plx0] DEBUG (hasc_sv_rcv_cb): rcvd hrtbt ts 24051, 7/9, rc 0^M lookup() at 0x80384a3e = lookup+0x4ce^M This should actually read (I think): devfs_lookup() at 0x802a7824 = devfs_lookup+0x264^M VOP_LOOKUP_APV() at 0x804d5995 = VOP_LOOKUP_APV+0x95^M [24165][irq261: plx0] DEBUG (hasc_sv_rcv_cb): rcvd hrtbt ts 24051, 7/9, rc 0^M -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On 22 December 2011 11:47, Steve Kargl s...@troutmask.apl.washington.edu wrote: [snip] Thankyou for posting some actual measurements! There is the additional observation in one of my 2008 emails (URLs have been posted) that if you have N+1 cpu-bound jobs with, say, job0 and job1 ping-ponging on cpu0 (due to ULE's cpu-affinity feature) and if I kill job2 running on cpu1, then neither job0 nor job1 will migrate to cpu1. So, one now has N cpu-bound jobs running on N-1 cpus. .. and this sounds like a pretty serious regression. Have you ever filed a PR for it? Finally, my initial post in this email thread was to tell O. Hartman to quit beating his head against a wall with ULE (in an HPC environment). Switch to 4BSD. This was based on my 2008 observations and I've now wasted 2 days gather additional information which only re-affirms my recommendation. I personally don't think this is time wasted. You've done something that noone else has actually done - provided actual results from real-life testing, rather than a hundred posts of I remember seeing X, so I don't use ULE. If you can definitely and consistently reproduce that N-1 cpu bound job bug, you're now in a great position to easily test and re-report KTR/schedtrace results to see what impact they have. Please don't underestimate exactly how valuable this is. How often are those two jobs migrating between CPUs? How am I supposed to read CPU load ? Why isn't it just sitting at 100% the whole time? Would you mind repeating this with 4BSD (the N+1 jobs) so we can see how the jobs are scheduled/interleaved? Something tells me we'll see it the jobs being scheduled evenly Adrian ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On 12/22/2011 16:23, Adrian Chadd wrote: You've done something that noone else has actually done - provided actual results from real-life testing, rather than a hundred posts of I remember seeing X, so I don't use ULE. Not to take away from Steve's excellent work on this, but I actually spent weeks following detailed instructions from various people using ktr, dtrace, etc. and was never able to produce any data that helped point anyone to something that could be fixed. I'm pretty sure that others have tried as well. That said, I'm glad that Steve was able to produce useful results, and hopefully it will lead to improvements. Doug -- [^L] Breadth of IT experience, and depth of knowledge in the DNS. Yours for the right price. :) http://SupersetSolutions.com/ ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: SCHED_ULE should not be the default
On Thu, Dec 22, 2011 at 04:23:29PM -0800, Adrian Chadd wrote: On 22 December 2011 11:47, Steve Kargl s...@troutmask.apl.washington.edu wrote: There is the additional observation in one of my 2008 emails (URLs have been posted) that if you have N+1 cpu-bound jobs with, say, job0 and job1 ping-ponging on cpu0 (due to ULE's cpu-affinity feature) and if I kill job2 running on cpu1, then neither job0 nor job1 will migrate to cpu1. ?So, one now has N cpu-bound jobs running on N-1 cpus. .. and this sounds like a pretty serious regression. Have you ever filed a PR for it? No. I was interacting directly with jeffr in 2008. I got as far as setting up root access on a node for jeffr. Unfortunately, both jeffr and I got busy with real life, and 4BSD allowed me to get my work done. Finally, my initial post in this email thread was to tell O. Hartman to quit beating his head against a wall with ULE (in an HPC environment). ?Switch to 4BSD. ?This was based on my 2008 observations and I've now wasted 2 days gather additional information which only re-affirms my recommendation. I personally don't think this is time wasted. You've done something that noone else has actually done - provided actual results from real-life testing, rather than a hundred posts of I remember seeing X, so I don't use ULE. If you can definitely and consistently reproduce that N-1 cpu bound job bug, you're now in a great position to easily test and re-report KTR/schedtrace results to see what impact they have. Please don't underestimate exactly how valuable this is. I'll try this tomorrow. I first need to modify the code I used in the 2008 test to disable IO, so that it is nearly completely cpu-bound. How often are those two jobs migrating between CPUs? How am I supposed to read CPU load ? Why isn't it just sitting at 100% the whole time? This is my 1st foray into ktr and schedgraph, so I may not have done something incorrectly. In particular, it seems that schedgraph takes the cpu clock as a command line argument, so there is probably some scaling that I'm missing. Would you mind repeating this with 4BSD (the N+1 jobs) so we can see how the jobs are scheduled/interleaved? Something tells me we'll see it the jobs being scheduled evenly Sure, I'll do this tomorrow as well. -- Steve ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Benchmark (Phoronix): FreeBSD 9.0-RC2 vs. Oracle Linux 6.1 Server
On Dec 22, 2011, at 3:58 PM, Jeremy Chadwick free...@jdc.parodius.com wrote: On Fri, Dec 23, 2011 at 12:44:14AM +0100, O. Hartmann wrote: On 12/21/11 19:41, Alexander Leidinger wrote: Hi, while the discussion continued here, some work started at some other place. Now... in case someone here is willing to help instead of talking, feel free to go to http://wiki.freebsd.org/BenchmarkAdvice and have a look what can be improved. The page is far from perfect and needs some additional people which are willing to improve it. This is only part of the problem. A tuning page in the wiki - which could be referenced from the benchmark page - would be great too. Any volunteers? A first step would be to take he tuning-man-page and wikify it. Other tuning sources are welcome too. Every FreeBSD dev with a wiki account can hand out write access to the wiki. The benchmark page gives contributor-access. If someone wants write access create a FirstnameLastname account and ask here for contributor-access. Don't worry if you think your english is not good enough, even some one-word notes can help (and _my_ english got already corrected by other people on the benchmark page). Bye, Alexander. Nice to see movement ;-) But there seems something unclear: man make.conf(5) says, that MALLOC_PRODUCTION is a knob set in /etc/make.conf. The WiJi says, MALLOC_PRODUCTION is to be set in /etc/src.conf. What's right and what's wrong now? I can say with certainty that this value belongs in /etc/make.conf (on RELENG_8 and earlier at least). src/share/mk/bsd.own.mk has no framework for MK_MALLOC_PRODUCTION, so, this is definitely a make.conf variable. Take the advice in tuning(7) with a grain of salt because a number of suggestions are really outdated. I know because I filed a PR last night after I saw how out of synch some of the defaults it claimed were with reality on 9.x+. And I know other suggestions in the manpage are dated as well ;/. Thanks, -Garrett___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: emacs-devel glib-warning
On 2011/12/22 at 23:02, 1126 mailingli...@elfsechsundzwanzig.de wrote: Hello! I considered switching from emacs23 to emacs24 over the christmas-holidays.. So I removed emacs23 and installed emacs-devel via ports. Emacs runs fine, in terminal, but it crashes my whole X-system when I try to start it as X-client... The error message tells me that there is a glib-problem: GLib-WARNING **: In call to g_spawn_sync(), exit status of a child process was requested but SIGCHLD action was set to SIG_IGN and ECHILD was received by waitpid(), so exit status can't be returned. This is a bug in the program calling g_spawn_sync(); either don't request the exit status, or don't set the SIGCHLD action. I am currently using emacs-devel, however, I have been using glib-2.30.x from marcus's experimental ports for a while. It seems everything is ok. Either you can tweak some configure args availabe to emacs-devel, and see how it is going, or you might pull the glib-2.30.x port from Marcus's site and give it a try. Good luck! I have emacs-devel installed, and glib-2.28.8_2. I am runnig xmonad as WM, but it happened on awesome as well.. Does anyone know what I can do to get emacs to work as X-client? ;) Thanks in advance! Greetings from rainy Cologne, 1126 -- The inside contact that you have developed at great expense is the first person to be let go in any reorganization. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: directory listing hangs in ufs state
On 12/22/2011 03:48, Kostik Belousov wrote: On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote: On 15.12.2011 17:01, Kostik Belousov wrote: On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick free...@jdc.parodius.comwrote: On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: On 14.12.2011 22:22, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: Hi Jeremy, This is not hardware problem, I've already checked that. I also ran fsck today and got no errors. After some more exploration of how mongodb works, I found that then listing hangs, one of mongodb thread is in biowr state for a long time. It periodically calls msync(MS_SYNC) accordingly to ktrace out. If I'll remove msync() calls from mongodb, how often data will be sync by OS? -- Andrey Zonov On 14.12.2011 2:15, Jeremy Chadwick wrote: On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: Have you any ideas what is going on? or how to catch the problem? Assuming this isn't a file on the root filesystem, try booting the machine in single-user mode and using fsck -f on the filesystem in question. Can you verify there's no problems with the disk this file lives on as well (smartctl -a /dev/disk)? I'm doubting this is the problem, but thought I'd mention it. I have no real answer, I'm sorry. msync(2) indicates it's effectively deprecated (see BUGS). It looks like this is effectively a mmap-version of fsync(2). I replaced msync(2) with fsync(2). Unfortunately, from man pages it is not obvious that I can do this. Anyway, thanks. Sorry, that wasn't what I was implying. Let me try to explain differently. msync(2) looks, to me, like an mmap-specific version of fsync(2). Based on the man page, it seems that the with msync() you can effectively guaranteed flushing of certain pages within an mmap()'d region to disk. fsync() would flush **all** buffers/internal pages to be flushed to disk. One would need to look at the code to mongodb to find out what it's actually doing with msync(). That is to say, if it's doing something like this (I probably have the semantics wrong -- I've never spent much time with mmap()): fd = open(/some/file, O_RDWR); ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); ret = msync(ptr, 65536, MS_SYNC); /* or alternatively, this: ret = msync(ptr, NULL, MS_SYNC); */ Then this, to me, would be mostly the equivalent to: fd = fopen(/some/file, r+); ret = fsync(fd); Otherwise, if it's calling msync() only on an address/location within the region ptr points to, then that may be more efficient (less pages to flush). They call msync() for the whole file. So, there will not be any difference. The mmap() arguments -- specifically flags (see man page) -- also play a role here. The one that catches my attention is MAP_NOSYNC. So you may need to look at the mongodb code to figure out what it's mmap() call is. One might wonder why they don't just use open() with the O_SYNC. I imagine that has to do with, again, performance; possibly the don't want all I/O synchronous, and would rather flush certain pages in the mmap'd region to disk as needed. I see the legitimacy in that approach (vs. just using O_SYNC). There's really no easy way for me to tell you which is more efficient, better, blah blah without spending a lot of time with a benchmarking program that tests all of this, *plus* an entire system (world) built with profiling. I ran for two hours mongodb with fsync() and got the following: STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 10:34:52 2011 3 192744314 3080182 This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. Then I ran it with default msync(): STARTED INBLK OUBLK MAJFLT MINFLT Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 There are also two graphics of disk business [1] [2]. The difference is significant, in 37 times! That what I expected to get. In commentaries for vm_object_page_clean() I found this: * When stuffing pages asynchronously, allow clustering. XXX we need a * synchronous clustering mode implementation. It means for me that msync(MS_SYNC) flush every page on disk in single IO transaction. If we multiply 4K and 37 we get 150K. This number is size of the single transaction in my experience. +alc@, kib@ Am I right? Is there any plan to implement this? Current buffer clustering code can only do only async writes. In fact, I am not quite sure what would consitute the sync clustering, because the ability to delay the write is important to be able to cluster at all. Also, I am not sure that lack of clustering is the biggest problem. IMO, the fact that each write is sync is the first problem there. It would be quite a work to add the tracking of the issued writes to the vm_object_page_clean() and down the stack. Esp. due to custom page write