Re: 4.7.0-rc7 ext4 error in dx_probe
On 2016-08-18 02:28, Darrick J. Wong wrote: > On Wed, Aug 17, 2016 at 05:27:25PM +0300, Török Edwin wrote: >> On 2016-08-08 09:28, Darrick J. Wong wrote: >>> On Sun, Aug 07, 2016 at 11:56:34PM -0400, Theodore Ts'o wrote: >>>> On Fri, Aug 05, 2016 at 12:15:48PM -0700, Darrick J. Wong wrote: >>>>>>> [1] >>>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23 >>>>>> >>>>>> I added the patch, rebuilt and rebooted. It will take some time >>>>>> before I'll report back since the issue is so hard to reproduce. >>>>> >>>>> FWIW I could trigger it reliably by running a bunch of directory traversal >>>>> programs simultaneously on the same directory. I have a script that fires >>>>> up multiple mutts pointing to the Maildirs for the high traffic Linux >>>>> lists. >>>> >>>> Hmm, I wonder if we should request that this patch be backported to >>>> -stable. Darrick, what do you think? >>> >>> Seems like an excellent idea. >> >> Hi, >> >> I see 4.7.1. was released, however before testing it in the changelog I can't >> see the inode.c commit with the checksum fix. Is there another commit that >> would provide an equivalent fix, or is the fix not part of 4.7.1? > > There's no 'equivalent' fix; the one we have simply didn't make it > to the stable maintainers before 4.7.1 came out. > > (Hey Ted, can you send the patch to ... whomever's in charge of 4.7?) Ping? 4.7.2 is out The process for submitting to stable seems to be described here [1], it doesn't say who can submit, but I guess it'd be better if one of the people listed in the original commit, or an ext4 maintainer submitted it. FWIW 4.7.0+patch has been working reliably here, would be nice if I could tell people here that its safe to upgrade to 4.7.x (without manually patching it). Best regards, --Edwin
Re: 4.7.0-rc7 ext4 error in dx_probe
On 2016-08-17 23:02, Johannes Stezenbach wrote: > On Fri, Aug 05, 2016 at 08:11:36PM +0200, Johannes Stezenbach wrote: >> On Fri, Aug 05, 2016 at 10:02:28AM -0700, Darrick J. Wong wrote: >>> >>> When you're back on 4.7, can you apply this patch[1] to see if it fixes >>> the problem? I speculate that the new parallel dir lookup code enables >>> multiple threads to be verifying the same directory block buffer at the >>> same time. >>> >>> [1] >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23 >> >> I added the patch, rebuilt and rebooted. It will take some time >> before I'll report back since the issue is so hard to reproduce. > > FWIW, so far the issue didn't appear again after I applied > the patch to 4.7.0, and I stressed it a bit with repo syncs, > AOSP builds, rsync backups etc. Same here: I've been using 4.7.0+patch daily, and everything is working fine. Best regards, --Edwin
Re: 4.7.0-rc7 ext4 error in dx_probe
On 2016-08-08 09:28, Darrick J. Wong wrote: > On Sun, Aug 07, 2016 at 11:56:34PM -0400, Theodore Ts'o wrote: >> On Fri, Aug 05, 2016 at 12:15:48PM -0700, Darrick J. Wong wrote: > [1] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ext4/inode.c?id=b47820edd1634dc1208f9212b7ecfb4230610a23 I added the patch, rebuilt and rebooted. It will take some time before I'll report back since the issue is so hard to reproduce. >>> >>> FWIW I could trigger it reliably by running a bunch of directory traversal >>> programs simultaneously on the same directory. I have a script that fires >>> up multiple mutts pointing to the Maildirs for the high traffic Linux lists. >> >> Hmm, I wonder if we should request that this patch be backported to >> -stable. Darrick, what do you think? > > Seems like an excellent idea. Hi, I see 4.7.1. was released, however before testing it in the changelog I can't see the inode.c commit with the checksum fix. Is there another commit that would provide an equivalent fix, or is the fix not part of 4.7.1? Best regards, --Edwin
Re: 4.7.0-rc7 ext4 error in dx_probe
On 2016-08-09 05:37, Darrick J. Wong wrote: > On Tue, Aug 09, 2016 at 12:13:01AM +0300, Török Edwin wrote: >> On 2016-08-08 19:55, Darrick J. Wong wrote: >>> On Mon, Aug 08, 2016 at 12:08:18PM -0400, Theodore Ts'o wrote: >>>> On Sun, Aug 07, 2016 at 11:28:10PM -0700, Darrick J. Wong wrote: >>>>> >>>>> I have one lingering concern -- is it a bug that two processes could be >>>>> computing the checksum of a buffer simultaneously? I would have thought >>>>> ext4 >>>>> would serialize that kind of buffer_head access... >>>> >>>> Do we know how this is happening? We've always depended on the VFS to >>>> provide this exclusion. The only way we should be modifying the >>>> buffer_head at the same time if two CPU's are trying to modify the >>>> directory at the same time, and that should _never_ be happening, even >>>> with the new directory parallism code, unless the file system has >>>> given permission and intends to do its own fine-grained locking. >>> >>> It's a combination of two things, I think. The first is that the >>> checksum calculation routine (temporarily) set the checksum field to >>> zero during the computation, which of course is a no-no. The patch >>> fixes that problem and should go in. >> >> Thanks a lot for the patch. >> I wrote a small testcase (see below) that triggers the problem quite soon on >> my box with kernel 4.7.0, and seems to have survived so far with kernel >> 4.7.0+patch. >> When it failed it printed something like "readdir: Bad message". >> >> The drop caches part is quite important for triggering the bug, and might >> explain why this bug was hard to reproduce: IIUC this race condition can >> happen only if 2+ threads/processes try to access the same directory, and the >> directory's inode is not in the cache (i.e. was never cached, or got kicked >> out of the cache). > > Could you formulate this into an xfstest, please? It would be very useful to > have this as a regression test. > > (Or attach a Signed-off-by and I'll take care of it eventually.) I've attached a signed-off-by line and a copyright header (feel free to add yourself in the copyright header too): Signed-off-by: Török Edwin >> /* >> $ gcc trigger.c -o trigger -pthread >> $ ./trigger >> */ /* * Test concurrent readdir on uncached inodes * * Copyright (C) 2016 Skylable Ltd. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */ >> >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> #include >> >> #define FILES 10 >> #define THREADS 16 >> #define LOOPS 1000 >> >> static void die(const char *msg) >> { >> perror(msg); >> exit(EXIT_FAILURE); >> } >> >> static void* list(void* arg) >> { >> for(int i=0;i> DIR *d = opendir("."); >> if (!d) { >> die("opendir"); >> } >> errno = 0; >> while(readdir(d)) {} >> if (errno) { >> die("readdir"); >> } >> closedir(d); >> FILE *f = fopen("/proc/sys/vm/drop_caches", "w"); >> if (f) { >> fputs("3", f); >> fclose(f); >> } >> } >> return NULL; >> } >> >> int main() >> { >> pthread_t t[THREADS]; >> >> if(mkdir("ext4test", 0755) < 0 && errno != EEXIST) >> die("mkdir"); >> if(chdir("ext4test") < 0) >> die("chdir"); >> for (unsigned i=0;i < FILES;i++) { >> char name[16]; >> snprintf(name, sizeof(name), "%d", i); >> int fd = open(name, O_WRONLY|O_CREAT, 0600); >> if (fd < 0) >> die("open"); >> close(fd); >> } >> for (unsigned i=0;i < sizeof(t)/sizeof(t[0]); i++) { >> pthread_create(&t[i], NULL,list, NULL); >> } >> for (unsigned i=0;i < sizeof(t)/sizeof(t[0]); i++) { >> pthread_join(t[i], NULL); >> } >> return 0; >> } >> >> >> >> -- >> Edwin Török | Co-founder and Lead Developer >> >> Skylable open-source object storage: reliable, fast, secure >> http://www.skylable.com > -- Edwin Török | Co-founder and Lead Developer Skylable open-source object storage: reliable, fast, secure http://www.skylable.com
Re: 4.7.0-rc7 ext4 error in dx_probe
On 2016-08-08 19:55, Darrick J. Wong wrote: > On Mon, Aug 08, 2016 at 12:08:18PM -0400, Theodore Ts'o wrote: >> On Sun, Aug 07, 2016 at 11:28:10PM -0700, Darrick J. Wong wrote: >>> >>> I have one lingering concern -- is it a bug that two processes could be >>> computing the checksum of a buffer simultaneously? I would have thought >>> ext4 >>> would serialize that kind of buffer_head access... >> >> Do we know how this is happening? We've always depended on the VFS to >> provide this exclusion. The only way we should be modifying the >> buffer_head at the same time if two CPU's are trying to modify the >> directory at the same time, and that should _never_ be happening, even >> with the new directory parallism code, unless the file system has >> given permission and intends to do its own fine-grained locking. > > It's a combination of two things, I think. The first is that the > checksum calculation routine (temporarily) set the checksum field to > zero during the computation, which of course is a no-no. The patch > fixes that problem and should go in. Thanks a lot for the patch. I wrote a small testcase (see below) that triggers the problem quite soon on my box with kernel 4.7.0, and seems to have survived so far with kernel 4.7.0+patch. When it failed it printed something like "readdir: Bad message". The drop caches part is quite important for triggering the bug, and might explain why this bug was hard to reproduce: IIUC this race condition can happen only if 2+ threads/processes try to access the same directory, and the directory's inode is not in the cache (i.e. was never cached, or got kicked out of the cache). /* $ gcc trigger.c -o trigger -pthread $ ./trigger */ #include #include #include #include #include #include #include #include #include #include #define FILES 10 #define THREADS 16 #define LOOPS 1000 static void die(const char *msg) { perror(msg); exit(EXIT_FAILURE); } static void* list(void* arg) { for(int i=0;ihttp://www.skylable.com
Re: 4.7.0-rc7 ext4 error in dx_probe
Hi, I have just encountered a similar problem after I've recently upgraded to 4.7.0: [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): dx_probe:740: inode #13295: comm python: Directory index failed checksum [Wed Aug 3 11:08:57 2016] Aborting journal on device dm-1-8. [Wed Aug 3 11:08:57 2016] EXT4-fs (dm-1): Remounting filesystem read-only [Wed Aug 3 11:08:57 2016] EXT4-fs error (device dm-1): ext4_journal_check_start:56: Detected aborted journal I've rebooted in single-user mode, fsck fixed the filesystem, and rebooted, filesystem is rw again now. inode #13295 seems to be this and I can list it now: stat /usr/lib64/python3.4/site-packages File: '/usr/lib64/python3.4/site-packages' Size: 12288 Blocks: 24 IO Block: 4096 directory Device: fd01h/64769dInode: 13295 Links: 180 Access: (0755/drwxr-xr-x) Uid: (0/root) Gid: (0/root) Access: 2016-05-09 11:29:44.056661988 +0300 Modify: 2016-08-01 00:34:24.029779875 +0300 Change: 2016-08-01 00:34:24.029779875 +0300 Birth: - The filesystem was /, I only noticed it was readonly after several hours when I tried to install something: /dev/mapper/vg--ssd-root on / type ext4 (rw,noatime,errors=remount-ro,data=ordered) $ uname -a Linux bolt 4.7.0-gentoo-rr #1 SMP Thu Jul 28 11:28:56 EEST 2016 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux FWIW I've been using ext4 for years and this is the first time I see this message. Prior to 4.7 I was on 4.6.1 -> 4.6.2 -> 4.6.3 -> 4.6.4. The kernel is from gentoo-sources + a patch for enabling AMD LWP (I had that patch since 4.6.3 and its not related to I/O). If I see this message again what should I do to obtain more information to trace down the root cause? Best regards, -- Edwin Török | Co-founder and Lead Developer Skylable open-source object storage: reliable, fast, secure http://www.skylable.com
Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX
David Chinner wrote: > On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote: > >> On 2/17/08, Török Edwin <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> xfsaild is causing many wakeups, a quick investigation shows >>> xfsaild_push is always >>> returning 30 msecs timeout value. >>> > > That's a bug Ok. Your patches fixes the 30+ wakeups :) > , and has nothing to do with power consumption. ;) > I suggest using a sysctl value (such as /proc/sys/vm/dirty_writeback_centisecs), instead of a hardcoded default 1000. That would further reduce the wakeups. > > I only managed to reproduce this on a lazy superblock counter > filesystem (i.e. new mkfs and recent kernel), The filesystem was created in July 2007 > Can you see if the patch below fixes the problem. Yes, it reduces wakeups to 1/sec. Thanks, --Edwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX
Hi, xfsaild is causing many wakeups, a quick investigation shows xfsaild_push is always returning 30 msecs timeout value. This is on an idle system, running only gnome, and gnome-terminal. I suggest changing the timeout logic in xfsaild to be more power consumption friendly. See below my original report to the powerTOP mailing list. Best regards, --Edwin --- Begin Message --- Török Edwin wrote: Török Edwin wrote: Hi, On latest -git of 2.6.25 I am getting lots of wakeups from xfsaild. 23.5% ( 33.3) xfsaild : schedule_timeout (process_timeout) [Should I Cc: xfs mailing list / lkml on this?] The problem seems to be with the timeout logic in xfsaild_push, which can return 3 timeout values (msecs): 1000, 10, 20, 30. I inserted a marker and attached a probe function, schedule_timeout always got called with 9 jiffies (which is 30 msecs, I have HZ=300). Changing xfs_trans_ail.c:270 from "tout += 20" to "tout = 1000", made xfsaild only do 1 wakeup/s instead of 33! For some reason xfs_aild always thinks it has work (I/O) to do, and never chooses the 1000 msec sleep value. sounds like an XFS bug... worth reporting to the xfs/lkml folks for sure. --- End Message ---
LatencyTOP: sync_page description
Hi Arjan, LatencyTOP says that sync_page is 'Writing a page to disk', however I see that even when no writes are involved, such as during a readdir, lseek, etc. Naming it a write is misleading, as no program is running that is doing constant writes to the disk. The only program is writing to a temp dir in /dev/shm. What would be a better description for sync_page? Here are some /proc/latency_stats containing sync_page: 125 6937678 210821 sync_page sync_page_killable sync_page_killable __lock_page_killable wake_bit_function generic_file_aio_read get_unused_fd_flags path_walk do_sync_read autoremove_wake_function security_file_permission rw_verify_area 306 5677749 215746 sync_page sync_page_killable sync_page_killable __lock_page_killable wake_bit_function generic_file_aio_read do_sync_read autoremove_wake_function security_file_permission rw_verify_area vfs_read vfs_llseek 21 435657 59966 sync_page sync_page __lock_page wake_bit_function read_cache_page_async ntfs_readpage read_cache_page map_mft_record ntfs_read_locked_inode ntfs_alloc_big_inode iget5_locked ntfs_test_inode 195 2716409 133660 blk_unplug sync_page sync_page __lock_page wake_bit_function read_cache_page_async ntfs_readpage read_cache_page map_mft_record ntfs_read_locked_inode ntfs_alloc_big_inode iget5_locked 28 1881278 181986 add_to_page_cache_lru sync_page sync_page_killable sync_page_killable __lock_page_killable wake_bit_function generic_file_aio_read get_unused_fd_flags path_walk do_sync_read autoremove_wake_function security_file_permission 2 17132 9746 add_to_page_cache_lru sync_page sync_page_killable sync_page_killable __lock_page_killable wake_bit_function generic_file_aio_read do_sync_read autoremove_wake_function security_file_permission rw_verify_area vfs_read 1 70 70 irq_exit sync_page sync_page_killable sync_page_killable __lock_page_killable wake_bit_function generic_file_aio_read do_sync_read autoremove_wake_function security_file_permission rw_verify_area vfs_read 23 306682 114514 blk_unplug sync_page sync_page_killable sync_page_killable __lock_page_killable wake_bit_function generic_file_aio_read do_sync_read autoremove_wake_function security_file_permission rw_verify_area vfs_read 1 153 153 hrtimer_interrupt smp_apic_timer_interrupt sync_page sync_page_killable sync_page_killable __lock_page_killable wake_bit_function generic_file_aio_read do_sync_read autoremove_wake_function cfq_idle_slice_timer security_file_permission Thanks, --Edwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange interaction between latencytop and the scheduler
Török Edwin wrote: > >> The performance aspect... collecting the data isn't cheap (which is >> why it's made a sysctl), >> I still plan to look at optimizing it but it won't ever be free. >> > > Yes, I understand that. Is there a way latencytop could track its own > overhead? I suppose it would lead to more accurate results > (not that there would be anything wrong with the current ones). > Latencytop userspace tool shows latencies > 0.1 msec, thus capturing backtraces for latencies <0.1msec could be avoided. If I apply the patch below, then enabling latencytop doesn't freeze X when running the "10-threads doing infloop usleep(1)" test. Still, I don't want to loose track of the latencies we didn't collect backtraces for, so I added a special "untraced" category, reported as first line in /proc/latency_stats. If needed, instead of hardcoding the threshold, it could be made a sysctl, or set via writing to /proc/latency_stats,... While I am running the test-program: $ grep untraced /proc/latency_stats 4875605 5120414 49 untraced On an idle system: $ grep untraced /proc/latency_stats 532 3287 47 untraced $ grep untraced /proc/latency_stats 853 5778 47 untraced $ grep untraced /proc/latency_stats 950 6788 47 untraced $ grep untraced /proc/latency_stats 1343 9977 49 untraced $ grep untraced /proc/latency_stats 1448 11075 49 untraced Best regards, --Edwin --- latencytop.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) --- /tmp/linux-2.6.24-rc8/kernel/latencytop.c2008-01-24 11:27:32.727487585 +0200 +++ kernel/latencytop.c2008-01-24 10:42:25.0 +0200 @@ -24,8 +24,11 @@ static DEFINE_SPINLOCK(latency_lock); #define MAXLR 128 +/* we are not interested in latencies less than 0.1msec, so + * don't get backtraces for latencies <0.05msec.*/ +#define LATENCY_TRACE_THRESHOLD 50 static struct latency_record latency_record[MAXLR]; - +static struct latency_record untraced; int latencytop_enabled; void clear_all_latency_tracing(struct task_struct *p) @@ -47,6 +50,7 @@ spin_lock_irqsave(&latency_lock, flags); memset(&latency_record, 0, sizeof(latency_record)); +memset(&untraced, 0, sizeof(untraced)); spin_unlock_irqrestore(&latency_lock, flags); } @@ -124,6 +128,15 @@ if (inter && usecs > 5000) return; +if(usecs < LATENCY_TRACE_THRESHOLD) { +/* don't get stacktrace for very low latencies */ +untraced.time += usecs; +if(usecs > untraced.max) +untraced.max = usecs; +untraced.count++; +return; +} + memset(&lat, 0, sizeof(lat)); lat.count = 1; lat.time = usecs; @@ -177,6 +190,10 @@ seq_puts(m, "Latency Top version : v0.1\n"); +seq_printf(m, "%i %li %li untraced \n", +untraced.count, +untraced.time, +untraced.max); for (i = 0; i < MAXLR; i++) { if (latency_record[i].backtrace[0]) { int q; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange interaction between latencytop and the scheduler
Arjan van de Ven wrote: > Török Edwin wrote: >> Is this normal? (is overhead really 40msec?) > > I'm seeing similar, I would not rule out that this is actual scheduler > behavior ;( > at least it seems consistent. Ok, it is good that we are seeing same behaviour. > I also made a patch (to lkml yesterday) that counts the total and > count of scheduler latencies > to also show the cumulative effect of these latencies Found it: http://lkml.org/lkml/2008/1/21/223. I'll try it the next days. Do you want me to report results? [and especially, should I Cc: Ingo Molnar on the report?] > >> I was also seeing an unusually high number of context switches (as shown >> by vmstat), I usually have 400-800 with non-patched kernels (while >> running mplayer too), but I was getting steadily over 1100 with the >> patch (on idle system). > > that's weird; there's nothing blocking in the entire patch at all. > (the only lock is a spinlock) Maybe the overhead of latencytop is triggerring other behaviour in the scheduler? Just a guess. Are there any tests I can run to see why there are more context switches? > > The performance aspect... collecting the data isn't cheap (which is > why it's made a sysctl), > I still plan to look at optimizing it but it won't ever be free. Yes, I understand that. Is there a way latencytop could track its own overhead? I suppose it would lead to more accurate results (not that there would be anything wrong with the current ones). > >> * I compile without CONFIG_SCHED_DEBUG, I no longer get *any* latency >> from the scheduler, even if I run multi-threaded programs, etc. Is this >> to be expected? (i.e. is this feature available only when enabling >> CONFIG_SCHED_DEBUG?) > > afaik yes. Ok. >> * percentages: new feature (nice!), but the values seem all wrong >> (global perc. are always wrong, per app perc. are usually ok, but see >> second example below) > Some minor latencytop userspace issues: > > note that the percentages are percentage this entry had compared the > total sum of latencies. > The maximum entry is only loosely related to that; say you have 1 > latency in "foo" for 100ms > but 900 of 1ms in "bar", "foo" will show 10% and "bar" will show 90% Thanks, that explains it. Looks like a good substitute for the average column. This is the functionality I was missing, when that column went away. Still, I wouldn't mind to see the average column too (maybe activated via a hotkey, or shown only if enough screenspace is available?). > >> * I miss the Average latency column. If it is too costly to keep account >> of an overall average, can we have last N second average? > > it's not costly to calculate, it's the screen space versus the value > of the information :( If I stretch my terminal window there's room for 2 or 3 more columns :) > >> * unknown reasons show a backtrace, but backtrace doesn't have enough >> room on screen > > still working on that; you can pass the --unknown option to dump these > so that I can add them > to the translation table. I gathered these while writing this reply: Unknown: put_device elv_insert blk_plug_device default_wake_function blk_execute_rq blk_rq_bio_prep blk_rq_append_bio blk_rq_map_user sg_io scsi_cmd_ioctl ip_queue_xmit tcp_transmit_skb Unknown: md_thread autoremove_wake_function md_thread md_thread kthread child_rip kthread child_rip Unknown: kswapd autoremove_wake_function kswapd kswapd kthread child_rip kthread child_rip Best regards, --Edwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Strange interaction between latencytop and the scheduler
Hi Arjan, Ingo, I am using latest latencytop from git (commit 92b6ca9d40998697866689f64b95647eca3200cb), and I'm seeing some strange things: [for minor latencytop userspace specific issues see at end of mail] * if I compile with CONFIG_SCHED_DEBUG I see a 30-40msec latency from the scheduler, always (Scheduler: waiting for cpu). There is also a constant ~5% user, ~2% sys CPU usage on an idle system, regardless if latencytop sysctl is on or off. Is this normal? (is overhead really 40msec?) I was also seeing an unusually high number of context switches (as shown by vmstat), I usually have 400-800 with non-patched kernels (while running mplayer too), but I was getting steadily over 1100 with the patch (on idle system). CPU usage with latencytop on: Cpu(s): 7.3%us, 7.0%sy, 0.3%ni, 85.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st CPU usage with latencytop sysctl off: Cpu(s): 5.0%us, 1.7%sy, 0.0%ni, 93.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st However top won't tell me who's using 5% CPU! PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 1 root 20 0 10388 648 548 S 0.0 0.0 0:01.70 init 2 root 15 -5 000 S 0.0 0.0 0:00.00 kthreadd 3 root 15 -5 000 S 0.0 0.0 0:00.13 ksoftirqd/0 4 root RT -5 000 S 0.0 0.0 0:00.00 watchdog/0 High number of context switches (>1100): procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 0520 60828 92276 83221200 28155 330 1215 16 3 73 8 0 0520 60828 92276 83221200 0 2 242 1158 5 1 94 0 0 0520 60828 92288 83221200 0 3 290 1261 8 1 91 0 0 0520 60836 92288 83221600 1 0 227 1407 10 2 89 0 Scheduler latency: CauseMaximum Percentage put_device elv_insert blk_plug_device default_wake152.4 msec 4.3 % Creating block layer request 68.1 msec 2.5 % Scheduler: waiting for cpu 26.6 msec 6.5 % Writing a page to disk 15.5 msec 1.8 % Checking for media change 10.3 msec 7.8 % SCSI cdrom ioctl9.4 msec 1.3 % Checking CDROM media present7.1 msec 0.3 % Waiting for event (poll)5.0 msec 59.4 % Application requested delay 4.9 msec 11.6 % Process hddtemp (3772) put_device elv_insert blk_plug_device default_wake152.4 msec 61.5 %blk_rq_bio_prep blk_rq_append_bio blk_rq_map_user sg_io scsi_ Creating block layer request 68.1 msec 36.3 % Scheduler: waiting for cpu 6.5 msec 2.2 % * I compile without CONFIG_SCHED_DEBUG, I no longer get *any* latency from the scheduler, even if I run multi-threaded programs, etc. Is this to be expected? (i.e. is this feature available only when enabling CONFIG_SCHED_DEBUG?) I was trying to trigger some scheduler latencies artifically, so I wrote a program that creates 10 threads, each doing an usleep(1) in an infloop. Still I get no scheduler latency (I was getting scheduler latency with version 0.1). If latencytop sysctl is on, when running this test, after a short while X becomes unusable, and almost freezes; yet I still can't see latencies. Before it freezes I captured top outputs (see below), notice how otherwise idle programs get a high CPU usage! If latencytop sysctl is off no freeze occurs, and CPU usage% is as normal (a.out getting 99%) Top output when latencytop sysctl is on: top - 19:33:31 up 1:28, 2 users, load average: 3.45, 1.21, 1.05 Tasks: 120 total, 9 running, 110 sleeping, 1 stopped, 0 zombie Cpu(s): 3.9%us, 95.5%sy, 0.6%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 2060592k total, 1979756k used,80836k free,69028k buffers Swap: 3903724k total,0k used, 3903724k free, 1473896k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 7008 edwin 20 0 38764 472 356 R 58.4 0.0 0:05.94 a.out 5613 edwin 20 0 179m 10m 7452 R 17.3 0.5 1:45.19 gkrellm 5920 edwin 20 0 565m 90m 25m R 17.3 4.5 1:41.82 icedove-bin This is an idle program! 5610 edwin 20 0 152m 21m 13m R 1.9 1.1 1:28.88 konsole 6749 edwin 20 0 254m 7800 5496 S 1.9 0.4 0:00.87 mplayer 5209 root 20 0 110m 35m 9016 S 1.3 1.8 1:36.31 Xorg 908 root 15 -5 000 S 0.6 0.0 0:01.60 scsi_eh_0 5132 postgres 20 0 77216 1284 736 S 0.6 0.1 0:00.35 postgres 5151 root 35 15 41068 29m 648 S 0.6 1.5 0:27.96 preload 7005 edwin 20 0 18976 1184 880 R 0.6 0.1 0:00.
Re: [PATCH] LatencyTop: make reason (blk_execute_rq+0x7a/0xd) known
Arjan van de Ven wrote: > Török Edwin wrote: >> >> Cause Maximum >> Average >> SCSI device ioctl 34.2 msec >> 14.4 msec > > great! I'll put this into my patchkit! Thanks. >> I also noticed some unsigned overflows, my top latency was sometimes a >> very large number (18), which >> was probably -1, or some other negative number. I haven't found a way >> to reproduce it yet (it happens very rarely). > > > I've not seen this; I'll take another peak at the code. I just captured this: Cause Maximum Average Waiting for userspace lock18446744073638300.0 msec9223372036783524.0 msec Waiting for processes to die (waitpid)18446744073638296.0 msec9223372036783520.0 msec Waiting for event (poll) 18446744073565216.0 msec764505121372.0 msec SCSI device ioctl 35.2 msec 13.7 msec Application requested delay19.5 msec 0.0 msec page fault 18.6 msec 4.0 msec Reading from file 16.6 msec 1.5 msec Waiting for buffer IO 15.2 msec 3.0 msec mutex lock 15.0 msec 15.0 msec I was also looking at /proc/latency_stats using watch, and I've seen that there were negative number. Unfortunately I couldn't copy+paste it, because it was gone too fast. I was running a program with 4 threads doing usleep(0) in an inf-loop. However I can't reproduce the overflow now with the same program. Best regards, --Edwin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] LatencyTop: make reason (blk_execute_rq+0x7a/0xd) known
Hi Arjan, One of the top reasons in latencytop is Unknown reason (blk_execute_rq+0x7a/0xd) 152.3 msec 49.0 msec Its from hddtemp, hald-addon-storage, ... I wrote a simple systemtap script [1] to find out where its called from, and its definitely from the ioctl (see below). 1820322406 hddtemp(3999): <- latency of 153260223 ns Returning from: 0x80311ab0 : blk_execute_rq+0x0/0xd0 [] Returning to : 0x80315580 : sg_io+0x200/0x3c0 [] 0x80315a02 : scsi_cmd_ioctl+0x2c2/0x410 [] 0x80417153 : ip_queue_xmit+0x2b3/0x420 [] 0x880c3be9 : sd_ioctl+0x99/0xf0 [sd_mod] 0x80313839 : blkdev_driver_ioctl+0x39/0x40 [] 0x80313ab8 : blkdev_ioctl+0x278/0x710 [] 0x803e540d : release_sock+0x8d/0xa0 [] 0x803e54b3 : lock_sock_nested+0x93/0xa0 [] 0x8029a4ff : d_alloc+0x6f/0x1a0 [] 0x802f102f : security_d_instantiate+0x1f/0x30 [] 0x803e540d : release_sock+0x8d/0xa0 [] 0x8043c8c8 : inet_accept+0x88/0xc0 [] 0x803e408e : move_addr_to_user+0x5e/0x70 [] 0x802f0bb1 : security_socket_post_accept+0x11/0x20 [] 0x803e472e : sys_accept+0x13e/0x1d0 [] 0x802af12b : block_ioctl+0x1b/0x20 [] 0x8029482b : do_ioctl+0x1b/0x70 [] 0x80294a6d : vfs_ioctl+0x1ed/0x290 [] 0x80296281 : sys_select+0x41/0x1b0 [] 0x80294b92 : sys_ioctl+0x82/0xa0 [] 0x8020c18e : system_call+0x7e/0x83 [] I suggest adding an annotation to make this reason known to latencytop by annotating sd_ioctl/scsi_cmd_ioctl, see my patch below. After my patch, instead of blk_execute_rq the real reason is showed: Cause Maximum Average SCSI device ioctl 34.2 msec 14.4 msec I noticed there's a similar annotation in sr.c for cdrom_ioctl, but sr.c also has annotations for scsi_nonblockable_ioctl, and scsi_ioctl. If they are needed for sd_block_ioctl too, maybe scsi_nonblockable_ioctl/scsi_ioctl should have the annotation themselves. I also noticed some unsigned overflows, my top latency was sometimes a very large number (18), which was probably -1, or some other negative number. I haven't found a way to reproduce it yet (it happens very rarely). [1] Systemtap script: #!/usr/bin/stap -v global start probe kernel.function("blk_execute_rq") { start[tid()] = gettimeofday_ns() } probe kernel.function("blk_execute_rq").return { # if we have not seen this before if (!([tid()] in start)) next delta = gettimeofday_ns() - start[tid()] # show backtraces for latencies over 1ms if(delta > 100) { printf("%s <- latency of %d ns\n", thread_indent(1), delta) print_backtrace () } delete start[tid()] } P.S.: LatencyTop could have a feature to ask for more details on unknown latency reasons, and it would generate a systemtap script itself, that would show the backtraces to help figure out whats going on. Best regards, --Edwin diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index a69b155..93ccfc0 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -686,6 +687,7 @@ static int sd_ioctl(struct inode * inode, struct file * filp, struct scsi_device *sdp = scsi_disk(disk)->device; void __user *p = (void __user *)arg; int error; +struct latency_entry reason; SCSI_LOG_IOCTL(1, printk("sd_ioctl: disk=%s, cmd=0x%x\n", disk->disk_name, cmd)); @@ -710,7 +712,9 @@ static int sd_ioctl(struct inode * inode, struct file * filp, case SCSI_IOCTL_GET_BUS_NUMBER: return scsi_ioctl(sdp, cmd, p); default: +set_latency_reason("SCSI device ioctl", &reason); error = scsi_cmd_ioctl(filp, disk->queue, disk, cmd, p); +restore_latency_reason(&reason); if (error != -ENOTTY) return error; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Improve hackbench
Ingo Molnar wrote: > * Zhang, Yanmin <[EMAIL PROTECTED]> wrote: > > >> hackbench is to test Linux scheduler. The original program is at >> http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c >> Based on this multi-process version, a nice person created a >> multi-thread version. Pls. see >> http://www.bullopensource.org/posix/pi-futex/hackbench_pth.c >> > > great. I've uploaded your unified & improved version to: > > http://redhat.com/~mingo/cfs-scheduler/tools/hackbench.c > > (i made some small changes - two warning fixes on gcc 4.2 and a default > of 10 groups when hackbench is called without parameters, plus a > printout.) > On x86-64 there's a bug [*], that causes hackbench to segfault when compiled with optimizations: in reap_worker(): int status; ... pthread_join(id, (void **)(void *)&status); That is not correct, sizeof(void*) > sizeof(int) on x86-64. Something gets overwritten on the stack, I tried with gcc -fstack-protector, but it doesn't detect it !? After applying the patch, it no longer segfaults. This patch fixes it: --- hackbench.c 2008-01-04 10:08:26.0 +0200 +++ ../hackbench.c 2008-01-04 13:45:22.0 +0200 @@ -241,8 +241,10 @@ wait(&status); if (!WIFEXITED(status)) exit(1); - } else - pthread_join(id, (void **)(void *)&status); + } else { + void* status; + pthread_join(id, (void **)&status); + } } /* One group of senders and receivers */ I also notice that the thread version is slower, than process version: $ ./hackbench 5 thread Running with 5*40 (== 200) tasks. Time: 0.413 $ ./hackbench 5 thread Running with 5*40 (== 200) tasks. Time: 0.423 $ ./hackbench 5 thread 20 Running with 5*40 (== 200) tasks. Time: 0.093 $ ./hackbench 5 thread 200 Running with 5*40 (== 200) tasks. Time: 0.827 $ ./hackbench 5 thread 2000 Running with 5*40 (== 200) tasks. Time: 8.409 $ ./hackbench 5 process 2000 Running with 5*40 (== 200) tasks. Time: 7.669 $ ./hackbench -pipe 5 process 2000 Running with 5*40 (== 200) tasks. Time: 3.416 $ ./hackbench -pipe 5 thread 2000 Running with 5*40 (== 200) tasks. Time: 4.320 [*] $ uname -a Linux lightspeed2 2.6.24-rc6-ge697789d #3 Wed Jan 2 11:15:05 EET 2008 x86_64 GNU/Linux $ gcc -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.2 --program-suffix=-4.2 --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.2.3 20071123 (prerelease) (Debian 4.2.2-4) $ wget http://redhat.com/~mingo/cfs-scheduler/tools/hackbench.c --13:40:53-- http://redhat.com/~mingo/cfs-scheduler/tools/hackbench.c => `hackbench.c' Resolving redhat.com... 209.132.177.50 Connecting to redhat.com|209.132.177.50|:80... connected. HTTP request sent, awaiting response... 302 Found Location: http://www.redhat.com/~mingo/cfs-scheduler/tools/hackbench.c [following] --13:40:54-- http://www.redhat.com/~mingo/cfs-scheduler/tools/hackbench.c => `hackbench.c' Resolving www.redhat.com... 209.132.177.50 Connecting to www.redhat.com|209.132.177.50|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c [following] --13:40:54-- http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c => `hackbench.c' Resolving people.redhat.com... 66.187.233.237 Connecting to people.redhat.com|66.187.233.237|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 8,455 (8.3K) [text/plain] 100%[>] 8,455 --.--K/s 13:40:55 (61.93 KB/s) - `hackbench.c' saved [8455/8455] $ gcc -O2 -g -Wall -o hackbench hackbench.c -lpthread hackbench.c:32:66: warning: missing terminating ' character $ ./hackbench 1 thread Running with 1*40 (== 40) tasks. Segmentation fault $ valgrind --trace-children=yes ./hackbench 1 thread ==27332== Memcheck, a memory error detector. ==27332== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==27332== Using LibVEX rev 1804, a library for dynamic binary translation. ==27332== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==27332== Using valgrind-3.3.0-Debian, a dynamic binary instrumentation framework. ==27332== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==27332== For more details, rerun with: -v ==27332== Running with 1*40 (== 40) tasks. ==27332== Thread 2: ==27332== Syscall param write(buf) p
Re: [RFT] Port 0x80 I/O speed
Rene Herman wrote: > Good day. > > Would some people on x86 (both 32 and 64) be kind enough to compile > and run the attached program? This is about testing how long I/O port > access to port 0x80 takes. It measures in CPU cycles so CPU speed is > crucial in reporting. > > Posted a previous incarnation of this before, buried in the outb 0x80 > thread which had a serialising problem. This one should as far as I > can see measure the right thing though. Please yell if you disagree... > Hi, Tested on 2 systems. System I processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 47 model name : AMD Athlon(tm) 64 Processor 3200+ stepping: 2 cpu MHz : 2000.000 cache size : 512 KB fpu : yes fpu_exception : yes Motherboard: Asus A9N-E With -m32: [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 619, in 583 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 619, in 583 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 After making the __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); I get this with 64: [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 618, in 583 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 618, in 583 [EMAIL PROTECTED]:~$ sudo ./port80 cycles: out 1107, in 1067 If I stop cpudyn I get a constant 618/583. System II processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 14 model name : Genuine Intel(R) CPU T2300 @ 1.66GHz stepping: 8 cpu MHz : 1667.000 cache size : 2048 KB physical id : 0 siblings: 2 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 14 model name : Genuine Intel(R) CPU T2300 @ 1.66GHz stepping: 8 cpu MHz : 1667.000 cache size : 2048 KB physical id : 0 siblings: 2 Dell Inspiron 6400, Intel Core Duo (ICH7 chipset) thunder:/home/edwin# ./port80 cycles: out 2480, in 1867 thunder:/home/edwin# ./port80 cycles: out 2482, in 1865 thunder:/home/edwin# ./port80 cycles: out 2968, in 1893 thunder:/home/edwin# ./port80 cycles: out 1991, in 1372 thunder:/home/edwin# ./port80 cycles: out 1979, in 1366 thunder:/home/edwin# ./port80 cycles: out 2473, in 1865 thunder:/home/edwin# ./port80 cycles: out 2484, in 1869 After setting CPU governor to performance: # ./port80 cycles: out 2368, in 1783 thunder:/home/edwin# ./port80 cycles: out 2377, in 1783 thunder:/home/edwin# ./port80 cycles: out 2367, in 1774 thunder:/home/edwin# ./port80 cycles: out 2370, in 1780 thunder:/home/edwin# ./port80 cycles: out 2365, in 1782 thunder:/home/edwin# ./port80 cycles: out 2369, in 1774 thunder:/home/edwin# ./port80 cycles: out 2366, in 1784 thunder:/home/edwin# ./port80 cycles: out 2379, in 1786 thunder:/home/edwin# ./port80 cycles: out 2367, in 1773 thunder:/home/edwin# ./port80 cycles: out 2376, in 1783 thunder:/home/edwin# ./port80 cycles: out 2360, in 1784 thunder:/home/edwin# ./port80 cycles: out 2367, in 1783 thunder:/home/edwin# ./port80 cycles: out 2370, in 1783 thunder:/home/edwin# ./port80 cycles: out 2382, in 1782 Also tried in a loop, but values are not constant: while true; do ./port80; done cycles: out 2415, in 1818 cycles: out 2405, in 1817 cycles: out 2414, in 1810 cycles: out 2411, in 1819 cycles: out 2407, in 1821 cycles: out 2410, in 1820 cycles: out 2418, in 1821 cycles: out 2408, in 1847 cycles: out 2404, in 1805 cycles: out 2411, in 1858 cycles: out 2395, in 1765 cycles: out 2377, in 1786 cycles: out 2378, in 1813 cycles: out 2395, in 1800 cycles: out 2381, in 1793 cycles: out 2382, in 1790 cycles: out 2399, in 1835 cycles: out 1928, in 1327 cycles: out 2410, in 1781 cycles: out 1996, in 1287 cycles: out 2369, in 1768 cycles: out 2401, in 1805 cycles: out 2395, in 1802 cycles: out 2389, in 1786 cycles: out 2359, in 1768 cycles: out 2495, in 1858 cycles: out 2408, in 1809 cycles: out 2919, in 1
Re: [PATCH] Add quirk to set AHCI mode on ICH boards
Alan Cox wrote: > On Thu, 8 Nov 2007 22:46:22 -0500 > Jeff Garzik <[EMAIL PROTECTED]> wrote: > > >> On Thu, Nov 08, 2007 at 10:29:37PM -0500, Mark Lord wrote: >> >>> And I might even privately patch my own kernels to map the ACHI BAR >>> in the cases where the BIOS didn't... >>> >> The inability to do this in the general case is the main reason why >> AHCI was not unconditionally enabled, even in IDE mode, when it was >> originally added... :/ >> > > We've done it all the time for various devices without problems (eg S3 > video cards). I'd like to see it go in - although perhaps attached to a > force_ahci boot param initially > There is one problem with force enabling ahci. You'll loose the CDROM on Dell laptops. Prior to force-enabling ahci there is one "device" that sees the 2 sata channels, and the 2 ide channels. When you force-enable ahci, this device becomes the ahci controller (it changes the device id), and the IDE controller will appear as a separate new device (with another device id), but it is disabled. There are registers on the ICH7 that allows you to set enabled/disabled status, but according to the documentation you should not enable a device after it has been disabled. In practice I couldn't get the CDROM to get re-enabled: * either nothing happend * spurious irqs were sent that nobody handles, unless I used irq=poll; but still no cdrom. Force-enabling AHCI, and not trying to enable the CDROM works, although I occasionally got NCQ errors. For a (long) discussion see this thread on the powertop mailing list: http://www.bughost.org/pipermail/power/2007-June/000533.html http://www.bughost.org/pipermail/power/2007-June/000573.html And there is also another slightly different approach: http://mjg59.livejournal.com/76062.html Best regards, --Edwin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/