Re: [BUG] Linux 2.6.25-rc2 - Kernel Ooops while running dbench
Two x86-64 boxes here lock up here on 2.6.25-rc2, shortly after boot. One running Fedora 8 + X (GNOME) and one a headless file server. configs and lspci attached. Unable to capture any splatter so far. Bisecting... 00:00.0 Host bridge: Intel Corporation 82955X Memory Controller Hub 00:01.0 PCI bridge: Intel Corporation 82955X PCI Express Root Port 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) 00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01) 00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 (rev 01) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01) 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01) 00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI Controller (rev 01) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 01:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] (rev a1) 04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5751 Gigabit Ethernet PCI Express (rev 01) 05:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5701 Gigabit Ethernet (rev 15) 00:00.0 Host bridge: Intel Corporation 82975X Memory Controller Hub 00:01.0 PCI bridge: Intel Corporation 82975X PCI Express Root Port 00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 01) 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) 00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01) 00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 (rev 01) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GH (ICH7DH) LPC Interface Bridge (rev 01) 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01) 00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI Controller (rev 01) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 01:00.0 VGA compatible controller: ATI Technologies Inc R580 [Radeon X1900 XT] (Primary) 01:00.1 Display controller: ATI Technologies Inc R580 [Radeon X1900 XT] (Secondary) 02:00.0 Multimedia controller: Philips Semiconductors Unknown device 7162 04:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller 05:02.0 Network controller: RaLink RT2561/RT61 802.11g PCI 05:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link) 05:05.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02) pretzel.bz2 Description: application/bzip core.bz2 Description: application/bzip
Re: [RFC] basic delayed allocation in VFS
Alex Tomas wrote: So without the ability to attach specific I/O completions to bios or support for unwritten extents directly in __mpage_writepage, there is no way XFS can use this generic delayed allocation code. I didn't say generic, see Subject: :) Well, it shouldn't even be in the VFS layer if it's only usable by one filesystem. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
Alex Tomas wrote: Jeff Garzik wrote: Is this based on Christoph's work? Christoph, or some other XFS hacker, already did generic delalloc, modeled on the XFS delalloc code. nope, this one is simple (something I'd prefer for ext4). The XFS one is proven and the work was already completed. What were the specific technical issues that made it unsuitable for ext4? I would rather not reinvent the wheel, particularly if the reinvention is less capable than the existing work. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] basic delayed allocation in VFS
Alex Tomas wrote: Good day, please review ... thanks, Alex basic delayed allocation in VFS: * block_prepare_write() can be passed special -get_block() which doesn't allocate blocks, but reserve them and mark bh delayed * a filesystem can use mpage_da_writepages() with other -get_block() which doesn't defer allocation. mpage_da_writepages() finds all non-allocated blocks and try to allocate them with minimal calls to -get_block(), then submit IO using __mpage_writepage() Signed-off-by: Alex Tomas [EMAIL PROTECTED] Is this based on Christoph's work? Christoph, or some other XFS hacker, already did generic delalloc, modeled on the XFS delalloc code. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
Theodore Tso wrote: I don't think we have a problem here. What we have now is fine, and It's fine for ext4, but not the wider world. This is a common problem created by parallel development when code dependencies exist. In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. Presumably you mean 2.6.23. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][TAKE5] fallocate system call
Andrew Morton wrote: b) We do what we normally don't do and reserve the syscall slots in mainline. If everyone agrees it's going to happen... why not? Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Heads up on sys_fallocate()
Amit K. Arora wrote: This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation fallocate, for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); As we are developing and testing the required patches, we decided to post a preliminary patch and get inputs from the community to give it a right direction and shape. First, a little description on the feature. Persistent preallocation is a file system feature using which an application (say, relational database servers) can explicitly preallocate blocks to a particular file. This feature can be used to reserve space for a file to get mainly the following benefits: 1 contiguity - less defragmentation and thus faster access speed, and 2 guarantee for a minimum space availibility (depending on how many blocks were preallocated) for the file, even if the filesystem becomes full. XFS already has an implementation for this, using an ioctl interface. And, ext4 is now coming up with this feature. In coming time we may see a few more file systems implementing this. Thus, it makes sense to have a more standard interface for this, like this new system call. Here is the initial and incomplete version of the patch, which can be used for the discussion, till we come up with a set of more complete patches. --- arch/i386/kernel/syscall_table.S |1 + fs/ext4/file.c |1 + fs/open.c| 18 ++ include/asm-i386/unistd.h|3 ++- include/linux/fs.h |1 + include/linux/syscalls.h |1 + 6 files changed, 24 insertions(+), 1 deletion(-) I certainly agree that we want something like this. posix_fallocate() is the glibc interface we want to be compatible with (which your definition is, AFAICS). Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How git affects kernel.org performance
Theodore Tso wrote: The fastest and probably most important thing to add is some readahead smarts to directories --- both to the htree and non-htree cases. If you're using some kind of b-tree structure, such as XFS does for directories, preallocation doesn't help you much. Delayed allocation can save you if your delayed allocator knows how to structure disk blocks so that a btree-traversal is efficient, but I'm guessing the biggest reason why we are losing is because we don't have sufficient readahead. This also has the advantage that it will help without needing to doing a backup/restore to improve layout. Something I just thought of: ATA and SCSI hard disks do their own read-ahead. Seeking all over the place to pick up bits of directory will hurt even more with the disk reading and throwing away data (albeit in its internal elevator and cache). Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote: But it a race that is _easily_ handled, and applications only need to implement one interface, not a different method for every filesystem that requires deeep filesystem knowledge. Besides, you still have to handle the case where the block you want has already been allocated because reading the metadata from userspace doesn't prevent the kernel from allocating the block you want before you ask for it... The race is easily handled either way, by having the block move fail when you tell the kernel the destination blocks. So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. If you accept a racy solution, you might as well do it outside the kernel, where you get the same results, but without adding silliness and bloat to the kernel. Every major filesystem has a libfoofs library that makes it trivial to read the metadata, so all you need to do is use an existing lib. IOWs, you are advocating that any application that wants to use this special allocation technique needs to link against every different filesystem library and it then needs to implement filesystem specific searches through their metadata? Nobody in their right mind would ever want to use an interface like this. Online defrag is OBVIOUSLY highly filesystem specific. You have to link against filesystem specific code somewhere, whether its inside the kernel or outside the kernel. Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 06:11:37PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 02:01:42AM -0400, Jeff Garzik wrote: On Wed, Oct 25, 2006 at 03:38:23PM +1000, David Chinner wrote: On Wed, Oct 25, 2006 at 12:48:44AM -0400, Jeff Garzik wrote: So why are you arguing that an interface is no good because it is fundamentally racy? ;) My point was that it is silly to introduce obviously racy code into the kernel, when -- inside the kernel -- it could be handled race-free. So how do you then get the generic interface to allocate blocks specified by userspace race free? As has been repeatedly stated, there is no generic. There MUST be filesystem-specific knowledge during these operations. If userspace directed allocation requires deep knowledge of the filesystem metadata (this is what you are saying they need to do, right?), then these applications will never, ever make use of this interface and we'll continue to have problems with them. Completely false assumptions. There is no difference in handling of knowledge, be it kernel space or userspace. Further, in the case being discussed in this thread, ext2meta has already been proven a workable solution. Sure, but that's not a generic solution to a problem common to all filesystems You clearly don't know what I'm talking about. ext2meta is an example of a filesystem-specific metadata access method, applicable to tasks such as online optimization. Implement that tiny kernel module for each filesystem, and you have everything you need, without races. This was discussed years ago; review the mailing lists. Google for 'Alexander Viro' and 'ext2meta'. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 04:54:50PM +0200, Jan Kara wrote: Yes, this sounds feasible. We could split the defrag ioctl into two pieces (addition of given extent to a file and swapping of extents), which can have generic interface... An ioctl is UGLY. This was discussed years ago. Google for 'Alexander Viro' and 'ext2meta'. That's a clean, flexible, extensible way to access metadata online. No need for ioctl binary translation across 32bit-64bit, or any other ioctl issue. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 08:25:30PM +0200, Jan Kara wrote: I see. So you mean that in our ext3meta filesystem we'd have a file named add_this_extent_to_inode and a file reloc_inode_interval and they'd be fed essentially the same info as the current ioctl interface and do the same thing as we currently do. Hmm, I don't find it that nice any more but yes, this would work. It depends on the operation. ext2meta[1] works fine for online defrag, just exporting metadata objects and providing read(1) and write(2) operations on them. Adding 'trigger' files (like your add_this_extent_to_inode) may make sense for some operations, indeed, but we need to see the whole picture before really understanding whether that interface is optimal. Jeff [1] http://linux.yyz.us/misc/ext2meta.c - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 08:36:56PM +0200, Jan Kara wrote: Yes, but there's a question of the interface to this operation. How to specify which indirect block I mean? Obviously we could introduce separate call for remapping indirect blocks but I find this solution kind of clumsy... Agreed... that gets nasty real quick. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 12:30:02PM +1000, Barry Naujok wrote: Could we have a more abstract method for asking the filesystem where the free blocks are and then using the same block addressing to tell the fs where to allocate/move the file's data to? That's fundamentally racy, so you might as well just read the filesystem metadata from userspace. No need to go through the kernel for that. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Wed, Oct 25, 2006 at 02:27:53PM +1000, David Chinner wrote: But it a race that is _easily_ handled, and applications only need to implement one interface, not a different method for every filesystem that requires deeep filesystem knowledge. Besides, you still have to handle the case where the block you want has already been allocated because reading the metadata from userspace doesn't prevent the kernel from allocating the block you want before you ask for it... The race is easily handled either way, by having the block move fail when you tell the kernel the destination blocks. The difference is that you don't unnecessarily bloat the kernel. Every major filesystem has a libfoofs library that makes it trivial to read the metadata, so all you need to do is use an existing lib. Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Ext3 online defrag
On Mon, Oct 23, 2006 at 06:31:40PM +0400, Alex Tomas wrote: isn't that a kernel responsbility to find/allocate target blocks? wouldn't it better to specify desirable target group and minimal acceptable chunk of free blocks? The kernel doesn't have enough knowledge to know whether or not the defragger prefers one blkdev location over another. When you are trying to consolidate blocks, you must specify the destination as well as source blocks. Certainly, to prevent corruption and other nastiness, you must fail if the destination isn't available... (ext2meta did all this...) Jeff - To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html