btrstress caused kernel oops after 8-ish days.
I ported my zfsstress program over to btrfs, and started running it on a test machine a few weeks ago. See here for more information and a link to the program: http://www.tummy.com/journals/entries/jafo_20100418_124309 It looks like after around 8 days of running, there were some issues, as shown in dmesg (below). The system is a 64-bit Atom 330 with 2GB RAM, and a single 250GB hard drive. btrfs has 200GB of that. The OS is the Fedora 13 Beta with kernel 2.6.33.1-24.fc13.x86_64. I had started btrstress and let it run a day or so. Then I went in and deleted the subvolume that btrstress puts everything into, then started it again. A few days later, I did the same. I also tried turning on compression with mount -o remount,compress /data. Around 6 hours later, it looks like btrstress was no longer working. The primary issue seems to be that file deletions aren't freeing up space. btrstress will fill the file-system up, but disables any write operations if the df output shows more than 95% full. So normally it would clear up some snapshots or files until it gets back down to 95% or less, and start doing writes again. However, after the Oops, it looks like it was able to continue allowing removes of files and snapshots, but df is no longer reflecting that. For example: [r...@btrtest btrstress-lZ6C7txz3n]# df -h FilesystemSize Used Avail Use% Mounted on /dev/sda1 29G 13G 16G 45% / tmpfs 991M 0 991M 0% /dev/shm /dev/sda4 200G 189G 9.9G 96% /data [r...@btrtest btrstress-lZ6C7txz3n]# find /data /data /data/btrstress-lZ6C7txz3n [r...@btrtest btrstress-lZ6C7txz3n]# btrfs subvolume list /data ID 28423 top level 5 path btrstress-lZ6C7txz3n [r...@btrtest btrstress-lZ6C7txz3n]# du -sh /data 4.0K/data [r...@btrtest btrstress-lZ6C7txz3n]# I've left the test system as it is, let me know if there's anything you'd like me to try on the system before I wipe it and start again. Also, let me know if this sort of report helps. Note that after enabling compression, but before the oops, dmesg reported a bunch of messages like: btrfs: relocating block group 11840520192 flags 1 btrfs: relocating block group 10766778368 flags 1 btrfs: relocating block group 9693036544 flags 1 btrfs: relocating block group 8619294720 flags 1 btrfs: relocating block group 7545552896 flags 1 btrfs: relocating block group 6471811072 flags 1 Note that the group numbers started at 212630241280 and reduced by around a billion for every line. dmesg output of oops below. BUG: unable to handle kernel NULL pointer dereference at 0075 IP: [810e380f] page_cache_sync_readahead+0x15/0x3a PGD 7a937067 PUD 3310c067 PMD 0 Oops: [#1] SMP last sysfs file: /sys/devices/pci:00/:00:1e.0/:04:00.1/irq CPU 0 Pid: 30242, comm: btrfs Not tainted 2.6.33.1-24.fc13.x86_64 #1 D945GCLF2/ RIP: 0010:[810e380f] [810e380f] page_cache_sync_readahead+0x15/0x3a RSP: 0018:88003309fac8 EFLAGS: 00010206 RAX: RBX: 880046476940 RCX: RDX: RSI: 88007ac840d0 RDI: 880046476b70 RBP: 88003309fac8 R08: 3f6a R09: 0246 R10: 88003309f8d8 R11: R12: 880077422968 R13: R14: 880046476608 R15: FS: 7f893574d740() GS:880004a0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0075 CR3: 33004000 CR4: 06f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process btrfs (pid: 30242, threadinfo 88003309e000, task 8800777a8000) Stack: 88003309fb68 a0364899 88003309fae8 000181c1 0 880046476a30 880046476608 88003309fb28 3f69 0 88007ac840d0 3f6a 000181c0 Call Trace: [a0364899] relocate_file_extent_cluster+0x18f/0x399 [btrfs] [a0364b46] relocate_data_extent+0xa3/0xbb [btrfs] [a0364e1a] relocate_block_group+0x2bc/0x384 [btrfs] [a036506f] btrfs_relocate_block_group+0x18d/0x312 [btrfs] [a034dfe7] btrfs_relocate_chunk+0x6c/0x4c2 [btrfs] [a033e051] ? btrfs_item_offset+0xbb/0xcb [btrfs] [a034c81b] ? btrfs_item_key_to_cpu+0x2a/0x46 [btrfs] [a034ea24] btrfs_balance+0x1ce/0x21b [btrfs] [811f02b0] ? inode_has_perm+0xaa/0xce [a0355cec] btrfs_ioctl+0x6f9/0x871 [btrfs] [81071226] ? sched_clock_cpu+0xc3/0xce [8107ba94] ? trace_hardirqs_off+0xd/0xf [81071274] ? cpu_clock+0x43/0x5e [8112c054] vfs_ioctl+0x32/0xa6 [8112c5d4] do_vfs_ioctl+0x490/0x4d6 [8112c670] sys_ioctl+0x56/0x79 [81009c72] system_call_fastpath+0x16/0x1b Code: 47 48 48 85 c0 74 04 31 f6 ff d0 48 83 c4 28
Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
On Apr 26, 2010, at 6:18 AM, KOSAK AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing (and later rd choosed to use another way). Then, It assume writepage refusing aren't happen on majority pages. IOW, the VM assume other many pages can writeout although the page can't. Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned. but now ext4 and btrfs refuse all writepage(). (right?) No, not exactly. Btrfs refuses the writepage() in the direct reclaim cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone scanning. I don't want to speak for Chris, but I assume it's due to stack depth concerns --- if it was just due to worrying about fs recursion issues, i assume all of the btrfs allocations could be done GFP_NOFS. Ext4 is slightly different; it refuses writepages() if the inode blocks for the page haven't yet been allocated. (Regardless of whether it's happening for direct reclaim or zone scanning.) However, if the on-disk block has been assigned (i.e., this isn't a delalloc case), ext4 will honor the writepage(). (i.e., if this is an mmap of an already existing file, or if the space has been pre-allocated using fallocate()).The reason for ext4's concern is lock ordering, although I'm investigating whether I can fix this. If we call set_page_writeback() to set PG_writeback (plus set the various bits of magic fs accounting), and then drop the page_lock, does that protect us from random changes happening to the page (i.e., from vmtruncate, etc.)? IOW, I don't think such documentation suppose delayed allocation issue ;) The point is, Our dirty page accounting only account per-system-memory dirty ratio and per-task dirty pages. but It doesn't account per-numa-node nor per-zone dirty ratio. and then, to refuse write page and fake numa abusing can make confusing our vm easily. if _all_ pages in our VM LRU list (it's per-zone), page activation doesn't help. It also lead to OOM. And I'm sorry. I have to say now all vm developers fake numa is not production level quority yet. afaik, nobody have seriously tested our vm code on such environment. (linux/arch/x86/Kconfig says This is only useful for debugging.) So I'm sorry I mentioned the fake numa bit, since I think this is a bit of a red herring. That code is in production here, and we've made all sorts of changes so ti can be used for more than just debugging. So please ignore it, it's our local hack, and if it breaks that's our problem.More importantly, just two weeks ago I talked to soeone in the financial sector, who was testing out ext4 on an upstream kernel, and not using our hacks that force 128MB zones, and he ran into the ext4/OOM problem while using an upstream kernel. It involved Oracle pinning down 3G worth of pages, and him trying to do a huge streaming backup (which of course wasn't using fallocate or direct I/O) under ext4, and he had the same issue --- an OOM, that I'm pretty sure was caused by the fact that ext4_writepage() was refusing the writepage() and most of the pages weren't nailed down by Oracle were delalloc.The same test scenario using ext3 worked just fine, of course. Under normal cases it's not a problem since statistically there should be enough other pages in the system compared to the number of pages that are subject to delalloc, such that pages can usually get pushed out until the writeback code can get around to writing out the pages. But in cases where the zones have been made artificially small, or you have a big program like Oracle pinning down a large number of pages, then of course we have problems. I'm trying to fix things from the file system side, which means trying to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in Documentation/filesystems/Locking as something which MUST be used if writepage() is going refuse a page. And then I discovered no one is actually using it. So that's why I was asking with respect whether the Locking documentation file was out of date, or whether all of the file systems are doing it wrong. On a related example of how file system code isnt' necessarily following what is required/recommended by the Locking documentation, ext2 and ext3 are both NOT using set_page_writeback()/end_page_writeback(), but are rather keeping the page locked until after they call block_write_full_page(), because of concerns of truncate coming in and screwing things up. But now looking at Locking, it appears that set_page_writeback() is as good as page_lock() for preventing the truncate code from coming in and screwing everything up? It's not clear to me exactly what locking guarantees are provided against truncate by set_page_writeback(). And suppose we are writing out a whole cluster of pages, say 4MB worth of pages; do we need to call
Re: [RFC] btrfs, udev and btrfs
On Fri, Apr 16, 2010 at 20:48, Goffredo Baroncelli kreij...@gmail.com wrote: Instead the first option has the disadvantage to need to be used for every new device. From this observation I write a udev rule which scan the new block devices, excluding floppy and cdrom. Below my udev rule $ cat /etc/udev/rules.d/60-btrfs.rules # ghigo 15/04/2010 ACTION!=add|change, GOTO=btrfs_scan_end SUBSYSTEM!=block, GOTO=btrfs_scan_end KERNEL!=sd[!0-9]*|hd[!0-9]*, GOTO=btrfs_scan_end IMPORT{program}=/sbin/blkid -o udev -p $tempnode Udev needs to do this already anyway. People are not encouraged to call this in their own rule files again. Just make sure you place the rule after the existing standard call that always comes with udev. The btrfs rules can just depend on the variable set in the environment. Also there are more devices than sd* which could have a btrfs volume, but this is also covered by the standard udev call to blkid. Thanks, Kay -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
making mirror
Hello I've created btrfs on single partition (/dev/sdc3) using: mkfs.btrfs /dev/sdc3 So far ... it is running fine - but i would like to switch to raid1 configuration. Unfortunately - i didn't pass any information about raid (-m raid1 -d raid1) when i created it :( Now i would like to hot-add a /dev/sda3 (which is same size as currently used /dev/sdc3). The problem is - that i don't have free space on other partitions - so i cannot move data away, create FS from start and copy data back to running mirror. Perfect solution would be a hot-adding new partition - but can i do it with btrfs-vol? If so - how? regards, -- manio jabber/e-mail: ma...@skyboo.net http://manio.skyboo.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrstress caused kernel oops after 8-ish days.
On 04/27/2010 05:46 AM, Chris Mason wrote: This oops is fixed in later kernels, and it's why things stopped. Thanks for the reply. I'm not sure I have the time to give this with respect to following the trunk kernel right now. If the btrfs project doesn't have test machines that could be set up for longer-term testing of something like btrstress, let me know and I'll look at it when I have some more time in the future. Thanks, Sean -- Sean Reifschneider, Member of Technical Staff j...@tummy.com tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability signature.asc Description: OpenPGP digital signature