btrstress caused kernel oops after 8-ish days.

2010-04-27 Thread Sean Reifschneider
I ported my zfsstress program over to btrfs, and started running it on
a test machine a few weeks ago.  See here for more information and a link
to the program:

   http://www.tummy.com/journals/entries/jafo_20100418_124309

It looks like after around 8 days of running, there were some issues, as
shown in dmesg (below).

The system is a 64-bit Atom 330 with 2GB RAM, and a single 250GB hard
drive.  btrfs has 200GB of that.  The OS is the Fedora 13 Beta with kernel
2.6.33.1-24.fc13.x86_64.

I had started btrstress and let it run a day or so.  Then I went in and
deleted the subvolume that btrstress puts everything into, then started it
again.  A few days later, I did the same.  I also tried turning on
compression with mount -o remount,compress /data.  Around 6 hours later,
it looks like btrstress was no longer working.

The primary issue seems to be that file deletions aren't freeing up space.
btrstress will fill the file-system up, but disables any write operations
if the df output shows more than 95% full.  So normally it would clear up
some snapshots or files until it gets back down to 95% or less, and start
doing writes again.

However, after the Oops, it looks like it was able to continue allowing
removes of files and snapshots, but df is no longer reflecting that.  For
example:

   [r...@btrtest btrstress-lZ6C7txz3n]# df -h
   FilesystemSize  Used Avail Use% Mounted on
   /dev/sda1  29G   13G   16G  45% /
   tmpfs 991M 0  991M   0% /dev/shm
   /dev/sda4 200G  189G  9.9G  96% /data
   [r...@btrtest btrstress-lZ6C7txz3n]# find /data
   /data
   /data/btrstress-lZ6C7txz3n
   [r...@btrtest btrstress-lZ6C7txz3n]# btrfs subvolume list /data
   ID 28423 top level 5 path btrstress-lZ6C7txz3n
   [r...@btrtest btrstress-lZ6C7txz3n]# du -sh /data
   4.0K/data
   [r...@btrtest btrstress-lZ6C7txz3n]#

I've left the test system as it is, let me know if there's anything you'd
like me to try on the system before I wipe it and start again.

Also, let me know if this sort of report helps.

Note that after enabling compression, but before the oops, dmesg reported a
bunch of messages like:

   btrfs: relocating block group 11840520192 flags 1
   btrfs: relocating block group 10766778368 flags 1
   btrfs: relocating block group 9693036544 flags 1
   btrfs: relocating block group 8619294720 flags 1
   btrfs: relocating block group 7545552896 flags 1
   btrfs: relocating block group 6471811072 flags 1

Note that the group numbers started at 212630241280 and reduced by around a
billion for every line.

dmesg output of oops below.

BUG: unable to handle kernel NULL pointer dereference at 0075
IP: [810e380f] page_cache_sync_readahead+0x15/0x3a
PGD 7a937067 PUD 3310c067 PMD 0
Oops:  [#1] SMP
last sysfs file: /sys/devices/pci:00/:00:1e.0/:04:00.1/irq
CPU 0
Pid: 30242, comm: btrfs Not tainted 2.6.33.1-24.fc13.x86_64 #1 D945GCLF2/
RIP: 0010:[810e380f]  [810e380f]
page_cache_sync_readahead+0x15/0x3a
RSP: 0018:88003309fac8  EFLAGS: 00010206
RAX:  RBX: 880046476940 RCX: 
RDX:  RSI: 88007ac840d0 RDI: 880046476b70
RBP: 88003309fac8 R08: 3f6a R09: 0246
R10: 88003309f8d8 R11:  R12: 880077422968
R13:  R14: 880046476608 R15: 
FS:  7f893574d740() GS:880004a0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 0075 CR3: 33004000 CR4: 06f0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process btrfs (pid: 30242, threadinfo 88003309e000, task 8800777a8000)
Stack:
 88003309fb68 a0364899 88003309fae8 000181c1
0 880046476a30 880046476608 88003309fb28 3f69
0  88007ac840d0 3f6a 000181c0
Call Trace:
 [a0364899] relocate_file_extent_cluster+0x18f/0x399 [btrfs]
 [a0364b46] relocate_data_extent+0xa3/0xbb [btrfs]
 [a0364e1a] relocate_block_group+0x2bc/0x384 [btrfs]
 [a036506f] btrfs_relocate_block_group+0x18d/0x312 [btrfs]
 [a034dfe7] btrfs_relocate_chunk+0x6c/0x4c2 [btrfs]
 [a033e051] ? btrfs_item_offset+0xbb/0xcb [btrfs]
 [a034c81b] ? btrfs_item_key_to_cpu+0x2a/0x46 [btrfs]
 [a034ea24] btrfs_balance+0x1ce/0x21b [btrfs]
 [811f02b0] ? inode_has_perm+0xaa/0xce
 [a0355cec] btrfs_ioctl+0x6f9/0x871 [btrfs]
 [81071226] ? sched_clock_cpu+0xc3/0xce
 [8107ba94] ? trace_hardirqs_off+0xd/0xf
 [81071274] ? cpu_clock+0x43/0x5e
 [8112c054] vfs_ioctl+0x32/0xa6
 [8112c5d4] do_vfs_ioctl+0x490/0x4d6
 [8112c670] sys_ioctl+0x56/0x79
 [81009c72] system_call_fastpath+0x16/0x1b
Code: 47 48 48 85 c0 74 04 31 f6 ff d0 48 83 c4 28 

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

2010-04-27 Thread KOSAKI Motohiro
 
 On Apr 26, 2010, at 6:18 AM, KOSAK
  AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
  (and later rd choosed to use another way).
  Then, It assume writepage refusing aren't happen on majority pages.
  IOW, the VM assume other many pages can writeout although the page can't.
  Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is 
  returned.
  but now ext4 and btrfs refuse all writepage(). (right?)
 
 No, not exactly.   Btrfs refuses the writepage() in the direct reclaim cases 
 (i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone 
 scanning.  I don't want to speak for Chris, but I assume it's due to stack 
 depth concerns --- if it was just due to worrying about fs recursion issues, 
 i assume all of the btrfs allocations could be done GFP_NOFS.
 
 Ext4 is slightly different; it refuses writepages() if the inode blocks for 
 the page haven't yet been allocated.  (Regardless of whether it's happening 
 for direct reclaim or zone scanning.)  However, if the on-disk block has been 
 assigned (i.e., this isn't a delalloc case), ext4 will honor the writepage(). 
   (i.e., if this is an mmap of an already existing file, or if the space has 
 been pre-allocated using fallocate()).The reason for ext4's concern is 
 lock ordering, although I'm investigating whether I can fix this.   If we 
 call set_page_writeback() to set PG_writeback (plus set the various bits of 
 magic fs accounting), and then drop the page_lock, does that protect us from 
 random changes happening to the page (i.e., from vmtruncate, etc.)?
 
  
  IOW, I don't think such documentation suppose delayed allocation issue ;)
  
  The point is, Our dirty page accounting only account per-system-memory
  dirty ratio and per-task dirty pages. but It doesn't account per-numa-node
  nor per-zone dirty ratio. and then, to refuse write page and fake numa
  abusing can make confusing our vm easily. if _all_ pages in our VM LRU
  list (it's per-zone), page activation doesn't help. It also lead to OOM.
  
  And I'm sorry. I have to say now all vm developers fake numa is not
  production level quority yet. afaik, nobody have seriously tested our
  vm code on such environment. (linux/arch/x86/Kconfig says This is only 
  useful for debugging.)
 
 So I'm sorry I mentioned the fake numa bit, since I think this is a bit of a 
 red herring.   That code is in production here, and we've made all sorts of 
 changes so ti can be used for more than just debugging.  So please ignore it, 
 it's our local hack, and if it breaks that's our problem.More 
 importantly, just two weeks ago I talked to soeone in the financial sector, 
 who was testing out ext4 on an upstream kernel, and not using our hacks that 
 force 128MB zones, and he ran into the ext4/OOM problem while using an 
 upstream kernel.  It involved Oracle pinning down 3G worth of pages, and him 
 trying to do a huge streaming backup (which of course wasn't using fallocate 
 or direct I/O) under ext4, and he had the same issue --- an OOM, that I'm 
 pretty sure was caused by the fact that ext4_writepage() was refusing the 
 writepage() and most of the pages weren't nailed down by Oracle were 
 delalloc.The same test scenario using ext3 worked just fine, of course.
 
 Under normal cases it's not a problem since statistically there should be 
 enough other pages in the system compared to the number of pages that are 
 subject to delalloc, such that pages can usually get pushed out until the 
 writeback code can get around to writing out the pages.   But in cases where 
 the zones have been made artificially small, or you have a big program like 
 Oracle pinning down a large number of pages, then of course we have problems. 
 
 I'm trying to fix things from the file system side, which means trying to 
 understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in 
 Documentation/filesystems/Locking as something which MUST be used if 
 writepage() is going refuse a page.  And then I discovered no one is actually 
 using it.   So that's why I was asking with respect whether the Locking 
 documentation file was out of date, or whether all of the file systems are 
 doing it wrong.
 
 On a related example of how file system code isnt' necessarily following what 
 is required/recommended by the Locking documentation, ext2 and ext3 are both 
 NOT using set_page_writeback()/end_page_writeback(), but are rather keeping 
 the page locked until after they call block_write_full_page(), because of 
 concerns of truncate coming in and screwing things up.   But now looking at 
 Locking, it appears that set_page_writeback() is as good as page_lock() for 
 preventing the truncate code from coming in and screwing everything up?   
 It's not clear to me exactly what locking guarantees are provided against 
 truncate by set_page_writeback().   And suppose we are writing out a whole 
 cluster of pages, say 4MB worth of pages; do we need to call 
 

Re: [RFC] btrfs, udev and btrfs

2010-04-27 Thread Kay Sievers
On Fri, Apr 16, 2010 at 20:48, Goffredo Baroncelli kreij...@gmail.com wrote:
 Instead the first option has the disadvantage to need to be used for every new
 device.
 From this observation I write a udev rule which scan the new block devices,
 excluding floppy and cdrom.

 Below my udev rule

  $ cat /etc/udev/rules.d/60-btrfs.rules

  # ghigo 15/04/2010

  ACTION!=add|change, GOTO=btrfs_scan_end
  SUBSYSTEM!=block, GOTO=btrfs_scan_end
  KERNEL!=sd[!0-9]*|hd[!0-9]*, GOTO=btrfs_scan_end

  IMPORT{program}=/sbin/blkid -o udev -p $tempnode

Udev needs to do this already anyway. People are not encouraged to
call this in their own rule files again. Just make sure you place the
rule after the existing standard call that always comes with udev. The
btrfs rules can just depend on the variable set in the environment.
Also there are more devices than sd* which could have a btrfs volume,
but this is also covered by the standard udev call to blkid.

Thanks,
Kay
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


making mirror

2010-04-27 Thread manio

Hello
I've created btrfs on single partition (/dev/sdc3) using:
mkfs.btrfs /dev/sdc3
So far  ... it is running fine - but i would like to switch to raid1 
configuration.
Unfortunately - i didn't pass any information about raid (-m raid1 -d 
raid1) when i created it :(


Now i would like to hot-add a /dev/sda3 (which is same size as currently 
used /dev/sdc3).
The problem is - that i don't have free space on other partitions - so i 
cannot move data away, create FS from start and copy data back to 
running mirror.
Perfect solution would be a hot-adding new partition - but can i do it 
with btrfs-vol? If so - how?


regards,
--
manio
jabber/e-mail: ma...@skyboo.net
http://manio.skyboo.net
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrstress caused kernel oops after 8-ish days.

2010-04-27 Thread Sean Reifschneider
On 04/27/2010 05:46 AM, Chris Mason wrote:
 This oops is fixed in later kernels, and it's why things stopped.

Thanks for the reply.  I'm not sure I have the time to give this with
respect to following the trunk kernel right now.  If the btrfs project
doesn't have test machines that could be set up for longer-term testing of
something like btrstress, let me know and I'll look at it when I have some
more time in the future.

Thanks,
Sean
-- 
Sean Reifschneider, Member of Technical Staff j...@tummy.com
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability



signature.asc
Description: OpenPGP digital signature