Re: Kernel error during btrfs balance

2011-01-26 Thread Erik Logtenberg
Hi,

It took me a couple of days, because I needed to patch my kernel first
and then issue a rebalance, which ran for more than two days.
Nevertheless, the rebalance succeeded without any kernel BUG-messages,
so apparently your patch works!

I noticed that at first, the messages were like this:

[79329.526490] btrfs: found 1939 extents
[79375.950834] btrfs: found 1939 extents
[79376.083599] btrfs: relocating block group 352220872704 flags 1
[80052.940435] btrfs: found 3786 extents
[80108.439657] btrfs: found 3786 extents
[80112.325548] btrfs: relocating block group 351147130880 flags 1

Just like I saw during previous balance-runs. Then all of a sudden the
messages changed to:

[104178.827594] btrfs allocation failed flags 1, wanted 2013265920
[104178.827599] space_info has 4271198208 free, is not full
[104178.827602] space_info total=214748364800, used=210440957952,
pinned=0, reserved=36208640, may_use=3168993280, readonly=0
[104178.827606] block group 1107296256 has 5368709120 bytes, 5368582144
used 0 pinned 0 reserved
[104178.827610] entry offset 1778384896, bytes 86016, bitmap yes
[104178.827612] entry offset 1855827968, bytes 20480, bitmap no
[104178.827614] entry offset 1855852544, bytes 20480, bitmap no
[104178.827617] block group has cluster?: no
[104178.827618] 0 blocks of free space at or bigger than bytes is
[104178.827621] block group 8623489024 has 5368709120 bytes, 5368705024
used 0 pinned 0 reserved
[104178.827624] entry offset 8891924480, bytes 4096, bitmap yes
[104178.827626] block group has cluster?: no
[104178.827628] 0 blocks of free space at or bigger than bytes is
[104178.827631] block group 17213423616 has 5368709120 bytes, 5368709120
used 0 pinned 0 reserved
[104178.827634] block group has cluster?: no

And so on.

Does this indicate an error of any sort, or is this expected behaviour?

Kind regards,

Erik.


On 01/21/2011 10:19 AM, Yan, Zheng wrote:
 please try patch attached below, Thanks.
 
 ---
 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index b37d723..49d6b13 100644
 --- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -1158,6 +1158,7 @@ static int clone_backref_node(struct
 btrfs_trans_handle *trans,
   new_node-bytenr = dest-node-start;
   new_node-level = node-level;
   new_node-lowest = node-lowest;
 + new_node-checked = 1;
   new_node-root = dest;
 
   if (!node-lowest) {
 ---
 
 
 On Fri, Jan 21, 2011 at 4:50 PM, Erik Logtenberg e...@logtenberg.eu wrote:
 Hi,

 I hit the same bug again I think:

 [291835.724344] [ cut here ]
 [291835.724376] kernel BUG at fs/btrfs/relocation.c:836!
 [291835.724401] invalid opcode:  [#1] SMP
 [291835.724424] last sysfs file:
 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
 [291835.724461] CPU 0
 [291835.724472] Modules linked in: uvcvideo snd_usb_audio
 snd_usbmidi_lib videodev v4l1_compat snd_rawmidi v4l2_compat_ioctl32
 btrfs zlib_deflate libcrc32c sha256_generic cryptd aes_x86_64
 aes_generic cbc dm_crypt tun ebtable_nat ebtables ipt_MASQUERADE
 iptable_nat nf_nat bridge stp llc nfsd lockd nfs_acl auth_rpcgss
 exportfs nls_utf8 cifs fscache sunrpc cpufreq_ondemand acpi_cpufreq
 freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6
 ip6table_filter ip6_tables ipv6 kvm_intel kvm dummy uinput
 snd_hda_codec_analog snd_hda_intel snd_hda_codec snd_hwdep snd_seq
 snd_seq_device e1000e snd_pcm snd_timer i2c_i801 snd shpchp iTCO_wdt
 iTCO_vendor_support soundcore dell_wmi sparse_keymap snd_page_alloc
 serio_raw joydev wmi dcdbas microcode usb_storage uas raid1 pata_acpi
 ata_generic radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last
 unloaded: scsi_wait_scan]
 [291835.725002]
 [291835.725013] Pid: 27386, comm: btrfs Tainted: G  I
 2.6.37-2.fc15.x86_64 #1
 [291835.725062] RIP: 0010:[a0565237]  [a0565237]
 build_backref_tree+0x473/0xd6d [btrfs]
 [291835.725126] RSP: 0018:8800373bf9c8  EFLAGS: 00010246
 [291835.725152] RAX: 8801367d5100 RBX: 88020b110880 RCX:
 0040
 [291835.725186] RDX: 0030 RSI: 006dd08d3000 RDI:
 880100069820
 [291835.725219] RBP: 8800373bfaf8 R08: 8050 R09:
 8800373bf980
 [291835.725253] R10: 8800373bf918 R11: 88020b110880 R12:
 8801367d5100
 [291835.725254] R13: 88012c0a24c0 R14: 88021e2013f0 R15:
 88021e201cf0
 [291835.725254] FS:  7fcb1a6cc760() GS:8800bfa0()
 knlGS:
 [291835.725254] CS:  0010 DS:  ES:  CR0: 8005003b
 [291835.725254] CR2: 02feeeb8 CR3: 0001c2943000 CR4:
 000426e0
 [291835.725254] DR0:  DR1:  DR2:
 
 [291835.725254] DR3:  DR6: 0ff0 DR7:
 0400
 [291835.725254] Process btrfs (pid: 27386, threadinfo 8800373be000,
 task 88022452ae40)
 [291835.725254] Stack:
 [291835.725254]  ea0004b5a470 ea00 8800373bf9f8
 

Re: Kernel error during btrfs balance

2011-01-26 Thread Hugo Mills
On Wed, Jan 26, 2011 at 10:04:02AM +0100, Erik Logtenberg wrote:
 Hi,
 
 It took me a couple of days, because I needed to patch my kernel first
 and then issue a rebalance, which ran for more than two days.
 Nevertheless, the rebalance succeeded without any kernel BUG-messages,
 so apparently your patch works!
 
 I noticed that at first, the messages were like this:
 
 [79329.526490] btrfs: found 1939 extents
 [79375.950834] btrfs: found 1939 extents
 [79376.083599] btrfs: relocating block group 352220872704 flags 1
 [80052.940435] btrfs: found 3786 extents
 [80108.439657] btrfs: found 3786 extents
 [80112.325548] btrfs: relocating block group 351147130880 flags 1
 
 Just like I saw during previous balance-runs. Then all of a sudden the
 messages changed to:
 
 [104178.827594] btrfs allocation failed flags 1, wanted 2013265920
 [104178.827599] space_info has 4271198208 free, is not full
 [104178.827602] space_info total=214748364800, used=210440957952,
 pinned=0, reserved=36208640, may_use=3168993280, readonly=0
 [104178.827606] block group 1107296256 has 5368709120 bytes, 5368582144
 used 0 pinned 0 reserved
 [104178.827610] entry offset 1778384896, bytes 86016, bitmap yes
 [104178.827612] entry offset 1855827968, bytes 20480, bitmap no
 [104178.827614] entry offset 1855852544, bytes 20480, bitmap no
 [104178.827617] block group has cluster?: no
 [104178.827618] 0 blocks of free space at or bigger than bytes is
 [104178.827621] block group 8623489024 has 5368709120 bytes, 5368705024
 used 0 pinned 0 reserved
 [104178.827624] entry offset 8891924480, bytes 4096, bitmap yes
 [104178.827626] block group has cluster?: no
 [104178.827628] 0 blocks of free space at or bigger than bytes is
 [104178.827631] block group 17213423616 has 5368709120 bytes, 5368709120
 used 0 pinned 0 reserved
 [104178.827634] block group has cluster?: no
 
 And so on.
 
 Does this indicate an error of any sort, or is this expected behaviour?

   As far as I know, it means that you've run out of space, and not
every block group has been rewritten by the balance process.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- In one respect at least, the Martians are a happy people: ---
  they have no lawyers.  


signature.asc
Description: Digital signature


Re: Kernel error during btrfs balance

2011-01-26 Thread Erik Logtenberg

 [104178.827624] entry offset 8891924480, bytes 4096, bitmap yes
 [104178.827626] block group has cluster?: no
 [104178.827628] 0 blocks of free space at or bigger than bytes is
 [104178.827631] block group 17213423616 has 5368709120 bytes, 5368709120
 used 0 pinned 0 reserved
 [104178.827634] block group has cluster?: no

 And so on.

 Does this indicate an error of any sort, or is this expected behaviour?
 
As far as I know, it means that you've run out of space, and not
 every block group has been rewritten by the balance process.
 
Hugo.
 

It is a 300GB volume with 79GB free. So hardly out of space. Moreover, I
started the balance operation with the sole purpose of reclaiming some
free space. The volume had like 40GB less free space when balance
started, which was used by / reserved for Metadata.

Kind regards,

Erik.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel error during btrfs balance

2011-01-26 Thread Erik Logtenberg

 Yesterday I reported a similar problem in this mailing list, in the  
 thread version.
 
 Running kernel 2.6.37 didn't show this error, but running kernel 2.6.38- 
 rc2 ended with errors.
 
 Viele Gruesse!
 Helmut

Ah, indeed, just like you I use 2.6.38-rc2. Or to be more precise:
2.6.38-0.rc2.git0.1.fc14.x86_64, which is the latest rawhide kernel,
with one additional patch, being the oneliner from Zheng Yan.

Kind regards,

Erik.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: version (was: btrfs, broken design?)

2011-01-26 Thread Erik Logtenberg
On 01/21/2011 03:32 PM, Diego Calleja wrote:
 On Viernes, 21 de Enero de 2011 10:54:00 Helmut Hullen escribió:
 
 And I never have seen somethin like Changelog - that would be fine  
 too.
 
 Check the wiki, I keep that updated: 
 https://btrfs.wiki.kernel.org/index.php/Main_Page#News


I like the Changelog on the wiki [1] very much, since it shows the most
important changes in easily understandable language. Unfortunately, the
most recent change is 2.6.35 (august 2010), while we are currently at
2.6.37 (stable) and 2.6.38-rc2 (mainline). Especially 2.6.38(-rc2)
contains many interesting new btrfs features and lots of important
fixes. It would be very nice if the Changelog could list/explain those.

[1] https://btrfs.wiki.kernel.org/index.php/Changelog

The News [2] section on the Main page does mention one more version,
2.6.37. But the news section is less elaborate than the Changelog and
also I notice that 2.6.36 is not mentioned in the News section. Still,
2.6.36 does contain all kinds of btrfs-related changes.

[2] https://btrfs.wiki.kernel.org/index.php/Main_Page#News

Diego, pls don't read anything negative in my comments, I enjoy and
respect your work very much! If you could find time to add those latest
changes to the wiki, it would be greatly appreciated.

Kind regards,

Erik.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: Fix balance panic

2011-01-26 Thread Yan, Zheng
Mark the cloned backref_node as checked in clone_backref_node()

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 045c9c2..bef9c22 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1157,6 +1157,7 @@ static int clone_backref_node(struct btrfs_trans_handle 
*trans,
new_node-bytenr = dest-node-start;
new_node-level = node-level;
new_node-lowest = node-lowest;
+   new_node-checked = 1;
new_node-root = dest;
 
if (!node-lowest) {
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: version (was: btrfs, broken design?)

2011-01-26 Thread Diego Calleja
On Miércoles, 26 de Enero de 2011 11:13:20 Erik Logtenberg escribió:
 Diego, pls don't read anything negative in my comments, I enjoy and
 respect your work very much! If you could find time to add those latest
 changes to the wiki, it would be greatly appreciated.

Thanks for your suggestion, I've updated the Changelog and removed the old
items from the news section.

2.6.36 didn't had many btrfs changes, there was no new features.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Fix balance panic

2011-01-26 Thread Helmut Hullen
Hallo, Yan,,

Du meintest am 26.01.11:

 Mark the cloned backref_node as checked in clone_backref_node()

 Signed-off-by: Yan, Zheng zheng.z@intel.com
 -+-
 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index 045c9c2..bef9c22 100644
 -+- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -1157,6 +1157,7 @@ static int clone_backref_node(struct
 btrfs_trans_handle *trans,new_node-bytenr = dest-node-start;
   new_node-level = node-level;
   new_node-lowest = node-lowest;
 + new_node-checked = 1;
   new_node-root = dest;

   if (!node-lowest) {
 --

Sorry - didn't solve my problem:

-- last lines from dmesg ---

bio too big device sdc (256  240)
bio too big device sdc (256  240)

[...] more than 800 such lines

bio too big device sdc (256  240)
bio too big device sdc (256  240)
bio too big device sdc (256  240)
[ cut here ]
kernel BUG at fs/btrfs/volumes.c:2097!
invalid opcode:  [#1]
last sysfs file: 
/sys/devices/pci:00/:00:07.1/host1/target1:0:0/1:0:0:0/block/sdb/dev
Modules linked in: sg nf_nat_ftp nf_conntrack_ftp ipt_MASQUERADE iptable_nat 
nf_nat xt_DSCP xt_multiport xt_recent nf_conntrack_ipv4 nf_defrag_ipv4 xt_state 
nf_conntrack xt_tcpudp ipt_REJECT iptable_filter iptable_mangle ip_tables 
xt_iprange x_tables nfsd exportfs 8139too 8139cp savagefb fb_ddc i2c_algo_bit 
vgastate i2c_piix4 piix e100 mii intel_agp intel_gtt agpgart cmd64x video 
thermal_sys ac battery yenta_socket pcmcia_rsrc pcmcia pcmcia_core 
thinkpad_acpi hwmon rfkill nvram fuse

Pid: 5991, comm: btrfs Not tainted 2.6.38-rc2-OD1 #2 26478EG/26478EG
EIP: 0060:[c1235264] EFLAGS: 00010282 CPU: 0
EIP is at btrfs_balance+0x2d4/0x2e0
EAX: fffb EBX: cfa58000 ECX: d7ce6c18 EDX: d7ce6818
ESI: d0574000 EDI: cf958400 EBP: cc4e3e9c ESP: cc4e3e38
 DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068
Process btrfs (pid: 5991, ti=cc4e2000 task=cce6c3c0 task.ti=cc4e2000)
Stack:
 99cc 0001 00e4 0010  0100 ccdcc000 cce8d058
 aea58000 0001  99cc 0001 0100b790  00e4
 0199cc00  0001 e400   ccc85380 ffea
Call Trace:
 [c123bcf1] btrfs_ioctl+0x2e1/0x9d0
 [c123ba10] ? btrfs_ioctl+0x0/0x9d0
 [c10c3f65] do_vfs_ioctl+0x85/0x590
 [c10206db] ? do_page_fault+0x17b/0x380
 [c10b554b] ? do_sys_open+0xdb/0x110
 [c10c44f7] sys_ioctl+0x87/0x90
 [c1753d0c] syscall_call+0x7/0xb
Code: 1b ff ff ff 89 f0 e8 cc 75 fb ff 8b 55 b4 8b 82 10 01 00 00 05 74 19 00 
00 e8 09 dc 51 00 e9 70 fd ff ff 31 db eb dd 85 c0 74 9d 0f 0b 0f 0b 0f 0b 0f 
0b 0f 0b 66 90 55 89 e5 56 53 83 ec 34 3e
EIP: [c1235264] btrfs_balance+0x2d4/0x2e0 SS:ESP 0068:cc4e3e38
---[ end trace ebf8fd68179e0b7a ]---

#  end dmesg --

Viele Gruesse!
Helmut
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs BUG during Ceph cosd open() syscall

2011-01-26 Thread Jim Schutt
Hi,

I got this kernel BUG on a server running multiple Ceph
cosd instances, during a heavy write load generated by
multiple Ceph clients.

The server was running the current ceph unstable kernel 
(a3f5274e535 in 
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git).

Please let me know what other information you need to 
make this report useful.

-- Jim

 BUG: unable to handle kernel NULL pointer dereference at 0100
[97221.834832] IP: [a075b3ab] btrfs_drop_inode+0x10/0x36 [btrfs]
[97221.834832] PGD 198d6b067 PUD 13d79f067 PMD 0 
[97221.834832] Oops:  [#1] SMP 
[97221.834832] last sysfs file: 
/sys/devices/pci:00/:00:1e.0/:10:0d.0/local_cpus
[97221.834832] CPU 3 
[97221.834832] Modules linked in: loop btrfs zlib_deflate ipt_MASQUERADE 
iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack 
ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge ]
[97221.834832] 
[97221.834832] Pid: 30295, comm: cosd Not tainted 2.6.37-00017-ga3f5274 #4 
0DT097/PowerEdge 1950
[97221.834832] RIP: 0010:[a075b3ab]  [a075b3ab] 
btrfs_drop_inode+0x10/0x36 [btrfs]
[97221.834832] RSP: 0018:8801cf205c08  EFLAGS: 00010282
[97221.834832] RAX: a075b39b RBX: 88018490a3a0 RCX: 0001
[97221.834832] RDX:  RSI: 819e7ea0 RDI: 88018490a3a0
[97221.834832] RBP: 8801cf205c08 R08: e8ccefa8 R09: 
[97221.834832] R10: 8801488e9658 R11:  R12: 88021b5c6400
[97221.834832] R13: 8801fad145a0 R14: 8801faf8c440 R15: 88017bab9848
[97221.834832] FS:  7f0b011f9940() GS:8800cfcc() 
knlGS:
[97221.834832] CS:  0010 DS:  ES:  CR0: 8005003b
[97221.834832] CR2: 0100 CR3: 0001b8c89000 CR4: 06e0
[97221.834832] DR0:  DR1:  DR2: 
[97221.834832] DR3:  DR6: 0ff0 DR7: 0400
[97221.834832] Process cosd (pid: 30295, threadinfo 8801cf204000, task 
8801488e9610)
[97221.834832] Stack:
[97221.834832]  8801cf205c28 810fd714 fffb 
fffb
[97221.834832]  8801cf205cd8 a07587e8 8801cf205c48 
0102
[97221.834832]  000fcf205c58 88022f5c46a0 8801d1ef8800 
8136a638
[97221.834832] Call Trace:
[97221.834832]  [810fd714] iput+0x5c/0x1e0
[97221.834832]  [a07587e8] btrfs_new_inode+0x2d3/0x2e5 [btrfs]
[97221.834832]  [8136a638] ? _cond_resched+0xe/0x22
[97221.834832]  [8136ae20] ? mutex_lock+0x16/0x3a
[97221.834832]  [a0756da1] ? start_transaction+0x176/0x1bc [btrfs]
[97221.834832]  [a075d1fc] btrfs_create+0xbb/0x1fa [btrfs]
[97221.834832]  [810f49e2] vfs_create+0x76/0x96
[97221.834832]  [810f56af] do_last+0x24d/0x4d3
[97221.834832]  [810f5b16] do_filp_open+0x1e1/0x4c5
[97221.834832]  [81031061] ? should_resched+0xe/0x2f
[97221.834832]  [8136a638] ? _cond_resched+0xe/0x22
[97221.834832]  [811aa669] ? might_fault+0xe/0x10
[97221.834832]  [811aa753] ? __strncpy_from_user+0x20/0x4a
[97221.834832]  [810e9023] do_sys_open+0x62/0xeb
[97221.834832]  [810e90df] sys_open+0x20/0x22
[97221.834832]  [81002c2b] system_call_fastpath+0x16/0x1b
[97221.834832] Code: 53 fc 94 e0 4c 89 e7 e8 f6 8a 95 e0 48 83 c4 18 5b 41 5c 
41 5d 41 5e 41 5f c9 c3 55 48 89 e5 0f 1f 44 00 00 48 8b 97 68 fe ff ff 83 ba 
00 01 00 00 00 75 12 48 8b 82 28 01 00 00 b9 01 00  
[97221.834832] RIP  [a075b3ab] btrfs_drop_inode+0x10/0x36 [btrfs]
[97221.834832]  RSP 8801cf205c08
[97221.834832] CR2: 0100
[97222.207152] ---[ end trace 32eb84782eb8 ]---



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Who wants metadata image from botched filesystem?

2011-01-26 Thread Paul Komkoff
Hello.

So I had the filesystem that became broken. on 2.6.37 with for-linus
unstable, when accessing some directories, it was hanging hard.
I created the metadata image and can put it somewhere if you want to
use it for something.

Thanks.

-- 
This message represents the official view of the voices in my head
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Btrfs: New inode number allocator

2011-01-26 Thread Goffredo Baroncelli
On 01/26/2011 02:53 AM, Li Zefan wrote:
 Here comes the compatability issue. It's fine to mount old btrfs, because
 we'll just use the original way to find free ino. But we can't mount new btrfs
 in older kernels, because the OFFSET makes highest objectid overflow when it
 is cast to unsigned long in 32bits system.
 
 We can store ino extents to a seperate btree, and then the new btrfs can
 be mounted in older kernels, but another problem will arise when remounting it
 in new kernels - creating new files will probably fail (but not oops)
 because the ino extent items are not consistent with inode items.
 
 If the above behavior (failing to create files) is not acceptable, we'll
 have to add an incompat flag.

I can't comment the patch from a technical point of view. However I want
to add a my comment about the compatibility issues.

I remember that Linus was not happy about a filesystem which is not
compatible when the the kernel version decrease. IIRC during the switch
from ext3 to ext4 there were some issues during a git bis-sect process.

So my suggestions are:
- don't allow that an automatic switch to the new inode allocation
policy. It should be the user to force this switch ( via a mount option
for example)
- in case the performance regression are noticeable, allow the user to
use the old policy, which, if I understood correctly, work fine on a 64
bit system [*].

Regards
G.Baroncelli


[*] Supposing to create continuously 1000 file per sec, it needs
2^64/1000 sec = ~ 573.000.000 years to exhaust all the available inode
numbers.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Atomic file data replace API

2011-01-26 Thread Olaf van der Spek
On Wed, Jan 26, 2011 at 8:30 PM, Chris Mason chris.ma...@oracle.com wrote:
 My answer hasn't really changed ;)  Replacing file data is a common
 operation, but it is still surprisingly complex.  Again, the truncate is
 O(size of the file) and it is actually impossible to do this atomically
 in most filesystems.

Unfortunately life isn't trivial. ;)
Given that it's common, it doesn't make sense to have code duplication
in lots of apps to implement the temp file rename pattern.
If it's too complex to implement in the FS (ATM), would it be possible
to implement it in a higher layer?

 You don't notice this because xfs/ext34/btrfs (and many others) have
 code that makes sure a truncate is restarted if you crash.  So, it
 appears to be atomic even though we're really just restarting the
 operation.  In order to have a truncate + replacement of data operation,
 we'd have to do a disk format change that includes both the truncate and
 the new data.

I'm not sure why the disk format would have to change.
Conceptually, just like the temp file case, you'd write the new data
to newly allocated blocks.
After (and I guess that's the complex part) they're safely on disk,
you update the meta data, in an atomic way.

 It would look a lot like echo data  file.new ; truncate file ; mv
 file.new file, but recorded in the FS metadata.

 I don't have this in the btrfs roadmap.  It would be nice but most
 people use databases for things that require atomic operations.  I

Executables and files shouldn't be in a DB.

Olaf
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LOOP_GET_STATUS(64) truncates pathnames to 64 chars (was Re: Bug in mkfs.btrfs?!)

2011-01-26 Thread Felix Blanke
Hi,

attached is the answer from Jari Ruusu, (one of?!) the main developer of 
loop-aes. It
seems that checking if a loop device is mounted following the link isn't the 
best
idea :)

I'll have time to look deeper into his example about the 14.02. I'll then try 
to fix
that issue in mkfs.btrfs. If someone wants to fix it earlier, do it :) 



Regards,
Felix
---BeginMessage---
First of all, I am not subscribed to linux-cry...@kernelnewbies.org mailing
list. But I do check web archive occasionally, at least for now. If you want
loop-AES maintainer to see your posts, please CC my @users.sourceforge.net
address, or linux-cry...@vger.kernel.org

Felix Blanke wrote:
 If I'm using the unpatched losetup (from util-linux-ng) and looping a
 device like
 
 /dev/disk/by-id/ata-WDC_WD6400AAKS-22A7B2_WD-WCASY7780706-part3
 
 it is using readlink to track down the link to e.g. /dev/sda3. But if
 I'm using the patched losetup I don't see any readlink in the strace.
 Because no readlink is used there is a problem with device names which
 have more then 64 characters (the field lo_name is an array of 64
 char). It is truncated like
 
 /dev/loop0: [0010]:6229
 (/dev/disk/by-id/ata-INTEL_SSDSA2M160G2GC_CVPO939201JX160AGN-par)
 encryption=AES128
 
 That does make trouble with e.g. btrfs, because mkfs.btrfs uses the
 devicename provided by losetup to check if a device is mounted. That
 can't work with an incomplete devicename.

mkfs.btrfs authors goofed. Using truncated name is doomed to fail. Correct
way to check that is to compare loop backing file inode and device numbers
to inode and device numbers of supected file. Here is sample losetup output:

/dev/loop6: [0902]:209029 (/dev/md3) offset=4096 encryption=AES128 multi-key-v3
   ^^
  device inode

Device number (0902) is hexadecimal. Inode number (209029) is decimal
number. /dev/md3 block special device node is present at mounted file system
on device 0x0902. Anything that symlinks or hardlinks to that inode 209029
on device 0x0902, is same file or device.

Even if losetup were to follow symlinks for that backing file name, symlink
destination would not necessarily be shorter. Although in your case it would
be shorter. Also, old versions of cryptoloop do not store the backing file
name at all. They use that same space for cipher algorithm name.

Below is source code for correct way for mkfs.btrfs to test loop backing
file status. Feel free to forward it to mkfs.btrfs authors.

-- 
Jari Ruusu  1024R/3A220F51 5B 4B F9 BB D3 3F 52 E9  DB 1D EB E3 24 0E A9 DD



/* loop-backing-dev-check.c / (c) 2011 Jari Ruusu / GNU GPL */

#include stdio.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include unistd.h
#include stdlib.h
#include sys/ioctl.h
#include linux/loop.h

int loop_backing_dev_check(char *loopdev, char *suspect)
{
int fd, ret = 0;
struct stat statbuf;
struct loop_info64 info64;
struct loop_info info;

if(!loopdev || !*loopdev || !suspect || !*suspect) goto outret0;
if(stat(loopdev, statbuf) || !S_ISBLK(statbuf.st_mode)) goto outret0;
if((statbuf.st_rdev  0xfff00) != 0x700) goto outret0;
if(stat(suspect, statbuf)) goto outret0;
if((fd = open(loopdev, O_RDONLY))  0) goto outret0;
if(!ioctl(fd, LOOP_GET_STATUS64, info64)) {
if(statbuf.st_dev != info64.lo_device) goto closeret0;
if(statbuf.st_ino != info64.lo_inode) goto closeret0;
ret = 1;   /* loop backing device/file is same as suspect */
} else if(!ioctl(fd, LOOP_GET_STATUS, info)) {
if(statbuf.st_dev != info.lo_device) goto closeret0;
if(statbuf.st_ino != info.lo_inode) goto closeret0;
ret = 1;   /* loop backing device/file is same as suspect */
}
 closeret0:
close(fd);
 outret0:
return ret;
}

/* usage: ./loop-backing-dev-check /dev/loop0 /dev/fd0 */
int main(int argc, char **argv)
{
if(argc != 3) exit(1);
if(loop_backing_dev_check(argv[1], argv[2])) {
printf(loop device %s is associated with %s\n, argv[1], argv[2]);
exit(0);
}
exit(1);
}
---End Message---


2.6.38-rc2 oops's when rebalancing on different size drives (was Re: version)

2011-01-26 Thread Chris Samuel
On 26/01/11 01:37, Helmut Hullen wrote:

 bio too big device sdc (256  240)
 bio too big device sdc (256  240)
 bio too big device sdc (256  240)
 bio too big device sdc (256  240)

Oh dear, those are errors from the block layer, looks like
btrfs is doing something wrong there.. :-(

 [ cut here ]
 kernel BUG at fs/btrfs/volumes.c:2097!

It looks like btrfs isn't handling errors coming back from the
block layer - at that point it's just called btrfs_relocate_chunk()..

So my guess is that the rebalancing code is naive and assumes
the drives are the same size - but I can't quite follow what
the code above that BUG_ON() is doing to verify that..

Chris M. ?

cheers!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: full btrfs partition, became unmountable (+ a solution that thankfully worked for me)

2011-01-26 Thread Cyrille Chépélov
Hello Shawn,

it's now performing a sequential read of the volume, which will probably
take significantly more time for you than for me (where I was dealing
with an image of a 16GB SD card, stored on a recent mechanical SATA
disk).

I'm a bit confused by what happens while reading the potential supers.
At first the blocks appear valid, then they are all misplaced (meaning
the bytenr field != the bytenr from which the block has been read, IOW
the block is most probably not part of btrfs structures, from what I
understand).  From the output before the will attempt to find useful
trees messages, it seems btrfsck is now doing a sequential read not
just of /dev/sde, but also every single block device ?

disk-io.c: try_emergency_tree_fixup() is probably now a bit too silent
for your use case at the moment. You might want to uncomment the
commented out fprintf there; this will make it very verbose (an extra
line per structure block) but will provide clues as to where on disk is
it working.

-- Cyrille

Le jeudi 27 janvier 2011 à 01:18 -0500, Shawn Stricker a écrit :
 any chance of getting a little more informative output?
 I started the command at about 2250 Eastern and now at 0117 Eastern the 
 command is still running and all of the attached output happened in the first 
 few minutes (under 5).
 On Jan 26, 2011, at 2:46 AM, Cyrille Chépélov wrote:
 
  Le mardi 25 janvier 2011 à 23:38 -0500, Shawn Stricker a écrit :
  Not sure where you pulled your source from but a fresh checkout of either 
  master or next of 
  git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git 
  does not compile properly.
  They both fail with 
  
  cc1: warnings being treated as errors
  disk-io.c: In function ‘btrfs_read_dev_super’:
  disk-io.c:937: error: format ‘%lu’ expects type ‘long unsigned int’, but 
  argument 4 has type ‘unsigned int’
  disk-io.c:957: error: implicit declaration of function ‘uuid_unparse’
  
  am I patching/compiling from the wrong source or is there something I am 
  missing?
  
  uh, I had been compiling with CFLAGS=-g, where the makefile specifies
  -O2 -Werror
  
  -Werror causes warnings to be treated as errors, which is a good thing
  in a way (makes sure stuff as this gets caught :) )
  
  fixes are:
  * line 937 (patched), should be %llu instead of %lu
  * line 957, there should be a prototype for uuid_unparse(), most
  certainly by including uuid/uuid.h
  
  please try this patch instead.
  
  Thanks for the feedback!
  
  -- Cyrille
  
  On Jan 25, 2011, at 1:46 PM, Cyrille Chépélov wrote:
  
  Hello all,
  
  Last Friday, the /var and /home partition on one of my appliances became
  full. This should normally not be much of a problem, except that after
  the incident, I had been unable to mount the partition back again.
  
  The appliance runs 2.6.32 as provided by Debian during the last two
  months. 
  The rescue computer runs 2.6.37; both exhibited the same behaviour at
  mount: an infinite loop-and-abort cycle (I unfortunately did not write
  down the exact messages, but in a nutshell, there was not enough free
  space to replay the log, so it aborted).
  
  After pulling the SD card (yes) to break the loop, I ended up with a
  corrupt file system. Any attempt to mount, debug or fsck (using
  btrfs-tools 0.19+20100601 as shipped by Debian, or compiled from git
  1b444cd2e6ab8dcafdd) aborted with the following message:
btrfs-debug-tree: disk-io.c:741: open_ctree_fd: Assertion `!(!
  tree_root-node)' failed.
  
  After much scavenging on the disk image, I finally managed to recover,
  using the (dirty) patch attached here. Since apparently other people had
  similar issues, I'm posting it in the hope it might be useful.
  
-- Cyrille
  
  PS: Chris, if btrfs-images of before and after my butcher fix would
  be useful to you, just let me know. 
  scavenge.patch
  
  
  scavenge-2.patch
 


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Btrfs: New inode number allocator

2011-01-26 Thread Li Zefan
Chris Mason wrote:
 Excerpts from Li Zefan's message of 2011-01-25 20:53:00 -0500:
 (WARNING: this patch is not completed or well-tested)

 We used to allocate inode number by searching through inode items, but 
 it made the allocation slower and slower as more and more files created.

 The current code just records the highest objectid in the btree without
 reusing old inode numbers, which will make the filesystem run out of
 inode number as we create/delete files.

 In this patch, free inode numbers are stored in the fs tree with key:

 [start, BTRFS_INO_EXTENT_KEY, end]
 
 Thanks a lot for working on this, it isn't an easy problem.
 
 I think Josef's free space cache for the extent allocation tree is the
 model you want to use.  They are actually solving exactly the same
 problem:
 
 In the extent allocation tree, a free extent is one with no keys in the
 tree.
 
 In the FS tree, a free inode is one with no keys in the tree.
 
 He has a cache that gets written on a per block group basis for the free
 extents in that block group.  It's a somewhat easier problem to solve in
 the inode number cache because you don't have the same problem where you
 need free blocks to store the free block cache ;)
 
 In his code, the cache stores the generation number of the commit that
 was used to create the cache.  If a cache unaware kernel mounts the
 filesystem and makes changes, we notice on the next mount because the
 cache generation number doesn't match the filesystem generation number.
 
 It will probably be easiest to dedicate a specific objectid to the inode
 number cache in each FS tree (say objectid == -12ULL), and then put the
 caching items directly in the tree under that objectid.
 
 I'd suggest that you also reuse his code to compactly store a range of
 free extents.  It wouldn't be hard to have a simple compression scheme
 that stored ranges for huge chunks of free inode numbers and did a
 bitmask for ranges where there are lots of free individual inodes.
 

I'll take your suggestion and try to implement it. Thanks.

(btw, I'll be off from Feb 29th to Mar 7th for Chinese Spring Festival)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: full btrfs partition, became unmountable (+ a solution that thankfully worked for me)

2011-01-26 Thread Shawn Stricker
any chance of getting a little more informative output?
I started the command at about 2250 Eastern and now at 0117 Eastern the command 
is still running and all of the attached output happened in the first few 
minutes (under 5).
btrfsck /dev/sde
trying potential super #0 at bytenr 65536 
super #0 at bytenr 65536 has better generation 134838 than 0, using that
trying potential super #1 at bytenr 67108864 
super #1 at bytenr 67108864 has same generation 134838 than 134838, skipping
   warning: super #1 at bytenr 67108864 has different contents!
trying potential super #2 at bytenr 274877906944 
super #2 at bytenr 274877906944 has same generation 134838 than 134838, skipping
   warning: super #2 at bytenr 274877906944 has different contents!
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 8679965255889070385
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 11385464139938791651
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 9270412280288921994
trying potential super #0 at bytenr 65536 
super #0 at bytenr 65536 has better generation 2155 than 0, using that
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 7739426643357674384
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 15592201610856999042
trying potential super #0 at bytenr 65536 
super #0 at bytenr 65536 has better generation 134838 than 0, using that
trying potential super #1 at bytenr 67108864 
super #1 at bytenr 67108864 has same generation 134838 than 134838, skipping
   warning: super #1 at bytenr 67108864 has different contents!
trying potential super #2 at bytenr 274877906944 
super #2 at bytenr 274877906944 has same generation 134838 than 134838, skipping
   warning: super #2 at bytenr 274877906944 has different contents!
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 13794433748072589868
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 6338804170709571794
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 1827607198315921929
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 0
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 1254821329273892037
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 5355923006792833603
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 15445565961457297964
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 3079817357236378973
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 2007935378006179730
trying potential super #0 at bytenr 65536 
invalid magic
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 5729257636792198197
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 9602773462471183673
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 327680
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 0
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 18446744073709551615
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 0
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 0
trying potential super #2 at bytenr 274877906944 
got only 0 bytes instead of 2859
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 0
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 0
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 4313900536667142911
trying potential super #0 at bytenr 65536 
super #0 at bytenr 65536 has better generation 134838 than 0, using that
trying potential super #1 at bytenr 67108864 
super #1 at bytenr 67108864 has same generation 134838 than 134838, skipping
   warning: super #1 at bytenr 67108864 has different contents!
trying potential super #2 at bytenr 274877906944 
super #2 at bytenr 274877906944 has same generation 134838 than 134838, skipping
   warning: super #2 at bytenr 274877906944 has different contents!
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 1142399309793345613
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 6887355887353813266
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 10874904992214108498
trying potential super #0 at bytenr 65536 
misplaced block thinks it's at 8679965255889070385
trying potential super #1 at bytenr 67108864 
misplaced block thinks it's at 16378195527537296748
trying potential super #2 at bytenr 274877906944 
misplaced block thinks it's at 9378314511156802577
trying potential super #0 at bytenr 65536 
super #0 

Re: version

2011-01-26 Thread Chris Mason
Excerpts from Helmut Hullen's message of 2011-01-25 09:37:00 -0500:
 crashes with the dmesg lines
 
 - dmesg ---
 
 bio too big device sdc (256  240)
 bio too big device sdc (256  240)
 bio too big device sdc (256  240)
 bio too big device sdc (256  240)
 [ cut here ]
 kernel BUG at fs/btrfs/volumes.c:2097!

Ugh, this one is an old friend I thought I had fixed up.  The two
devices have different limits on the max size of the bio, and we're
using one that is too large.

I'll get it fixed for the next rc.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html