Re: Issue building a file based rootfs image with mkfs.btrfs

2013-09-28 Thread Chris Mason
Quoting Saul Wold (2013-09-19 14:19:34)
 Hi there,
 
 I am attempting to build a rootfs image from an existing rootfs 
 directory tree.  I am using the 0.20 @ 194aa4a of Chris's git repo.
 
 The couple problem I saw was that the target image file needed to exist, 
 although I think I can patch that then the FS size was much larger than 
 the actual size, I tracked this to the usage of ftw not accounting for 
 symlinks, I have a patch for that which I will send once I finish 
 getting the other issues resolved.
 
 Next issue I hit was an assertion failure after getting not enough free 
 space message:
 
 not enough free space
 add_file_items failed
 unable to traverse_directory
 Making image is aborted.
 mkfs.btrfs: mkfs.c:1542: main: Assertion `!(ret)' failed.
 
 I am kind of stuck on this one, took it as far as I can right now. 
 Would I be better off dropping back to 0.19 or can we move forward 
 fixing this?

Hi Saul,

Update on my end, the problem is the image code expects every file to
fit inside a single chunk.  It's only creating 8MB chunks, so any file
over 8MB in size is causing problems.

I'm fixing it up here, I should have a patch for you on Monday.

Thanks!

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: device add should check existing FS before adding

2013-09-28 Thread Anand Jain

On 09/28/2013 02:32 AM, Zach Brown wrote:

@@ -49,14 +50,17 @@ static int cmd_add_dev(int argc, char **argv)
int i, fdmnt, ret=0, e;
DIR *dirstream = NULL;
int discard = 1;
+   int force = 0;
+   char estr[100];

+   res = test_dev_for_mkfs(argv[i], force, estr);
+   if (res) {
+   fprintf(stderr, %s, estr);
continue;
}


This test_dev_for_mkfs() error string interface is bad.  The caller
should not have to magically guess the string size that the function is
going to use.  Especially because users can trivial provide giant paths
that exhaust that tiny buffer.  If an arbitrarily too small buffer in
the caller was needed at all, its length should have been passed in with
the string pointer.  (Or a string struct that all C projects eventually
grow.)

But all the callers just immediately print it anyway.  Get rid of that
string argument entirely and just have test_dev_for_mkfs() print the
strings.


 Right. But this patch didn't introduce test_dev_for_mkfs()
 revamp of it will be good in a separate patch as it touches
 other functions as well.

Thanks, Anand


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: calculate disk space that a subvol could free

2013-09-28 Thread Anand Jain

On 09/28/2013 03:10 AM, Zach Brown wrote:

diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index de246ab..0f36cde 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -809,6 +809,7 @@ static int cmd_subvol_show(int argc, char **argv)
int fd = -1, mntfd = -1;
int ret = 1;
DIR *dirstream1 = NULL, *dirstream2 = NULL;
+   u64 freeable_bytes;

if (check_argc_exact(argc, 2))
usage(cmd_subvol_show_usage);
@@ -878,6 +879,8 @@ static int cmd_subvol_show(int argc, char **argv)
goto out;
}

+   freeable_bytes = get_subvol_freeable_bytes(fd);
+
ret = 0;
/* print the info */
printf(%s\n, fullpath);
@@ -915,6 +918,8 @@ static int cmd_subvol_show(int argc, char **argv)
else
printf(\tFlags: \t\t\t-\n);

+   printf(\tUnshared space: \t%s\n,
+   pretty_size(freeable_bytes));


There's no reason to have a local variable:

printf(\tUnshared space: \t%s\n,
pretty_size(get_subvol_freeable_bytes(fd));



printf(\tSnapshot(s):\n);
filter_set = btrfs_list_alloc_filter_set();
diff --git a/utils.c b/utils.c
index ccb5199..ca30485 100644
--- a/utils.c
+++ b/utils.c
@@ -2062,3 +2062,157 @@ int lookup_ino_rootid(int fd, u64 *rootid)

return 0;
  }
+
+/* gets the ref count for given extent
+ * 0 = didn't find the item
+ * n = number of references
+*/
+u64 get_extent_refcnt(int fd, u64 disk_blk)
+{
+   int ret = 0, i, e;
+   struct btrfs_ioctl_search_args args;
+   struct btrfs_ioctl_search_key *sk = args.key;
+   struct btrfs_ioctl_search_header sh;
+   unsigned long off = 0;
+
+   memset(args, 0, sizeof(args));
+
+   sk-tree_id = BTRFS_EXTENT_TREE_OBJECTID;
+
+   sk-min_type = BTRFS_EXTENT_ITEM_KEY;
+   sk-max_type = BTRFS_EXTENT_ITEM_KEY;
+
+   sk-min_objectid = disk_blk;
+   sk-max_objectid = disk_blk;
+
+   sk-max_offset = (u64)-1;
+   sk-max_transid = (u64)-1;
+
+   while (1) {
+   sk-nr_items = 4096;
+
+   ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args);
+   e = errno;
+   if (ret  0) {
+   fprintf(stderr, ERROR: search failed - %s\n,
+   strerror(e));
+   return 0;
+   }
+   if (sk-nr_items == 0)
+   break;
+
+   off = 0;
+   for (i = 0; i  sk-nr_items; i++) {
+   struct btrfs_extent_item *ei;
+   u64 ref;
+
+   memcpy(sh, args.buf + off, sizeof(sh));
+   off += sizeof(sh);
+
+   if (sh.type != BTRFS_EXTENT_ITEM_KEY) {
+   off += sh.len;
+   continue;
+   }
+
+   ei = (struct btrfs_extent_item *)(args.buf + off);
+   ref = btrfs_stack_extent_refs(ei);
+   return ref;
+   }
+   sk-min_objectid = sh.objectid;
+   sk-min_offset = sh.offset;
+   sk-min_type = sh.type;
+   if (sk-min_offset  (u64)-1)
+   sk-min_offset++;
+   else if (sk-min_objectid  (u64)-1) {
+   sk-min_objectid++;
+   sk-min_offset = 0;
+   sk-min_type = 0;
+   } else
+   break;
+   }
+   return 0;
+}


These two fiddly functions only differ in the tree search and what they
do with each item.  So replace them with a function that takes a
description of the search and calls the caller's callback for each item.

typedef void (*item_func_t)(struct btrfs_key *key, void *data, void *arg);

int btrfs_for_each_item(int fd, min and max and junk,
item_func_t func, void *arg);

u64 get_subvol_freeable_bytes(int fd)
{
u64 size_bytes = 0;

btrfs_for_each_item(fd, , sum_extents, size_bytes);

return size_bytes;
}

Etc.  You get the idea.



 Will fix them.  Thanks !

-Anand




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue building a file based rootfs image with mkfs.btrfs

2013-09-28 Thread Saul Wold

On 09/28/2013 05:29 AM, Chris Mason wrote:

Quoting Saul Wold (2013-09-19 14:19:34)

Hi there,

I am attempting to build a rootfs image from an existing rootfs
directory tree.  I am using the 0.20 @ 194aa4a of Chris's git repo.

The couple problem I saw was that the target image file needed to exist,
although I think I can patch that then the FS size was much larger than
the actual size, I tracked this to the usage of ftw not accounting for
symlinks, I have a patch for that which I will send once I finish
getting the other issues resolved.

Next issue I hit was an assertion failure after getting not enough free
space message:

not enough free space
add_file_items failed
unable to traverse_directory
Making image is aborted.
mkfs.btrfs: mkfs.c:1542: main: Assertion `!(ret)' failed.

I am kind of stuck on this one, took it as far as I can right now.
Would I be better off dropping back to 0.19 or can we move forward
fixing this?


Hi Saul,

Update on my end, the problem is the image code expects every file to
fit inside a single chunk.  It's only creating 8MB chunks, so any file
over 8MB in size is causing problems.

I'm fixing it up here, I should have a patch for you on Monday.

Ah great news!  I want to verify is your git repo for btrfs-progs the 
main upstream?  I see loads of other patches flying around, but not 
applied there.


Thanks again

Sau!



Thanks!

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Corrupt btrfs filesystem recovery... (Due to *sata* errors)

2013-09-28 Thread Martin
This may be of interest for the fail cause aswel as how to recover...


I have a known good 2TB (4kByte physical sectors) HDD that supports
sata3 (6Gbit/s). Writing data via rsync at the 6Gbit/s sata rate caused
IO errors for just THREE sectors...

Yet btrfsck bombs out with LOTs of errors...

How best to recover from this?

(This is a 'backup' disk so not 'critical' but it would be nice to avoid
rewriting about 1.5TB of data over the network...)


Is there an obvious sequence/recipe to follow for recovery?

Thanks,
Martin



Further details:

Linux  3.10.7-gentoo-r1 #2 SMP Fri Sep 27 23:38:06 BST 2013 x86_64 AMD
E-450 APU with Radeon(tm) HD Graphics AuthenticAMD GNU/Linux

# btrfs version
Btrfs v0.20-rc1-358-g194aa4a

Single 2TB HDD using default mkbtrfs.
Entire disk (/dev/sdc) is btrfs (no partitions).


The IO errors were:

kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3215049328
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3206563752
kernel: end_request: I/O error, dev sdc, sector 3213925248
kernel: end_request: I/O error, dev sdc, sector 3213925248
kernel: end_request: I/O error, dev sdc, sector 3213925248
kernel: end_request: I/O error, dev sdc, sector 3213925248
kernel: end_request: I/O error, dev sdc, sector 3213925248

Lots of sata error noise omitted.


The sata problem was fixed by limiting libata to 3Gbit/s:

libata.force=3.0G

added onto the Grub kernel line.

Running badblocks twice in succession (non-destructive data test!)
shows no surface errors and no further errors on the sata interface.

Running btrfsck twice gives the same result, giving a failure with:

Ignoring transid failure
btrfsck: cmds-check.c:1066: process_file_extent: Assertion `!(rec-ino
!= key-objectid || rec-refs  1)' failed.


An abridged summary is:

checking extents
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
leaf parent key incorrect 907185135616
bad block 907185135616
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
leaf parent key incorrect 915444883456
bad block 915444883456
leaf parent key incorrect 915445014528
bad block 915445014528
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185082368 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
parent transid verify failed on 907185127424 wanted 15935 found 12264
leaf parent key incorrect 907183771648
bad block 907183771648
leaf parent key incorrect 907183779840
bad block 907183779840
leaf parent key incorrect 907183783936
bad block 907183783936
[...]
leaf parent key incorrect 907185913856
bad block 907185913856
leaf parent key incorrect 907185917952
bad block 907185917952
parent transid verify failed on 915431579648 wanted 16974 found 16972
parent transid verify failed on 915431579648 wanted 16974 found 16972
parent transid verify failed on 915432382464 wanted 16974 found 16972
parent transid verify failed on 915432382464 wanted 16974 found 16972
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915444707328 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
parent transid verify failed on 915445092352 wanted 16974 found 13021
parent transid verify failed on 915445100544 wanted 16974 found 13021
parent transid verify failed on 915445100544 wanted 16974 found 13021
parent transid verify failed on 915432734720 wanted 16974 found 16972
parent transid verify failed on 915432734720 wanted 16974 found 16972
parent transid verify failed on 915433144320 wanted 16974 found 16972
parent transid verify failed on 915433144320 wanted 16974 found 16972
parent transid verify failed on 915431862272 wanted 16974 found 16972
parent transid verify failed on 915431862272 wanted 16974 found 16972
parent transid verify failed on 915444715520 wanted 16974 found 13021
parent transid verify failed on 915444715520 wanted 16974 found 13021
parent transid verify failed on 915445166080 wanted 16974 found 13021
parent transid verify failed on 915445166080 wanted 16974 found 

Not possible to read device stats for devices added after mount

2013-09-28 Thread Ondřej Kunc
Hi,

I discovered one minor bug in BTRFS filesystem. I made nagios check
for btrfs which reads device statistics for all devices in mounted
btrfs filesystem, calling btrfs dev stats /btrfs.

But there is one problem ... it's output looks like this:

[/dev/sda].corruption_errs 0
..
...
[/dev/sdt].generation_errs 0
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sdb2 failed: No such device
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sdh failed: No such device
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sdj failed: No such device
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sdk failed: No such device
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sdl failed: No such device
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sdp failed: No such device
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sdq failed: No such device
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sds failed: No such device
ERROR: ioctl(BTRFS_IOC_GET_DEV_STATS) on /dev/sde failed: No such device

But this is not true ... all specified devices exist and are members
of btrfs filesystem. In dmesg I see this:
...
[973077.098957] btrfs: get dev_stats failed, not yet valid
[973077.098984] btrfs: get dev_stats failed, not yet valid
[973077.099011] btrfs: get dev_stats failed, not yet valid
[973077.099038] btrfs: get dev_stats failed, not yet valid
[973077.099065] btrfs: get dev_stats failed, not yet valid
[973077.099092] btrfs: get dev_stats failed, not yet valid
[973077.099118] btrfs: get dev_stats failed, not yet valid


What makes device statistics valid ? I tried doing full filesystem
scrub ... but it did not fix that issue.

Thank you for any hints

Using this kernel (if it matters):
3.10-2-amd64 #1 SMP Debian 3.10.7-1 (2013-08-17) x86_64 GNU/Linux

Ondřej Kunc
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

2013-09-28 Thread Chris Murphy

On Sep 28, 2013, at 1:26 PM, Martin m_bt...@ml1.co.uk wrote:

 Writing data via rsync at the 6Gbit/s sata rate caused
 IO errors for just THREE sectors...
 
 Yet btrfsck bombs out with LOTs of errors…

Any fs will bomb out on write errors.

 How best to recover from this?

Why you're getting I/O errors at SATA 6Gbps link speed needs to be understood. 
Is it a bad cable? Bad SATA port? Drive or controller firmware bug? Or libata 
driver bug?

 Lots of sata error noise omitted.

And entire dmesg might still be useful. I don't know if the list will handle 
the whole dmesg in one email, but it's worth a shot (reply to an email in the 
thread, don't change the subject).

It's possible software or hardware problems are detected well before writes are 
even initiated.

 Running badblocks twice in succession (non-destructive data test!)
 shows no surface errors and no further errors on the sata interface.

SATA link speed related errors aren't related to bad blocks. If you do a 
smartctl -x on the drive, chances are it's recording PHY Event errors that 
might be relevant, and also SMART might record UDMA/CMC errors that would just 
corroborate that the drive also found link errors.


 
 Running btrfsck twice gives the same result, giving a failure with:

Well honestly at this point I expect file system corruption as it's entirely 
possible that before the hardware dropped the link speed down to SATA 3Gbps, 
there was corrupt data already sent to the drive and that's not something Btrfs 
can know about until trying to read the data back in. So *shrug* - I don't see 
Btrfs as a way to totally mitigate hardware problems. It's the same problem 
with bad RAM, and Btrfs doesn't like that either.


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

2013-09-28 Thread Martin
Chris,

All agreed. Further comment inlined:

(Should have mentioned more prominently that the hardware problem has
been worked-around by limiting the sata to 3Gbit/s on bootup.)


On 28/09/13 21:51, Chris Murphy wrote:
 
 On Sep 28, 2013, at 1:26 PM, Martin m_bt...@ml1.co.uk wrote:
 
 Writing data via rsync at the 6Gbit/s sata rate caused IO errors
 for just THREE sectors...
 
 Yet btrfsck bombs out with LOTs of errors…
 
 Any fs will bomb out on write errors.

Indeed. However, are not the sata errors reported back to btrfs so that
it knows whatever parts haven't been updated?

Is there not a mechanism to then go read-only?

Also, should not the journal limit the damage?


 How best to recover from this?
 
 Why you're getting I/O errors at SATA 6Gbps link speed needs to be
 understood. Is it a bad cable? Bad SATA port? Drive or controller
 firmware bug? Or libata driver bug?

I systematically eliminated such as leads, PSU, and NCQ. Limiting libata
to only use 3Gbit/s is the one change that gives a consistent fix. The
HDD and motherboard both support 6Gbit/s, but hey-ho, that's an
experiment I can try again some other time when I have another HDD/SSD
to test in there.

In any case, for the existing HDD - motherboard combination, using sata2
rather than sata3 speeds shouldn't noticeably impact performance. (Other
than sata2 works reliably and so is infinitely better for this case!)


 Lots of sata error noise omitted.
 
 And entire dmesg might still be useful. I don't know if the list will
 handle the whole dmesg in one email, but it's worth a shot (reply to
 an email in the thread, don't change the subject).

I can email directly if of use/interest. Let me know offlist.


 do a smartctl -x on the drive, chances are it's recording PHY Event

(smartctl -x errors shown further down...)

Nothing untoward noticed:

# smartctl -a /dev/sdc

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD20EARX-00PASB0
Serial Number:WD-...
LU WWN Device Id: ...
Firmware Version: 51.0AB51
User Capacity:2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:Sat Sep 28 23:35:57 2013 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

[...]

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x002f   200   200   051Pre-fail  Always
  -   9
  3 Spin_Up_Time0x0027   253   159   021Pre-fail  Always
  -   1983
  4 Start_Stop_Count0x0032   100   100   000Old_age   Always
  -   55
  5 Reallocated_Sector_Ct   0x0033   200   200   140Pre-fail  Always
  -   0
  7 Seek_Error_Rate 0x002e   200   200   000Old_age   Always
  -   0
  9 Power_On_Hours  0x0032   099   099   000Old_age   Always
  -   800
 10 Spin_Retry_Count0x0032   100   253   000Old_age   Always
  -   0
 11 Calibration_Retry_Count 0x0032   100   253   000Old_age   Always
  -   0
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   Always
  -   53
192 Power-Off_Retract_Count 0x0032   200   200   000Old_age   Always
  -   31
193 Load_Cycle_Count0x0032   199   199   000Old_age   Always
  -   3115
194 Temperature_Celsius 0x0022   118   110   000Old_age   Always
  -   32
196 Reallocated_Event_Count 0x0032   200   200   000Old_age   Always
  -   0
197 Current_Pending_Sector  0x0032   200   200   000Old_age   Always
  -   0
198 Offline_Uncorrectable   0x0030   200   200   000Old_age
Offline  -   0
199 UDMA_CRC_Error_Count0x0032   200   200   000Old_age   Always
  -   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000Old_age
Offline  -   0


# smartctl -x /dev/sdc

... also shows the errors it saw:

(Just the last 4 copied which look timed for when the HDD was last
exposed to 6Gbit/s sata)

Error 46 [21] occurred at disk power-on lifetime: 755 hours (31 days +
11 hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  01 -- 51 00 08 00 00 6c 1a 4b b0 e0 00  Error: AMNF 8 sectors at LBA =
0x6c1a4bb0 = 1813662640

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time
Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---

  

Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

2013-09-28 Thread Martin
On 28/09/13 20:26, Martin wrote:

 ... btrfsck bombs out with LOTs of errors...
 
 How best to recover from this?
 
 (This is a 'backup' disk so not 'critical' but it would be nice to avoid
 rewriting about 1.5TB of data over the network...)
 
 
 Is there an obvious sequence/recipe to follow for recovery?


I've got the drive reliably working with the sata limited to 3Gbit/s.
What is the best sequence to try to tidy-up and carry on with the 1.5TB
or so of data on there, rather than working from scratch?


So far, I've only run btrfsck since the corruption errors for the three
sectors...

Suggestions for recovery?

Thanks,
Martin



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Questions regarding logging upon fsync in btrfs

2013-09-28 Thread Aastha Mehta
Hi,

I have few questions regarding logging triggered by calling fsync in BTRFS:

1. If I understand correctly, fsync will call to log entire inode in
the log tree. Does this mean that the data extents are also logged
into the log tree? Are they copied into the log tree, or just
referenced? Are they copied into the subvolume's extent tree again
upon replay?

2. During replay, when the extents are added into the extent
allocation tree, do they acquire the physical extent number during
replay? Does they physical extent allocated to the data in the log
tree differ from that in the subvolume?

3. I see there is a mount option of notreelog available. After
disabling tree logging, does fsync still lead to flushing of buffers
to the disk directly?

4. Is it possible to selectively identify certain files in the log
tree and flush them to disk directly, without waiting for the replay
to do it?

Thanks

-- 
Aastha Mehta
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-28 Thread Aastha Mehta
I am using linux kernel 3.1.10-1.16, just to let you know.

Thanks

On 29 September 2013 01:35, Aastha Mehta aasth...@gmail.com wrote:
 Hi,

 I have few questions regarding logging triggered by calling fsync in BTRFS:

 1. If I understand correctly, fsync will call to log entire inode in
 the log tree. Does this mean that the data extents are also logged
 into the log tree? Are they copied into the log tree, or just
 referenced? Are they copied into the subvolume's extent tree again
 upon replay?

 2. During replay, when the extents are added into the extent
 allocation tree, do they acquire the physical extent number during
 replay? Does they physical extent allocated to the data in the log
 tree differ from that in the subvolume?

 3. I see there is a mount option of notreelog available. After
 disabling tree logging, does fsync still lead to flushing of buffers
 to the disk directly?

 4. Is it possible to selectively identify certain files in the log
 tree and flush them to disk directly, without waiting for the replay
 to do it?

 Thanks

 --
 Aastha Mehta



-- 
Aastha Mehta
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-28 Thread Hugo Mills
On Sun, Sep 29, 2013 at 01:46:23AM +0200, Aastha Mehta wrote:
 I am using linux kernel 3.1.10-1.16, just to let you know.

   Not that it invalidates the questions below, but that's a really
old kernel. You should update to something recent (3.11, or 3.12-rc2)
as soon as possible. There are major problems in 3.1 (and most of the
subsequent kernels) that have been fixed in 3.11. Of course, there are
still major problems in 3.11 that haven't been fixed yet, but we don't
know about very many of those. :) (And when we do, we'll be
recommending that you upgrade to whatever has them fixed...)

   Hugo.

 Thanks
 
 On 29 September 2013 01:35, Aastha Mehta aasth...@gmail.com wrote:
  Hi,
 
  I have few questions regarding logging triggered by calling fsync in BTRFS:
 
  1. If I understand correctly, fsync will call to log entire inode in
  the log tree. Does this mean that the data extents are also logged
  into the log tree? Are they copied into the log tree, or just
  referenced? Are they copied into the subvolume's extent tree again
  upon replay?
 
  2. During replay, when the extents are added into the extent
  allocation tree, do they acquire the physical extent number during
  replay? Does they physical extent allocated to the data in the log
  tree differ from that in the subvolume?
 
  3. I see there is a mount option of notreelog available. After
  disabling tree logging, does fsync still lead to flushing of buffers
  to the disk directly?
 
  4. Is it possible to selectively identify certain files in the log
  tree and flush them to disk directly, without waiting for the replay
  to do it?
 
  Thanks
 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Diablo-D3 My code is never released,  it escapes from the ---   
  git repo and kills a few beta testers on the way out.  


signature.asc
Description: Digital signature


Re: Questions regarding logging upon fsync in btrfs

2013-09-28 Thread Josef Bacik
On Sun, Sep 29, 2013 at 01:35:15AM +0200, Aastha Mehta wrote:
 Hi,
 
 I have few questions regarding logging triggered by calling fsync in BTRFS:
 
 1. If I understand correctly, fsync will call to log entire inode in
 the log tree. Does this mean that the data extents are also logged
 into the log tree? Are they copied into the log tree, or just
 referenced? Are they copied into the subvolume's extent tree again
 upon replay?
 

The data extents are copied as well, as in the metadata that points to the data,
not the actual data itself.  For 3.1 it's all of the extents in the inode, in
3.8 on it's only the extents that have changed this transaction.

 2. During replay, when the extents are added into the extent
 allocation tree, do they acquire the physical extent number during
 replay? Does they physical extent allocated to the data in the log
 tree differ from that in the subvolume?
 

No the physical location was picked when we wrote the data out during fsync.  If
we crash and re-mount the replay will just insert the ref into the extent tree
for the disk offset as it replays the extents.

 3. I see there is a mount option of notreelog available. After
 disabling tree logging, does fsync still lead to flushing of buffers
 to the disk directly?
 

notreelog just means that we write the data and wait on the ordered data extents
and then commit the transaction.  So you get the data for the inode you are
fsycning and all of the metadata for the entire file system that has changed in
that transaction.

 4. Is it possible to selectively identify certain files in the log
 tree and flush them to disk directly, without waiting for the replay
 to do it?
 

I don't understand this question, replay only happens on mount after a
crash/power loss, and everything is replayed that is in the log, there is no way
to select which inode is replayed.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

2013-09-28 Thread Chris Murphy

On Sep 28, 2013, at 4:51 PM, Martin m_bt...@ml1.co.uk wrote:

 Indeed. However, are not the sata errors reported back to btrfs so that
 it knows whatever parts haven't been updated?

It's a good question.

My doubtful speculation of such a mechanism is that it is really not the 
responsibility of the file system to be prepared for the hardware face planting 
this spectacularly. The hardware really should do better than this. There are 
specifications that apply here, and the drive and controller and driver all 
agreed long before the mounting of a volume and writes started to occur. But 
then later on, at some point in the middle of the really important part of the 
conversation (writing your data) something in the hardware chain puked and said 
OHHh wait about that prior conversation, I'm really confused, let's talk at a 
slower speed shall we? So the before part is just a lost conversation, is my 
speculation.

The other thing is that SATA and SAS handle these things differently. When 
there's such a serious error that results in a link speed change, usually the 
bus is reset and for SATA it means the command queue is lost. And I don't think 
Btrfs is informed of what commands were completed vs failed in such a case.

But I'd love someone who actually knows what they're talking about to answer 
that question.

My expectation though, is that unlike perhaps other file systems, Btrfs's 
design goal is to handle the data that did get written, better. In that it's 
still accessible where other file systems possibly will have a more difficulty.


 Is there not a mechanism to then go read-only?

I don't know. In this case it does seem sorta reasonable. But the dmesg might 
still be revealing. The PHY Event counters indicate a lot of retries of over 
1000 sectors.
 
 Also, should not the journal limit the damage?

Well it's COW so it's not quite like a journaled file system, but yeah it 
should be in a position to know at the next mount time the most recent state of 
file system consistency. But that doesn't mean it can fix the parts that are 
just fundamentally broken. But I think it's a valid question, now what? 
because I don't actually know the state of your file system or how to determine 
it. So maybe Hugo, or someone else has some thoughts.

But for sure I would move to kernel 3.11.2 or 3.12.rc2 before mounting this 
file system again.

 
 
 How best to recover from this?
 
 Why you're getting I/O errors at SATA 6Gbps link speed needs to be
 understood. Is it a bad cable? Bad SATA port? Drive or controller
 firmware bug? Or libata driver bug?
 
 I systematically eliminated such as leads, PSU, and NCQ. Limiting libata
 to only use 3Gbit/s is the one change that gives a consistent fix. The
 HDD and motherboard both support 6Gbit/s, but hey-ho, that's an
 experiment I can try again some other time when I have another HDD/SSD
 to test in there.

Stick with forced 3Gbps, but I think it's worth while to find out what the 
actual problem is. One day you forget about this 3Gbps SATA link, upgrade or 
regress to another kernel and you don't have the 3Gbps forced speed on the 
parameter line, and poof - you've got more problems again. The hardware 
shouldn't negotiate a 6Gbps link and then do a backwards swan dive at 30,000' 
with your data as if it's an after thought.


 In any case, for the existing HDD - motherboard combination, using sata2
 rather than sata3 speeds shouldn't noticeably impact performance. (Other
 than sata2 works reliably and so is infinitely better for this case!)

It's true.


 
 
 Lots of sata error noise omitted.
 
 And entire dmesg might still be useful. I don't know if the list will
 handle the whole dmesg in one email, but it's worth a shot (reply to
 an email in the thread, don't change the subject).
 
 I can email directly if of use/interest. Let me know offlist.

Use pastebin.com and post the link if it's really huge, but I'd consider 
setting it to no expiration because if something interesting is learned, people 
doing searches have a better chance of finding the problem if the link hasn't 
expired.

I would also separately unmount the file system, note the latest kernel 
message, then mount the file system and see if there are any kernel messages 
that might indicate recognition of problems with the fs.

I would not use btrfsck --repair until someone says it's a good idea. That 
person would not be me.

Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Corrupt btrfs filesystem recovery... What best instructions?

2013-09-28 Thread Martin
On 28/09/13 23:54, Martin wrote:
 On 28/09/13 20:26, Martin wrote:
 
 ... btrfsck bombs out with LOTs of errors...

 How best to recover from this?

 (This is a 'backup' disk so not 'critical' but it would be nice to avoid
 rewriting about 1.5TB of data over the network...)


 Is there an obvious sequence/recipe to follow for recovery?
 
 
 I've got the drive reliably working with the sata limited to 3Gbit/s.
 What is the best sequence to try to tidy-up and carry on with the 1.5TB
 or so of data on there, rather than working from scratch?
 
 
 So far, I've only run btrfsck since the corruption...

So...

Any options for btrfsck to fix things?

Or is anything/everything that is fixable automatically fixed on the
next mount?

Or should:

btrfs scrub /dev/sdX

be run first?

Or?


What does btrfs do (or can do) for recovery?

Advice welcomed,

Thanks,
Martin




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Corrupt btrfs filesystem recovery... (Due to *sata* errors)

2013-09-28 Thread Martin
Chris,

Thanks for good comment/discussion.

On 29/09/13 03:06, Chris Murphy wrote:
 
 On Sep 28, 2013, at 4:51 PM, Martin m_bt...@ml1.co.uk wrote:
 

 Stick with forced 3Gbps, but I think it's worth while to find out
 what the actual problem is. One day you forget about this 3Gbps SATA
 link, upgrade or regress to another kernel and you don't have the
 3Gbps forced speed on the parameter line, and poof - you've got more
 problems again. The hardware shouldn't negotiate a 6Gbps link and
 then do a backwards swan dive at 30,000' with your data as if it's an
 after thought.

I've got an engineer's curiosity so that one is very definitely marked
for revisiting at some time... If only to blog that x-y-z combination is
a tar pit for your data...


 In any case, for the existing HDD - motherboard combination, using
 sata2 rather than sata3 speeds shouldn't noticeably impact
 performance. (Other than sata2 works reliably and so is infinitely
 better for this case!)
 
 It's true.

Well, the IO data rate for badblocks is exactly the same as before,
limited by the speed of the physical rust spinning and data density...


 I would also separately unmount the file system, note the latest
 kernel message, then mount the file system and see if there are any
 kernel messages that might indicate recognition of problems with the
 fs.
 
 I would not use btrfsck --repair until someone says it's a good idea.
 That person would not be me.

It is sat unmounted until some informed opinion is gained...


Thanks again for your notes,

Regards,
Martin




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix memory leak of chunks' extent map

2013-09-28 Thread Liu Bo
Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/volumes.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0431147..ee1fdac 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4483,6 +4483,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
logical, u64 len)
btrfs_crit(fs_info, Invalid mapping for %Lu-%Lu, got 
%Lu-%Lu\n, logical, logical+len, em-start,
em-start + em-len);
+   free_extent_map(em);
return 1;
}
 
@@ -4663,6 +4664,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
btrfs_crit(fs_info, found a bad mapping, wanted %Lu, 
   found %Lu-%Lu\n, logical, em-start,
   em-start + em-len);
+   free_extent_map(em);
return -EINVAL;
}
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: fix memory leak of chunks' extent map

2013-09-28 Thread Liu Bo
As we're hold a ref on looking up the extent map, we need to drop the ref
before returning to callers.

Signed-off-by: Liu Bo bo.li@oracle.com
---
v2: add the missing changelog.

 fs/btrfs/volumes.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0431147..ee1fdac 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4483,6 +4483,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
logical, u64 len)
btrfs_crit(fs_info, Invalid mapping for %Lu-%Lu, got 
%Lu-%Lu\n, logical, logical+len, em-start,
em-start + em-len);
+   free_extent_map(em);
return 1;
}
 
@@ -4663,6 +4664,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
btrfs_crit(fs_info, found a bad mapping, wanted %Lu, 
   found %Lu-%Lu\n, logical, em-start,
   em-start + em-len);
+   free_extent_map(em);
return -EINVAL;
}
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Corrupt btrfs filesystem recovery... What best instructions?

2013-09-28 Thread Duncan
Martin posted on Sun, 29 Sep 2013 03:10:37 +0100 as excerpted:

 So...
 
 Any options for btrfsck to fix things?
 
 Or is anything/everything that is fixable automatically fixed on the
 next mount?
 
 Or should:
 
 btrfs scrub /dev/sdX
 
 be run first?
 
 Or?
 
 
 What does btrfs do (or can do) for recovery?

Here's a general-case answer (courtesy gmane) to the order in which to 
try recovery question, that Hugo posted a few weeks ago:

http://permalink.gmane.org/gmane.comp.file-systems.btrfs/27999

Note that in specific cases someone who knew what they were doing could 
omit some steps and focus on others, but I'm not at that level of know 
what I'm doing, so...

Scrub... would go before this, if it's useful.  But scrub depends on a 
second, valid copy being available in ordered to fix the bad-checksum 
one.  On a single device btrfs, btrfs defaults to DUP metadata (unless 
it's SSD), so you may have a second copy for that, but you won't have a 
second copy of the data.  This is a very strong reason to go btrfs raid1 
mode (for both data and metadata) if you can, because that gives you a 
second copy of everything, thereby actually making use of btrfs' checksum 
and scrub ability.  (Unfortunately, there is as yet no way to do N-way 
mirroring, there's only the second copy not a third, no matter how many 
devices you have in that raid1.)

Finally, if you mentioned your kernel (and btrfs-tools) version(s) I 
missed it, but [boilerplate recommendation, stressed repeatedly both in 
the wiki and on-list] btrfs being still labeled experimental and under 
serious development, there's still lots of bugs fixed every kernel 
release.  So as Chris Murphy said, if you're not on 3.11-stable or 3.12-
rcX already, get there.  Not only can the safety of your data depend on 
it, but by choosing to run experimental we're all testers, and our 
reports if something does go wrong will be far more usable if we're on a 
current kernel.  Similarly, btrfs-tools 0.20-rc1 is already somewhat old; 
you really should be on a git-snapshot beyond that.  (The master branch 
is kept stable, work is done in other branches and only merged to master 
when it's considered suitably stable, so a recently updated btrfs-tools 
master HEAD is at least in theory always the best possible version you 
can be running.  If that's ever NOT the case, then testers need to be 
reporting that ASAP so it can be fixed, too.)

Back to the kernel, it's worth noting that 3.12-rcX includes an option 
that turns off most btrfs bugons by default.  Unless you're a btrfs 
developer (which it doesn't sound like you are), you'll want to activate 
that (turning off the bugons), as they're not helpful for ordinary users 
and just force unnecessary reboots when something minor and otherwise 
immediately recoverable goes wrong.  That's just one of the latest fixes.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html