Re: [PATCH V7] Btrfs: enhance raid1/10 balance heuristic

2018-11-13 Thread Goffredo Baroncelli
On 12/11/2018 12.58, Timofey Titovets wrote:
> From: Timofey Titovets 
> 
> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
> based on pid % num of mirrors.
[...]
>   v6 -> v7:
> - Fixes based on Nikolay Borisov review:
>   * Assume num == 2
>   * Remove "for" loop based on that assumption, where possible
>   * No functional changes
> 
> Signed-off-by: Timofey Titovets 
> Tested-by: Dmitrii Tcvetkov 
> Reviewed-by: Dmitrii Tcvetkov 
> ---
[...]
> +/**
> + * guess_optimal - return guessed optimal mirror
> + *
> + * Optimal expected to be pid % num_stripes
> + *
> + * That's generaly ok for spread load
> + * Add some balancer based on queue length to device
> + *
> + * Basic ideas:
> + *  - Sequential read generate low amount of request
> + *so if load of drives are equal, use pid % num_stripes balancing
> + *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
> + *and repick if other dev have "significant" less queue length
> + *  - Repick optimal if queue length of other mirror are less
> + */
> +static int guess_optimal(struct map_lookup *map, int num, int optimal)
> +{
> + int i;
> + int round_down = 8;
> + /* Init for missing bdevs */
> + int qlen[2] = { INT_MAX, INT_MAX };
> + bool is_nonrot[2] = { false, false };
> + bool all_bdev_nonrot = true;
> + bool all_bdev_rotate = true;
> + struct block_device *bdev;
> +
> + ASSERT(num == 2);
> +
> + /* Check accessible bdevs */> + for (i = 0; i < 2; i++) {

>From your function comment, it is not clear why you are comparing "num" to 
>"2". Pay attention that there are somewhere some patches which implement raid 
>profile with higher redundancy (IIRC up to 4). I suggest to put your 
>assumption also in the comment ("...this function works up to 2 mirrors...") 
>and, better, add a define like 

#define BTRFS_MAX_RAID1_RAID10_MIRRORS 2

And replace "2" with BTRFS_MAX_RAID1_RAID10_MIRRORS



> + bdev = map->stripes[i].dev->bdev;
> + if (bdev) {
> + qlen[i] = 0;
> + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
> + if (is_nonrot[i])
> + all_bdev_rotate = false;
> + else
> + all_bdev_nonrot = false;
> + }
> + }
> +
> + /*
> +  * Don't bother with computation
> +  * if only one of two bdevs are accessible
> +  */
> + if (qlen[0] == INT_MAX)
> + return 1;
> + if (qlen[1] == INT_MAX)
> + return 0;
> +
> + if (all_bdev_nonrot)
> + round_down = 2;
> +
> + for (i = 0; i < 2; i++) {
> + bdev = map->stripes[i].dev->bdev;
> + qlen[i] = bdev_get_queue_len(bdev, round_down);
> + }
> +
> + /* For mixed case, pick non rotational dev as optimal */
> + if (all_bdev_rotate == all_bdev_nonrot) {
> + if (is_nonrot[0])
> + optimal = 0;
> + else
> + optimal = 1;
> + }
> +
> + if (qlen[optimal] > qlen[(optimal + 1) % 2])
> + optimal = i;
> +
> + return optimal;
> +}
> +
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   struct map_lookup *map, int first,
>   int dev_replace_is_ongoing)
> @@ -5177,7 +5274,8 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>   else
>   num_stripes = map->num_stripes;
>  
> - preferred_mirror = first + current->pid % num_stripes;
> + preferred_mirror = first + guess_optimal(map, num_stripes,
> +  current->pid % num_stripes);
>  
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


[PATCH 8/9] btrfs: Make more generic the code for RAID 6 rebuilding

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

The original code which handles the recovery of a RAID 6 disks array
assumes that all reads are multiple of 1 << GRUB_DISK_SECTOR_BITS and it
assumes that all the I/O is done via the struct grub_diskfilter_segment.
This is not true for the btrfs code. In order to reuse the native
grub_raid6_recover() code, it is modified to not call
grub_diskfilter_read_node() directly, but to call an handler passed
as an argument.

Signed-off-by: Goffredo Baroncelli 
Reviewed-by: Daniel Kiper 
---
 grub-core/disk/raid6_recover.c | 52 ++
 include/grub/diskfilter.h  |  9 ++
 2 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/grub-core/disk/raid6_recover.c b/grub-core/disk/raid6_recover.c
index aa674f6ca..0cf691ddf 100644
--- a/grub-core/disk/raid6_recover.c
+++ b/grub-core/disk/raid6_recover.c
@@ -74,14 +74,26 @@ mod_255 (unsigned x)
 }
 
 static grub_err_t
-grub_raid6_recover (struct grub_diskfilter_segment *array, int disknr, int p,
-char *buf, grub_disk_addr_t sector, grub_size_t size)
+raid6_recover_read_node (void *data, int disknr,
+   grub_uint64_t sector,
+   void *buf, grub_size_t size)
+{
+struct grub_diskfilter_segment *array = data;
+
+return grub_diskfilter_read_node (>nodes[disknr],
+ (grub_disk_addr_t)sector,
+ size >> GRUB_DISK_SECTOR_BITS, buf);
+}
+
+grub_err_t
+grub_raid6_recover_gen (void *data, grub_uint64_t nstripes, int disknr, int p,
+   char *buf, grub_uint64_t sector, grub_size_t size,
+   int layout, raid_recover_read_t read_func)
 {
   int i, q, pos;
   int bad1 = -1, bad2 = -1;
   char *pbuf = 0, *qbuf = 0;
 
-  size <<= GRUB_DISK_SECTOR_BITS;
   pbuf = grub_zalloc (size);
   if (!pbuf)
 goto quit;
@@ -91,17 +103,17 @@ grub_raid6_recover (struct grub_diskfilter_segment *array, 
int disknr, int p,
 goto quit;
 
   q = p + 1;
-  if (q == (int) array->node_count)
+  if (q == (int) nstripes)
 q = 0;
 
   pos = q + 1;
-  if (pos == (int) array->node_count)
+  if (pos == (int) nstripes)
 pos = 0;
 
-  for (i = 0; i < (int) array->node_count - 2; i++)
+  for (i = 0; i < (int) nstripes - 2; i++)
 {
   int c;
-  if (array->layout & GRUB_RAID_LAYOUT_MUL_FROM_POS)
+  if (layout & GRUB_RAID_LAYOUT_MUL_FROM_POS)
c = pos;
   else
c = i;
@@ -109,8 +121,7 @@ grub_raid6_recover (struct grub_diskfilter_segment *array, 
int disknr, int p,
 bad1 = c;
   else
 {
-  if (! grub_diskfilter_read_node (>nodes[pos], sector,
-  size >> GRUB_DISK_SECTOR_BITS, buf))
+ if (!read_func(data, pos, sector, buf, size))
 {
   grub_crypto_xor (pbuf, pbuf, buf, size);
   grub_raid_block_mulx (c, buf, size);
@@ -128,7 +139,7 @@ grub_raid6_recover (struct grub_diskfilter_segment *array, 
int disknr, int p,
 }
 
   pos++;
-  if (pos == (int) array->node_count)
+  if (pos == (int) nstripes)
 pos = 0;
 }
 
@@ -139,16 +150,14 @@ grub_raid6_recover (struct grub_diskfilter_segment 
*array, int disknr, int p,
   if (bad2 < 0)
 {
   /* One bad device */
-  if ((! grub_diskfilter_read_node (>nodes[p], sector,
-   size >> GRUB_DISK_SECTOR_BITS, buf)))
+  if (!read_func(data, p, sector, buf, size))
 {
   grub_crypto_xor (buf, buf, pbuf, size);
   goto quit;
 }
 
   grub_errno = GRUB_ERR_NONE;
-  if (grub_diskfilter_read_node (>nodes[q], sector,
-size >> GRUB_DISK_SECTOR_BITS, buf))
+  if (read_func(data, q, sector, buf, size))
 goto quit;
 
   grub_crypto_xor (buf, buf, qbuf, size);
@@ -160,14 +169,12 @@ grub_raid6_recover (struct grub_diskfilter_segment 
*array, int disknr, int p,
   /* Two bad devices */
   unsigned c;
 
-  if (grub_diskfilter_read_node (>nodes[p], sector,
-size >> GRUB_DISK_SECTOR_BITS, buf))
+  if (read_func(data, p, sector, buf, size))
 goto quit;
 
   grub_crypto_xor (pbuf, pbuf, buf, size);
 
-  if (grub_diskfilter_read_node (>nodes[q], sector,
-size >> GRUB_DISK_SECTOR_BITS, buf))
+  if (read_func(data, q, sector, buf, size))
 goto quit;
 
   grub_crypto_xor (qbuf, qbuf, buf, size);
@@ -190,6 +197,15 @@ quit:
   return grub_errno;
 }
 
+static grub_err_t
+grub_raid6_recover (struct grub_diskfilter_segment *array, int disknr, int p,
+char *buf, grub_disk_addr_t sector, grub_size_t size)
+{
+  return grub_raid6_recover_gen (array, array->n

[PATCH 9/9] btrfs: Add RAID 6 recovery for a btrfs filesystem.

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

Add the RAID 6 recovery, in order to use a RAID 6 filesystem even if some
disks (up to two) are missing. This code use the md RAID 6 code already
present in grub.

Signed-off-by: Goffredo Baroncelli 
---
 grub-core/fs/btrfs.c | 63 +++-
 1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index db8df0eea..b49f714ef 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 GRUB_MOD_LICENSE ("GPLv3+");
 
@@ -702,11 +703,36 @@ rebuild_raid5 (char *dest, struct raid56_buffer *buffers,
 }
 }
 
+static grub_err_t
+raid6_recover_read_buffer (void *data, int disk_nr,
+  grub_uint64_t addr __attribute__ ((unused)),
+  void *dest, grub_size_t size)
+{
+struct raid56_buffer *buffers = data;
+
+if (!buffers[disk_nr].data_is_valid)
+   return grub_errno = GRUB_ERR_READ_ERROR;
+
+grub_memcpy(dest, buffers[disk_nr].buf, size);
+
+return grub_errno = GRUB_ERR_NONE;
+}
+
+static void
+rebuild_raid6 (struct raid56_buffer *buffers, grub_uint64_t nstripes,
+   grub_uint64_t csize, grub_uint64_t parities_pos, void *dest,
+   grub_uint64_t stripen)
+
+{
+  grub_raid6_recover_gen (buffers, nstripes, stripen, parities_pos,
+  dest, 0, csize, 0, raid6_recover_read_buffer);
+}
+
 static grub_err_t
 raid56_read_retry (struct grub_btrfs_data *data,
   struct grub_btrfs_chunk_item *chunk,
-  grub_uint64_t stripe_offset,
-  grub_uint64_t csize, void *buf)
+  grub_uint64_t stripe_offset, grub_uint64_t stripen,
+  grub_uint64_t csize, void *buf, grub_uint64_t parities_pos)
 {
   struct raid56_buffer *buffers;
   grub_uint64_t nstripes = grub_le_to_cpu16 (chunk->nstripes);
@@ -777,6 +803,15 @@ raid56_read_retry (struct grub_btrfs_data *data,
   ret = GRUB_ERR_READ_ERROR;
   goto cleanup;
 }
+  else if (failed_devices > 2 && (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID6))
+{
+  grub_dprintf ("btrfs",
+   "not enough disks for raid6: total %" PRIuGRUB_UINT64_T
+   ", missing %" PRIuGRUB_UINT64_T "\n",
+   nstripes, failed_devices);
+  ret = GRUB_ERR_READ_ERROR;
+  goto cleanup;
+}
   else
 grub_dprintf ("btrfs",
  "enough disks for RAID 5 rebuilding: total %"
@@ -787,7 +822,7 @@ raid56_read_retry (struct grub_btrfs_data *data,
   if (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID5)
 rebuild_raid5 (buf, buffers, nstripes, csize);
   else
-grub_dprintf ("btrfs", "called rebuild_raid6(), NOT IMPLEMENTED\n");
+rebuild_raid6 (buffers, nstripes, csize, parities_pos, buf, stripen);
 
   ret = GRUB_ERR_NONE;
  cleanup:
@@ -877,9 +912,11 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
unsigned redundancy = 1;
unsigned i, j;
int is_raid56;
+   grub_uint64_t parities_pos = 0;
 
-   is_raid56 = !!(grub_le_to_cpu64 (chunk->type) &
-  GRUB_BTRFS_CHUNK_TYPE_RAID5);
+is_raid56 = !!(grub_le_to_cpu64 (chunk->type) &
+  (GRUB_BTRFS_CHUNK_TYPE_RAID5 |
+   GRUB_BTRFS_CHUNK_TYPE_RAID6));
 
if (grub_le_to_cpu64 (chunk->size) <= off)
  {
@@ -1011,6 +1048,9 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
   *size if the latter is smaller.
   *  - nparities is the number of parities (1 for RAID5, 2 for
   *RAID6); used only in RAID5/6 code.
+  *size if the latter is smaller.
+  *  - parities_pos is the position of the parity in a row (
+  *2 for P1, 3 for P2...)
   */
  stripe_nr = grub_divmod64 (off, chunk_stripe_length, );
 
@@ -1029,6 +1069,17 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
   */
  grub_divmod64 (high + stripen, nstripes, );
 
+ /*
+  * parities_pos is equal to "(high - nparities) % nstripes"
+  * (see the diagram above).
+  * However "high - nparities" might be negative (eg when high
+  * == 0) leading to an incorrect computation.
+  * Instead "high + nstripes - nparities" is always positive and
+  * in modulo nstripes is equal to "(high - nparities) % nstripes"
+  */
+ grub_divmod64 (high + nstripes - nparities, nstripes,
+_pos);
+
  stripe_offset = low + chunk_stripe_length * high;
  

[PATCH 6/9] btrfs: Refactor the code that read from disk

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

Move the code in charge to read the data from disk into a separate
function. This helps to separate the error handling logic (which depends on
the different raid profiles) from the read from disk logic.
Refactoring this code increases the general readability too.

This is a preparatory patch, to help the adding of the RAID 5/6 recovery
code.

Signed-off-by: Goffredo Baroncelli 
Reviewed-by: Daniel Kiper 
---
 grub-core/fs/btrfs.c | 75 ++--
 1 file changed, 44 insertions(+), 31 deletions(-)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index 179a2da9d..554f350c5 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -625,6 +625,46 @@ find_device (struct grub_btrfs_data *data, grub_uint64_t 
id)
   return ctx.dev_found;
 }
 
+static grub_err_t
+btrfs_read_from_chunk (struct grub_btrfs_data *data,
+  struct grub_btrfs_chunk_item *chunk,
+  grub_uint64_t stripen, grub_uint64_t stripe_offset,
+  int redundancy, grub_uint64_t csize,
+  void *buf)
+{
+struct grub_btrfs_chunk_stripe *stripe;
+grub_disk_addr_t paddr;
+grub_device_t dev;
+grub_err_t err;
+
+stripe = (struct grub_btrfs_chunk_stripe *) (chunk + 1);
+/* Right now the redundancy handling is easy.
+   With RAID5-like it will be more difficult.  */
+stripe += stripen + redundancy;
+
+paddr = grub_le_to_cpu64 (stripe->offset) + stripe_offset;
+
+grub_dprintf ("btrfs", "stripe %" PRIxGRUB_UINT64_T
+ " maps to 0x%" PRIxGRUB_UINT64_T
+ ". Reading paddr 0x%" PRIxGRUB_UINT64_T "\n",
+ stripen, stripe->offset, paddr);
+
+dev = find_device (data, stripe->device_id);
+if (!dev)
+  {
+   grub_dprintf ("btrfs",
+ "couldn't find a necessary member device "
+ "of multi-device filesystem\n");
+   grub_errno = GRUB_ERR_NONE;
+   return GRUB_ERR_READ_ERROR;
+  }
+
+err = grub_disk_read (dev->disk, paddr >> GRUB_DISK_SECTOR_BITS,
+ paddr & (GRUB_DISK_SECTOR_SIZE - 1),
+ csize, buf);
+return err;
+}
+
 static grub_err_t
 grub_btrfs_read_logical (struct grub_btrfs_data *data, grub_disk_addr_t addr,
 void *buf, grub_size_t size, int recursion_depth)
@@ -638,7 +678,6 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
   grub_err_t err = 0;
   struct grub_btrfs_key key_out;
   int challoc = 0;
-  grub_device_t dev;
   struct grub_btrfs_key key_in;
   grub_size_t chsize;
   grub_disk_addr_t chaddr;
@@ -885,36 +924,10 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
 
for (i = 0; i < redundancy; i++)
  {
-   struct grub_btrfs_chunk_stripe *stripe;
-   grub_disk_addr_t paddr;
-
-   stripe = (struct grub_btrfs_chunk_stripe *) (chunk + 1);
-   /* Right now the redundancy handling is easy.
-  With RAID5-like it will be more difficult.  */
-   stripe += stripen + i;
-
-   paddr = grub_le_to_cpu64 (stripe->offset) + stripe_offset;
-
-   grub_dprintf ("btrfs", "stripe %" PRIxGRUB_UINT64_T
- " maps to 0x%" PRIxGRUB_UINT64_T "\n",
- stripen, stripe->offset);
-   grub_dprintf ("btrfs", "reading paddr 0x%" PRIxGRUB_UINT64_T
- "\n", paddr);
-
-   dev = find_device (data, stripe->device_id);
-   if (!dev)
- {
-   grub_dprintf ("btrfs",
- "couldn't find a necessary member device "
- "of multi-device filesystem\n");
-   err = grub_errno;
-   grub_errno = GRUB_ERR_NONE;
-   continue;
- }
-
-   err = grub_disk_read (dev->disk, paddr >> GRUB_DISK_SECTOR_BITS,
- paddr & (GRUB_DISK_SECTOR_SIZE - 1),
- csize, buf);
+   err = btrfs_read_from_chunk (data, chunk, stripen,
+stripe_offset,
+i, /* redundancy */
+csize, buf);
if (!err)
  break;
grub_errno = GRUB_ERR_NONE;
-- 
2.19.0



[PATCH 3/9] btrfs: Move the error logging from find_device() to its caller.

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

The caller knows better if this error is fatal or not, i.e. another disk is
available or not.

This is a preparatory patch.

Signed-off-by: Goffredo Baroncelli 
Reviewed-by: Daniel Kiper 
---
 grub-core/fs/btrfs.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index bf0dbce21..2b1dd7dd4 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -603,12 +603,7 @@ find_device (struct grub_btrfs_data *data, grub_uint64_t 
id, int do_rescan)
   if (do_rescan)
 grub_device_iterate (find_device_iter, );
   if (!ctx.dev_found)
-{
-  grub_error (GRUB_ERR_BAD_FS,
- N_("couldn't find a necessary member device "
-"of multi-device filesystem"));
-  return NULL;
-}
+return NULL;
   data->n_devices_attached++;
   if (data->n_devices_attached > data->n_devices_allocated)
 {
@@ -906,6 +901,9 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
dev = find_device (data, stripe->device_id, j);
if (!dev)
  {
+   grub_dprintf ("btrfs",
+ "couldn't find a necessary member device "
+ "of multi-device filesystem\n");
err = grub_errno;
grub_errno = GRUB_ERR_NONE;
continue;
-- 
2.19.0



[PATCH 5/9] btrfs: Move logging code in grub_btrfs_read_logical()

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

A portion of the logging code is moved outside of internal for(;;). The part
that is left inside is the one which depends on the internal for(;;) index.

This is a preparatory patch. The next one will refactor the code inside
the for(;;) into an another function.

Signed-off-by: Goffredo Baroncelli 
Reviewed-by: Daniel Kiper 
---
 grub-core/fs/btrfs.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index c07d91657..179a2da9d 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -871,6 +871,18 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
 
for (j = 0; j < 2; j++)
  {
+   grub_dprintf ("btrfs", "chunk 0x%" PRIxGRUB_UINT64_T
+ "+0x%" PRIxGRUB_UINT64_T
+ " (%d stripes (%d substripes) of %"
+ PRIxGRUB_UINT64_T ")\n",
+ grub_le_to_cpu64 (key->offset),
+ grub_le_to_cpu64 (chunk->size),
+ grub_le_to_cpu16 (chunk->nstripes),
+ grub_le_to_cpu16 (chunk->nsubstripes),
+ grub_le_to_cpu64 (chunk->stripe_length));
+   grub_dprintf ("btrfs", "reading laddr 0x%" PRIxGRUB_UINT64_T "\n",
+ addr);
+
for (i = 0; i < redundancy; i++)
  {
struct grub_btrfs_chunk_stripe *stripe;
@@ -883,20 +895,11 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
 
paddr = grub_le_to_cpu64 (stripe->offset) + stripe_offset;
 
-   grub_dprintf ("btrfs", "chunk 0x%" PRIxGRUB_UINT64_T
- "+0x%" PRIxGRUB_UINT64_T
- " (%d stripes (%d substripes) of %"
- PRIxGRUB_UINT64_T ") stripe %" PRIxGRUB_UINT64_T
+   grub_dprintf ("btrfs", "stripe %" PRIxGRUB_UINT64_T
  " maps to 0x%" PRIxGRUB_UINT64_T "\n",
- grub_le_to_cpu64 (key->offset),
- grub_le_to_cpu64 (chunk->size),
- grub_le_to_cpu16 (chunk->nstripes),
- grub_le_to_cpu16 (chunk->nsubstripes),
- grub_le_to_cpu64 (chunk->stripe_length),
  stripen, stripe->offset);
grub_dprintf ("btrfs", "reading paddr 0x%" PRIxGRUB_UINT64_T
- " for laddr 0x%" PRIxGRUB_UINT64_T "\n", paddr,
- addr);
+ "\n", paddr);
 
dev = find_device (data, stripe->device_id);
if (!dev)
-- 
2.19.0



[PATCH 4/9] btrfs: Avoid a rescan for a device which was already not found.

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

Change the behavior of find_device(): before the patch, a read of a missed
device may trigger a rescan. However, it is never recorded that a device
is missed, so each single read of a missed device may triggers a rescan.
It is the caller who decides if a rescan is performed in case of a missed
device. And it does quite often, without considering if in the past a
devices was already found as "missed"
This behavior causes a lot of unneeded rescan, causing a huge slowdown in
case of a missed device.

After the patch, the "missed device" information is stored in the
data->devices_attached[] array (as a NULL value in the field dev). A rescan
is triggered only if no information at all is found. This means that only
the first time a read of a missed device triggers a rescan.

The change in the code is done removing "return NULL" when the disk is not
found. So it is always executed the code which stores in the
data->devices_attached[] array the value returned by grub_device_iterate():
NULL if the device is missed, or a valid data otherwise.

Signed-off-by: Goffredo Baroncelli 
---
 grub-core/fs/btrfs.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index 2b1dd7dd4..c07d91657 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -588,7 +588,7 @@ find_device_iter (const char *name, void *data)
 }
 
 static grub_device_t
-find_device (struct grub_btrfs_data *data, grub_uint64_t id, int do_rescan)
+find_device (struct grub_btrfs_data *data, grub_uint64_t id)
 {
   struct find_device_ctx ctx = {
 .data = data,
@@ -600,10 +600,9 @@ find_device (struct grub_btrfs_data *data, grub_uint64_t 
id, int do_rescan)
   for (i = 0; i < data->n_devices_attached; i++)
 if (id == data->devices_attached[i].id)
   return data->devices_attached[i].dev;
-  if (do_rescan)
-grub_device_iterate (find_device_iter, );
-  if (!ctx.dev_found)
-return NULL;
+
+  grub_device_iterate (find_device_iter, );
+
   data->n_devices_attached++;
   if (data->n_devices_attached > data->n_devices_allocated)
 {
@@ -615,7 +614,8 @@ find_device (struct grub_btrfs_data *data, grub_uint64_t 
id, int do_rescan)
* sizeof (data->devices_attached[0]));
   if (!data->devices_attached)
{
- grub_device_close (ctx.dev_found);
+ if (ctx.dev_found)
+   grub_device_close (ctx.dev_found);
  data->devices_attached = tmp;
  return NULL;
}
@@ -898,7 +898,7 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
  " for laddr 0x%" PRIxGRUB_UINT64_T "\n", paddr,
  addr);
 
-   dev = find_device (data, stripe->device_id, j);
+   dev = find_device (data, stripe->device_id);
if (!dev)
  {
grub_dprintf ("btrfs",
@@ -975,7 +975,8 @@ grub_btrfs_unmount (struct grub_btrfs_data *data)
   unsigned i;
   /* The device 0 is closed one layer upper.  */
   for (i = 1; i < data->n_devices_attached; i++)
-grub_device_close (data->devices_attached[i].dev);
+if (data->devices_attached[i].dev)
+grub_device_close (data->devices_attached[i].dev);
   grub_free (data->devices_attached);
   grub_free (data->extent);
   grub_free (data);
-- 
2.19.0



[PATCH 7/9] btrfs: Add support for recovery for a RAID 5 btrfs profiles.

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

Add support for recovery for a RAID 5 btrfs profile. In addition
it is added some code as preparatory work for RAID 6 recovery code.

Signed-off-by: Goffredo Baroncelli 
---
 grub-core/fs/btrfs.c | 160 +--
 1 file changed, 155 insertions(+), 5 deletions(-)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index 554f350c5..db8df0eea 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 GRUB_MOD_LICENSE ("GPLv3+");
 
@@ -665,6 +666,139 @@ btrfs_read_from_chunk (struct grub_btrfs_data *data,
 return err;
 }
 
+struct raid56_buffer {
+  void *buf;
+  int  data_is_valid;
+};
+
+static void
+rebuild_raid5 (char *dest, struct raid56_buffer *buffers,
+  grub_uint64_t nstripes, grub_uint64_t csize)
+{
+  grub_uint64_t i;
+  int first;
+
+  for(i = 0; buffers[i].data_is_valid && i < nstripes; i++);
+
+  if (i == nstripes)
+{
+  grub_dprintf ("btrfs", "called rebuild_raid5(), but all disks are OK\n");
+  return;
+}
+
+  grub_dprintf ("btrfs", "rebuilding RAID 5 stripe #%" PRIuGRUB_UINT64_T "\n", 
i);
+
+  for (i = 0, first = 1; i < nstripes; i++)
+{
+  if (!buffers[i].data_is_valid)
+   continue;
+
+  if (first) {
+   grub_memcpy(dest, buffers[i].buf, csize);
+   first = 0;
+  } else
+   grub_crypto_xor (dest, dest, buffers[i].buf, csize);
+
+}
+}
+
+static grub_err_t
+raid56_read_retry (struct grub_btrfs_data *data,
+  struct grub_btrfs_chunk_item *chunk,
+  grub_uint64_t stripe_offset,
+  grub_uint64_t csize, void *buf)
+{
+  struct raid56_buffer *buffers;
+  grub_uint64_t nstripes = grub_le_to_cpu16 (chunk->nstripes);
+  grub_uint64_t chunk_type = grub_le_to_cpu64 (chunk->type);
+  grub_err_t ret = GRUB_ERR_OUT_OF_MEMORY;
+  grub_uint64_t i, failed_devices;
+
+  buffers = grub_zalloc (sizeof(*buffers) * nstripes);
+  if (!buffers)
+goto cleanup;
+
+  for (i = 0; i < nstripes; i++)
+{
+  buffers[i].buf = grub_zalloc (csize);
+  if (!buffers[i].buf)
+   goto cleanup;
+}
+
+  for (failed_devices = 0, i = 0; i < nstripes; i++)
+{
+  struct grub_btrfs_chunk_stripe *stripe;
+  grub_disk_addr_t paddr;
+  grub_device_t dev;
+  grub_err_t err;
+
+  stripe = (struct grub_btrfs_chunk_stripe *) (chunk + 1) + i;
+
+  paddr = grub_le_to_cpu64 (stripe->offset) + stripe_offset;
+  grub_dprintf ("btrfs", "reading paddr %" PRIxGRUB_UINT64_T
+" from stripe ID %" PRIxGRUB_UINT64_T "\n", paddr,
+stripe->device_id);
+
+  dev = find_device (data, stripe->device_id);
+  if (!dev)
+   {
+ buffers[i].data_is_valid = 0;
+ grub_dprintf ("btrfs", "stripe %" PRIuGRUB_UINT64_T " FAILED (dev ID 
%"
+   PRIxGRUB_UINT64_T ")\n", i, stripe->device_id);
+ failed_devices++;
+ continue;
+   }
+
+  err = grub_disk_read (dev->disk, paddr >> GRUB_DISK_SECTOR_BITS,
+   paddr & (GRUB_DISK_SECTOR_SIZE - 1),
+   csize, buffers[i].buf);
+  if (err == GRUB_ERR_NONE)
+   {
+ buffers[i].data_is_valid = 1;
+ grub_dprintf ("btrfs", "stripe %" PRIuGRUB_UINT64_T " Ok (dev ID %"
+   PRIxGRUB_UINT64_T ")\n", i, stripe->device_id);
+   }
+  else
+   {
+ buffers[i].data_is_valid = 0;
+ grub_dprintf ("btrfs", "stripe %" PRIuGRUB_UINT64_T
+   " FAILED (dev ID %" PRIxGRUB_UINT64_T ")\n", i,
+   stripe->device_id);
+ failed_devices++;
+   }
+}
+
+  if (failed_devices > 1 && (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID5))
+{
+  grub_dprintf ("btrfs",
+   "not enough disks for RAID 5: total %" PRIuGRUB_UINT64_T
+   ", missing %" PRIuGRUB_UINT64_T "\n",
+   nstripes, failed_devices);
+  ret = GRUB_ERR_READ_ERROR;
+  goto cleanup;
+}
+  else
+grub_dprintf ("btrfs",
+ "enough disks for RAID 5 rebuilding: total %"
+ PRIuGRUB_UINT64_T ", missing %" PRIuGRUB_UINT64_T "\n",
+ nstripes, failed_devices);
+
+  /* if these are enough, try to rebuild the data */
+  if (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID5)
+rebuild_raid5 (buf, buffers, nstripes, csize);
+  else
+grub_dprintf ("btrfs", "called rebuild_raid6(), NOT IMPLEMENTED\n");
+
+  ret = GRUB_ERR_NONE;
+ cleanup:
+  if (

[PATCH 2/9] btrfs: Add helper to check the btrfs header.

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

This helper is used in a few places to help the debugging. As
conservative approach the error is only logged.
This does not impact the error handling.

Signed-off-by: Goffredo Baroncelli 
Reviewed-by: Daniel Kiper 
---
 grub-core/fs/btrfs.c | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index 9bc6d399d..bf0dbce21 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -77,7 +77,8 @@ struct btrfs_header
 {
   grub_btrfs_checksum_t checksum;
   grub_btrfs_uuid_t uuid;
-  grub_uint8_t dummy[0x30];
+  grub_uint64_t bytenr;
+  grub_uint8_t dummy[0x28];
   grub_uint32_t nitems;
   grub_uint8_t level;
 } GRUB_PACKED;
@@ -286,6 +287,25 @@ free_iterator (struct grub_btrfs_leaf_descriptor *desc)
   grub_free (desc->data);
 }
 
+static grub_err_t
+check_btrfs_header (struct grub_btrfs_data *data, struct btrfs_header *header,
+grub_disk_addr_t addr)
+{
+  if (grub_le_to_cpu64 (header->bytenr) != addr)
+{
+  grub_dprintf ("btrfs", "btrfs_header.bytenr is not equal node addr\n");
+  return grub_error (GRUB_ERR_BAD_FS,
+"header bytenr is not equal node addr");
+}
+  if (grub_memcmp (data->sblock.uuid, header->uuid, sizeof(grub_btrfs_uuid_t)))
+{
+  grub_dprintf ("btrfs", "btrfs_header.uuid doesn't match sblock uuid\n");
+  return grub_error (GRUB_ERR_BAD_FS,
+"header uuid doesn't match sblock uuid");
+}
+  return GRUB_ERR_NONE;
+}
+
 static grub_err_t
 save_ref (struct grub_btrfs_leaf_descriptor *desc,
  grub_disk_addr_t addr, unsigned i, unsigned m, int l)
@@ -341,6 +361,7 @@ next (struct grub_btrfs_data *data,
 
   err = grub_btrfs_read_logical (data, grub_le_to_cpu64 (node.addr),
 , sizeof (head), 0);
+  check_btrfs_header (data, , grub_le_to_cpu64 (node.addr));
   if (err)
return -err;
 
@@ -402,6 +423,7 @@ lower_bound (struct grub_btrfs_data *data,
   /* FIXME: preread few nodes into buffer. */
   err = grub_btrfs_read_logical (data, addr, , sizeof (head),
 recursion_depth + 1);
+  check_btrfs_header (data, , addr);
   if (err)
return err;
   addr += sizeof (head);
-- 
2.19.0



[PATCH V8] Add support for BTRFS raid5/6 to GRUB

2018-09-27 Thread Goffredo Baroncelli


i All,

the aim of this patches set is to provide support for a BTRFS raid5/6
filesystem in GRUB.

The first patch, implements the basic support for raid5/6. I.e this works when
all the disks are present.

The next 5 patches, are preparatory ones.

The 7th patch implements the raid5 recovery for btrfs (i.e. handling the
disappearing of 1 disk).
The 8th patch makes the code for handling the raid6 recovery more generic.
The last one implements the raid6 recovery for btrfs (i.e. handling the
disappearing up to two disks).

I tested the code in grub-emu, and it works both with all the disks,
and with some disks missing. I checked the crc32 calculated from grub and
from linux and these matched. Finally I checked if the support for md raid6
still works properly, and it does (with all drives and with up to 2 drives
missing)

Comments are welcome.

Changelog
v1: initial support for btrfs raid5/6. No recovery allowed
v2: full support for btrfs raid5/6. Recovery allowed
v3: some minor cleanup suggested by Daniel Kiper; reusing the
original raid6 recovery code of grub
v4: Several spell fix; better description of the RAID layout
in btrfs, and the variables which describes the stripe
positioning; split the patch #5 in two (#5 and #6)
v5: Several spell fix; improved code comment in patch #1, small
clean up in the code
v6: Small cleanup; improved the wording in the RAID6 layout
description; in the function raid6_recover_read_buffer() avoid
a unnecessary memcpy in case of invalid data;
v7: - patch 2,3,5,6,8 received an Review-by Daniel, and were unchanged from
the last time (only minor cleanup in the commit description requested by
Daniel)
- patch 7 received some small update rearranging a for(), and some
bracket around if()
- patch 4, received an update message which explains better why NULL
is stored in data->devices_attached[]
- patch 9, received a blank line to separate better a code line from
a previous comment. A description of 'parities_pos' was added
- patch 1, received a major update about the variable meaning description
in the comment. However I suspect that we need some further review to reach
a fully agreement about this text. NB: the update are relate only to 
comments
v8: - patch 2,5,6,8 received an Review-by Daniel, and were unchanged from
the last time (only minor cleanup in the commit description requested by
Daniel)
- patch 1 received some adjustement to the variables description due to
  the different terminology between BTRFS and other RAID implementatio.
  Added a description for the "nparities" variable.
- patch 3 removed some unnecessary curly brackets (change request by Daniel)
- patch 4 received an improved commit description about why and how 
  the function find_device() is changed
- patch 7 received an update which transforms a i = 0; while(i..) i++; in
  for( i = 0. ; i++);
- patch 9 received an update to the comment

BR
G.Baroncelli

--
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5




[PATCH 1/9] btrfs: Add support for reading a filesystem with a RAID 5 or RAID 6 profile.

2018-09-27 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

Signed-off-by: Goffredo Baroncelli 
---
 grub-core/fs/btrfs.c | 74 
 1 file changed, 74 insertions(+)

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index be195448d..9bc6d399d 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
 #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
 #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
 #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
+#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
+#define GRUB_BTRFS_CHUNK_TYPE_RAID6 0x100
   grub_uint8_t dummy2[0xc];
   grub_uint16_t nstripes;
   grub_uint16_t nsubstripes;
@@ -764,6 +766,78 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
  stripe_offset = low + chunk_stripe_length
* high;
  csize = chunk_stripe_length - low;
+ break;
+   }
+ case GRUB_BTRFS_CHUNK_TYPE_RAID5:
+ case GRUB_BTRFS_CHUNK_TYPE_RAID6:
+   {
+ grub_uint64_t nparities, stripe_nr, high, low;
+
+ redundancy = 1;   /* no redundancy for now */
+
+ if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
+   {
+ grub_dprintf ("btrfs", "RAID5\n");
+ nparities = 1;
+   }
+ else
+   {
+ grub_dprintf ("btrfs", "RAID6\n");
+ nparities = 2;
+   }
+
+ /*
+  * A RAID 6 layout consists of several stripes spread on the
+  * disks, following a layout like the one below
+  *
+  *   Disk0  Disk1  Disk2  Disk3
+  *
+  *A1 B1 P1 Q1
+  *Q2 A2 B2 P2
+  *P3 Q3 A3 B3
+  *  [...]
+  *
+  *  Note that the placement of the parities depends on row index.
+  *  Pay attention that the BTRFS terminolgy may be different
+  *  from others RAID implementation (e.g. lvm/dm or md). In BTRFS
+  *  a contiguous block of data of a disk (like A1) is called
+  *  stripe.
+  *  In the code below:
+  *  - stripe_nr is the stripe number without the parities
+  *(A1 = 0, B1 = 1, A2 = 2, B2 = 3, ...),
+  *  - high is the row number (0 for A1...Q1, 1 for Q2...P2, ...),
+  *  - stripen is the disk number in a row (0 for A1,Q2,P3, 1 for
+  *B1...),
+  *  - off is the logical address to read,
+  *  - chunk_stripe_length is the size of a stripe (typically 64k),
+  *  - nstripes is the number of disks of a row
+  *  - low is the offset of the data inside a stripe,
+  *  - stripe_offset is the data offset in an array,
+  *  - csize is the "potential" data to read. It will be reduced to
+  *size if the latter is smaller.
+  *  - nparities is the number of parities (1 for RAID5, 2 for
+  *RAID6); used only in RAID5/6 code.
+  */
+ stripe_nr = grub_divmod64 (off, chunk_stripe_length, );
+
+ /*
+  * stripen is computed without the parities (0 for A1, A2, A3...
+  * 1 for B1, B2...).
+  */
+ high = grub_divmod64 (stripe_nr, nstripes - nparities, );
+
+ /*
+  * the stripes are spread across the disks, each row their
+  * position is shifted by 1 place. So to compute the real disk
+  * number occuped by a stripe, we need to sum also the
+  * "row number" in modulo nstripes (0 for A1, 1 for A2, 2 for
+  * A3).
+  */
+ grub_divmod64 (high + stripen, nstripes, );
+
+ stripe_offset = low + chunk_stripe_length * high;
+ csize = chunk_stripe_length - low;
+
  break;
}
  default:
-- 
2.19.0



Re: GRUB writing to grubenv outside of kernel fs code

2018-09-19 Thread Goffredo Baroncelli
On 18/09/2018 19.15, Goffredo Baroncelli wrote:
>> b. The bootloader code, would have to have sophisticated enough Btrfs
>> knowledge to know if the grubenv has been reflinked or snapshot,
>> because even if +C, it may not be valid to overwrite, and COW must
>> still happen, and there's no way the code in GRUB can do full blow COW
>> and update a bunch of metadata.

> And what if GRUB ignore the possibility of COWing and overwrite the data ? Is 
> it a so big problem that the data is changed in all the snapshots ? 
> It would be interested if the same problem happens for a swap file.

I gave a look to the Sandoval's patches about implementing swap on BTRFS. This 
patch set
prevents the subvolume containing the swapfile to be snapshot-ted (and the file 
to be balanced and so on...); what if we would add the same constraint to the 
grubenv file ?


BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


[PATCH 8/9] btrfs: Make more generic the code for RAID 6 rebuilding

2018-09-19 Thread Goffredo Baroncelli
From: Goffredo Baroncelli 

The original code which handles the recovery of a RAID 6 disks array
assumes that all reads are multiple of 1 << GRUB_DISK_SECTOR_BITS and it
assumes that all the I/O is done via the struct grub_diskfilter_segment.
This is not true for the btrfs code. In order to reuse the native
grub_raid6_recover() code, it is modified to not call
grub_diskfilter_read_node() directly, but to call an handler passed
as an argument.

Signed-off-by: Goffredo Baroncelli 
Reviewed-by: Daniel Kiper 
---
 grub-core/disk/raid6_recover.c | 52 ++
 include/grub/diskfilter.h  |  9 ++
 2 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/grub-core/disk/raid6_recover.c b/grub-core/disk/raid6_recover.c
index aa674f6ca..0cf691ddf 100644
--- a/grub-core/disk/raid6_recover.c
+++ b/grub-core/disk/raid6_recover.c
@@ -74,14 +74,26 @@ mod_255 (unsigned x)
 }
 
 static grub_err_t
-grub_raid6_recover (struct grub_diskfilter_segment *array, int disknr, int p,
-char *buf, grub_disk_addr_t sector, grub_size_t size)
+raid6_recover_read_node (void *data, int disknr,
+   grub_uint64_t sector,
+   void *buf, grub_size_t size)
+{
+struct grub_diskfilter_segment *array = data;
+
+return grub_diskfilter_read_node (>nodes[disknr],
+ (grub_disk_addr_t)sector,
+ size >> GRUB_DISK_SECTOR_BITS, buf);
+}
+
+grub_err_t
+grub_raid6_recover_gen (void *data, grub_uint64_t nstripes, int disknr, int p,
+   char *buf, grub_uint64_t sector, grub_size_t size,
+   int layout, raid_recover_read_t read_func)
 {
   int i, q, pos;
   int bad1 = -1, bad2 = -1;
   char *pbuf = 0, *qbuf = 0;
 
-  size <<= GRUB_DISK_SECTOR_BITS;
   pbuf = grub_zalloc (size);
   if (!pbuf)
 goto quit;
@@ -91,17 +103,17 @@ grub_raid6_recover (struct grub_diskfilter_segment *array, 
int disknr, int p,
 goto quit;
 
   q = p + 1;
-  if (q == (int) array->node_count)
+  if (q == (int) nstripes)
 q = 0;
 
   pos = q + 1;
-  if (pos == (int) array->node_count)
+  if (pos == (int) nstripes)
 pos = 0;
 
-  for (i = 0; i < (int) array->node_count - 2; i++)
+  for (i = 0; i < (int) nstripes - 2; i++)
 {
   int c;
-  if (array->layout & GRUB_RAID_LAYOUT_MUL_FROM_POS)
+  if (layout & GRUB_RAID_LAYOUT_MUL_FROM_POS)
c = pos;
   else
c = i;
@@ -109,8 +121,7 @@ grub_raid6_recover (struct grub_diskfilter_segment *array, 
int disknr, int p,
 bad1 = c;
   else
 {
-  if (! grub_diskfilter_read_node (>nodes[pos], sector,
-  size >> GRUB_DISK_SECTOR_BITS, buf))
+ if (!read_func(data, pos, sector, buf, size))
 {
   grub_crypto_xor (pbuf, pbuf, buf, size);
   grub_raid_block_mulx (c, buf, size);
@@ -128,7 +139,7 @@ grub_raid6_recover (struct grub_diskfilter_segment *array, 
int disknr, int p,
 }
 
   pos++;
-  if (pos == (int) array->node_count)
+  if (pos == (int) nstripes)
 pos = 0;
 }
 
@@ -139,16 +150,14 @@ grub_raid6_recover (struct grub_diskfilter_segment 
*array, int disknr, int p,
   if (bad2 < 0)
 {
   /* One bad device */
-  if ((! grub_diskfilter_read_node (>nodes[p], sector,
-   size >> GRUB_DISK_SECTOR_BITS, buf)))
+  if (!read_func(data, p, sector, buf, size))
 {
   grub_crypto_xor (buf, buf, pbuf, size);
   goto quit;
 }
 
   grub_errno = GRUB_ERR_NONE;
-  if (grub_diskfilter_read_node (>nodes[q], sector,
-size >> GRUB_DISK_SECTOR_BITS, buf))
+  if (read_func(data, q, sector, buf, size))
 goto quit;
 
   grub_crypto_xor (buf, buf, qbuf, size);
@@ -160,14 +169,12 @@ grub_raid6_recover (struct grub_diskfilter_segment 
*array, int disknr, int p,
   /* Two bad devices */
   unsigned c;
 
-  if (grub_diskfilter_read_node (>nodes[p], sector,
-size >> GRUB_DISK_SECTOR_BITS, buf))
+  if (read_func(data, p, sector, buf, size))
 goto quit;
 
   grub_crypto_xor (pbuf, pbuf, buf, size);
 
-  if (grub_diskfilter_read_node (>nodes[q], sector,
-size >> GRUB_DISK_SECTOR_BITS, buf))
+  if (read_func(data, q, sector, buf, size))
 goto quit;
 
   grub_crypto_xor (qbuf, qbuf, buf, size);
@@ -190,6 +197,15 @@ quit:
   return grub_errno;
 }
 
+static grub_err_t
+grub_raid6_recover (struct grub_diskfilter_segment *array, int disknr, int p,
+char *buf, grub_disk_addr_t sector, grub_size_t size)
+{
+  return grub_raid6_recover_gen (array, array->n

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Goffredo Baroncelli
On 18/09/2018 20.52, Chris Murphy wrote:
> On Tue, Sep 18, 2018 at 11:15 AM, Goffredo Baroncelli
>  wrote:
>> On 18/09/2018 06.21, Chris Murphy wrote:
>>> b. The bootloader code, would have to have sophisticated enough Btrfs
>>> knowledge to know if the grubenv has been reflinked or snapshot,
>>> because even if +C, it may not be valid to overwrite, and COW must
>>> still happen, and there's no way the code in GRUB can do full blow COW
>>> and update a bunch of metadata.
>>
>> And what if GRUB ignore the possibility of COWing and overwrite the data ? 
>> Is it a so big problem that the data is changed in all the snapshots ?
>> It would be interested if the same problem happens for a swap file.
> 
> I think it's an abomination :-) It totally perverts the idea of
> reflinks and snapshots and blurs the line between domains. 

:-)

> Is it a
> user file or not and are these user space commands or not and are they
> reliable or do they have exceptions?

On this statement I fully agree, on the one below a bit less
> 
> I have a boot subvolume mounted at /boot, and this boot subvolume gets
> snapshot, and if GRUB can overwrite grubenv, it overwrites the
> purported GRUB state information in every one of those boots, going
> back maybe months, even when these are read only subvolumes.

Also the 'suse' behavior have the same issue: storing the data somewhere in the 
storage reserved area suffers of the same problem. We should be realistic, 
without implement a full btrfs filesystem engine, it is near impossible to have 
a grubenv file visible by the filesystem and snapshot-able.


> 
> I think it's a problem, and near as I can tell it'll be a problem for
> all kinds of complex storage. I don't see how the bootloader itself
> can do an overwrite onto raid5 or raid6. 


> That's certainly supported by GRUB for reading
Not yet, I am working on that [1]

> but is the bootloader overwrite of gruvenv going to
> recompute parity and write to multiple devices? Eek!

Recompute the parity should not be a big deal. Updating all the (b)trees would 
be a too complex goal.
> 
> 

[1] http://lists.gnu.org/archive/html/grub-devel/2018-06/msg00064.html
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Goffredo Baroncelli
On 18/09/2018 06.21, Chris Murphy wrote:
> b. The bootloader code, would have to have sophisticated enough Btrfs
> knowledge to know if the grubenv has been reflinked or snapshot,
> because even if +C, it may not be valid to overwrite, and COW must
> still happen, and there's no way the code in GRUB can do full blow COW
> and update a bunch of metadata.

And what if GRUB ignore the possibility of COWing and overwrite the data ? Is 
it a so big problem that the data is changed in all the snapshots ? 
It would be interested if the same problem happens for a swap file.


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Goffredo Baroncelli
On 07/20/2018 07:17 AM, Andrei Borzenkov wrote:
> 18.07.2018 22:42, Goffredo Baroncelli пишет:
>> On 07/18/2018 09:20 AM, Duncan wrote:
>>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>>> excerpted:
>>>
>>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>>> excerpted:
>>>>>
>>>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>>>
>>>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>>>> parity are mutually exclusive.
>>>>>
>>>>> I can't agree.  I don't know whether you meant that in the global
>>>>> sense,
>>>>> or purely in the btrfs context (which I suspect), but either way I
>>>>> can't agree.
>>>>>
>>>>> In the pure btrfs context, while striping and mirroring/pairing are
>>>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>>>> flexible enough to allow both together and the feature may at some
>>>>> point be added, so it makes sense to have a layout notation format
>>>>> flexible enough to allow it as well.
>>>>
>>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>>> have - striping (RAID0)
>>>> - parity  (?)
>>>> - striping + parity  (e.g. RAID5/6)
>>>> - mirroring  (RAID1)
>>>> - mirroring + striping  (RAID10)
>>>>
>>>> However you can't have mirroring+parity; this means that a notation
>>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>>> too verbose.
>>>
>>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
>>> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
>>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
>>> on top of raid0.  
>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on 
>> top of) ???
>>
>> Seriously, of course you can combine a lot of different profile; however the 
>> only ones that make sense are the ones above.
> 
> RAID50 (striping across RAID5) is common.

Yeah someone else report that. But other than reducing the number of disk per 
raid5 (increasing the ration number of disks/number of parity disks), which 
other advantages has ? 
Limiting the number of disk per raid, in BTRFS would be quite simple to 
implement in the "chunk allocator"

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Goffredo Baroncelli
On 07/19/2018 09:10 PM, Austin S. Hemmelgarn wrote:
> On 2018-07-19 13:29, Goffredo Baroncelli wrote:
[...]
>>
>> So until now you are repeating what I told: the only useful raid profile are
>> - striping
>> - mirroring
>> - striping+paring (even limiting the number of disk involved)
>> - striping+mirroring
> 
> No, not quite.  At least, not in the combinations you're saying make sense if 
> you are using standard terminology.  RAID05 and RAID06 are not the same thing 
> as 'striping+parity' as BTRFS implements that case, and can be significantly 
> more optimized than the trivial implementation of just limiting the number of 
> disks involved in each chunk (by, you know, actually striping just like what 
> we currently call raid10 mode in BTRFS does).

Could you provide more information ?

>>
>>>
>>> RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they 
>>> might actually make sense in BTRFS to provide a backup means of rebuilding 
>>> blocks that fail checksum validation if both copies fail.
>> If you need further redundancy, it is easy to implement a parity3 and 
>> parity4 raid profile instead of stacking a raid6+raid1
> I think you're misunderstanding what I mean here.
> 
> RAID15/16 consist of two layers:
> * The top layer is regular RAID1, usually limited to two copies.
> * The lower layer is RAID5 or RAID6.
> 
> This means that the lower layer can validate which of the two copies in the 
> upper layer is correct when they don't agree.  

This happens only because there is a redundancy greater than 1. Anyway BTRFS 
has the checksum, which helps a lot in this area

> It doesn't really provide significantly better redundancy (they can 
> technically sustain more disk failures without failing completely than simple 
> two-copy RAID1 can, but just like BTRFS raid10, they can't reliably survive 
> more than one (or two if you're using RAID6 as the lower layer) disk 
> failure), so it does not do the same thing that higher-order parity does.
>>
>>>>
>>>> The fact that you can combine striping and mirroring (or pairing) makes 
>>>> sense because you could have a speed gain (see below).
>>>> []
>>>>>>>
>>>>>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>>>>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>>>>>> much works this way except with huge (gig size) chunks.
>>>>>>
>>>>>> As implemented in BTRFS, raid1 doesn't have striping.
>>>>>
>>>>> The argument is that because there's only two copies, on multi-device
>>>>> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
>>>>> alternate device pairs, it's effectively striped at the macro level, with
>>>>> the 1 GiB device-level chunks effectively being huge individual device
>>>>> strips of 1 GiB.
>>>>
>>>> The striping concept is based to the fact that if the "stripe size" is 
>>>> small enough you have a speed benefit because the reads may be performed 
>>>> in parallel from different disks.
>>> That's not the only benefit of striping though.  The other big one is that 
>>> you now have one volume that's the combined size of both of the original 
>>> devices.  Striping is arguably better for this even if you're using a large 
>>> stripe size because it better balances the wear across the devices than 
>>> simple concatenation.
>>
>> Striping means that the data is interleaved between the disks with a 
>> reasonable "block unit". Otherwise which would be the difference between 
>> btrfs-raid0 and btrfs-single ?
> Single mode guarantees that any file less than the chunk size in length will 
> either be completely present or completely absent if one of the devices 
> fails.  BTRFS raid0 mode does not provide any such guarantee, and in fact 
> guarantees that all files that are larger than the stripe unit size (however 
> much gets put on one disk before moving to the next) will all lose data if a 
> device fails.
> 
> Stupid as it sounds, this matters for some people.

I think that even better would be having different filesystems.

>>
>>>
>>>> With a "stripe size" of 1GB, it is very unlikely that this would happens.
>>> That's a pretty big assumption.  There are all kinds of access patterns 
>>> that will still distribute the load reasonably evenly across the 
>>> constituent devices, even if they don't parallelize thin

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Goffredo Baroncelli
On 07/18/2018 09:20 AM, Duncan wrote:
> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
> excerpted:
> 
>> On 07/17/2018 11:12 PM, Duncan wrote:
>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>> excerpted:
>>>
>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>
>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>> parity are mutually exclusive.
>>>
>>> I can't agree.  I don't know whether you meant that in the global
>>> sense,
>>> or purely in the btrfs context (which I suspect), but either way I
>>> can't agree.
>>>
>>> In the pure btrfs context, while striping and mirroring/pairing are
>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>> flexible enough to allow both together and the feature may at some
>>> point be added, so it makes sense to have a layout notation format
>>> flexible enough to allow it as well.
>>
>> When I say orthogonal, It means that these can be combined: i.e. you can
>> have - striping (RAID0)
>> - parity  (?)
>> - striping + parity  (e.g. RAID5/6)
>> - mirroring  (RAID1)
>> - mirroring + striping  (RAID10)
>>
>> However you can't have mirroring+parity; this means that a notation
>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>> too verbose.
> 
> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
> on top of raid0.  
And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top 
of) ???

Seriously, of course you can combine a lot of different profile; however the 
only ones that make sense are the ones above.

The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below). 
[]
>>>
>>> As someone else pointed out, md/lvm-raid10 already work like this. 
>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>> much works this way except with huge (gig size) chunks.
>>
>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device 
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
> alternate device pairs, it's effectively striped at the macro level, with 
> the 1 GiB device-level chunks effectively being huge individual device 
> strips of 1 GiB.

The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in parallel 
from different disks.
With a "stripe size" of 1GB, it is very unlikely that this would happens.

 
> At 1 GiB strip size it doesn't have the typical performance advantage of 
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
> strips/chunks.



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Goffredo Baroncelli
On 07/17/2018 11:12 PM, Duncan wrote:
> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
> excerpted:
> 
>> On 07/15/2018 04:37 PM, waxhead wrote:
> 
>> Striping and mirroring/pairing are orthogonal properties; mirror and
>> parity are mutually exclusive.
> 
> I can't agree.  I don't know whether you meant that in the global sense, 
> or purely in the btrfs context (which I suspect), but either way I can't 
> agree.
> 
> In the pure btrfs context, while striping and mirroring/pairing are 
> orthogonal today, Hugo's whole point was that btrfs is theoretically 
> flexible enough to allow both together and the feature may at some point 
> be added, so it makes sense to have a layout notation format flexible 
> enough to allow it as well.

When I say orthogonal, It means that these can be combined: i.e. you can have
- striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation where both 
'C' ( = number of copy) and 'P' ( = number of parities) is too verbose.

[...]
> 
>> Question #2: historically RAID10 is requires 4 disks. However I am
>> guessing if the stripe could be done on a different number of disks:
>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>> that every 64k, the data are stored on a different disk
> 
> As someone else pointed out, md/lvm-raid10 already work like this.  What 
> btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much 
> works this way except with huge (gig size) chunks.

As implemented in BTRFS, raid1 doesn't have striping.

 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-16 Thread Goffredo Baroncelli
On 07/15/2018 04:37 PM, waxhead wrote:
> David Sterba wrote:
>> An interesting question is the naming of the extended profiles. I picked
>> something that can be easily understood but it's not a final proposal.
>> Years ago, Hugo proposed a naming scheme that described the
>> non-standard raid varieties of the btrfs flavor:
>>
>> https://marc.info/?l=linux-btrfs=136286324417767
>>
>> Switching to this naming would be a good addition to the extended raid.
>>
> As just a humble BTRFS user I agree and really think it is about time to move 
> far away from the RAID terminology. However adding some more descriptive 
> profile names (or at least some aliases) would be much better for the 
> commoners (such as myself).
> 
> For example:
> 
> Old format / New Format / My suggested alias
> SINGLE  / 1C / SINGLE
> DUP / 2CD    / DUP (or even MIRRORLOCAL1)
> RAID0   / 1CmS   / STRIPE


> RAID1   / 2C / MIRROR1
> RAID1c3 / 3C / MIRROR2
> RAID1c4 / 4C / MIRROR3
> RAID10  / 2CmS   / STRIPE.MIRROR1

Striping and mirroring/pairing are orthogonal properties; mirror and parity are 
mutually exclusive. What about

RAID1 -> MIRROR1
RAID10 -> MIRROR1S
RAID1c3 -> MIRROR2
RAID1c3+striping -> MIRROR2S

and so on...

> RAID5   / 1CmS1P / STRIPE.PARITY1
> RAID6   / 1CmS2P / STRIPE.PARITY2

To me these should be called something like

RAID5 -> PARITY1S
RAID6 -> PARITY2S

The S final is due to the fact that usually RAID5/6 spread the data on all 
available disks

Question #1: for "parity" profiles, does make sense to limit the maximum disks 
number where the data may be spread ? If the answer is not, we could omit the 
last S. IMHO it should. 
Question #2: historically RAID10 is requires 4 disks. However I am guessing if 
the stripe could be done on a different number of disks: What about 
RAID1+Striping on 3 (or 5 disks) ? The key of striping is that every 64k, the 
data are stored on a different disk







> 
> I find that writing something like "btrfs balance start 
> -dconvert=stripe5.parity2 /mnt" is far less confusing and therefore less 
> error prone than writing "-dconvert=1C5S2P".
> 
> While Hugo's suggestion is compact and to the point I would call for 
> expanding that so it is a bit more descriptive and human readable.
> 
> So for example : STRIPE where  obviously is the same as Hugo 
> proposed - the number of storage devices for the stripe and no  would be 
> best to mean 'use max devices'.
> For PARITY then  is obviously required
> 
> Keep in mind that most people (...and I am willing to bet even Duncan which 
> probably HAS backups ;) ) get a bit stressed when their storage system is 
> degraded. With that in mind I hope for more elaborate, descriptive and human 
> readable profile names to be used to avoid making mistakes using the 
> "compact" layout.
> 
> ...and yes, of course this could go both ways. A more compact (and dare I say 
> cryptic) variant can cause people to stop and think before doing something 
> and thus avoid errors,
> 
> Now that I made my point I can't help being a bit extra hash, obnoxious and 
> possibly difficult so I would also suggest that Hugo's format could have been 
> changed (dare I say improved?) from
> 
> numCOPIESnumSTRIPESnumPARITY
> 
> to.
> 
> REPLICASnum.STRIPESnum.PARITYnum
> 
> Which would make the above table look like so:
> 
> Old format / My Format / My suggested alias
> SINGLE  / R0.S0.P0 / SINGLE
> DUP / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
> RAID0   / R0.Sm.P0 / STRIPE
> RAID1   / R1.S0.P0 / MIRROR1
> RAID1c3 / R2.S0.P0 / MIRROR2
> RAID1c4 / R3.S0.P0 / MIRROR3
> RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
> RAID5   / R1.Sm.P1 / STRIPE.PARITY1
> RAID6   / R1.Sm.P2 / STRIPE.PARITY2
> 
> And i think this is much more readable, but others may disagree. And as a 
> side note... from a (hobby) coders perspective this is probably simpler to 
> parse as well.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3)

2018-07-13 Thread Goffredo Baroncelli
gt; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 5ca1d21fc4a7..137952d3375d 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -825,7 +825,8 @@ enum btrfs_err_code {
>   BTRFS_ERROR_DEV_TGT_REPLACE,
>   BTRFS_ERROR_DEV_MISSING_NOT_FOUND,
>   BTRFS_ERROR_DEV_ONLY_WRITABLE,
> - BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
> + BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS,
> + BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
>  };
>  
>  #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index aff1356c2bb8..fa75b63dd928 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -836,6 +836,7 @@ struct btrfs_dev_replace_item {
>  #define BTRFS_BLOCK_GROUP_RAID10 (1ULL << 6)
>  #define BTRFS_BLOCK_GROUP_RAID5 (1ULL << 7)
>  #define BTRFS_BLOCK_GROUP_RAID6 (1ULL << 8)
> +#define BTRFS_BLOCK_GROUP_RAID1C3   (1ULL << 9)
>  #define BTRFS_BLOCK_GROUP_RESERVED   (BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
>BTRFS_SPACE_INFO_GLOBAL_RSV)
>  
> @@ -847,6 +848,7 @@ enum btrfs_raid_types {
>   BTRFS_RAID_SINGLE,
>   BTRFS_RAID_RAID5,
>   BTRFS_RAID_RAID6,
> + BTRFS_RAID_RAID1C3,
>   BTRFS_NR_RAID_TYPES
>  };
>  
> @@ -856,6 +858,7 @@ enum btrfs_raid_types {
>  
>  #define BTRFS_BLOCK_GROUP_PROFILE_MASK   (BTRFS_BLOCK_GROUP_RAID0 |   \
>BTRFS_BLOCK_GROUP_RAID1 |   \
> +  BTRFS_BLOCK_GROUP_RAID1C3 | \
>BTRFS_BLOCK_GROUP_RAID5 |   \
>BTRFS_BLOCK_GROUP_RAID6 |   \
>BTRFS_BLOCK_GROUP_DUP | \
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Goffredo Baroncelli
On 06/28/2018 04:17 PM, Chris Murphy wrote:
> Btrfs does two, maybe three, bad things:
> 1. No automatic resync. This is a net worse behavior than mdadm and
> lvm, putting data at risk.
> 2. The new data goes in a single chunk; even if the user does a manual
> balance (resync) their data isn't replicated. They must know to do a
> -dconvert balance to replicate the new data. Again this is a net worse
> behavior than mdadm out of the box, putting user data at risk.
> 3. Apparently if nodatacow, given a file with two copies of different
> transid, Btrfs won't always pick the higher transid copy? If true
> that's terrible, and again not at all what mdadm/lvm are doing.

All these could be avoided simply not allowing a multidevice filesystem to 
mount without ensuring that all the devices have the same generation.

In the past I proposed a mount.btrfs helper; I am still thinking that it would 
be the right place to
a) put all the check before mounting the filesystem
b) print the correct information in order to help the user on what he has to do 
to solve the issues

Regarding your point 3), it must be point out that in case of NOCOW files, even 
having the same transid it is not enough. It still be possible that a copy is 
update before a power failure preventing the super-block update.
I think that the only way to prevent it to happens is:
  1) using a data journal (which means that each data is copied two times)
OR
  2) using a cow filesystem (with cow enabled of course !)

I think that this is a good example of why a HW Raid controller battery backed 
could be better than a SW raid. Of course the likelihood of a lot of problems 
could be reduced using a power supply.


BR
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-06-24 Thread Goffredo Baroncelli
this dramatically lessens the required new code and shortens time-
> to-stability to something reasonable, in contrast to the about a decade 
> btrfs has taken already, without yet reaching a full feature set and full 
> stability.  IMO they may well have a point, tho AFAIK they're still new 
> and immature themselves and (I believe) don't have it either, so it's a 
> point that AFAIK has yet to be fully demonstrated.
> 
> We'll see how they evolve.  I do actually expect them to move faster than 
> btrfs, but also expect the interface may not be as smooth and unified as 
> they'd like to present as I expect there to remain some hiccups in 
> smoothing over the layering issues.  Also, because they've deliberately 
> chosen to go with existing technology where possible in ordered to evolve 
> to stability faster, by the same token they're deliberately limiting the 
> evolution to incremental over existing technology, and I expect there's 
> some stuff btrfs will do better as a result... at least until btrfs (or a 
> successor) becomes stable enough for them to integrate (parts of?) it as 
> existing demonstrated-stable technology.

I fully agree with the above sentences...
> 
> The other difference, AFAIK, is that stratis is specifically a 
> corporation making it a/the main money product, whereas btrfs was always 
> something the btrfs devs used at their employers (oracle, facebook), who 
> have other things as their main product.  As such, stratis is much more 
> likely to prioritize things like raid status monitors, hot-spares, etc, 
> that can be part of the product they sell, where they've been lower 
> priority for btrfs.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2018-05-03 Thread Goffredo Baroncelli
On 08/02/2017 08:47 PM, Chris Mason wrote:
>> I agree, MD pretty much needs a separate device simply because they can't 
>> allocate arbitrary space on the other array members.  BTRFS can do that 
>> though, and I would actually think that that would be _easier_ to implement 
>> than having a separate device.
>>
>> That said, I do think that it would need to be a separate chunk type, 
>> because things could get really complicated if the metadata is itself using 
>> a parity raid profile.
> 
> Thanks for running with this Liu, I'm reading through all the patches. I do 
> agree that it's better to put the logging into a dedicated chunk type, that 
> way we can have it default to either double or triple mirroring.

Sorry for reply a bit late :-), however it should be sufficient to start the 
writes from the stripe boundary. For a filesystem this is complicate to grant, 
however for a journal it would be more simple to do;

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-03 Thread Goffredo Baroncelli
On 05/03/2018 02:47 PM, Alberto Bursi wrote:
> 
> 
> On 01/05/2018 23:57, Gandalf Corvotempesta wrote:
>> Hi to all
>> I've found some patches from Andrea Mazzoleni that adds support up to 6
>> parity raid.
>> Why these are wasn't merged ?
>> With modern disk size, having something greater than 2 parity, would be
>> great.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> His patch was about a generic library to do RAID6, it wasn't directly 
> for btrfs.
> 
> To actually use that for btrfs someone would have to actually port btrfs 
> to that library.

In the past Andrea ported this library to btrfs too

https://lwn.net/Articles/588106/

> -Alberto

G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-03 Thread Goffredo Baroncelli
On 05/03/2018 01:26 PM, Austin S. Hemmelgarn wrote:
>> My intention was to highlight that the parity-checksum is not related to the 
>> reliability and safety of raid5/6.
> It may not be related to the safety, but it is arguably indirectly related to 
> the reliability, dependent on your definition of reliability.  Spending less 
> time verifying the parity means you're spending less time in an indeterminate 
> state of usability, which arguably does improve the reliability of the 
> system.  However, that does still have nothing to do with the write hole.

If you start a scrub once per week, the fact that grub requires 1 hr, or 1 day 
doesn't impact the reliability, because in any case you have 1 week of 
un-scrubbed data.


BR 
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli
On 05/02/2018 11:20 PM, waxhead wrote:
> 
> 
[...]
> 
> Ok, before attempting and answer I have to admit that I do not know enough 
> about how RAID56 is laid out on disk in BTRFS terms. Is data checksummed pr. 
> stripe or pr. disk? Is parity calculated on the data only or is it calculated 
> on the data+checksum ?!
> 

Data is checksummed per block. The parity is invisible to the checksum. The 
parity are allocated in an "address space" parallel to the data "address space" 
exposed by the BG.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli
On 05/02/2018 09:29 PM, Austin S. Hemmelgarn wrote:
> On 2018-05-02 13:25, Goffredo Baroncelli wrote:
>> On 05/02/2018 06:55 PM, waxhead wrote:
>>>>
>>>> So again, which problem would solve having the parity checksummed ? On the 
>>>> best of my knowledge nothing. In any case the data is checksummed so it is 
>>>> impossible to return corrupted data (modulo bug :-) ).
>>>>
>>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
>>> checksum the parity there is no way to verify that that the data (parity) 
>>> you use to reconstruct other data is correct.
>>
>> In any case you could catch that the compute data is wrong, because the data 
>> is always checksummed. And in any case you must check the data against their 
>> checksum.
>>
>> My point is that storing the checksum is a cost that you pay *every time*. 
>> Every time you update a part of a stripe you need to update the parity, and 
>> then in turn the parity checksum. It is not a problem of space occupied nor 
>> a computational problem. It is a a problem of write amplification...
>>
>> The only gain is to avoid to try to use the parity when
>> a) you need it (i.e. when the data is missing and/or corrupted)
>> and b) it is corrupted.
>> But the likelihood of this case is very low. And you can catch it during the 
>> data checksum check (which has to be performed in any case !).
>>
>> So from one side you have a *cost every time* (the write amplification), to 
>> other side you have a gain (cpu-time) *only in case* of the parity is 
>> corrupted and you need it (eg. scrub or corrupted data)).
>>
>> IMHO the cost are very higher than the gain, and the likelihood the gain is 
>> very lower compared to the likelihood (=100% or always) of the cost.
> You do realize that a write is already rewriting checksums elsewhere? It 
> would be pretty trivial to make sure that the checksums for every part of a 
> stripe end up in the same metadata block, at which point the only cost is 
> computing the checksum (because when a checksum gets updated, the whole block 
> it's in gets rewritten, period, because that's how CoW works).
> 
> Looking at this another way (all the math below uses SI units):
> 
[...]
Good point: precomputing the checksum of the parity save a lot of time for the 
scrub process. You can see this in a more simply way saying that the parity 
calculation (which is dominated by the memory bandwidth) is like O(n) (where n 
is the number of disk); the parity checking (which again is dominated by the 
memory bandwidth) against a checksum is like O(1). And when the data written is 
2 order of magnitude lesser than the data stored, the effort required to 
precompute the checksum is negligible.

Anyway, my "rant" started when Ducan put near the missing of parity checksum 
and the write hole. The first might be a performance problem. Instead the write 
hole could lead to a loosing data. My intention was to highlight that the 
parity-checksum is not related to the reliability and safety of raid5/6.

> 
> So, lets look at data usage:
> 
> 1GB of data is translates to 62500 16kB blocks of data, which equates to an 
> additional 15625 blocks for parity.  Adding parity checksums adds a 25% 
> overhead to checksums being written, but that actually doesn't translate to a 
> huge increase in the number of _blocks_ of checksums written.  One 16k block 
> can hold roughly 500 checksums, so it would take 125 blocks worth of 
> checksums without parity, and 157 (technically 156.25, but you can't write a 
> quarter block) with parity checksums. Thus, without parity checksums, writing 
> 1GB of data involves writing 78250 blocks, while doing the same with parity 
> checksums involves writing 78282 blocks, a net change of only 32 blocks, or 
> **0.0409%**.

How you would store the checksum ?
I asked that because I am not sure that we could use the "standard" btrfs 
metadata to store the checksum of the parity. Doing so you could face some 
pathological effect like:
- update a block(1) in a stripe(1)
- update the parity of stripe(1) containing block(1)
- update the checksum of parity stripe (1), which is contained in another 
stripe(2) [**]

- update the parity of stripe (2) containing the checksum of parity stripe(1)
- update the checksum of parity stripe (2), which is contained in another 
stripe(3)

and so on...

[**] pay attention that the checksum and the parity *have* to be in different 
stripe, otherwise you have the egg/chicken problem: compute the parity, then 
update the checksum, then update the parity again because the checksum is 
changed

In order to avoid that, I fear that you can't store the checksum over a raid5/6

Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli
On 05/02/2018 08:17 PM, waxhead wrote:
> Goffredo Baroncelli wrote:
>> On 05/02/2018 06:55 PM, waxhead wrote:
>>>>
>>>> So again, which problem would solve having the parity checksummed ? On the 
>>>> best of my knowledge nothing. In any case the data is checksummed so it is 
>>>> impossible to return corrupted data (modulo bug :-) ).
>>>>
>>> I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
>>> checksum the parity there is no way to verify that that the data (parity) 
>>> you use to reconstruct other data is correct.
>>
>> In any case you could catch that the compute data is wrong, because the data 
>> is always checksummed. And in any case you must check the data against their 
>> checksum.
>>
> What if you lost an entire disk? or had corruption for both data AND 
> checksum? How do you plan to safely reconstruct that without checksummed 
> parity?

As general rule, the returned data is *always* check against their checksum. So 
in any case wrong data is never returned. Let me to say in another words: 
having the parity checksummed doesn't increase the reliability and or the 
safety of the RAID rebuilding. I want to repeat again: even if the parity is 
corrupted, the rebuild (wrong) data fails the check against its checksum and it 
is not returned !

Back to your questions:

1) Loosing 1 disks -> 1 fault
2) Loosing both data and checksum -> 2 faults

RAID5 is single fault prof. So if case #2 happens, raid5 can't protect you. 
However BTRFS is capable to detect that the data is wrong due to checksum.
In case #1, there is no problem, because for each stripe you have enough data 
to rebuild the missing one.

Because I read several time that the checksum parity would increase the 
reliability and/or the safety of the raid5/6 profile, let me to explain the 
logic:

read_from_disk() {
data = read_data()
if (data != ERROR && check_checksum(data))
return data;
/* checksum mismatch or data is missing */
full_stripe = read_full_stripe()
if (raid5_profile) {
/* raid5 has only one way of rebuilding data */
data = rebuild_raid5_data(full_stripe)
if (data != ERROR && check_checksum(data)) {
rebuild_stripe(data, full_stripe)
return data;
}
/* parity and/or another data is corrupted/missed */
return ERROR
}

for_each raid6_rebuilding_strategies(full_stripe) {
/* 
 * raid6 might have more than one way of rebuilding data 
 * depending by the degree of failure
 */
data = rebuild_raid6_data(full_stripe)
if (data != ERROR && check_checksum(data)) {
rebuild_stripe(data, full_stripe)
/* data is good, return it */
return data;
}
}
return ERROR
}

In case of raid5, there is only one way of rebuild the data. In case of raid6 
and 1 fault, there are several way of rebuilding data (however in case of two 
failure, there are only one way). So more possibilities have to be tested for 
rebuilding the data.
If the parity is corrupted, the rebuild data is corrupted too, and it fails the 
check against its checksum.


> 
>> My point is that storing the checksum is a cost that you pay *every time*. 
>> Every time you update a part of a stripe you need to update the parity, and 
>> then in turn the parity checksum. It is not a problem of space occupied nor 
>> a computational problem. It is a a problem of write amplification...
> How much of a problem is this? no benchmarks have been run since the feature 
> is not yet there I suppose.

It is simple, for each stripe touched you need to update the parity(1); then 
you need to update parity checksums(1) (which in turn would requires an update 
of the parity(2) of the stripe where is stored the parity(1) checksums, which 
in turn would requires to update the parity(2) checksum... and so on)

> 
>>
>> The only gain is to avoid to try to use the parity when
>> a) you need it (i.e. when the data is missing and/or corrupted)
> I'm not sure I can make out your argument here , but with RAID5/6 you don't 
> have another copy to restore from. You *have* to use the parity to 
> reconstruct data and it is a good thing if this data is trusted.
I never say the opposite

> 
>> and b) it is corrupted.
>> But the likelihood of this case is very low. And you can catch it during the 
>> data checksum check (which has to be performed in any case !).
>>
>> So from one side you have a *cost every time* (the write

Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli
On 05/02/2018 06:55 PM, waxhead wrote:
>>
>> So again, which problem would solve having the parity checksummed ? On the 
>> best of my knowledge nothing. In any case the data is checksummed so it is 
>> impossible to return corrupted data (modulo bug :-) ).
>>
> I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
> checksum the parity there is no way to verify that that the data (parity) you 
> use to reconstruct other data is correct.

In any case you could catch that the compute data is wrong, because the data is 
always checksummed. And in any case you must check the data against their 
checksum.

My point is that storing the checksum is a cost that you pay *every time*. 
Every time you update a part of a stripe you need to update the parity, and 
then in turn the parity checksum. It is not a problem of space occupied nor a 
computational problem. It is a a problem of write amplification...

The only gain is to avoid to try to use the parity when 
a) you need it (i.e. when the data is missing and/or corrupted)
and b) it is corrupted. 
But the likelihood of this case is very low. And you can catch it during the 
data checksum check (which has to be performed in any case !).

So from one side you have a *cost every time* (the write amplification), to 
other side you have a gain (cpu-time) *only in case* of the parity is corrupted 
and you need it (eg. scrub or corrupted data)).

IMHO the cost are very higher than the gain, and the likelihood the gain is 
very lower compared to the likelihood (=100% or always) of the cost.


BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-02 Thread Goffredo Baroncelli
Hi
On 05/02/2018 03:47 AM, Duncan wrote:
> Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as
> excerpted:
> 
>> Hi to all I've found some patches from Andrea Mazzoleni that adds
>> support up to 6 parity raid.
>> Why these are wasn't merged ?
>> With modern disk size, having something greater than 2 parity, would be
>> great.
> 1) [...] the parity isn't checksummed, 

Why the fact that the parity is not checksummed is a problem ?
I read several times that this is a problem. However each time the thread 
reached the conclusion that... it is not a problem.

So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).

On the other side, having the parity would increase both the code complexity 
and the write amplification, because every time a part of the stripe is touched 
not only the parity has to be updated, but also the checksum has too..


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add support for BTRFS raid5/6 to GRUB

2018-04-23 Thread Goffredo Baroncelli
On 04/23/2018 01:50 PM, Daniel Kiper wrote:
> On Tue, Apr 17, 2018 at 09:57:40PM +0200, Goffredo Baroncelli wrote:
>> Hi All,
>>
>> Below you can find a patch to add support for accessing files from
>> grub in a RAID5/6 btrfs filesystem. This is a RFC because it is
>> missing the support for recovery (i.e. if some devices are missed). In
>> the next days (weeks ?) I will extend this patch to support also this
>> case.
>>
>> Comments are welcome.
> 
> More or less LGTM. Just a nitpick below... I am happy to take full blown
> patch into GRUB if it is ready.

Thanks for the comments; however now I implemented also the recovery. It is 
under testing. Let me few days and I will resubmit the patches.

> 
>> BR
>> G.Baroncelli
>>
>>
>> ---
>>
>> commit 8c80a1b7c913faf50f95c5c76b4666ed17685666
>> Author: Goffredo Baroncelli <kreij...@inwind.it>
>> Date:   Tue Apr 17 21:40:31 2018 +0200
>>
>> Add initial support for btrfs raid5/6 chunk
>>
>> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
>> index be195448d..4c5632acb 100644
>> --- a/grub-core/fs/btrfs.c
>> +++ b/grub-core/fs/btrfs.c
>> @@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
>>  #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
>>  #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
>>  #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
>> +#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
>> +#define GRUB_BTRFS_CHUNK_TYPE_RAID60x100
>>grub_uint8_t dummy2[0xc];
>>grub_uint16_t nstripes;
>>grub_uint16_t nsubstripes;
>> @@ -764,6 +766,39 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
>> grub_disk_addr_t addr,
>>stripe_offset = low + chunk_stripe_length
>>  * high;
>>csize = chunk_stripe_length - low;
>> +  break;
>> +}
>> +  case GRUB_BTRFS_CHUNK_TYPE_RAID5:
>> +  case GRUB_BTRFS_CHUNK_TYPE_RAID6:
>> +{
>> +  grub_uint64_t nparities;
>> +  grub_uint64_t parity_pos;
>> +  grub_uint64_t stripe_nr, high;
>> +  grub_uint64_t low;
>> +
>> +  redundancy = 1;   /* no redundancy for now */
>> +
>> +  if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
>> +{
>> +  grub_dprintf ("btrfs", "RAID5\n");
>> +  nparities = 1;
>> +}
>> +  else
>> +{
>> +  grub_dprintf ("btrfs", "RAID6\n");
>> +  nparities = 2;
>> +}
>> +
>> +  stripe_nr = grub_divmod64 (off, chunk_stripe_length, );
>> +
>> +  high = grub_divmod64 (stripe_nr, nstripes - nparities, );
>> +  grub_divmod64 (high+nstripes-nparities, nstripes, _pos);
>> +  grub_divmod64 (parity_pos+nparities+stripen, nstripes, );
> 
> Missing spaces around "+" and "-".
> 
> Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Add support for BTRFS raid5/6 to GRUB

2018-04-17 Thread Goffredo Baroncelli
Hi All,

Below you can find a patch to add support for accessing files from grub in a 
RAID5/6 btrfs filesystem. This is a RFC because it is missing the support for 
recovery (i.e. if some devices are missed). In the next days (weeks ?) I will 
extend this patch to support also this case.

Comments are welcome.

BR
G.Baroncelli


---

commit 8c80a1b7c913faf50f95c5c76b4666ed17685666
Author: Goffredo Baroncelli <kreij...@inwind.it>
Date:   Tue Apr 17 21:40:31 2018 +0200

Add initial support for btrfs raid5/6 chunk

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index be195448d..4c5632acb 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
 #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
 #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
 #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
+#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
+#define GRUB_BTRFS_CHUNK_TYPE_RAID60x100
   grub_uint8_t dummy2[0xc];
   grub_uint16_t nstripes;
   grub_uint16_t nsubstripes;
@@ -764,6 +766,39 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
  stripe_offset = low + chunk_stripe_length
* high;
  csize = chunk_stripe_length - low;
+ break;
+   }
+ case GRUB_BTRFS_CHUNK_TYPE_RAID5:
+ case GRUB_BTRFS_CHUNK_TYPE_RAID6:
+   {
+ grub_uint64_t nparities;
+ grub_uint64_t parity_pos;
+ grub_uint64_t stripe_nr, high;
+ grub_uint64_t low;
+
+ redundancy = 1;   /* no redundancy for now */
+
+ if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
+   {
+ grub_dprintf ("btrfs", "RAID5\n");
+ nparities = 1;
+   }
+ else
+   {
+ grub_dprintf ("btrfs", "RAID6\n");
+ nparities = 2;
+   }
+
+ stripe_nr = grub_divmod64 (off, chunk_stripe_length, );
+
+ high = grub_divmod64 (stripe_nr, nstripes - nparities, );
+ grub_divmod64 (high+nstripes-nparities, nstripes, _pos);
+ grub_divmod64 (parity_pos+nparities+stripen, nstripes, );
+
+ stripe_offset = low + chunk_stripe_length * high;
+ csize = chunk_stripe_length - low;
+
  break;
    }
  default:


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to dump/find parity of RAID-5 file?

2018-04-17 Thread Goffredo Baroncelli
On 02/06/2017 09:40 PM, Goffredo Baroncelli wrote:
> On 2017-02-03 11:44, Lakshmipathi.G wrote:
>> Hi.
>>
>> Came across this thread 
>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html 
>> Exploring possibility of adding test-scripts around these area using 
>> dump-tree & corrupt-block.But
>> unable to figure-out how to get parity of file or find its location.  
>> dump-tree output gave,
>>
>> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 145096704) itemoff 15557 
>> itemsize 144
>> length 134217728 owner 2 stripe_len 65536 type DATA|RAID5
>> io_align 65536 io_width 65536 sector_size 4096
>> num_stripes 3 sub_stripes 0
>> stripe 0 devid 3 offset 6368 # Is this 
>> parity?Seems empty?
>> dev_uuid f62df114-186c-4e48-8152-9ed15aa078b4
>> stripe 1 devid 2 offset 6368 # Contains file 
>> data-stripe-1
>> dev_uuid c0aeaab0-e57e-4f7a-9356-db1878876d9f
>> stripe 2 devid 1 offset 83034112 # Contains file 
>> data-stripe-2
>> dev_uuid 637b3666-9d8f-4ec4-9969-53b0b933b9b1
>> thanks.
> 
> IIRC, the parity is spread across the disk stripes of the chunk.
> 
> So first you have to find the logical-offset [LO] where the the file begins. 
> Then you have to map this offset to the chunk which holds the data. The chunk 
> has the following info:
> - chunk start [CS], chunk length [CL]
> - for each stripe:
>   where the stripe starts
> 
> If you subtract the chunk-start from the logical-offset [ CO == LO-CS], you 
> will find the offset where the data belongs in the chunk.
> 
> As stated above, the PARITY is spread across the chunk stripes. So (supposing 
> that the stripe size is 64K, the raid level is 5, the disks are three), 
> 
> - the first 64k of stripe 0, is data [0..64K)
> - the first 64k of stripe 1, is data [64..128K)
> - the first 64k of stripe 2 is parity, 
> 
> - the 2nd 64k of stripe 0 is parity, 
> - the 2nd 64k of stripe 1, is data [128..196K)
> - the 2nd 64k of stripe 2, is data [192..256K)
> 
> - the 3rd 64k of stripe 0, is data [256..320K)
> - the 3rd 64k of stripe 1 is parity, 
> - the 3rd 64k of stripe 2, is data [320..384K)
> and so on,

I was wrong !
- the 3rd 64k of stripe 0, is data [320..384K)
- the 3rd 64k of stripe 1 is parity, 
- the 3rd 64k of stripe 2, is data [256..320K)

Basically stripe 0 and 2 were swapped. The idea is that after parity there is 
stripe 1, then stripe 2, and so on 

DDD
III
SSS
KKK
123

01P 
P23 
5P4 
67P 

Where 
  P = parity
  0...7 are the slices of data

So 'P' parity move from left to right starting from the last disk; the data are 
in the increasing address from left to right *starting* from parity in a 
circular buffer.

I am trying to implement raid5/6 in grub, so I went deep in the raid5 layout

BR
G.Baroncelli




> 
> To find the data, You have to compare the CO to the data [...) range.
> 
> If you look to an my old patch (unfinished :-( ), you can find some example 
> to dump the different stripe
> 
> [BTRFS-PROGS][PATCH][V2] Add two new commands: 'btrfs insp physical-find' and 
> 'btrfs insp physical-dump'
> 
> 
> 
> 
> 
> 
>> Cheers.
>> Lakshmipathi.G
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: dump-super: Refactor print function and add extra check

2018-04-11 Thread Goffredo Baroncelli
On 04/11/2018 02:32 AM, Qu Wenruo wrote:
[...]
>>>> so to get rid  of generate_tab_indent and indent_str
>>>
>>> And we need to call such functions in each helper macros, with
>>> duplicated codes.
>>
>> Please look at the asm generated: even if the "source generated" by the 
>> expansion of the macro is bigger, the binary code is smaller.
>> E.g. the code below 
> 
> No, I don't mean asm code, but C code.

May be that there is some misunderstanding: my code is about 20loc, your one is 
about 50loc... I am missing something ?

[...]
>>> When passing random stream to dump-super, such reason will make output
>>> quite nasty.
>>> So just INVALID to info the user that some of the members don't look
>>> valid is good enough, as the tool is only to help guys who are going to
>>> manually patching superblocks.
>>
>> I think that we should increase the possible target also to who want to make 
>> some debugging :-)
> 
> There are several problems here to output the condition
> 
> 1) Loose condition
> for basic alignment check it may looks good to output the condition, but
> the fact is, the condition is not 100% correct for 64K pages system.
> So when output IS_ALIGN(value, SZ_4K), it's not 100% correct.

I don't understand your statement: does the alignment is the same for all the 
system ?. If not, this means that a filesystem created on a x86 might not work 
on a PPC64 (which IIRC is a 64k page hardware) ?


> 
> 2) Too long condition for some case.
> !(is_power_of_2(value) && value >= SZ_4K && value <= SZ_64K)
> This seems pretty strange in fact.

I agree that this is quite verbose... however if this is the constraint, why 
not print it

> 
> 3) C macro usage
>Just a minor problem, macro usage such as SZ_4K/SZ_64K may not looks
>good for new developers.

I find it quite clear and intuitive even if this is the first time that I 
notice these macros

[...]
> 
> Thanks,
> Qu
BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: dump-super: Refactor print function and add extra check

2018-04-09 Thread Goffredo Baroncelli
 btrfs_stack_device_bytes_used(>dev_item));
> - printf("dev_item.io_align\t%u\n", (unsigned int)
> -btrfs_stack_device_io_align(>dev_item));
> - printf("dev_item.io_width\t%u\n", (unsigned int)
> -btrfs_stack_device_io_width(>dev_item));
> - printf("dev_item.sector_size\t%u\n", (unsigned int)
> -btrfs_stack_device_sector_size(>dev_item));
> - printf("dev_item.devid\t\t%llu\n",
> -btrfs_stack_device_id(>dev_item));
> - printf("dev_item.dev_group\t%u\n", (unsigned int)
> -btrfs_stack_device_group(>dev_item));
> - printf("dev_item.seek_speed\t%u\n", (unsigned int)
> -btrfs_stack_device_seek_speed(>dev_item));
> - printf("dev_item.bandwidth\t%u\n", (unsigned int)
> -btrfs_stack_device_bandwidth(>dev_item));
> - printf("dev_item.generation\t%llu\n", (unsigned long long)
> -btrfs_stack_device_generation(>dev_item));
> + /* For embedded device item, don't do extra check, just like kernel */
> + print_super_dev(sb, type);
> + print_super_dev(sb, total_bytes);
> + print_super_dev(sb, bytes_used);
> + print_super_dev(sb, io_align);
> + print_super_dev(sb, io_width);
> + print_super_dev(sb, sector_size);
> + print_super_dev(sb, id);
> + print_super_dev(sb, group);
> + print_super_dev(sb, seek_speed);
> + print_super_dev(sb, bandwidth);
> + print_super_dev(sb, generation);
>   if (full) {
>   printf("sys_chunk_array[%d]:\n", BTRFS_SYSTEM_CHUNK_ARRAY_SIZE);
>   print_sys_chunk_array(sb);
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-04 Thread Goffredo Baroncelli
On 04/04/2018 08:01 AM, Zygo Blaxell wrote:
> On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
>> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
[...]
>> Before you pointed out that the non-contiguous block written has
>> an impact on performance. I am replaying  that the switching from a
>> different BG happens at the stripe-disk boundary, so in any case the
>> block is physically interrupted and switched to another disk
> 
> The difference is that the write is switched to a different local address
> on the disk.
> 
> It's not "another" disk if it's a different BG.  Recall in this plan
> there is a full-width BG that is on _every_ disk, which means every
> small-width BG shares a disk with the full-width BG.  Every extent tail
> write requires a seek on a minimum of two disks in the array for raid5,
> three disks for raid6.  A tail that is strip-width minus one will hit
> N - 1 disks twice in an N-disk array.

Below I made a little simulation; my results telling me another thing:

Current BTRFS (w/write hole)

Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)

Case A.1): extent size = 192kb:
5 writes of 64kb spread on 5 disks (3data + 2 parity)

Case A.2.2): extent size = 256kb: (optimistic case: contiguous space available)
5 writes of 64kb spread on 5 disks (3 data + 2 parity)
2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**]
3 writes of 64 kb spread on 3 disks (data + 2 parity)

Note that the two reads are contiguous to the 5 writes both in term of space 
and time. The three writes are contiguous only in terms of space, but not in 
terms of time, because these could happen only after the 2 reads and the 
consequent parities computations. So we should consider that between these two 
events, some disk activities happen; this means seeks between the 2 reads and 
the 3 writes


BTRFS with multiple BG (wo/write hole)

Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)

Case B.1): extent size = 192kb:
5 writes of 64kb spread on 5 disks

Case B.2): extent size = 256kb:
5 writes of 64kb spread on 5 disks in BG#1
3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks)

So if I count correctly:
- case B1 vs A1: these are equivalent
- case B2 vs A2.1/A2.2:
8 writes vs 8 writes
3 seeks vs 3 seeks
0 reads vs 2 reads

So to me it seems that the cost of doing a RMW cycle is worse than seeking to 
another BG.

Anyway I am reaching the conclusion, also thanks of this discussion, that this 
is not enough. Even if we had solve the problem of the "extent smaller than 
stripe" write, we still face gain this issue when part of the file is changed.
In this case the file update breaks the old extent and will create a three 
extents: the first part, the new part, the last part. Until that everything is 
OK. However the "old" part of the file would be marked as free space. But using 
this part could require a RMW cycle

I am concluding that the only two reliable solution are 
a) variable stripe size (like ZFS does) 
or b) logging the RMW cycle of a stripe 


[**] Does someone know if the checksum are checked during this read ?
[...]
 
BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-03 Thread Goffredo Baroncelli
On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
>> I have to point out that in any case the extent is physically
>> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
>> you want to write 128KB, the first half is written in the first disk,
>> the other in the 2nd disk.  If you want to write 96kb, the first 64
>> are written in the first disk, the last part in the 2nd, only on a
>> different BG.
> The "only on a different BG" part implies something expensive, either
> a seek or a new erase page depending on the hardware.  Without that,
> nearby logical blocks are nearby physical blocks as well.

In any case it happens on a different disk

> 
>> So yes there is a fragmentation from a logical point of view; from a
>> physical point of view the data is spread on the disks in any case.

> What matters is the extent-tree point of view.  There is (currently)
> no fragmentation there, even for RAID5/6.  The extent tree is unaware
> of RAID5/6 (to its peril).

Before you pointed out that the non-contiguous block written has an impact on 
performance. I am replaying  that the switching from a different BG happens at 
the stripe-disk boundary, so in any case the block is physically interrupted 
and switched to another disk

However yes: from an extent-tree point of view there will be an increase of 
number extents, because the end of the writing is allocated to another BG (if 
the size is not stripe-boundary)

> If an application does a loop writing 68K then fsync(), the multiple-BG
> solution adds two seeks to read every 68K.  That's expensive if sequential
> read bandwidth is more scarce than free space.

Why you talk about an additional seeks? In any case (even without the 
additional BG) the read happens from another disks

>> * c),d),e) are applied only for the tail of the extent, in case the
> size is less than the stripe size.
> 
> It's only necessary to split an extent if there are no other writes
> in the same transaction that could be combined with the extent tail
> into a single RAID stripe.  As long as everything in the RAID stripe
> belongs to a single transaction, there is no write hole

May be that a more "simpler" optimization would be close the transaction when 
the data reach the stripe boundary... But I suspect that it is not so simple to 
implement.

> Not for d.  Balance doesn't know how to get rid of unreachable blocks
> in extents (it just moves the entire extent around) so after a balance
> the writes would still be rounded up to the stripe size.  Balance would
> never be able to free the rounded-up space.  That space would just be
> gone until the file was overwritten, deleted, or defragged.

If balance is capable to move the extent, why not place one near the other 
during a balance ? The goal is not to limit the the writing of the end of a 
extent, but avoid writing the end of an extent without further data (e.g. the 
gap to the stripe has to be filled in the same transaction)

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-03 Thread Goffredo Baroncelli
On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
>>>> I thought that a possible solution is to create BG with different
>>> number of data disks. E.g. supposing to have a raid 6 system with 6
>>> disks, where 2 are parity disk; we should allocate 3 BG
>>>> BG #1: 1 data disk, 2 parity disks
>>>> BG #2: 2 data disks, 2 parity disks,
>>>> BG #3: 4 data disks, 2 parity disks
>>>>
>>>> For simplicity, the disk-stripe length is assumed = 4K.
>>>>
>>>> So If you have a write with a length of 4 KB, this should be placed
>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
>>> should be placed in in BG#2, then in BG#1.
>>>> This would avoid space wasting, even if the fragmentation will
>>> increase (but shall the fragmentation matters with the modern solid
>>> state disks ?).
>> I don't really see why this would increase fragmentation or waste space.

> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> remaining 2 blocks).  It also flips the usual order of "determine size
> of extent, then allocate space for it" which might require major surgery
> on the btrfs allocator to implement.

I have to point out that in any case the extent is physically interrupted at 
the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, 
the first half is written in the first disk, the other in the 2nd disk.  If you 
want to write 96kb, the first 64 are written in the first disk, the last part 
in the 2nd, only on a different BG.
So yes there is a fragmentation from a logical point of view; from a physical 
point of view the data is spread on the disks in any case.

In any case, you are right, we should gather some data, because the performance 
impact are no so clear.

I am not worried abut having different BG; we have problem with these because 
we never developed tool to handle this issue properly (i.e. a daemon which 
starts a balance when needed). But I hope that this will be solved in future.

In any case, the all solutions proposed have their trade off:

- a) as is: write hole bug
- b) variable stripe size (like ZFS): big impact on how btrfs handle the 
extent. limited waste of space
- c) logging data before writing: we wrote the data two times in a short time 
window. Moreover the log area is written several order of magnitude more than 
the other area; there was some patch around
- d) rounding the writing to the stripe size: waste of space; simple to 
implement;
- e) different BG with different stripe size: limited waste of space; logical 
fragmentation.


* c),d),e) are applied only for the tail of the extent, in case the size is 
less than the stripe size.
* for b),d), e), the wasting of space may be reduced with a balance 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-02 Thread Goffredo Baroncelli
On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
[...]
> It is possible to combine writes from a single transaction into full
> RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> Any partially-filled stripe is effectively read-only and the space within
> it is inaccessible until all data within the stripe is overwritten,
> deleted, or relocated by balance.
>
> btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> update, but that has a significant write magnification effect (and before
> kernel 4.14, non-trivial CPU load as well).
> 
> btrfs could also just allocate the full stripe to an extent, but emit
> only extent ref items for the blocks that are in use.  No fragmentation
> but lots of extra disk space used.  Also doesn't quite work the same
> way for metadata pages.
> 
> If btrfs adopted the ZFS approach, the extent allocator and all higher
> layers of the filesystem would have to know about--and skip over--the
> parity blocks embedded inside extents.  Making this change would mean
> that some btrfs RAID profiles start interacting with stuff like balance
> and compression which they currently do not.  It would create a new
> block group type and require an incompatible on-disk format change for
> both reads and writes.

I thought that a possible solution is to create BG with different number of 
data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are 
parity disk; we should allocate 3 BG

BG #1: 1 data disk, 2 parity disks
BG #2: 2 data disks, 2 parity disks,
BG #3: 4 data disks, 2 parity disks

For simplicity, the disk-stripe length is assumed = 4K.

So If you have a write with a length of 4 KB, this should be placed in BG#1; if 
you have a write with a length of 4*3KB, the first 8KB, should be placed in in 
BG#2, then in BG#1.

This would avoid space wasting, even if the fragmentation will increase (but 
shall the fragmentation matters with the modern solid state disks ?).

Time to time, a re-balance should be performed to empty the BG #1, and #2. 
Otherwise a new BG should be allocated.

The cost should be comparable to the logging/journaling (each data shorter than 
a full-stripe, has to be written two times); the implementation should be quite 
easy, because already NOW btrfs support BG with different set of disks.

BR 
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-31 Thread Goffredo Baroncelli
On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
>> The key is that if a data write is interrupted, all the transaction
>> is interrupted and aborted. And due to the COW nature of btrfs, the
>> "old state" is restored at the next reboot.

> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> RMW operations which are not COW and don't provide any data integrity
> guarantee.  Old data (i.e. data from very old transactions that are not
> part of the currently written transaction) can be destroyed by this.

Could you elaborate a bit ?

Generally speaking, updating a part of a stripe require a RMW cycle, because
- you need to read all data stripe (with parity in case of a problem)
- then you should write
- the new data
- the new parity (calculated on the basis of the first read, and the 
new data)

However the "old" data should be untouched; or you are saying that the "old" 
data is rewritten with the same data ? 

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-31 Thread Goffredo Baroncelli
On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
>>> is always a full-device operation.  In theory btrfs could track
>>> modifications at the chunk level but this isn't even specified in the
>>> on-disk format, much less implemented.
>> It could go even further; it would be sufficient to track which
>> *partial* stripes update will be performed before a commit, in one
>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
>> a scrub on these stripes would be sufficient.

> A scrub cannot fix a raid56 write hole--the data is already lost.
> The damaged stripe updates must be replayed from the log.

Your statement is correct, but you doesn't consider the COW nature of btrfs.

The key is that if a data write is interrupted, all the transaction is 
interrupted and aborted. And due to the COW nature of btrfs, the "old state" is 
restored at the next reboot.

What is needed in any case is rebuild of parity to avoid the "write-hole" bug. 
And this is needed only for a partial stripe write. For a full stripe write, 
due to the fact that the commit is not flushed, it is not needed the scrub at 
all.

Of course for the NODATACOW file this is not entirely true; but I don't see the 
gain to switch from the cost of COW to the cost of a log.

The above sentences are correct (IMHO) if we don't consider a power 
failure+device missing case. However in this case even logging the "new data" 
would be not sufficient.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 0/7] btrfs-progs: Allow normal user to call "subvolume list/show"

2018-03-30 Thread Goffredo Baroncelli
On 03/30/2018 06:46 AM, Misono Tomohiro wrote:
> On 2018/03/30 2:35, Goffredo Baroncelli wrote:
>> Hi Misono,
> 
> Hello,
[...]
>> My conclusion, is that your work is not consistent with the current 
>> implementation; so I am suggesting to create a new subcommand ("btrfs sub 
>> tree" ?) where you can use, without constraint, your api.
> 
> I agree that the current output of "sub list" is confusing first of all and
> it is not easy to keep consistent behavior between user and root.
> 
> So, I think "sub list" (or new command) should be:
>   - (default) list the subvolumes under the specified path, including 
> subvolumes mounted below the specified path.
>   Any user can do this (with appropriate permission checks).
>   The path to be printed is the relative from the specified path.
>   - (-a option) list all the subvolmumes in the filesystem. Only root can do 
> this.
>   The path to be printed is the absolute path from the toplevel 
> subvolume.
> 
> This would need some changes in progs code, but I think we can use the same 
> new ioctls I proposed[1] for the default behavior.



> 
> I'd like to update "sub list" command eventually rather than adding new 
> command, even though this breaks the compatibility.
> I want to know what other developer/user think.
> 
> [1] https://www.spinics.net/lists/linux-btrfs/msg76003.html
> 
>>
>> Another small differences that I found is in the subcommand "btrfs sub 
>> show". This is not capable to show the snapshots of a subvolume; I think 
>> that, in "user mode", it should be reported that there is a "missing 
>> capabilities" problem than to show an empty list
> 
> Yes, this is because that only the snapshots *under the specified path* are 
> searched.
> This can be easily changed to search under the mount point, but this also 
> results in
> slight change in print format of the path for root. I tried to keep as much 
> consistency in this version.

When I perform a snapshot, I think the new snapshot more as a sister/brother 
than a child, so I put it at the same level instead of below the source 
subvolume. Moreover the path printed at the first line seems to be relative at 
the root of filesystem.

> Thanks,
> Tomohiro Misono
> 
>>
>> Below an example of what I am say:
>>
>> ghigo@venice:/tmp$ btrfs sub cre a
>> ghigo@venice:/tmp$ btrfs sub snap a c
>>
>> ghigo@venice:/tmp$ # patched btrfs
>> ghigo@venice:/tmp$ /home/ghigo/btrfs/btrfs-progs/btrfs sub show a
>> tmp/a
>>  Name:   a
>>  UUID:   0c19a7a4-a7f5-1849-9262-88ab3da504d0
>>  Parent UUID:-
>>  Received UUID:  -
>>  Creation time:  2018-03-28 22:48:09 +0200
>>  Subvolume ID:   598
>>  Generation: 696908
>>  Gen at creation:696907
>>  Parent ID:  257
>>  Top level ID:   257
>>  Flags:  -
>>  Snapshot(s):
>> 
>>
>> ghigo@venice:/tmp$ # stock btrfs
>> ghigo@venice:/tmp$ sudo /home/ghigo/btrfs/btrfs-progs/btrfs sub show a
>>   ^^
>> tmp/a
>>  Name:   a
>>  UUID:   0c19a7a4-a7f5-1849-9262-88ab3da504d0
>>  Parent UUID:-
>>  Received UUID:  -
>>  Creation time:  2018-03-28 22:48:09 +0200
>>  Subvolume ID:   598
>>  Generation: 696908
>>  Gen at creation:696907
>>  Parent ID:  257
>>  Top level ID:   257
>>  Flags:  -
>>  Snapshot(s):
>>  debian/tmp/c
>> ^
>>
>>
>> BR
>> G.Baroncelli
>>
>>
>> -
>> Test performed:
>>
>> ghigo@venice:/tmp$ sudo btrfs sub crea a
>> ghigo@venice:/tmp$ sudo btrfs sub crea a/a1
>> ghigo@venice:/tmp$ sudo btrfs sub crea a/a1/a2
>> ghigo@venice:/tmp$ sudo btrfs sub crea b
>> ghigo@venice:/tmp$ sudo btrfs sub crea b/b1
>> ghigo@venice:/tmp$ sudo chmod og-rwx b/.
>>
>>
>> # stock btrfs progs
>> ghigo@venice:/tmp$ sudo btrfs sub l .
>> ID 257 gen 696886 top level 5 path debian
>> ID 289 gen 587461 top level 257 path var/lib/machines
>> ID 299 gen 693561 top level 5 path boot
>> ID 582 gen 683965 top level 5 path i386
>> ID 592 gen 6968

Re: [RFC PATCH v3 0/7] btrfs-progs: Allow normal user to call "subvolume list/show"

2018-03-30 Thread Goffredo Baroncelli
On 03/30/2018 06:46 AM, Misono Tomohiro wrote:
> [Un]fortunately, the stock "btrfs sub list" has the capability to show all 
> the subvolumes, even the ones not mounted, so this can be entirely replaced 
> by your API.

s/can be entirely/can't be entirely/

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-30 Thread Goffredo Baroncelli
On 03/29/2018 11:50 PM, Zygo Blaxell wrote:
> On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
>> Hey.
>>
>> Some things would IMO be nice to get done/clarified (i.e. documented in
>> the Wiki and manpages) from users'/admin's  POV:
[...]
> 
>>   - changing raid lvls?
> 
> btrfs uses a brute-force RAID conversion algorithm which always works, but
> takes zero short cuts.  e.g. there is no speed optimization implemented
> for cases like "convert 2-disk raid1 to 1-disk single" which can be
> very fast in theory.  The worst-case running time is the only running
> time available in btrfs.

[...]
What it is reported by Zygo is an excellent source of information. However I 
have to point out that BTRFS has a little optimization: i.e. scrub/balance only 
works on the allocated chunk. So a partial filled filesystem requires less time 
than a nearly filled one

> 
> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> is always a full-device operation.  In theory btrfs could track
> modifications at the chunk level but this isn't even specified in the
> on-disk format, much less implemented.

It could go even further; it would be sufficient to track which *partial* 
stripes update will be performed before a commit, in one of the btrfs logs. 
Then in case of a mount of an unclean filesystem, a scrub on these stripes 
would be sufficient.

BR
G.Baroncelli

[...]


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 0/7] btrfs-progs: Allow normal user to call "subvolume list/show"

2018-03-29 Thread Goffredo Baroncelli
handling
> v1 -> v2
>   - add independent error handling patch (1st patch)
>   - reimplement according to ioctl change
>   - various cleanup
> ===
> 
> This RFC implements user version of "subvolume list/show" using three new 
> ioctls.
> The ioctl patch to the kernel can be found in the ML titled 
>   "[PATCH v3 0/3] btrfs: Add three new unprivileged ioctls to allow normal 
> users to call "sub list/show" etc.
> 
> 1th patch is independent and improvements of error handling
> 2nd-4th are some prepartion works.
> 5th patch is the main part.
> 6th-7th adds the test for "subvolume list"
> 
> The main behavior differences between root and normal users are:
> 
> - "sub list" list the subvolumes which exist under the specified path 
> (including the path itself). The specified path itself is not needed to be
>  a subvolume. Also If the subvolume cannot be opend but the parent
> directory can be, the information other than name or id would be zeroed out.
> 
> - snapshot filed of "subvolume show" just lists
> the snapshots under the specified subvolume.
> 
> 
> This is a part of RFC I sent last December[1] whose aim is to improve normal 
> users' usability.
> The remaining works of RFC are: 
>   - Allow "sub delete" for empty subvolume
>   - Allow "qgroup show" to check quota limit
> 
> [1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg70991.html
> 
> 
> Tomohiro Misono (7):
>   btrfs-progs: sub list: Call rb_free_nodes() in error path
>   btrfs-progs: ioctl: Add 3 definitions of new unprivileged ioctl
>   btrfs-progs: sub list: Pass specified path down to
> btrfs_list_subvols()
>   btrfs-progs: fallback to open without O_NOATIME flag in
> find_mount_root()
>   btrfs-progs: sub list: Allow normal user to call "subvolume list/show"
>   btrfs-progs: test: Add helper function to check if test user exists
>   btrfs-porgs: test: Add cli-test/009 to check subvolume list for both
> root and normal user
> 
>  btrfs-list.c   | 376 
> +++--
>  btrfs-list.h   |   7 +-
>  cmds-subvolume.c   |  14 +-
>  ioctl.h    |  86 +++
>  tests/cli-tests/009-subvolume-list/test.sh | 136 +++
>  tests/common   |  10 +
>  utils.c|  13 +-
>  7 files changed, 609 insertions(+), 33 deletions(-)
>  create mode 100755 tests/cli-tests/009-subvolume-list/test.sh
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] Allow rmdir(2) to delete a subvolume

2018-03-26 Thread Goffredo Baroncelli
Hi Misono,

On 03/26/2018 10:28 AM, Misono Tomohiro wrote:
> changelog:
>   v1 -> v2 ... split the patch to hopefully make review easier
> 
> 1st patch is a preparation work just moving the declaration of
> may_destroy_subvol().
> 
> 2nd patch is the main part. New function btrfs_delete_subvolume() is
> introduced and used in btrfs_rmdir() when a direcoty is an empty
> subvolume. The function is almost the copy of second half of
> btrfs_ioctl_snap_destroy().
> The code path for "sub delete" is not changed yet.
> 
> 3rd patch is a cleanup of btrfs_ioctl_snap_destroy() and uses 
> brrfs_delete_subvolume() for "sub delete" too.
> 
> Tomohiro Misono (3):
>   btrfs: move may_destroy_subvol() from ioctl.c to inode.c
>   btrfs: Allow rmdir(2) to delete a subvolume
>   btrfs: cleanup btrfs_ioctl_snap_destroy() by using btrfs_delete_subvolume()
> 
>  fs/btrfs/ctree.h |   5 +-
>  fs/btrfs/inode.c | 199 
> ++-
>  fs/btrfs/ioctl.c | 185 +--
>  3 files changed, 200 insertions(+), 189 deletions(-)
> 

I checked this patch, and it works. You can add
Tested-by: Goffredo Baroncelli <kreij...@inwind.it>

-

ghigo@venice:/tmp$ btrfs sub cre a
Create subvolume './a'
ghigo@venice:/tmp$ btrfs sub cre a/b
Create subvolume 'a/b'
ghigo@venice:/tmp$ btrfs sub cre a/b/d
Create subvolume 'a/b/d'
ghigo@venice:/tmp$ touch a/b/d/c
ghigo@venice:/tmp$ rm -rf a
ghigo@venice:/tmp$ ls -la
total 4
drwxrwxrwt 1 root  root  492 Mar 26 21:18 .
drwxr-xr-x 1 root  root  196 Mar 14 19:41 ..
drwxrwxrwt 1 root  root0 Mar 26 21:17 .font-unix
drwxrwxrwt 1 root  root8 Mar 26 21:18 .ICE-unix
drwxrwxrwt 1 root  root0 Mar 26 21:17 .Test-unix
-r--r--r-- 1 root  root   11 Mar 26 21:17 .X0-lock
drwxrwxrwt 1 root  root4 Mar 26 21:17 .X11-unix
drwxrwxrwt 1 root  root0 Mar 26 21:17 .XIM-unix



ghigo@venice:/tmp$ btrfs sub crea a
Create subvolume './a'
ghigo@venice:/tmp$ touch a/b
ghigo@venice:/tmp$ rmdir a
rmdir: failed to remove 'a': Directory not empty
ghigo@venice:/tmp$ rm a/b 
ghigo@venice:/tmp$ rmdir a



ghigo@venice:/tmp$ btrfs sub cre d
Create subvolume './d'
ghigo@venice:/tmp$ su - pippo
Password: 
pippo@venice:~$ cd /tmp/
pippo@venice:/tmp$ rmdir d
rmdir: failed to remove 'd': Operation not permitted
pippo@venice:/tmp$ sudo chown pippo d
[sudo] password for pippo: 
pippo@venice:/tmp$ rmdir d

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Allow non-privileged user to delete empty subvolume by default

2018-03-22 Thread Goffredo Baroncelli
On 03/22/2018 01:15 PM, Austin S. Hemmelgarn wrote:
> On 2018-03-21 16:38, Goffredo Baroncelli wrote:
>> On 03/21/2018 12:47 PM, Austin S. Hemmelgarn wrote:
>>> I agree as well, with the addendum that I'd love to see a new ioctl that 
>>> does proper permissions checks.  While letting rmdir(2) work for an empty 
>>> subvolume with the appropriate permissions would be great (it will let rm 
>>> -r work correctly), it doesn't address the usefulness of being able to just 
>>> `btrfs subvolume delete` and not have to wait for the command to finish 
>>> before you can reuse the name.
>>
>> How this could work ?
>>
>> If you want to check all the subvolumes files permissions, this will require 
>> some time: you need to traverse all the subvolume-filesystem; and only if 
>> all the checks are passed, you can delete the subvolume.
>>
>> Unfortunately I think that only two options exist:
>> - don't check permissions, and you can quick remove a subvolume
>> - check all the permissions, i.e. check all the files permissions, and only 
>> if all the permissions are OK, you can delete the subvolume. However this 
>> cannot be a "quick" subvolume delete
> 
> Why exactly would you need to check everything?  What I'm talking about is 
> having behavior like `user_subvol_rm_allowed` be the default, with an 
> additional check emulating the regular dentry removal check (namely that the 
> user has appropriate permissions on the parent directory) so that people 
> can't delete things like their own home directories.  We're already _way_ 
> beyond POSIX semantics here because we're debating the handling of 
> permissions for an ioctl that takes a different fd than what it functionally 
> operates on, so I see no reason whatsoever that we need to enforce POSIX 
> semantics to that degree.
> 

Why "user_subvol_rm_allowed" doesn't satisfy your needing ? Does not perform 
enough permission checking ?

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Allow non-privileged user to delete empty subvolume by default

2018-03-21 Thread Goffredo Baroncelli
On 03/21/2018 12:47 PM, Austin S. Hemmelgarn wrote:
> I agree as well, with the addendum that I'd love to see a new ioctl that does 
> proper permissions checks.  While letting rmdir(2) work for an empty 
> subvolume with the appropriate permissions would be great (it will let rm -r 
> work correctly), it doesn't address the usefulness of being able to just 
> `btrfs subvolume delete` and not have to wait for the command to finish 
> before you can reuse the name.

How this could work ? 

If you want to check all the subvolumes files permissions, this will require 
some time: you need to traverse all the subvolume-filesystem; and only if all 
the checks are passed, you can delete the subvolume.

Unfortunately I think that only two options exist:
- don't check permissions, and you can quick remove a subvolume
- check all the permissions, i.e. check all the files permissions, and only if 
all the permissions are OK, you can delete the subvolume. However this cannot 
be a "quick" subvolume delete

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Allow non-privileged user to delete empty subvolume by default

2018-03-20 Thread Goffredo Baroncelli
On 03/20/2018 07:45 AM, Misono, Tomohiro wrote:
> Deletion of subvolume by non-privileged user is completely restricted
> by default because we can delete a subvolume even if it is not empty
> and may cause data loss. In other words, when user_subvol_rm_allowed
> mount option is used, a user can delete a subvolume containing the
> directory which cannot be deleted directly by the user.
> 
> However, there should be no harm to allow users to delete empty subvolumes
> when rmdir(2) would have been allowed if they were normal directories.
> This patch allows deletion of empty subvolume by default.

Instead of modifying the ioctl, what about allowing rmdir(2) to work for an 
_empty_ subvolume (and all the permission check are satisfied) ?



> 
> Note that user_subvol_rm_allowed option requires write+exec permission
> of the subvolume to be deleted, but they are not required for empty
> subvolume.
> 
> The comment in the code is also updated accordingly.
> 
> Signed-off-by: Tomohiro Misono <misono.tomoh...@jp.fujitsu.com>
> ---
>  fs/btrfs/ioctl.c | 55 +++
>  1 file changed, 31 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 111ee282b777..838406a7a7f5 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2366,36 +2366,43 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
> file *file,
>   dest = BTRFS_I(inode)->root;
>   if (!capable(CAP_SYS_ADMIN)) {
>   /*
> -  * Regular user.  Only allow this with a special mount
> -  * option, when the user has write+exec access to the
> -  * subvol root, and when rmdir(2) would have been
> -  * allowed.
> +  * By default, regular user is only allowed to delete
> +  * empty subvols when rmdir(2) would have been allowed
> +  * if they were normal directories.
>*
> -  * Note that this is _not_ check that the subvol is
> -  * empty or doesn't contain data that we wouldn't
> +  * If the mount option 'user_subvol_rm_allowed' is set,
> +  * it allows users to delete non-empty subvols when the
> +  * user has write+exec access to the subvol root and when
> +  * rmdir(2) would have been allowed (except the emptiness
> +  * check).
> +  *
> +  * Note that this option does _not_ check that if the subvol
> +  * is empty or doesn't contain data that the user wouldn't
>* otherwise be able to delete.
>*
> -  * Users who want to delete empty subvols should try
> -  * rmdir(2).
> +  * Users who want to delete empty subvols created by
> +  * snapshot (ino number == 2) can use rmdir(2).
>*/
> - err = -EPERM;
> - if (!btrfs_test_opt(fs_info, USER_SUBVOL_RM_ALLOWED))
> - goto out_dput;
> + err = -ENOTEMPTY;
> + if (inode->i_size != BTRFS_EMPTY_DIR_SIZE) {
> + if (!btrfs_test_opt(fs_info, USER_SUBVOL_RM_ALLOWED))
> + goto out_dput;
>  
> - /*
> -  * Do not allow deletion if the parent dir is the same
> -  * as the dir to be deleted.  That means the ioctl
> -  * must be called on the dentry referencing the root
> -  * of the subvol, not a random directory contained
> -  * within it.
> -  */
> - err = -EINVAL;
> - if (root == dest)
> - goto out_dput;
> + /*
> +  * Do not allow deletion if the parent dir is the same
> +  * as the dir to be deleted.  That means the ioctl
> +  * must be called on the dentry referencing the root
> +  * of the subvol, not a random directory contained
> +  * within it.
> +  */
> + err = -EINVAL;
> + if (root == dest)
> + goto out_dput;
>  
> - err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
> - if (err)
> - goto out_dput;
> +     err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
> + if (err)
> + goto out_dput;
> + }
>   }
>  
>   /* check if subvolume may be deleted by a user */
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashes running btrfs scrub

2018-03-20 Thread Goffredo Baroncelli
On 03/19/2018 07:06 PM, Liu Bo wrote:
[...]
> 
> So Mike's case is, that both metadata and data are configured as
> raid6, and the operations, balance and scrub, that he tried, need to
> set the existing block group as readonly (in order to avoid any
> further changes being applied during operations are running), then we
> got into the place where another system chunk is needed.
> 
> However, I think it'd be better to have some warnings about this when
> doing a) mkfs.btrfs -mraid6, b) btrfs device add.

What about in the kernel dmesg during the mount

> 
> David, any idea?
> 
> thanks,
> liubo
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 5/7] btrfs-progs: sub list: Allow normal user to call "subvolume list/show"

2018-03-19 Thread Goffredo Baroncelli
   return ret;
> + }
> + /*
> +  * now we have an rbtree full of root_info objects, but we
> +  * need to fill in their path names within the subvol that
> +  * is referencing each one.
> +  */
> + ret = list_subvol_fill_paths(fd, root_lookup);
> + } else {
> + ret = list_subvol_user(fd, root_lookup, path);
> + if (ret) {
> + if (ret == -ENOTTY)
> +error("can't perform the search: Operation by non-privileged user is not 
> supported by this kernel.");
> + else
> + error("can't perform the search: %s",
> + strerror(-ret));
> + }

Sorry for noticing that only now: which is the reason to not using the new api 
also in "root" case? I know that the behavior is a bit different, so my 
question is: it is possible to extend the new ioctls to support also the old 
behavior in the root case? So we could get rid of the "tree search" ioctl, 
using it only for the old kernel

>   }
>  
> - /*
> -  * now we have an rbtree full of root_info objects, but we need to fill
> -  * in their path names within the subvol that is referencing each one.
> -  */
> - ret = list_subvol_fill_paths(fd, root_lookup);
>   return ret;
>  }
>  
> @@ -1598,12 +1920,12 @@ int btrfs_get_subvol(int fd, struct root_info 
> *the_ri, const char *path)
>   return ret;
>   }
>  
> + ret = -ENOENT;
>   rbn = rb_first();
>   while(rbn) {
>   ri = rb_entry(rbn, struct root_info, rb_node);
>   rr = resolve_root(, ri, root_id);
> - if (rr == -ENOENT) {
> - ret = -ENOENT;
> + if (rr == -ENOENT || uuid_is_null(ri->uuid)) {
>   rbn = rb_next(rbn);
>   continue;
>   }
> diff --git a/cmds-subvolume.c b/cmds-subvolume.c
> index faa10c5a..68de7db7 100644
> --- a/cmds-subvolume.c
> +++ b/cmds-subvolume.c
> @@ -488,6 +488,7 @@ static int cmd_subvol_list(int argc, char **argv)
>   int is_only_in_path = 0;
>   DIR *dirstream = NULL;
>   enum btrfs_list_layout layout = BTRFS_LIST_LAYOUT_DEFAULT;
> + uid_t uid;
>  
>   filter_set = btrfs_list_alloc_filter_set();
>   comparer_set = btrfs_list_alloc_comparer_set();
> @@ -596,6 +597,13 @@ static int cmd_subvol_list(int argc, char **argv)
>   goto out;
>   }
>  
> + uid = geteuid();
> + if (uid != 0 && is_list_all) {
> + ret = -1;
> + error("only root can use -a option");
> + goto out;
> + }
> +
>   if (flags)
>   btrfs_list_setup_filter(_set, BTRFS_LIST_FILTER_FLAGS,
>   flags);
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashes running btrfs scrub

2018-03-18 Thread Goffredo Baroncelli
On 03/18/2018 08:57 AM, Goffredo Baroncelli wrote:
>   BTRFS_SYSTEM_CHUNK_ARRAY_SIZE = 2048
>   sizeof(struct btrfs_chunk)) = 48
>   sizeof(struct btrfs_stripe) = 32
> 
> So
> 
>   (2048/2-48)/32+1 = 31
> 
> If my math is correct

my math was wrong:

sizeof(struct btrfs_chunk)=80

so the maximum number of disk is: (2048/2-80)/32+1 = 30

Does make sense putting some warning in btrfs kernel module, and in some btrfs 
command like:
- btrfs fi show
- btrfs dev add

and in the wiki ? 30 device are a big number, however not unreachable...


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crashes running btrfs scrub

2018-03-18 Thread Goffredo Baroncelli
On 03/18/2018 07:41 AM, Liu Bo wrote:
> ((BTRFS_SYSTEM_CHUNK_ARRAY_SIZE / 2) - sizeof(struct btrfs_chunk)) /
> sizeof(struct btrfs_stripe) + 1

BTRFS_SYSTEM_CHUNK_ARRAY_SIZE = 2048
sizeof(struct btrfs_chunk)) = 48
sizeof(struct btrfs_stripe) = 32

So

(2048/2-48)/32+1 = 31

If my math is correct
gb
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 6/8] btrfs-progs: sub list: Allow normal user to call "subvolume list/show"

2018-03-17 Thread Goffredo Baroncelli
On 03/15/2018 09:15 AM, Misono, Tomohiro wrote:
> Allow normal user to call "subvolume list/show" by using 3 new
> unprivileged ioctls (BTRFS_IOC_GET_SUBVOL_INFO,
> BTRFS_IOC_GET_SUBVOL_ROOTREF and BTRFS_IOC_INO_LOOKUP_USER).
> 
> Note that for root, "subvolume list" returns all the subvolume in the
> filesystem by default, but for normal user, it returns subvolumes
> which exist under the specified path (including the path itself).

I found the original "btrfs sub list" behavior quite confusing. I think that 
the problem is that the output is too technical. And the '-a' switch increase 
this confusion. May be that I am no smart enough :(

The "normal user behavior" seems to me more coherent. However I am not sure 
that this differences should be acceptable. In any case it should be tracked in 
the man page.

Time to add another command (something like "btrfs sub ls") with a more "human 
friendly" output ?

> The specified path itself is not needed to be a subvolume.
> If the subvolume cannot be opened but the parent directory can be,
> the information other than name or id would be zeroed out.
> 
> Also, for normal user, snapshot filed of "subvolume show" just lists
> the snapshots under the specified subvolume.
> 
> Signed-off-by: Tomohiro Misono <misono.tomoh...@jp.fujitsu.com>
> ---
>  btrfs-list.c | 326 
> +--
>  cmds-subvolume.c |  13 +++
>  2 files changed, 332 insertions(+), 7 deletions(-)
> 

[]

>  static void print_subvolume_column(struct root_info *subv,
>  enum btrfs_list_column_enum column)
>  {
> @@ -1527,17 +1826,28 @@ static int btrfs_list_subvols(int fd, struct 
> root_lookup *root_lookup,
>  {
>   int ret;
>  
> - ret = list_subvol_search(fd, root_lookup);
> - if (ret) {
> - error("can't perform the search: %m");
> - return ret;
> + ret = check_perm_for_tree_search(fd);
> + if (ret < 0) {
> + error("can't check the permission for tree search: %s",
> + strerror(-ret));
> + return -1;
>   }
>  
>   /*
>* now we have an rbtree full of root_info objects, but we need to fill
>* in their path names within the subvol that is referencing each one.
>*/
> - ret = list_subvol_fill_paths(fd, root_lookup);
> + if (ret) {
> + ret = list_subvol_search(fd, root_lookup);
> + if (ret) {
> + error("can't perform the search: %s", strerror(-ret));
> + return ret;
> + }
> + ret = list_subvol_fill_paths(fd, root_lookup);
> + } else {
> + ret = list_subvol_user(fd, root_lookup, path);
> + }
> +
>   return ret;

I think that the check above should be refined: if I run "btrfs sub list" 
patched in a kernel without patch I don't get any error and/or warning:

ghigo@venice:~/btrfs/btrfs-progs$ # kernel w/o patch; btrfs-progs w/patch
ghigo@venice:~/btrfs/btrfs-progs$ ./btrfs sub list /
ghigo@venice:~/btrfs/btrfs-progs$ # kernel w/o patch; btrfs-progs w/o patch
ghigo@venice:~/btrfs/btrfs-progs$ btrfs sub list /
ERROR: can't perform the search: Operation not permitted

I think that in both case an error should be raised


>  }
>  
> @@ -1631,12 +1941,14 @@ int btrfs_get_subvol(int fd, struct root_info 
> *the_ri, const char *path)
>   return ret;
>   }
>  
> + ret = -ENOENT;
>   rbn = rb_first();
>   while(rbn) {
>   ri = rb_entry(rbn, struct root_info, rb_node);
>   rr = resolve_root(, ri, root_id);
> - if (rr == -ENOENT) {
> - ret = -ENOENT;
> + if (rr == -ENOENT ||
> + ri->root_id == BTRFS_FS_TREE_OBJECTID ||
> + uuid_is_null(ri->uuid)) {
>   rbn = rb_next(rbn);
>   continue;
>   }
> diff --git a/cmds-subvolume.c b/cmds-subvolume.c
> index faa10c5a..7a7c6f3b 100644
> --- a/cmds-subvolume.c
> +++ b/cmds-subvolume.c
> @@ -596,6 +596,19 @@ static int cmd_subvol_list(int argc, char **argv)
>   goto out;
>   }
>  
> + ret = check_perm_for_tree_search(fd);
> + if (ret < 0) {
> + ret = -1;
> + error("can't check the permission for tree search: %s",
> +     strerror(-ret));
> + goto out;
> + }
> + if (!ret && is_list_all) {
> + ret = -1;
> + error("only

Re: [RFC PATCH v2 3/8] btrfs-progs: sub list: Add helper function which checks the permission for tree search ioctl

2018-03-17 Thread Goffredo Baroncelli
On 03/15/2018 09:13 AM, Misono, Tomohiro wrote:
> This is a preparetion work to allow normal user to call
> "subvolume list/show".
> 
> Signed-off-by: Tomohiro Misono <misono.tomoh...@jp.fujitsu.com>
> ---
>  btrfs-list.c | 33 +
>  btrfs-list.h |  1 +
>  2 files changed, 34 insertions(+)
> 
> diff --git a/btrfs-list.c b/btrfs-list.c
> index 50e5ce5f..88689a9d 100644
> --- a/btrfs-list.c
> +++ b/btrfs-list.c
> @@ -958,6 +958,39 @@ out:
>   return 0;
>  }
>  
> +/*
> + * Check the permission for tree search ioctl by searching a key
> + * which alwasys exists
> + */
> +int check_perm_for_tree_search(int fd)
> +{

In what this is different from '(getuid() == 0)' ? 
Which would be the cases where an error different from EPERM would be returned 
(other than the filesystem is not a BTRFS one )?

I am not against to this kind of function. However with this implementation is 
not clear its behavior:
- does it check if the user has enough privileges to perform a 
BTRFS_IOC_TREE_SEARCH
or 
- does it check if BTRFS_IOC_TREE_SEARCH might return some error ?

If the former, I would put this function into utils.[hc] checking only the 
[e]uid of the user. If the latter the name is confusing...

> + struct btrfs_ioctl_search_args args;
> + struct btrfs_ioctl_search_key *sk = 
> + int ret;
> +
> + memset(, 0, sizeof(args));
> + sk->tree_id = BTRFS_ROOT_TREE_OBJECTID;
> + sk->min_objectid = BTRFS_EXTENT_TREE_OBJECTID;
> + sk->max_objectid = BTRFS_EXTENT_TREE_OBJECTID;
> + sk->min_type = BTRFS_ROOT_ITEM_KEY;
> + sk->max_type = BTRFS_ROOT_ITEM_KEY;
> + sk->min_offset = 0;
> + sk->max_offset = (u64)-1;
> + sk->min_transid = 0;
> + sk->max_transid = (u64)-1;
> + sk->nr_items = 1;
> + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
> +
> + if (!ret)
> + ret = 1;
> + else if (ret < 0 && errno == EPERM)
> + ret = 0;
> + else
> + ret = -errno;
> +
> + return ret;
> +}
> +
>  static int list_subvol_search(int fd, struct root_lookup *root_lookup)
>  {
>   int ret;
> diff --git a/btrfs-list.h b/btrfs-list.h
> index 6e5fc778..6225311d 100644
> --- a/btrfs-list.h
> +++ b/btrfs-list.h
> @@ -176,5 +176,6 @@ char *btrfs_list_path_for_root(int fd, u64 root);
>  int btrfs_list_get_path_rootid(int fd, u64 *treeid);
>  int btrfs_get_subvol(int fd, struct root_info *the_ri);
>  int btrfs_get_toplevel_subvol(int fd, struct root_info *the_ri);
> +int check_perm_for_tree_search(int fd);
>  
>  #endif
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-14 Thread Goffredo Baroncelli
On 03/14/2018 08:27 PM, Austin S. Hemmelgarn wrote:
> On 2018-03-14 14:39, Goffredo Baroncelli wrote:
>> On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
>> [...]
>>>>
>>>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In 
>>>> a conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
>>>> checksum, so this problem doesn't exist.
>>>>
>>>> I am curious how ZFS solves this problem.
>>> It doesn't support disabling COW or the O_DIRECT flag, so it just never has 
>>> the problem in the first place.
>>
>> I would like to perform some tests: however I think that you are right. if 
>> you make a "double buffering" approach (copy the data in the page cache, 
>> compute the checksum, then write the data to disk), the mismatch should not 
>> happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT 
>> is a prerequisite to the "double buffering"; alone it couldn't be 
>> sufficient; what about mmap ? Are we sure that this does a double buffering ?
> There's a whole lot of applications that would be showing some pretty serious 
> issues if checksumming didn't work correctly with mmap(), so I think it does 
> work correctly given that we don't have hordes of angry users and sysadmins 
> beating down the doors.

I tried to do in parallel updating a page and writing in different thread; I 
was unable to reproduce a checksum mismatch; so it seems that mmap are safe 
from this point of view;

>>
>> I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I 
>> prefer this to the checksum mismatch bug.
> This is only reasonable if you are writing to the files.  Checksums appear to 
> be checked on O_DIRECT reads, and outside of databases and VM's, read-only 
> access accounts for a significant percentage of O_DIRECT usage, partly 
> because it is needed for AIO support (nginx for example can serve files using 
> AIO and O_DIRECT and gets a pretty serious performance boost on heavily 
> loaded systems by doing so).
> 

So O_DIRECT should be unsupported/ignored only for the writing ? It could be a 
good compromise...

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-14 Thread Goffredo Baroncelli
On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
[...]
>>
>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a 
>> conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
>> checksum, so this problem doesn't exist.
>>
>> I am curious how ZFS solves this problem.
> It doesn't support disabling COW or the O_DIRECT flag, so it just never has 
> the problem in the first place.

I would like to perform some tests: however I think that you are right. if you 
make a "double buffering" approach (copy the data in the page cache, compute 
the checksum, then write the data to disk), the mismatch should not happen. Of 
course this is incompatible with O_DIRECT; but disabling O_DIRECT is a 
prerequisite to the "double buffering"; alone it couldn't be sufficient; what 
about mmap ? Are we sure that this does a double buffering ?

I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer 
this to the checksum mismatch bug.


>>
>> However I have to point out that this problem is not solved by the COW. COW 
>> solved only the problem about an interrupted commit of the filesystem, where 
>> the data is update in place (so it is available by the user), but the 
>> metadata not.
> COW is irrelevant if you're bypassing it.  It's only enforced for metadata so 
> that you don't have to check the FS every time you mount it (because the way 
> BTRFS uses it guarantees consistency of the metadata).
>>
>>>
>>> Even if not... I should be only a problem in case of a crash during
>>> that,.. and than I'd still prefer to get the false positive than bad
>>> data.
>>
>> How you can know if it is a "bad data" or a "bad checksum" ?
> You can't directly.  Just like you can't know which copy in a two-device MD 
> RAID1 array is bad when they mismatch.
> 
> That's part of why I'm not all that fond of the idea of having checksums 
> without COW, you need to verify the data using secondary means anyway, so why 
> exactly should you waste time verifying it twice?

This is true

>>
>>>
>>> Anyway... it's not going to happen so the discussion is pointless.
>>> I think people can probably use dm-integrity (which btw: does no CoW
>>> either (IIRC) and still can provide integrity... ;-) ) to see whether
>>> their data is valid.
>>> No nice but since it won't change on btrfs, a possible alternative.
>>
>> Even in this case I am curious about dm-integrity would sole this issue.
> dm-integrity uses journaling, and actually based on the testing I've done, 
> will typically have much worse performance than the overhead of just enabling 
> COW on files on BTRFS and manually defragmenting them on a regular basis.

Good to know
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-13 Thread Goffredo Baroncelli
On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote:
> On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:
>> Unfortunately no, the likelihood might be 100%: there are some
>> patterns which trigger this problem quite easily. See The link which
>> I posted in my previous email. There was a program which creates a
>> bad checksum (in COW+DATASUM mode), and the file became unreadable.
> But that rather seems like a plain bug?!

You are right, unfortunately it seems that it is catalogate as WONT-FIX :(

> No reason that would conceptually make checksumming+notdatacow
> impossible.
> 
> AFAIU, the conceptual thin would be about:
> - data is written in nodatacow
>   => thus a checksum must be written as well, so write it
> - what can then of course happen is
>   - both csum and data are written => fine
>   - csum is written but data not and then some crash => csum will show
> that => fine
>   - data is written but csum not and then some crash => csum will give
> false positive
> 
> Still better few false positives, as many unnoticed data corruptions
> and no true raid repair.

A checksum mismatch, is returned as -EIO by a read() syscall. This is an event 
handled badly by most part of the programs. 
I.e. suppose that a page of a VM ram image file has a wrong checksum. When the 
VM starts, tries to read the page, got -EIO and aborts. It is even possible 
that it could not print which page is corrupted. In this case, how the user 
understand the problem, and what he could do ?


[]

> 
>> Again, you are assuming that the likelihood of having a bad checksum
>> is low. Unfortunately this is not true. There are pattern which
>> exploits this bug with a likelihood=100%.
> 
> Okay I don't understand why this would be so and wouldn't assume that
> the IO pattern can affect it heavily... but I'm not really btrfs
> expert.
> 
> My blind assumption would have been that writing an extent of data
> takes much longer to complete than writing the corresponding checksum.

The problem is the following: there is a time window between the checksum 
computation and the writing the data on the disk (which is done at the lower 
level via a DMA channel), where if the data is update the checksum would 
mismatch. This happens if we have two threads, where the first commits the data 
on the disk, and the second one updates the data (I think that both VM and 
database could behave so).

In btrfs, a checksum mismatch creates an -EIO error during the reading. In a 
conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
checksum, so this problem doesn't exist. 

I am curious how ZFS solves this problem.

However I have to point out that this problem is not solved by the COW. COW 
solved only the problem about an interrupted commit of the filesystem, where 
the data is update in place (so it is available by the user), but the metadata 
not.


> 
> Even if not... I should be only a problem in case of a crash during
> that,.. and than I'd still prefer to get the false positive than bad
> data.

How you can know if it is a "bad data" or a "bad checksum" ?


> 
> 
> Anyway... it's not going to happen so the discussion is pointless.
> I think people can probably use dm-integrity (which btw: does no CoW
> either (IIRC) and still can provide integrity... ;-) ) to see whether
> their data is valid.
> No nice but since it won't change on btrfs, a possible alternative.

Even in this case I am curious about dm-integrity would sole this issue.

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-12 Thread Goffredo Baroncelli
On 03/11/2018 11:37 PM, Christoph Anton Mitterer wrote:
> On Sun, 2018-03-11 at 18:51 +0100, Goffredo Baroncelli wrote:
>>
>> COW is needed to properly checksum the data. Otherwise is not
>> possible to ensure the coherency between data and checksum (however I
>> have to point out that BTRFS fails even in this case [*]).
>> We could rearrange this sentence, saying that: if you want checksum,
>> you need COW...
> 
> No,... not really... the meta-data is anyway always CoWed... so if you
> do checksum *and* notdatacow,..., the only thing that could possibly
> happen (in the worst case) is, that data that actually made it
> correctly to the disk is falsely determined bad, as the metadata (i.e.
> the checksums) weren't upgraded correctly.
> 
> That however is probably much less likely than the other way round,..
> i.e. bad data went to disk and would be detected with checksuming.

Unfortunately no, the likelihood might be 100%: there are some patterns which 
trigger this problem quite easily. See The link which I posted in my previous 
email. There was a program which creates a bad checksum (in COW+DATASUM mode), 
and the file became unreadable.

> 
> 
> I had lots of discussions about this here on the list, and no one ever
> brought up a real argument against it... I also had an off-list
> discussion with Chris Mason who IIRC confirmed that it would actually
> work as I imagine it... with the only two problems:
> - good data possibly be marked bad because of bad checksums
> - reads giving back EIO where people would rather prefer bad data

If you cannot know if a checksum is bad or the data is bad, the checksum is not 
useful at all!

If I read correctly what you wrote, it seems that you consider a "minor issue" 
the fact that the checksum is not correct. If you accept the possibility that a 
checksum might be wrong, you wont trust anymore the checksum; so the checksum 
became not useful.
 

> (not really sure if this were really his two arguments,... I'd have to
> look it up, so don't nail me down).
> 
> 
> Long story short:
> 
> In any case, I think giving back bad data without EIO is unacceptable.
> If someone really doesn't care (e.g. because he has higher level
> checksumming and possibly even repair) he could still manually disable
> checksumming.
> 
> The little chance of having a false positive weights IMO far less that
> have very large amounts of data (DBs, VM images are our typical cases)
> completely unprotected.

Again, you are assuming that the likelihood of having a bad checksum is low. 
Unfortunately this is not true. There are pattern which exploits this bug with 
a likelihood=100%.

> 
> And not having checksumming with notdatacow breaks any safe raid repair
> (so in that case "repair" may even overwrite good data),... which is
> IMO also unacceptable.
> And the typical use cases for nodatacow (VMs, DBs) are in turn not so
> uncommon to want RAID.
> 
> 
> I really like btrfs,... and it's not that other fs (which typically
> have no checksumming at all) would perform better here... but not
> having it for these major use case is a big disappointment for me.
> 
> 
> Cheers,
> Chris.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-11 Thread Goffredo Baroncelli
On 03/10/2018 03:29 PM, Christoph Anton Mitterer wrote:
> On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote:
>> So for OLTP workloads you definitely want nodatacow enabled, bear in
>> mind this also disables crc checksumming, but your db engine should
>> already have such functionality implemented in it.
> 
> Unlike repeated claims made here on the list and other places... I
> woudln't now *any* DB system which actually does this per default and
> or in a way that would be comparable to filesystem lvl checksumming.
> 

I agree with you, also nobody warn that without checksum in case of RAID 
filesystem BTRFS is not capable anymore to check if a stripe is correct or not

> 
> Look back in the archives... when I've asked several times for
> checksumming support *with* nodatacow, I evaluated the existing status
> for the big ones (postgres,mysql,sqlite,bdb)... and all of them had
> this either not enabled per default, not at all, or requiring special
> support for the program using the DB.
> 
> 
> Similar btw: no single VM image type I've evaluated back then had any
> form of checksumming integrated.
> 
> 
> Still, one of the major deficiencies (not in comparison to other fs,
> but in comparison to how it should be) of btrfs unfortunately :-(

COW is needed to properly checksum the data. Otherwise is not possible to 
ensure the coherency between data and checksum (however I have to point out 
that BTRFS fails even in this case [*]).
We could rearrange this sentence, saying that: if you want checksum, you need 
COW...

> 
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[*] https://www.spinics.net/lists/linux-btrfs/msg69185.html

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: Add unprivileged subvolume search ioctl

2018-03-07 Thread Goffredo Baroncelli
On 03/07/2018 01:40 AM, Misono, Tomohiro wrote:
> On 2018/03/07 5:29, Goffredo Baroncelli wrote:
>> On 03/06/2018 09:30 AM, Misono, Tomohiro wrote:
>>> Add new unprivileged ioctl (BTRFS_IOC_GET_SUBVOL_INFO) which searches
>>> and returns only subvolume related item (ROOT_ITEM/ROOT_BACKREF/ROOT_REF)
>>> from root tree. The arguments of this ioctl are the same as treesearch
>>> ioctl and can be used like treesearch ioctl.
>>
>> Is it a pro ? The treesearch ioctl is tightly coupled to the btrfs internal 
>> structure, this means that if we would change the btrfs internal structure, 
>> we have to update all the clients of this api. For the treesearch it is an 
>> acceptable compromise between flexibility and speed of developing. But for a 
>> more specialized API, I think that it would make sense to provide a less 
>> coupled api to the internal structure.
> 
> Thanks for the comments.
> 
> The reason I choose the same api is just to minimize the code change in 
> btrfs-progs.
> For tree search part, it works just switching the ioctl number from 
> BTRFS_IOC_TREE_SEARCH
> to BTRFS_IOC_GET_SUBVOL_INFO in list_subvol_search()[1].
> 
> [1] https://marc.info/?l=linux-btrfs=152032537911218=2

I suggest to avoid this approach. An API is forever; we already have a 
"root-only" one which is quite unfriendly and error prone; I think that we 
should put all the energies to make a better one. 

I think that the major weaknesses of this api are:
- it exports the the data in "le" format  (see struct btrfs_root_item as 
example); 
- it requires the user to increase the key for the next ioctl call. This could 
be doable in the kernel space before returning
- this ioctl exports both the ROOT_BACKREF and ROOT_ITEM info. Why not make two 
separate (simplers) ioctl(s) ?

>>
>> Below some comments
[...]

>>> +   if ((key->objectid == BTRFS_FS_TREE_OBJECTID ||
>>> +   (key->objectid >= BTRFS_FIRST_FREE_OBJECTID &&
>>> +key->objectid <= BTRFS_LAST_FREE_OBJECTID)) &&
>>> +   key->type >= BTRFS_ROOT_ITEM_KEY &&
>>> +   key->type <= BTRFS_ROOT_BACKREF_KEY)
>>
>> Why returning all the range [BTRFS_ROOT_ITEM_KEY...BTRFS_ROOT_BACKREF_KEY]. 
>> It would be sufficient to return only
>>
>>   +  (key->type == BTRFS_ROOT_ITEM_KEY ||
>>   +   key->type == BTRFS_ROOT_BACKREF_KEY))
> 
> Sorry, it is a mistake, I mean "key->type <= BTRFS_ROOTREF_KEY".
> Although btrfs-progs only uses BTRFS_ROOT_BACKREF_KEY, I notice libbtrfs 
> uses BTRFS_ROOT_REF_KEY instead. So, I think it is better to return both
> ROOT_BACKREF_KEY and ROOT_REF_KEY. I will fix this in v2.
> 

I was referring to the '>=' and '<=' instead of '=='. If another type is added 
in the middle, it would be returned. I find it a bit error prone.

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] btrfs: Add unprivileged subvolume search ioctl

2018-03-06 Thread Goffredo Baroncelli
_no_name(info, );
> + if (IS_ERR(root)) {
> + btrfs_free_path(path);
> + return -ENOENT;
> + }
> +
> + key.objectid = sk->min_objectid;
> + key.type = sk->min_type;
> + key.offset = sk->min_offset;
> +
> + while (1) {
> + ret = btrfs_search_forward(root, , path, sk->min_transid);
> + if (ret != 0) {
> + if (ret > 0)
> + ret = 0;
> + goto err;
> + }
> +
> + ret = copy_subvol_item(path, , sk, buf_size, ubuf,
> +  _offset, _found);
> + btrfs_release_path(path);
> + if (ret)
> + break;
> + }
> +
> + if (ret > 0)
> + ret = 0;
> +err:
> + sk->nr_items = num_found;
> + btrfs_free_path(path);
> + return ret;
> +}
> +
> +/*
> + * Unprivileged ioctl for getting subvolume related item.
> + * The arguments are the same as btrfs_ioctl_tree_search()
> + * and can be used like tree search ioctl.
> + *
> + * Only returns ROOT_ITEM/ROOT_BACKREF/ROOT_REF key of subvolumes
> + * from root tree. Also, the name of subvolume will be omitted from
> + * ROOT_BACKREF/ROOT_REF item to prevent name leak.
> + */
> +static noinline int btrfs_ioctl_get_subvol_info(struct file *file,
> +void __user *argp)
> +{
> + struct btrfs_ioctl_search_args __user *uargs;
> + struct btrfs_ioctl_search_key sk;
> + struct inode *inode;
> + size_t buf_size;
> + int ret;
> +
> + uargs = (struct btrfs_ioctl_search_args __user *)argp;
> +
> + if (copy_from_user(, >key, sizeof(sk)))
> + return -EFAULT;
> +
> + buf_size = sizeof(uargs->buf);
> +
> + inode = file_inode(file);
> + ret = search_subvol(inode, , _size, uargs->buf);
> +
> + /*
> +  * In the original implementation an overflow is handled by returning a
> +  * search header with a len of zero, so reset ret.
> +  */
> + if (ret == -EOVERFLOW)
> + ret = 0;
> +
> + if (ret == 0 && copy_to_user(>key, , sizeof(sk)))
> + ret = -EFAULT;
> + return ret;
> +}
> +
>  static noinline int btrfs_ioctl_snap_destroy(struct file *file,
>void __user *arg)
>  {
> @@ -5663,6 +5915,8 @@ long btrfs_ioctl(struct file *file, unsigned int
>   return btrfs_ioctl_get_features(file, argp);
>   case BTRFS_IOC_SET_FEATURES:
>   return btrfs_ioctl_set_features(file, argp);
> + case BTRFS_IOC_GET_SUBVOL_INFO:
> + return btrfs_ioctl_get_subvol_info(file, argp);
>   }
>  
>   return -ENOTTY;
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index c8d99b9ca550..1e9361cf66d5 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -843,5 +843,7 @@ enum btrfs_err_code {
>  struct btrfs_ioctl_vol_args_v2)
>  #define BTRFS_IOC_LOGICAL_INO_V2 _IOWR(BTRFS_IOCTL_MAGIC, 59, \
>   struct btrfs_ioctl_logical_ino_args)
> +#define BTRFS_IOC_GET_SUBVOL_INFO _IOWR(BTRFS_IOCTL_MAGIC, 60, \
> + struct btrfs_ioctl_search_args)
>  
>  #endif /* _UAPI_LINUX_BTRFS_H */
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs

2018-02-09 Thread Goffredo Baroncelli
On 02/08/2018 08:07 PM, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
>> On 02/06/2018 12:15 AM, Liu Bo wrote:
>> [...]
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>
>> [...]
>> [...]
>>
>>> +
>>> +   /* read lock please */
>>> +   do {
>>> +   seq = read_seqbegin(_info->bc_lock);
>>> +   list_for_each_entry(bc, _info->bad_chunks, list) {
>>> +   len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +   bc->chunk_offset);
>>> +   /* chunk offset is u64 */
>>> +   if (len >= PAGE_SIZE)
>>> +   break;
>>> +   }
>>> +   } while (read_seqretry(_info->bc_lock, seq));
>>
>> Using this interface, how many chunks can you list ? If I read the code 
>> correctly, only up to fill a kernel page.
>>
>> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) 
>> it is possible to list only few hundred of chunks (~200). Not more; and the 
>> last one could be even listed incomplete.
>>
> 
> That's true.
> 
>> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to 
>> list could be more than 200.
>>
>> My first suggestion is to limit the number of chunks to show to 200 (a page 
>> should be big enough to contains all these chunks offset). If the chunks 
>> number are greater, ends the list with a marker (something like '[...]\n').
>> This would solve the ambiguity about the fact that the list chunks are 
>> complete or not. Anyway you cannot list all the chunks.
>>
> 
> Good point, and I need to add one more field to each record to specify
> it's metadata or data.
> 
> So what I have in my mind is that this kind of error is rare and
> reflects bad sectors on disk, but if there are really that many
> errors, I think we need to replace the disk without hesitation.

What happens if you loose an entire disk ? If so, you fill "bad_sector"


> 
>> However, my second suggestions is to ... change completely the interface. 
>> What about adding a directory in sysfs, where each entry is a chunk ?
>>
>> Something like:
>>
>> /sys/fs/btrfs//chunks//type  # 
>> data/metadata/sys
>> /sys/fs/btrfs//chunks//profile   # 
>> dup/linear
>> /sys/fs/btrfs//chunks//size  # size
>> /sys/fs/btrfs//chunks//devs  # chunks devs 
>>
>> And so on. 
>>
>> Checking  "[...]/devs", it would be easy understand if the 
>> chunk is in "degraded" mode or not.
> 
> I'm afraid we cannot do that as it'll occupy too much memory for large
> filesystems given a typical chunk size is 1GB.
> 
>>
>> However I have to admit that I don't know how feasible is iterate over a 
>> sysfs directory which is a map of a kernel objects list.
>>
>> I think that if these interface would be good enough, we could get rid of a 
>> lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>>
> 
> TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?

Likely yes. However TREE_SEARCH is bad because it is hard linked to the 
filesystem structure. I.e. if for some reason BTRFS change the on disk layout, 
what happens for the old application which expect the previous disk format ? 
And for the same reason, it is root-only (UID==0)

Let me to give another idea: what about exporting these information via an 
ioctl ? It could be extended to query all information about (all) the chunks...

> 
> thanks,
> -liubo
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs

2018-02-08 Thread Goffredo Baroncelli
On 02/06/2018 12:15 AM, Liu Bo wrote:
[...]
> One way to mitigate the data loss pain is to expose 'bad chunks',
> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
> to relocate the whole chunk and get the full raid6 protection again
> (if the relocation works).

[...]
[...]

> +
> + /* read lock please */
> + do {
> + seq = read_seqbegin(_info->bc_lock);
> + list_for_each_entry(bc, _info->bad_chunks, list) {
> + len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
> + bc->chunk_offset);
> + /* chunk offset is u64 */
> + if (len >= PAGE_SIZE)
> + break;
> + }
> + } while (read_seqretry(_info->bc_lock, seq));

Using this interface, how many chunks can you list ? If I read the code 
correctly, only up to fill a kernel page.

If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it 
is possible to list only few hundred of chunks (~200). Not more; and the last 
one could be even listed incomplete.

IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list 
could be more than 200.

My first suggestion is to limit the number of chunks to show to 200 (a page 
should be big enough to contains all these chunks offset). If the chunks number 
are greater, ends the list with a marker (something like '[...]\n').
This would solve the ambiguity about the fact that the list chunks are complete 
or not. Anyway you cannot list all the chunks.

However, my second suggestions is to ... change completely the interface. What 
about adding a directory in sysfs, where each entry is a chunk ?

Something like:

/sys/fs/btrfs//chunks//type # 
data/metadata/sys
/sys/fs/btrfs//chunks//profile  # dup/linear
/sys/fs/btrfs//chunks//size # size
/sys/fs/btrfs//chunks//devs # chunks devs 

And so on. 

Checking  "[...]/devs", it would be easy understand if the chunk 
is in "degraded" mode or not.

However I have to admit that I don't know how feasible is iterate over a sysfs 
directory which is a map of a kernel objects list.

I think that if these interface would be good enough, we could get rid of a lot 
of ioctl(TREE_SEARCH) from btrfs-progs. 

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Goffredo Baroncelli
ed to do most
> of the mount process (whether you continue or abort is another matter). 
> This can't be done sanely from outside the kernel.  Adding finer control
> would be reasonable ("wait and block" vs "try and return immediately") but
> that's about all.  It's be also wrong to have a different interface for
> daemon X than for humans.
> 
> Ie, the thing systemd can safely do, is to stop trying to rule everything,
> and refrain from telling the user whether he can mount something or not.
> And especially, unmounting after the user mounts manually...
> 
> 
> Meow!
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/26] libbtrfsutil: add btrfs_util_create_snapshot()

2018-01-27 Thread Goffredo Baroncelli
On 01/27/2018 06:45 AM, Omar Sandoval wrote:
> On Sat, Jan 27, 2018 at 01:00:58PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018年01月27日 03:46, Omar Sandoval wrote:
>>> On Fri, Jan 26, 2018 at 08:31:06PM +0100, Goffredo Baroncelli wrote:
>>>> On 01/26/2018 07:40 PM, Omar Sandoval wrote:
>>>>> From: Omar Sandoval <osan...@fb.com>
[...]
>> Unfortunately, at least there is also some planned work to bring a
>> shared code base between kernel and btrfs-progs, which is also named
>> libbtrfs, inspired by libxfs.
> 
> That's right, I forgot about that. There's definitely value in having a
> distinction between btrfs_util_ (userspace interfaces) and btrfs_
> (filesystem disk format).
> 
>> And depending on the respect of view, some developer may prefer the
>> short btrfs_ prefix for libbtrfs, while other developers/users will
>> definitely prefer btrfs_ prefix for libbtrfsutil.
>>
>> What about shorted prefix like butil_ or btrutil_?
> 
> Those aren't very informative, I think sticking with btrfs_util_ is
> fine, it's not that bad to type out.

Let me another chance: what about 'ubtrfs_'...



> 
>> Thanks,
>> Qu
>>
>>>
>>> I'll wait a bit for people to bikeshed on the naming before I go and
>>> rename everything, but I'm leaning towards the shorter name and
>>> appending _fd instead of prepending f_.
>>>
>>>> 3) regarding the btrfs_util_create_snapshot() function, I think that it 
>>>> would be useful to add some more information:
>>>> a) if used recursive is NOT atomic
>>>> b) if used recursive, root capabilities are needed
>>>>
>>>> The same for the other functions: mark with a 'root required' tag all the 
>>>> functions which require the root capabilities.
>>>
>>> That's a great point, I'll document that.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 
> 
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/26] libbtrfsutil: add btrfs_util_create_snapshot()

2018-01-26 Thread Goffredo Baroncelli
m btrfs_util_error btrfs_util_f2_create_snapshot(int fd, int parent_fd,
> + const char *name, int flags,
> + uint64_t *async_transid,
> + struct 
> btrfs_util_qgroup_inherit *qgroup_inherit)
> +{
> + struct btrfs_ioctl_vol_args_v2 args = {.fd = fd};
> + enum btrfs_util_error err;
> + size_t len;
> + int ret;
> +
> + if ((flags & ~BTRFS_UTIL_CREATE_SNAPSHOT_MASK) ||
> + ((flags & BTRFS_UTIL_CREATE_SNAPSHOT_READ_ONLY) &&
> +  (flags & BTRFS_UTIL_CREATE_SNAPSHOT_RECURSIVE))) {
> +     errno = EINVAL;
> + return BTRFS_UTIL_ERROR_INVALID_ARGUMENT;
> + }
> +
> + if (flags & BTRFS_UTIL_CREATE_SNAPSHOT_READ_ONLY)
> + args.flags |= BTRFS_SUBVOL_RDONLY;
> + if (async_transid)
> + args.flags |= BTRFS_SUBVOL_CREATE_ASYNC;
> + if (qgroup_inherit) {
> + args.flags |= BTRFS_SUBVOL_QGROUP_INHERIT;
> + args.qgroup_inherit = (struct btrfs_qgroup_inherit 
> *)qgroup_inherit;
> + args.size = (sizeof(*args.qgroup_inherit) +
> +  args.qgroup_inherit->num_qgroups *
> +  sizeof(args.qgroup_inherit->qgroups[0]));
> + }
> +
> + len = strlen(name);
> + if (len >= sizeof(args.name)) {
> + errno = ENAMETOOLONG;
> + return BTRFS_UTIL_ERROR_INVALID_ARGUMENT;
> + }
> + memcpy(args.name, name, len);
> + args.name[len] = '\0';
> +
> + ret = ioctl(parent_fd, BTRFS_IOC_SNAP_CREATE_V2, );
> + if (ret == -1)
> + return BTRFS_UTIL_ERROR_SUBVOL_CREATE_FAILED;
> +
> + if (async_transid)
> + *async_transid = args.transid;
> +
> + if (flags & BTRFS_UTIL_CREATE_SNAPSHOT_RECURSIVE) {
> + err = snapshot_subvolume_children(fd, parent_fd, name,
> +   async_transid);
> + if (err)
> + return err;
> + }
> +
> + return BTRFS_UTIL_OK;
> +}
> +
>  __attribute__((visibility("default")))
>  void btrfs_util_destroy_subvolume_iterator(struct 
> btrfs_util_subvolume_iterator *iter)
>  {
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Improve subvolume usability for a normal user

2017-12-05 Thread Goffredo Baroncelli
On 12/05/2017 09:17 PM, Austin S. Hemmelgarn wrote:
> On 2017-12-05 14:09, Goffredo Baroncelli wrote:
>> On 12/05/2017 07:46 PM, Graham Cobb wrote:
>>> On 05/12/17 18:01, Goffredo Baroncelli wrote:
>>>> On 12/05/2017 04:42 PM, Graham Cobb wrote:
>> []
>>>>>>> Then no impact to kernel, all complex work is done in user space.
>>>>>> Exactly how hard is it to just check ownership of the root inode of a
>>>>>> subvolume from the ioctl context?  You could just as easily push all the
>>>>>> checking out to the VFS layer by taking an open fd for the subvolume
>>>>>> root (and probably implicitly closing it) instead of taking a path, and
>>>>>> that would give you all the benefits of ACL's and whatever security
>>>>>> modules the local system is using.
>>>>>
>>>>> +1 - stop inventing new access control rules for each different action!
>>>>
>>>> A subvolume is like a directory; In all filesystems you cannot remove an 
>>>> inode if you don't have write access to the parent directory. I assume 
>>>> that this is a POSIX requirement; and if so this should be true even for 
>>>> BTRFS.
>>>> This means that in order to remove a subvolume and you are not root, you 
>>>> should check all the contained directories. It is not sufficient to check 
>>>> one inode.
>>>>
>>>> In the past I create a patch [1][2] which extended the unlink (2) syscall 
>>>> to allow removing a subvolume if:
>>>> a) the user is root  or
>>>> b.1) if the subvolume is empty and
>>>> b.2) the user has the write capability to the parent directory
>>>
>>> That is also OK, as it follows a logical and consistent model without
>>> introducing new rules, but it has two disadvantages with snapshots the
>>> user creates and then wants to delete:
>>>
>>> i) they have to change the snapshot to readwrite to remove all the
>>> contents -- how many users will know that that can even be done?
>>
>> I think that this is an orthogonal question.  However the user should use 
>> btrfs set
>>
>> Anyway I am not against to use something more specific like "btrfs sub 
>> del...", where it could be possible to implement a more specific checks. I 
>> am only highlight that destroy a subvolume without looking to its content 
>> could break the POSIX rules
> No, it doesn't break POSIX rules, because you can't do it with POSIX defined 
> interfaces (not counting ioctls, but ioctls are by definition system 
> dependent).

user$ mkdir -p dir1/dir2
user$ touch dir1/dir2/file
user$ sudo chown foo:foo dir1/dir2

user$ rm -rf dir1/
rm: cannot remove 'dir1/dir2/file1': Permission denied

In a POSIX filesystem, you (as user) cannot remove file1.

But if instead of dir1 you have a subvolume (called sub1), and the check of 
removing the subvolume is performed only on the root of subvolume, you could 
remove it.

Is it a problem ? I think no, but I am not a security expert. But even if POSIX 
doesn't have the concept of subvolume, this is definitely a break of a POSIX 
rule.

On the other side, if the user makes a snapshot of a subvolume containing a not 
owned and not empty directory, for the user it would be impossible to delete 
its snapshot (!).


>>
>>>
>>> ii) if the snapshot contains a file the user cannot remove then the user
>>> cannot remove the snapshot at all (even though they could create it).
>>
>> It is the same for the other filesystem: if in a filesystem tree, there is a 
>> directory which is not changeable by the user, the user cannot remove all 
>> the tree. Again, I am only highlighting that there is a possible break of 
>> POSIX rules.
> POSIX says nothing about non-permissions related reasons for not being able 
> to remove a directory. 
Nobody has say the opposite

>  A subvolume is treated (too much in my opinion) like a mount point, 
>regardless of whether or not it's explicitly mounted, and that seems to be 
>most of why you can't remove it with unlink().



>>
>>>
>>> That is why I preferred Austin's first proposal. I suppose your proposal
>>> could be extended:
>>>
>>> a) the user is root  or
>>> b.1) if the subvolume is empty and
>>> b.2) the user has the write capability to the parent directory, or
>>> c) the subvolume is readonly
>>
>> If a subvolume is marked read-only, the user has to change to RW via "btrfs 
>> property set"
> I actually do agree with this assessment to a 

Re: [RFC] Improve subvolume usability for a normal user

2017-12-05 Thread Goffredo Baroncelli
On 12/05/2017 07:46 PM, Graham Cobb wrote:
> On 05/12/17 18:01, Goffredo Baroncelli wrote:
>> On 12/05/2017 04:42 PM, Graham Cobb wrote:
[]
>>>>> Then no impact to kernel, all complex work is done in user space.
>>>> Exactly how hard is it to just check ownership of the root inode of a
>>>> subvolume from the ioctl context?  You could just as easily push all the
>>>> checking out to the VFS layer by taking an open fd for the subvolume
>>>> root (and probably implicitly closing it) instead of taking a path, and
>>>> that would give you all the benefits of ACL's and whatever security
>>>> modules the local system is using.  
>>>
>>> +1 - stop inventing new access control rules for each different action!
>>
>> A subvolume is like a directory; In all filesystems you cannot remove an 
>> inode if you don't have write access to the parent directory. I assume that 
>> this is a POSIX requirement; and if so this should be true even for BTRFS.
>> This means that in order to remove a subvolume and you are not root, you 
>> should check all the contained directories. It is not sufficient to check 
>> one inode.
>>
>> In the past I create a patch [1][2] which extended the unlink (2) syscall to 
>> allow removing a subvolume if:
>> a) the user is root  or
>> b.1) if the subvolume is empty and
>> b.2) the user has the write capability to the parent directory
> 
> That is also OK, as it follows a logical and consistent model without
> introducing new rules, but it has two disadvantages with snapshots the
> user creates and then wants to delete:
> 
> i) they have to change the snapshot to readwrite to remove all the
> contents -- how many users will know that that can even be done?

I think that this is an orthogonal question.  However the user should use btrfs 
set

Anyway I am not against to use something more specific like "btrfs sub del...", 
where it could be possible to implement a more specific checks. I am only 
highlight that destroy a subvolume without looking to its content could break 
the POSIX rules

> 
> ii) if the snapshot contains a file the user cannot remove then the user
> cannot remove the snapshot at all (even though they could create it).

It is the same for the other filesystem: if in a filesystem tree, there is a 
directory which is not changeable by the user, the user cannot remove all the 
tree. Again, I am only highlighting that there is a possible break of POSIX 
rules.

> 
> That is why I preferred Austin's first proposal. I suppose your proposal
> could be extended:
> 
> a) the user is root  or
> b.1) if the subvolume is empty and
> b.2) the user has the write capability to the parent directory, or
> c) the subvolume is readonly

If a subvolume is marked read-only, the user has to change to RW via "btrfs 
property set"

> 
> However it is done, I think it is important that all normal VFS access
> rules (such as ACLs) are followed for the subvolume itself.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Improve subvolume usability for a normal user

2017-12-05 Thread Goffredo Baroncelli
On 12/05/2017 09:25 AM, Misono, Tomohiro wrote:
> Hello all,
> 
> I want to address some issues of subvolume usability for a normal user.
> i.e. a user can create subvolumes, but
>  - Cannot delete their own subvolume (by default)
>  - Cannot tell subvolumes from directories (in a straightforward way)
>  - Cannot check the quota limit when qgroup is enabled
> 
> Here I show the initial thoughts and approaches to this problem.
> I want to check if this is a right approach or not before I start writing 
> code.
> 
> Comments are welcome.
> Tomohiro Misono
> 
> ==
> - Goal and current problem
> The goal of this RFC is to give a normal user more control to their own 
> subvolumes.
> Currently the control to subvolumes for a normal user is restricted as below: 
> 
> +-+--+--+
> |   command   | root | user |
> +-+--+--+
> | sub create  | Y| Y|
> | sub snap| Y| Y|
> | sub del | Y| N|
> | sub list| Y| N|
> | sub show| Y| N|
> | qgroup show | Y| N|
> +-+--+--+
> 

I suggest to consider also 

   btrfs fi usage

see my patches [1] and [2]

> In short, I want to change this as below in order to improve user's usability:
> 
> +-+--++
> |   command   | root | user   |
> +-+--++
> | sub create  | Y| Y  |
> | sub snap| Y| Y  |
> | sub del | Y| N -> Y |
> | sub list| Y| N -> Y |
> | sub show| Y| N -> Y |
> | qgroup show | Y| N -> Y |
> +-+--++
> 
> In words,
> (1) allow deletion of subvolume if a user owns it, and 
I commented this in another part of this thread. My suggestion is to extend 
unlink(2) to remove a subvolume if it is empty, and if the user has needed 
privileges. This would be fully POSIX compliant.

[]

> 
>  (2) getting subvolume/quota info
> 
>   TREE_SEARCH ioctl is used to retrieve the subvolume/quota info by 
> btrfs-progs
>   (sub show/list, qgroup show etc.). This requires the root permission.
>   
>   The easiest way to allow a user to get subvolume/quota info is just relaxing
>   the permission of TREE_SEARCH. However, since all the tree information (inc.
>   file name) will be exposed, this poses a sequrity risk and is not 
> acceptable.
>   
>   So, I want to introduce 2 ioctls to get subvolume/quota info respectively 
> for
>   a normal user, which return only the info readable by the user:
>   
>[subvolume info]
> Mainly search ROOT tree for ROOT_ITEM/ROOT_BACKREF, but check its read
>right by searching FS/FILE tree and comparing it with caller's uid.

Is it POSIX compliant ? Currently I can't access an entry if its parent 
directory doesn't have the 'x' bit enabled.
>From another point of view, a subvolume is more like a "mount --bind". And I 
>can know which and where devices are mounted looking at /proc/self/mountinfo.

I fear that we should build the path at the kernel level, then check (again at 
the kernel level) if all path elements are accessible by the user, and then 
returning the full path to the user. I am incline to think that accessing 
element by element (on the basis of the parent directory access bit), would be 
a nightmare from posix compliance POV.

A more simple approach would be traversing all the filesystem tree (via 
opendir/readdir) and return all the idem with inode == 256 (something like 
'find / -inum 256'). Even tough it would be very inefficient


>   
>[quota info]
> Same as above, but mainly search QUOTA tree and only returns level 0 info.
>   
>   Also, in order to construct subvolume path, permission of INO_LOOKUP ioctl
>   needs to be relaxed for the user who has read access to the subvolume.
> 
> 
> - Summary of Proposal
>  - Change the default behavior to allow a user to delete their own subvolumes
>  - Add 2 new non-root ioctls to get subvolume/quota info for accessible 
> subvolumes
> ==
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[1] https://www.spinics.net/lists/linux-btrfs/msg69564.html
[2] https://www.spinics.net/lists/linux-btrfs/msg69566.html

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Improve subvolume usability for a normal user

2017-12-05 Thread Goffredo Baroncelli
 open fd for the subvolume
>> root (and probably implicitly closing it) instead of taking a path, and
>> that would give you all the benefits of ACL's and whatever security
>> modules the local system is using.  
> 
> +1 - stop inventing new access control rules for each different action!

A subvolume is like a directory; In all filesystems you cannot remove an inode 
if you don't have write access to the parent directory. I assume that this is a 
POSIX requirement; and if so this should be true even for BTRFS.
This means that in order to remove a subvolume and you are not root, you should 
check all the contained directories. It is not sufficient to check one inode.

In the past I create a patch [1][2] which extended the unlink (2) syscall to 
allow removing a subvolume if:
a) the user is root  or
b.1) if the subvolume is empty and
b.2) the user has the write capability to the parent directory

This will solve all the problems:
a) removing a subvolume is like removing a directory; no different check is 
performed; no new (and strange) rule
b) an ordinary user can remove a subvolume, like a directory
c) is quite easy to implement

Root still have another way (more efficient) to remove a subvolume: he can 
invoke the BTRFS_IOC_SNAP_DESTROY, which doesn't need to traverse all the tree.

BR
G.Baroncelli


[1] https://www.spinics.net/lists/linux-btrfs/msg06499.html
[2] https://patchwork.kernel.org/patch/260301/
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: handle dynamically reappearing missing device

2017-11-20 Thread Goffredo Baroncelli
On 11/20/2017 09:19 AM, Anand Jain wrote:
> 
> 
> On 11/18/2017 09:52 PM, Goffredo Baroncelli wrote:
>> On 11/17/2017 01:36 PM, Anand Jain wrote:
>>> If the device is not present at the time of (-o degrade) mount,
>>> the mount context will create a dummy missing struct btrfs_device.
>>> Later this device may reappear after the FS is mounted and
>>> then device is included in the device list but it missed the
>>> open_device part. So this patch handles that case by going
>>> through the open_device steps which this device missed and finally
>>> adds to the device alloc list.
>>
>> What happens if the first devices got writes before the "last device" is 
>> joined ?
>>
>> Supposing to have a raid1 pair of devices: sda, sdb
>>
>> - sda is dissappeared (usb detached ?)
>> - the filesystem is mounted as
>> mount -o degraded /dev/sdb
>> - some writes happens on /dev/sdb
>> - the user reattach /dev/sda
>> - udev run "btrfs dev scan"
>> - the system joins /dev/sda to the filesystem disks pool
>>
>> Because the filesystem is a raid1, btrfs may read from /dev/sdb (updated 
>> data) or /dev/sda (old data), or worse read something from the former and 
>> something from the later (metadata from sda and data from sdb ) ???
> 
>  Thanks for the test scenario, this case is fine as its read from
>  the disk having highest generation number, so we pick sdb in the
>  case above.

Ok, so in this case which is the benefit to add a disk ? With a lover number 
generation, the added will be used at all ?
In this case, would be better an explicit user intervention instead of an 
automatic one ?


> 
> Thanks, Anand
> 
> 
>> Am I missing something ?
>>
>> BR
>> G.Baroncelli
>>
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: handle dynamically reappearing missing device

2017-11-18 Thread Goffredo Baroncelli
On 11/17/2017 01:36 PM, Anand Jain wrote:
> If the device is not present at the time of (-o degrade) mount,
> the mount context will create a dummy missing struct btrfs_device.
> Later this device may reappear after the FS is mounted and
> then device is included in the device list but it missed the
> open_device part. So this patch handles that case by going
> through the open_device steps which this device missed and finally
> adds to the device alloc list.

What happens if the first devices got writes before the "last device" is joined 
?

Supposing to have a raid1 pair of devices: sda, sdb

- sda is dissappeared (usb detached ?)
- the filesystem is mounted as
mount -o degraded /dev/sdb
- some writes happens on /dev/sdb
- the user reattach /dev/sda
- udev run "btrfs dev scan"
- the system joins /dev/sda to the filesystem disks pool

Because the filesystem is a raid1, btrfs may read from /dev/sdb (updated data) 
or /dev/sda (old data), or worse read something from the former and something 
from the later (metadata from sda and data from sdb ) ???

Am I missing something ?

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: avoid overflow when sector_t is 32 bit

2017-10-04 Thread Goffredo Baroncelli
On 10/04/2017 07:13 PM, Liu Bo wrote:
> On Wed, Oct 04, 2017 at 04:22:28PM +0200, David Sterba wrote:
>> On Tue, Oct 03, 2017 at 07:31:10PM +0200, Goffredo Baroncelli wrote:
>>> From: Goffredo Baroncelli <kreij...@inwind.it>
>>>
>>> Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
>>> btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/) introduces a
>>> regression on 32 bit machines.
>>> When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
>>> (2TB+) block devices and files) sector_t is 32 bit on 32bit machines.
>>>
>>> In the function submit_extent_page, 'sector' (which is sector_t type) is
>>> multiplied by 512 to convert it from sectors to bytes, leading to an
>>> overflow when the disk is bigger than 4GB (!).
>>
>> That's not good. There are some known typedefs that hide the 32bit/64bit
>> differences but the LBDAF and sector_t is new to me. Thanks for the
>> report and fix, I'll get it to linus/master tree in the next batch so it
>> can go to stable tree.
>>
>> I've seen sector_t used in places where it is not necessary so I'll try
>> to minimize the usage and more surprises from the << 9 shifts.
>>
>> Reviewed-by: David Sterba <dste...@suse.com>
>> Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
>> CC: sta...@vger.kernel.org # 4.13+
> 
> However, this sector_t is passed from its callers, e.g.
> 
> __do_readpage()
> {
>   sector_t sector;
>   ...
>   sector = em->block_start >> 9;
>   ...
> }
> 
> if sector_t is 32bit, the above %sector could also lose high bits.
> Some cleanups are needed to use u64 directly.

If sector_t is  32bit, the kernel can't access disk bigger than 2TB. So I 
suppose that block_start is less than 4GB.
> 
> Even with this patch, I suspect that there might be errors from block
> layer as sector_t is used everywhere in block layer.
> 
> For a btrfs FS that is created and used on 64bit system, it'll be
> causing trouble anyway if letting it mount 32bit system, lets refuse
> the mount firstly.

I think that the check should be:

if sizeof(sector_t) < 8 && filesystem_end > 2TB then
do_not_mount_it


I am not sure what the other fs does

> 
> Thanks,
> 
> -liubo
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: avoid overflow when sector_t is 32 bit

2017-10-03 Thread Goffredo Baroncelli
From: Goffredo Baroncelli <kreij...@inwind.it>

Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/) introduces a
regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).

I added a cast to u64 to avoid overflow.

Based on v4.14-rc3.


Signed-off-by: Goffredo Baroncelli <kreij...@inwind.it>
Tested-by: Jean-Denis Girard <jd.gir...@sysnux.pf>

---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 12ab19a4b93e..970190cd347e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2801,7 +2801,7 @@ static int submit_extent_page(unsigned int opf, struct 
extent_io_tree *tree,
}
}
 
-   bio = btrfs_bio_alloc(bdev, sector << 9);
+   bio = btrfs_bio_alloc(bdev, (u64)sector << 9);
bio_add_page(bio, page, page_size, offset);
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
-- 
2.14.2

--
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mount error on 32 bits, ok on 64 bits

2017-10-02 Thread Goffredo Baroncelli
On 10/02/2017 11:46 PM, Goffredo Baroncelli wrote:
> which on 32bit is 32 bit if CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == 
> Support for large (2TB+) block devices and files)

Only now I noticed that in the first email you attached the config; in fact in 
your config CONFIG_LBDAF is not set.

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mount error on 32 bits, ok on 64 bits

2017-10-02 Thread Goffredo Baroncelli
On 10/02/2017 07:59 PM, Jean-Denis Girard wrote:
> Le 30/09/2017 à 17:29, Jean-Denis Girard a écrit :
>> Le 28/09/2017 à 19:26, Jean-Denis Girard a écrit :
>> The problem seems to come from commit c821e7f3 "pass bytes to
>> btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/): the
>> system is now running fine on 4.13.4 with only that patch reverted.
> 
> Same situation with 4.14-rc3: my system cannot mount root file-system.
> If I revert the patch, the system boots normally. This is 100%
> reproducible. How can I help resolve that issue?


Looking at the patch, it seems suspect this chunk:

@@ -2798,7 +2798,7 @@  static int submit_extent_page(int op, int op_flags, 
struct extent_io_tree *tree,
}
}
 
-   bio = btrfs_bio_alloc(bdev, sector);
+   bio = btrfs_bio_alloc(bdev, sector << 9);
bio_add_page(bio, page, page_size, offset);
bio->bi_end_io = end_io_func;
bio->bi_private = tree;

Now sector, is defined as

sector_t   [1]

which in turn it might be defined as 

unsigned long   [2]

which on 32bit is 32 bit if CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == 
Support for large (2TB+) block devices and files)

The point is that 

sector << 9 

may overflow if the disk is bigger than 4GB (and in your case it seems to be 8 
GB).

If I am correct, could you please so kindly to 

- repllay the patch
- AND try to replace
bio = btrfs_bio_alloc(bdev, sector << 9);
with
bio = btrfs_bio_alloc(bdev, (u64)sector << 9);


[1] 
http://elixir.free-electrons.com/linux/latest/source/fs/btrfs/extent_io.c#L2762
[2] 
http://elixir.free-electrons.com/linux/latest/source/include/linux/types.h#L133
> 
> 
> Thanks,
> 
BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Why do full balance and deduplication reduce available free space?

2017-10-02 Thread Goffredo Baroncelli
On 10/02/2017 12:02 PM, Niccolò Belli wrote:
> Hi,
> I have several subvolumes mounted with compress-force=lzo and autodefrag. 
>> Since I use lots of snapshots (snapper keeps around 24 hourly snapshots, 7 
>> daily 
> snapshots and 4 weekly snapshots) I had to create a systemd timer to perform 
> a 
> full balance and deduplication each night. In fact data needs to be 
> already deduplicated when snapshots are created, otherwise I have no other 
> way to deduplicate snapshots.


[...] 
> Data,single: Size:44.00GiB, Used:40.00GiB
>   /dev/sda5  44.00GiB
> 
> Metadata,single: Size:5.00GiB, Used:3.78GiB
>   /dev/sda5   5.00GiB
[...]

> Data,single: Size:41.00GiB, Used:40.01GiB
>   /dev/sda5  41.00GiB
> 
> Metadata,single: Size:5.00GiB, Used:3.77GiB
>   /dev/sda5   5.00GiB
[...]

> Data,single: Size:41.00GiB, Used:40.03GiB
>   /dev/sda5  41.00GiB
> 
> Metadata,single: Size:5.00GiB, Used:3.84GiB
>   /dev/sda5   5.00GiB
[...]

> Data,single: Size:41.00GiB, Used:40.04GiB
>   /dev/sda5  41.00GiB
> 
> Metadata,single: Size:5.00GiB, Used:3.97GiB
>   /dev/sda5   5.00GiB
[]

 
> 
> It further reduced the available free space! Balance and deduplication 
> actually reduced my available free space of 400MB!
> 400MB each night!

Your data increased by 40MB (over 40GB, so about ~0.1%); instead your metadata 
increased about 200MB (over ~4GB, about ~2%); so

1) it seems to me that your data is quite "deduped"
2) (NB this is a my guessing) I think that deduping (and or re-balancing) 
rearranges the metadata leading to a increase disk usage. The only explanation 
that I found is that the deduping breaks the sharing of metadata with the 
snapshots:
- a snapshot share the metadata, which in turn refers to the data. Because the 
metadata is shared, there is only one copy. The metadata remains shared, until 
it is not changed/updated.
- dedupe, when shares a file block, updates the metadata breaking the sharing 
with its snapshot, and thus creating a copy of these.
NB: updating snapshot metadata is the same that updating subvolume metadata


> How is it possible? Should I avoid doing balances and deduplications at all?

Try few days without deduplication, and check if something change. May be that 
it would be sufficient to delay the deduping: not each night, but each week or 
month.
Another option is running dedupe on all the files (including the snapshotted 
ones). In fact this would still break the metadata sharing, but the extents 
should still be shared (IMHO :-) ). Of course the cost of deduping will 
increasing a lot (about 24+7+4 = 35 times)


> 
> Thanks,
> Niccolò

BR
G.Baroncelli

> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What means "top level" in "btrfs subvolume list" ?

2017-10-02 Thread Goffredo Baroncelli
On 09/30/2017 04:53 PM, Peter Grandi wrote:
>> My Hypothesis, it should be the ID of the root subvolume ( or
>> 5 if it is not mounted). [ ... ]
> Well, a POSIX filesystem typically has a root directory, and it
> can be mounted as the system root or any other point. A Btrfs
> filesystem has multiple root directories, that are mounted by
> default "somewhere" (a design decision that I think was unwise,
> but "whatever").

I agree that the possibility to create a subvolume/snapshot everywhere could 
create a bit of confusion; i.e. the "mountpoint" of a subvolume after a 
snapshot is a very strange beast:

ghigo@venice:/tmp$ btrfs sub cre sv1
Create subvolume './sv1'
ghigo@venice:/tmp$ btrfs sub cre sv1/sv2
Create subvolume 'sv1/sv2'
ghigo@venice:/tmp$ btrfs sub snap  sv1 sn1
Create a snapshot of 'sv1' in './sn1'
ghigo@venice:/tmp$ ls -lid sn1/
256 drwxr-xr-x 1 ghigo ghigo 6 Oct  2 20:54 sn1/
ghigo@venice:/tmp$ ls -li sn1/
total 0
2 drwxr-xr-x 1 root root 0 Oct  2 20:55 sv2
ghigo@venice:/tmp$ touch sn1/sv2/foo
touch: cannot touch 'sn1/sv2/foo': Permission denied
ghigo@venice:/tmp$ sudo touch sn1/sv2/foo
touch: cannot touch 'sn1/sv2/foo': Permission denied

It seems a directory, but it is not allowed to create file inside (!)

However the user is free to limit itself to create snapshot only under / 
(rootid==5) :-). 

 
> The subvolume containing the mountpoint directory of another
> subvolume's root directory is is no way or sense its "parent",
> as there is no derivation relationship; root directories are
> independent of each other and their mountpoint is (or should be)
> a runtime entity.

This is true. However I think that "btrfs fi list" was designed to show where 
subvolumes (or snapshots) are, for two reasons:
1) avoid to search trough all the filesystem looking for directory having 
inode=256
2) when the user mounts a subvolume as root, he has not direct access to all 
subvolume unless he also mount / (rootid=5) on another place. "btrfs sub list" 
has not this limit

BTW, my preferred choice is:
- mount a subvolume as root; 
- mount the root filesystem (rootid=5) under /var/btrfs
- make snapshot and/or subvolume only under / (rootid=5), handle these in 
/var/btrfs
- if needed, mount a subvolume in the filesystem through fstab;
- however I am still searching a way to handle the snapshots which fully 
satisfy  me (putting them under /, is not the best)

BR

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/14] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-10-02 Thread Goffredo Baroncelli
On 10/02/2017 01:47 PM, Austin S. Hemmelgarn wrote:
>>>
>>> What about doing it on a file instead of a device ? As sparse file, it 
>>> would be less expensive to enlarge then shrink. I think that who need to 
>>> build a filesystem with "shrink", doesn't need to create it on a real block 
>>> device
>>
>> For device, nothing different, just v3 patchset will handle it.
> Agreed on this point.
>>
>> For file, sparse file of course.
> But not on this one.  Unless the image gets properly compacted, a sparse file 
> will only help when you're just storing the image on a filesystem.

I think that you have misunderstood my proposal... My suggestion is to create 
the image using a file, and after image creation compact it, and then cut it at 
the end. So you don't have to care about the space needing during the process 
(before the shrinking).

Today mkfs.btrfs fails to create an image on an empty file

ghigo@venice:/tmp$ mkdir test
ghigo@venice:/tmp$ mkdir test/test1
ghigo@venice:/tmp$ touch test/test1/file
ghigo@venice:/tmp$ mkfs.btrfs --root test disk.img
btrfs-progs v4.12
See http://btrfs.wiki.kernel.org for more information.

ERROR: failed to check size for disk.img: No such file or directory
ghigo@venice:/tmp$ touch disk.img
ghigo@venice:/tmp$ mkfs.btrfs --root test disk.img
btrfs-progs v4.12
See http://btrfs.wiki.kernel.org for more information.

ERROR: 'disk.img' is too small to make a usable filesystem
ERROR: minimum size for each btrfs device is 41943040

you have to create a big enough

ghigo@venice:/tmp$ truncate -s 10G disk.img 
ghigo@venice:/tmp$ mkfs.btrfs --root test disk.img
btrfs-progs v4.12
See http://btrfs.wiki.kernel.org for more information.
[...]


BR
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


What means "top level" in "btrfs subvolume list" ?

2017-09-30 Thread Goffredo Baroncelli
(please ignore my previous email, because I wrote somewhere "top id" instead of 
"top level")
Hi All,

I am trying to figure out which means "top level" in the output of "btrfs sub 
list"


ghigo@venice:~$ sudo btrfs sub list /
[sudo] password for ghigo: 
ID 257 gen 548185 top level 5 path debian
ID 289 gen 418851 top level 257 path var/lib/machines
ID 299 gen 537230 top level 5 path boot
ID 532 gen 502364 top level 257 path tmp/test1
ID 533 gen 502277 top level 532 path tmp/test1/test2
ID 534 gen 502407 top level 257 path tmp/test1.snap
ID 535 gen 502363 top level 532 path tmp/test1/test3
ID 536 gen 502407 top level 257 path tmp/test1.snap2
ID 537 gen 537132 top level 5 path test
ID 538 gen 537130 top level 537 path test/sub1

The root filesystem is mounted as

mount -o subvol=debian 


I think that "btrfs sub list" is buggy, and it outputs a "top level ID" equal 
to the "parent ID" (on the basis of the code). But I am still asking which 
would be the RIGHT "top level id". My Hypothesis, it should be the ID of the 
root subvolume ( or 5 if it is not mounted). So the output should be

 
ghigo@venice:~$ sudo btrfs sub list /
[sudo] password for ghigo: 
ID 257 gen 548185 top level 5 path debian
ID 289 gen 418851 top level 257 path var/lib/machines
ID 299 gen 537230 top level 5 path boot
ID 532 gen 502364 top level 257 path tmp/test1
ID 533 gen 502277 top level 257 path tmp/test1/test2
ID 534 gen 502407 top level 257 path tmp/test1.snap
ID 535 gen 502363 top level 257 path tmp/test1/test3
ID 536 gen 502407 top level 257 path tmp/test1.snap2
ID 537 gen 537132 top level 5 path test
ID 538 gen 537130 top level 5 path test/sub1


The subvolumes in the file system (mounted from /debian) are

/   ID = 5
/debian ID = 257
/debian/var/lib/machinesID = 289
/boot   ID = 299
/debian/tmp/test1   ID = 532
/debian/tmp/test1/test2 ID = 533
/debian/tmp/test1.snap  ID = 534
/debian/tmp/test1/test3 ID = 535
/debian/tmp/test1.snap2 ID = 536
/test   ID = 537
/test/sub1      ID = 538

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


What means "top id" in "btrfs subvolume list" ?

2017-09-30 Thread Goffredo Baroncelli
Hi All,

I am trying to figure out which means "top id" in the output of "btrfs sub list"


ghigo@venice:~$ sudo btrfs sub list /
[sudo] password for ghigo: 
ID 257 gen 548185 top level 5 path debian
ID 289 gen 418851 top level 257 path var/lib/machines
ID 299 gen 537230 top level 5 path boot
ID 532 gen 502364 top level 257 path tmp/test1
ID 533 gen 502277 top level 532 path tmp/test1/test2
ID 534 gen 502407 top level 257 path tmp/test1.snap
ID 535 gen 502363 top level 532 path tmp/test1/test3
ID 536 gen 502407 top level 257 path tmp/test1.snap2
ID 537 gen 537132 top level 5 path test
ID 538 gen 537130 top level 537 path test/sub1

The root filesystem is mounted as

mount -o subvol=debian 


I think that "btrfs sub list" is buggy, and it outputs a "top level ID" equal 
to the "parent ID" (on the basis of the code). But I am still asking which 
would be the RIGHT "top level id". My Hypothesis, it should be the ID of the 
root subvolume ( or 5 if it is not mounted). So the output should be

 
ghigo@venice:~$ sudo btrfs sub list /
[sudo] password for ghigo: 
ID 257 gen 548185 top level 5 path debian
ID 289 gen 418851 top level 257 path var/lib/machines
ID 299 gen 537230 top level 5 path boot
ID 532 gen 502364 top level 257 path tmp/test1
ID 533 gen 502277 top level 257 path tmp/test1/test2
ID 534 gen 502407 top level 257 path tmp/test1.snap
ID 535 gen 502363 top level 257 path tmp/test1/test3
ID 536 gen 502407 top level 257 path tmp/test1.snap2
ID 537 gen 537132 top level 5 path test
ID 538 gen 537130 top level 5 path test/sub1


The subvolumes in the file system (mounted from /debian) are

/   ID = 5
/debian ID = 257
/debian/var/lib/machinesID = 289
/boot   ID = 299
/debian/tmp/test1   ID = 532
/debian/tmp/test1/test2 ID = 533
/debian/tmp/test1.snap  ID = 534
/debian/tmp/test1/test3 ID = 535
/debian/tmp/test1.snap2 ID = 536
/test   ID = 537
/test/sub1  ID = 538

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't remove device -> I/O error

2017-09-30 Thread Goffredo Baroncelli
eserve, single: total=16.00MiB, used=0.00B

 remove the device /dev/sda ( id == 4)
 note because /dev/sda doesn't exist anymore, I had to pass the id
 of the device instead the path in "btrfs dev del..."

ghigo@emulato:~$ sudo -i
root@emulato:~# echo 1 >/sys/block/sda/device/delete 

root@emulato:~# btrfs dev us /mnt/btrfs1/
/dev/sda, ID: 4
   Device size:   0.00B
   Device slack:   16.00EiB
   Data,RAID5:  1.00GiB
   Metadata,RAID1:256.00MiB
   System,RAID1:   32.00MiB
   Unallocated: 8.72GiB

/dev/vdb, ID: 1
   Device size:10.00GiB
   Device slack:  0.00B
   Data,RAID5:  1.00GiB
   Metadata,RAID1:  1.00GiB
   Unallocated: 8.00GiB

/dev/vdc, ID: 2
   Device size:10.00GiB
   Device slack:  0.00B
   Data,RAID5:  1.00GiB
   Metadata,RAID1:  1.00GiB
   Unallocated: 8.00GiB

/dev/vdd, ID: 3
   Device size:10.00GiB
   Device slack:  0.00B
   Data,RAID5:  1.00GiB
   Metadata,RAID1:256.00MiB
   System,RAID1:   32.00MiB
   Unallocated: 8.72GiB

root@emulato:~# btrfs dev del 4 /mnt/btrfs1/

In the dmesg *few* errors where present, but the process was successfully and 
the I was able to perform a scrub process without any issue




root@emulato:~# btrfs scrub start /mnt/btrfs1/
scrub started on /mnt/btrfs1/, fsid 9b13bdf1-539f-4529-bbc7-f392a072ee5c 
(pid=689)
[]
root@emulato:~# btrfs scrub status /mnt/btrfs1/
scrub status for 9b13bdf1-539f-4529-bbc7-f392a072ee5c
scrub started at Sat Sep 30 13:30:35 2017 and finished after 00:00:14
total bytes scrubbed: 1.59GiB with 0 errors


In both the case the device delete phase was done in few seconds (about 30-60 
secon for 1GB of data). It was in a emulated environment, but I suppose that 
also on the bare metal it will be quite quick.

BR
G.Baroncelli

> 
> 
> 
> Am 30.09.2017 um 09:16 schrieb Goffredo Baroncelli:
>> On 09/30/2017 01:06 AM, DocMAX wrote:
>>>>> Did you removed the disk before mounting (physically or doing echo 1 
>>>>> >/sys/block/xxx/device/delete)? Which steps you performed ?
>>> - removed drive physically
>>>
>>> - mounted degraded mode
>>>
>>> - btrfs dev del -> same i/o error
>>>
>> Did you switch off the machine ? If not, before mounting in degraded mode, 
>> do "echo 1 >/sys/block/xxx/device/delete". After the monting do a btrfs dev 
>> del missing
>>
>>
>>
>>
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't remove device -> I/O error

2017-09-30 Thread Goffredo Baroncelli
On 09/30/2017 01:06 AM, DocMAX wrote:
>>> Did you removed the disk before mounting (physically or doing echo 1 
>>> >/sys/block/xxx/device/delete)? Which steps you performed ?
> 
> - removed drive physically
> 
> - mounted degraded mode
> 
> - btrfs dev del -> same i/o error
> 

Did you switch off the machine ? If not, before mounting in degraded mode, do 
"echo 1 >/sys/block/xxx/device/delete". After the monting do a btrfs dev del 
missing



> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't remove device -> I/O error

2017-09-29 Thread Goffredo Baroncelli
On 09/29/2017 11:09 PM, DocMAX wrote:
> Thanks for the reply.
> 
> I don't want to replace the drive. I want to remove.
> 
> Also tried in degraded mode. I get the exact same error.

Did you removed the disk before mounting (physically or doing echo 1 
>/sys/block/xxx/device/delete)? Which steps you performed ?

> 
> I'm not sure but i think i formated the drive on Kernel 4.11.

This shouldn't matter
> 
> I am on Kernel 4.13 now.

Ok, it is quite recently
> 
> 
> I have the bad feeling that i will never get rid of that small drive unless i 
> re-format.

No, it should not be necessary.

> 
> 
> 
> Am 29.09.2017 um 23:04 schrieb Goffredo Baroncelli:
>> On 09/29/2017 10:00 PM, Dirk Diggler wrote:
>>> Hi,
>>>
>>> is there any chance to get my device removed?
>> I simulated a device removing in KVM with
>>
>> echo 1 >/sys/block/sdj/device/delete
>>
>> then
>>
>> btrfs dev del 6 /mnt/
>>
>>
>> And I got success. But I am not sure if this is the right thing todo.
>>
>> You can use "btrfs replace start -r ". But you need another device.
>>
>> Otherwise, you can shutdown the filesystem, removing (physically) the disk 
>> then remount with a "mount -o degraded " followed by a "btrfs dev del 
>> missing /..."
>> Before doing so, please tell us which kernel you are using.
>>
>> RAID5/6 until few months ago has a lot of bugs, so if you have an old kernel 
>> it is very difficult to remove a device with success.
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't remove device -> I/O error

2017-09-29 Thread Goffredo Baroncelli
On 09/29/2017 10:00 PM, Dirk Diggler wrote:
> Hi,
> 
> is there any chance to get my device removed?

I simulated a device removing in KVM with

echo 1 >/sys/block/sdj/device/delete

then 

btrfs dev del 6 /mnt/


And I got success. But I am not sure if this is the right thing todo.

You can use "btrfs replace start -r ". But you need another device.

Otherwise, you can shutdown the filesystem, removing (physically) the disk then 
remount with a "mount -o degraded " followed by a "btrfs dev del missing 
/..."
Before doing so, please tell us which kernel you are using.

RAID5/6 until few months ago has a lot of bugs, so if you have an old kernel it 
is very difficult to remove a device with success.

> Scrub literally takes months to complete (SATA 2/3 mix, about 1 minute
> per gigabyte) and i'm not sure if that helps.
> I guess same with balance. Mabye there is a quicker way. I can do
> without some data if it's corrupted. I have a backup, but i want to
> avoid to copy all data from scratch!
> 
> Whenever i try to remove dev 6, i get:
> 
> console:
> ERROR: error removing device '/dev/sdj': Input/output error
> 
> dmesg (i/o error right after this):
> BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520
> csum 0x98f94189 expected csum 0x585e5744 mirror 1
> BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616
> csum 0x98f94189 expected csum 0xcefd2ae0 mirror 1
> BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520
> csum 0x98f94189 expected csum 0x585e5744 mirror 1
> BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616
> csum 0x98f94189 expected csum 0xcefd2ae0 mirror 1
> BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520
> csum 0x4023cac1 expected csum 0x585e5744 mirror 2
> BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616
> csum 0xea91b663 expected csum 0xcefd2ae0 mirror 2
> BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520
> csum 0x98f94189 expected csum 0x585e5744 mirror 1
> BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520
> csum 0x4023cac1 expected csum 0x585e5744 mirror 2
> 
> My setup:
> /dev/sdf, ID: 3
>Device size: 2.73TiB
>Device slack:  0.00B
>Data,RAID5:333.00GiB
>Data,RAID5:  5.00GiB
>Unallocated: 2.40TiB
> 
> /dev/sdg, ID: 2
>Device size: 1.82TiB
>Device slack:  0.00B
>Data,RAID5:333.00GiB
>Data,RAID5:955.00GiB
>Data,RAID5:  5.57GiB
>Metadata,RAID1:  3.00GiB
>Unallocated:   566.44GiB
> 
> /dev/sdh, ID: 4
>Device size: 1.82TiB
>Device slack:  0.00B
>Data,RAID5:333.00GiB
>Data,RAID5:955.00GiB
>Data,RAID5:  5.57GiB
>Metadata,RAID1:  2.00GiB
>System,RAID1:   32.00MiB
>Unallocated:   567.41GiB
> 
> /dev/sdi, ID: 7
>Device size: 2.73TiB
>Device slack:  0.00B
>Data,RAID5:333.00GiB
>Data,RAID5:955.00GiB
>Data,RAID5:  5.57GiB
>Metadata,RAID1: 11.00GiB
>System,RAID1:   32.00MiB
>Unallocated: 1.45TiB
> 
> /dev/sdj, ID: 6
>Device size:   465.76GiB
>Device slack:  0.00B
>Data,RAID5:333.00GiB
>Data,RAID5:587.38MiB
>Unallocated:   132.19GiB
> 
> /dev/sdk, ID: 1
>Device size: 1.82TiB
>Device slack:  0.00B
>Data,RAID5:333.00GiB
>Data,RAID5:955.00GiB
>Data,RAID5:  5.57GiB
>Metadata,RAID1:  3.00GiB
>Unallocated:   566.44GiB
> 
> /dev/sdl, ID: 5
>Device size: 1.82TiB
>Device slack:  0.00B
>Data,RAID5:333.00GiB
>Data,RAID5:955.00GiB
>Data,RAID5:  5.57GiB
>Metadata,RAID1:  3.00GiB
>    Unallocated:   566.44GiB
> 
> Thanks,
> DocMAX
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/14] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-09-29 Thread Goffredo Baroncelli
On 09/28/2017 02:00 AM, Qu Wenruo wrote:
> 
> 
> On 2017年09月28日 00:20, David Sterba wrote:
>> On Mon, Sep 25, 2017 at 07:15:30AM -0400, Austin S. Hemmelgarn wrote:
>>> On 2017-09-24 10:08, Goffredo Baroncelli wrote:
>>>> On 09/24/2017 12:10 PM, Anand Jain wrote:
>>
>> A lot of points in this thread, let me address them here.
>>
>> I don't want to remove --rootdir functionality, the fix that's being
>> proposed removes the minimalization -- feature that was not well known,
>> but I understand it's useful (and already used).
>>
>> I'd like to fix that in another way, eg. as an option to mkfs or a
>> separate tool.
>>
>> I'm not worried about adding more code or code complexity. If we do it
>> right it's not a problem in the long run. But people for some reason
>> like to delete other people's code or functionality.
>>
>> Regarding guessing number of users, this is always hard. So if there's
>> one vocal enough to let us know about the usecase, it's IMHO good time
>> to explore the it, code-wise and documentation-wise, and fix it or
>> improve.
>>
>> So, what next. I'd like to get rid of the custom chunk allocator, namely
>> because of the known 1MB area misuse and duplication.
>>
>> Implementing the minimalization might need some preparatory work and I
>> don't have a realistic ETA now.
> 
> Well, if using over-reserve-then-shrink method, it could be done, without 
> much hassle.

What about doing it on a file instead of a device ? As sparse file, it would be 
less expensive to enlarge then shrink. I think that who need to build a 
filesystem with "shrink", doesn't need to create it on a real block device
> 
> However ETA maybe delayed to middle of Oct, as I'm going to enjoy my holiday 
> during 1st Oct to 7th Oct.
> 
> Thanks,
> Qu
> 
>> Ideally we'd fix both problems in one
>> version, as I'd rather avoid "temporary" solution to drop the
>> minimalization with the promise to put it back later.
>>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] New ioctl BTRFS_IOC_GET_CHUNK_INFO

2017-09-25 Thread Goffredo Baroncelli
From: Goffredo Baroncelli <kreij...@inwind.it>

Add a new ioctl to get info about chunk without requiring the root privileges.
This allow to a non root user to know how the space of the filesystem is
allocated.

Signed-off-by: Goffredo Baroncelli <kreij...@inwind.it>
---
 fs/btrfs/ioctl.c   | 215 +
 include/uapi/linux/btrfs.h |  38 
 2 files changed, 253 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index fa1b78cf25f6..d9ba9ed52693 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2220,6 +2220,219 @@ static noinline int btrfs_ioctl_tree_search_v2(struct 
file *file,
return ret;
 }
 
+/*
+ * Return:
+ * 0   -> copied all data, possible further data
+ * 1   -> copied all data, no further data
+ * -EAGAIN -> not enough space, restart it
+ * -EFAULT -> the user passed an invalid address/size pair
+ */
+static noinline int copy_chunk_info(struct btrfs_path *path,
+  char __user *ubuf,
+  size_t buf_size,
+  u64 *used_buf,
+  int *num_found,
+  u64 *offset)
+{
+   struct extent_buffer *leaf;
+   unsigned long item_off;
+   unsigned long item_len;
+   int nritems;
+   int i;
+   int slot;
+   int ret = 0;
+   struct btrfs_key key;
+
+   leaf = path->nodes[0];
+   slot = path->slots[0];
+   nritems = btrfs_header_nritems(leaf);
+
+   for (i = slot; i < nritems; i++) {
+   u64 destsize;
+   struct btrfs_chunk_info ci;
+   struct btrfs_chunk chunk;
+   int j, chunk_size;
+
+   item_off = btrfs_item_ptr_offset(leaf, i);
+   item_len = btrfs_item_size_nr(leaf, i);
+
+   btrfs_item_key_to_cpu(leaf, , i);
+   /*
+* we are not interested in other items type
+*/
+   if (key.type != BTRFS_CHUNK_ITEM_KEY) {
+   ret = 1;
+   goto out;
+   }
+
+   /*
+* In any case, the next search must start from here
+*/
+   *offset = key.offset;
+   read_extent_buffer(leaf, , item_off, sizeof(chunk));
+
+   /*
+* chunk.num_stripes-1 is correct, because btrfs_chunk includes
+* already a stripe
+*/
+   destsize = sizeof(struct btrfs_chunk_info) +
+   (chunk.num_stripes - 1) * sizeof(struct btrfs_stripe);
+
+   BUG_ON(destsize > item_len);
+
+   if (buf_size < destsize + *used_buf) {
+   if (*num_found)
+   /* try onother time */
+   ret = -EAGAIN;
+   else
+   /* in any case the buffer is too small */
+   ret = -EOVERFLOW;
+   goto out;
+   }
+
+   /* copy chunk */
+   chunk_size = offsetof(struct btrfs_chunk_info, stripes);
+   memset(, 0, chunk_size);
+   ci.length = btrfs_stack_chunk_length();
+   ci.stripe_len = btrfs_stack_chunk_stripe_len();
+   ci.type = btrfs_stack_chunk_type();
+   ci.num_stripes = btrfs_stack_chunk_num_stripes();
+   ci.sub_stripes = btrfs_stack_chunk_sub_stripes();
+   ci.offset = key.offset;
+
+   if (copy_to_user(ubuf + *used_buf, , chunk_size)) {
+   ret = -EFAULT;
+   goto out;
+   }
+   *used_buf += chunk_size;
+
+   /* copy stripes */
+   for (j = 0 ; j < chunk.num_stripes ; j++) {
+   struct btrfs_stripe chunk_stripe;
+   struct btrfs_chunk_info_stripe csi;
+
+   /*
+* j-1 is correct, because btrfs_chunk includes already
+* a stripe
+*/
+   read_extent_buffer(leaf, _stripe,
+   item_off + sizeof(struct btrfs_chunk) +
+   sizeof(struct btrfs_stripe) *
+   (j - 1), sizeof(chunk_stripe));
+
+   memset(, 0, sizeof(csi));
+
+   csi.devid = btrfs_stack_stripe_devid(_stripe);
+   csi.offset = btrfs_stack_stripe_offset(_stripe);
+   memcpy(csi.dev_uuid, chunk_stripe.dev_uuid,
+   sizeof(chunk_stripe.dev_uuid));
+   if (copy_to_user(ubuf + *used_buf, , sizeof(csi))) {
+  

[RFC] btrfs: new ioctl BTRFS_IOC_GET_CHUNK_INFO

2017-09-25 Thread Goffredo Baroncelli
The aim of this patch is to replace the BTRFS_IOC_TREE_SEARCH ioctl when
it is used to obtain information about the chunks/block groups. 

The problems in using the BTRFS_IOC_TREE_SEARCH is that it access
the very low data structure of BTRFS. This means: 
1) this would be complicated a possible change of the disk format
2) it requires the root privileges

The new BTRFS_IOC_GET_CHUNK_INFO is an attempt to address these issue.
This ioctl can be called from a not root user: I think that the data exposed 
are not sensibile data.

In a separate patches set, there is an update to the btrfs utility to
allow it to use this new ioctl int the load_chunk_info() function.
So now it is possible to do "btrfs fi us" without root privileges.

This is an RFC, because I want to be sure that
1) it is right my understanding that the chunk info are not a "sensibile data"
2) the api is a good api. In particular I am not sure that returning -EAGAIN is
the right thing to do when further data is available but the buffer is not
enough.

Comments are welcome
BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Use the new ioctl BTRFS_IOC_GET_CHUNK_INFO for load_chunk_info()

2017-09-25 Thread Goffredo Baroncelli
From: Goffredo Baroncelli <kreij...@inwind.it>

Use the new ioctl BTRFS_IOC_GET_CHUNK_INFO in load_chunk_info() instead of the
BTRFS_IOC_TREE_SEARCH. The old method is still present as fallback for
compatibility reason.

The goal is to avoid BTRFS_IOC_GET_CHUNK_INFO because it requires root
privileges.

Signed-off-by: Goffredo Baroncelli <kreij...@inwind.it>
---
 cmds-fi-usage.c | 108 ++--
 ioctl.h |  60 +++
 2 files changed, 165 insertions(+), 3 deletions(-)

diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index 6c846c15..6a101e28 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -36,7 +36,7 @@
 /*
  * Add the chunk info to the chunk_info list
  */
-static int add_info_to_list(struct chunk_info **info_ptr,
+static int legacy_add_info_to_list(struct chunk_info **info_ptr,
int *info_count,
struct btrfs_chunk *chunk)
 {
@@ -127,7 +127,8 @@ static int cmp_chunk_info(const void *a, const void *b)
((struct chunk_info *)b)->type);
 }
 
-static int load_chunk_info(int fd, struct chunk_info **info_ptr, int 
*info_count)
+static int legacy_load_chunk_info(int fd, struct chunk_info **info_ptr,
+   int *info_count)
 {
int ret;
struct btrfs_ioctl_search_args args;
@@ -180,7 +181,7 @@ static int load_chunk_info(int fd, struct chunk_info 
**info_ptr, int *info_count
off += sizeof(*sh);
item = (struct btrfs_chunk *)(args.buf + off);
 
-   ret = add_info_to_list(info_ptr, info_count, item);
+   ret = legacy_add_info_to_list(info_ptr, info_count, 
item);
if (ret) {
*info_ptr = NULL;
return 1;
@@ -213,6 +214,107 @@ static int load_chunk_info(int fd, struct chunk_info 
**info_ptr, int *info_count
return 0;
 }
 
+/*
+ * Add the chunk info to the chunk_info list
+ */
+static int add_info_to_list(struct chunk_info **info_ptr,
+   int *info_count,
+   struct btrfs_chunk_info *chunk)
+{
+
+   u64 type = chunk->type;
+   u64 size = chunk->length;
+   int num_stripes = chunk->num_stripes;
+   int j;
+
+   for (j = 0 ; j < num_stripes ; j++) {
+   int i;
+   struct chunk_info *p = NULL;
+   u64devid;
+
+   devid = chunk->stripes[j].devid;
+
+   for (i = 0 ; i < *info_count ; i++)
+   if ((*info_ptr)[i].type == type &&
+   (*info_ptr)[i].devid == devid &&
+   (*info_ptr)[i].num_stripes == num_stripes) {
+   p = (*info_ptr) + i;
+   break;
+   }
+
+   if (!p) {
+   int tmp = sizeof(struct btrfs_chunk) *
+   (*info_count + 1);
+   struct chunk_info *res = realloc(*info_ptr, tmp);
+
+   if (!res) {
+   free(*info_ptr);
+   error("not enough memory");
+   return -ENOMEM;
+   }
+
+   *info_ptr = res;
+   p = res + *info_count;
+   (*info_count)++;
+
+   p->devid = devid;
+   p->type = type;
+   p->size = 0;
+   p->num_stripes = num_stripes;
+   }
+
+   p->size += size;
+
+   }
+
+   return 0;
+
+}
+
+static int load_chunk_info(int fd, struct chunk_info **info_ptr,
+   int *info_count)
+{
+
+   char buf[4096];
+   struct btrfs_ioctl_chunk_info *bici =
+   (struct btrfs_ioctl_chunk_info *)buf;
+   int cont;
+
+   bici->buf_size = sizeof(buf);
+   bici->offset = (u64)0;
+
+   do {
+   int i;
+   struct btrfs_chunk_info *ci;
+   int ret;
+
+   cont = false;
+   ret = ioctl(fd, BTRFS_IOC_GET_CHUNK_INFO, bici);
+   if (ret < 0) {
+   int e = errno;
+
+   if (e == ENOTTY)
+   return legacy_load_chunk_info(fd, info_ptr,
+   info_count);
+   else if (e == EAGAIN)
+   cont = true;
+   else
+   return -e;
+   }
+
+   ci = btrfs_first_chunk_info(bici);
+   for (i = 0 ; i < bici->items_count ; i++) {
+   add_info

[RFC] btrfs-progs: use the new ioctl BTRFS_IOC_GET_CHUNK_INFO

2017-09-25 Thread Goffredo Baroncelli

Hi all,

this patches set uses the new ioctl BTRFS_IOC_GET_CHUNK_INFO to get
information about the chunk (see my other emails). 

The first patch change the function get_partition_size() to avoid
ioctl which would require root privileges (BLKGETSIZE64).
Instead it obtain the information via the sysfs filesystem.

The second patch use the BTRFS_IOC_GET_CHUNK_INFO instead of
TREE_SEARCH_IOCTL. The old code is still in place as fallback
when the kernel doesn't have the new ioctl.

These patches allow to use "btrfs fi usage" without root privileges.

Comments are welcome
BR
G.Baroncelli

--
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] get_partition_size() doens't require root privileges

2017-09-25 Thread Goffredo Baroncelli
From: Goffredo Baroncelli <kreij...@inwind.it>

Allow the get_partition_size() to be called without
root privileges.

Signed-off-by: Goffredo baroncelli <kreij...@inwind.it>
---
 utils.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/utils.c b/utils.c
index 7a2710fe..ee3ed544 100644
--- a/utils.c
+++ b/utils.c
@@ -2080,18 +2080,29 @@ u64 disk_size(const char *path)
 
 u64 get_partition_size(const char *dev)
 {
-   u64 result;
-   int fd = open(dev, O_RDONLY);
+   struct stat statbuf;
+   int r;
+   int fd;
+   char buf[100];
 
-   if (fd < 0)
-   return 0;
-   if (ioctl(fd, BLKGETSIZE64, ) < 0) {
-   close(fd);
+   r = stat(dev, );
+   if (r != 0)
return 0;
-   }
+
+   snprintf(buf, sizeof(buf),
+   "/sys/dev/block/%d:%d/size", major(statbuf.st_rdev),
+minor(statbuf.st_rdev));
+
+   fd = open(buf, O_RDONLY);
+   BUG_ON(fd < 0);
+
+   r = read(fd, buf, sizeof(buf)-1);
close(fd);
 
-   return result;
+   BUG_ON(r < 0);
+   buf[r] = 0;
+
+   return atoll(buf) * 512;
 }
 
 /*
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/14] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option

2017-09-24 Thread Goffredo Baroncelli
On 09/24/2017 12:10 PM, Anand Jain wrote:
> 
> 
>> All my points are clear for this patchset:
>> I know I removed one function, and my reason is:
>> 1) No or little usage
>>     And it's anti intuition.
>> 2) Dead code (not tested nor well documented)
>> 3) Possible workaround
>>
>> I can add several extra reasons as I stated before, but number of reasons 
>> won't validate anything anyway.
> 
>  End user convenience wins over the developer's technical difficulties.

Sorry, but I have to agree with Qu; there is no a big demand (other than 
Austin) for this functionality; even you stated that "...at some point it 
may..."; until now the UI is quite unfriendly (we should use a big enough 
device, and then cut it by hand on the basis of the output of mkfs.btrfs)...

I fear that this is another feature which increase the complexity of btrfs (and 
tools) with little or none usage

The work of Qu is a nice cleanup; I hope that this will be the direction which 
BTRFS will takes: removing of "strange/unused" features improving the 
reliability of the others.

BR
G.Baroncelli



> 
>  Pls don't remove this feature, it needs fix such as #2 (above) to improve on 
> #1 (above) as in your list.
> 
> Thanks, Anand
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix unexpected result when dio reading corrupted blocks

2017-09-18 Thread Goffredo Baroncelli
On 09/15/2017 11:06 PM, Liu Bo wrote:
> commit 4246a0b63bd8 ("block: add a bi_error field to struct bio")
> changed the logic of how dio read endio reports errors.
> 
> For single stripe dio read, %bio->bi_status reflects the error before
> verifying checksum, and now we're updating it when data block matches
> with its checksum, while in the mismatching case, %bio->bi_status is
> not updated to relfect that.
> 
> When some blocks in a file have been corrupted on disk, reading such a
> file ends up with
> 
> 1) checksum errros are reported in kernel log
> 2) read(2) returns successfully with some content being 0x01.
> 
> In order to fix it, we need to report its checksum mismatch error to
> the upper layer (dio layer in this case) as well.
> 
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> Reported-by: Goffredo Baroncelli <kreij...@inwind.it>
> cc: Goffredo Baroncelli <kreij...@inwind.it>

Tested-by: Goffredo Baroncelli <kreij...@inwind.it>

> ---
>  fs/btrfs/inode.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 24bcd5c..a46799e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8267,11 +8267,8 @@ static void btrfs_endio_direct_read(struct bio *bio)
>   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>   blk_status_t err = bio->bi_status;
>  
> - if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) {
> + if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
>   err = btrfs_subio_endio_read(inode, io_bio, err);
> - if (!err)
> - bio->bi_status = 0;
> - }
>  
>   unlock_extent(_I(inode)->io_tree, dip->logical_offset,
> dip->logical_offset + dip->bytes - 1);
> @@ -8279,7 +8276,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>  
>   kfree(dip);
>  
> - dio_bio->bi_status = bio->bi_status;
> + dio_bio->bi_status = err;
>   dio_end_io(dio_bio);
>  
>   if (io_bio->end_io)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: load_chunk_info() wrong set search key 'type'

2017-09-17 Thread Goffredo Baroncelli
The function load_chunk_info() doesn't initialize correctly the
sk->min/max_type when it calls the TREE_SEARCH ioctl: these keys are swapped.
Moreover this function assumes that all the items contained in the tree
BTRFS_CHUNK_TREE_OBJECTID are of type BTRFS_CHUNK_ITEM_KEY, however some
items are of type DEV_ITEM. Add an if( ) to pick only the right ones.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] load_chunk_info() : wrong search key 'type'

2017-09-17 Thread Goffredo Baroncelli
From: Goffredo Baroncelli <kreij...@inwind.it>

The function load_chunk_info() doesn't initialize correctly the
sk->min/max_type when it calls the TREE_SEARCH ioctl: these keys are swapped.
Moreover this function assumes that all the items contained in the tree
BTRFS_CHUNK_TREE_OBJECTID are of type BTRFS_CHUNK_ITEM_KEY, however some
items are of type DEV_ITEM. Add an if( ) to pick only the right ones.

Signed-off-by: G.Baroncelli <kreij...@inwind.it>
---
 cmds-fi-usage.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index 6c846c15..52d63524 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -147,8 +147,8 @@ static int load_chunk_info(int fd, struct chunk_info 
**info_ptr, int *info_count
 
sk->min_objectid = 0;
sk->max_objectid = (u64)-1;
-   sk->max_type = 0;
-   sk->min_type = (u8)-1;
+   sk->min_type = 0;
+   sk->max_type = (u8)-1;
sk->min_offset = 0;
sk->max_offset = (u64)-1;
sk->min_transid = 0;
@@ -174,20 +174,21 @@ static int load_chunk_info(int fd, struct chunk_info 
**info_ptr, int *info_count
off = 0;
for (i = 0; i < sk->nr_items; i++) {
struct btrfs_chunk *item;
+   u8 type;
sh = (struct btrfs_ioctl_search_header *)(args.buf +
  off);
-
+   type = btrfs_search_header_type(sh);
off += sizeof(*sh);
-   item = (struct btrfs_chunk *)(args.buf + off);
-
-   ret = add_info_to_list(info_ptr, info_count, item);
-   if (ret) {
-   *info_ptr = NULL;
-   return 1;
+   if (type == BTRFS_CHUNK_ITEM_KEY) {
+   item = (struct btrfs_chunk *)(args.buf + off);
+   ret = add_info_to_list(info_ptr, info_count, 
item);
+   if (ret) {
+   *info_ptr = NULL;
+   return 1;
+   }
}
 
off += btrfs_search_header_len(sh);
-
sk->min_objectid = btrfs_search_header_objectid(sh);
sk->min_type = btrfs_search_header_type(sh);
sk->min_offset = btrfs_search_header_offset(sh)+1;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: subvolume list as user?

2017-09-16 Thread Goffredo Baroncelli
On 09/16/2017 04:37 PM, Ulli Horlacher wrote:
> On Sat 2017-09-16 (14:39), Hans van Kranenburg wrote:
>> On 09/16/2017 11:45 AM, Ulli Horlacher wrote:
>>
>>> Every user can create subvolumes (and root cannot prevent it!).
>>> But how can a user list these subvolumes?
>>>
>>> tux@xerus:/test/tux: btrfs subvolume create test
>>> Create subvolume './test'
>>
>> From your other posts I don't quickly get if you actually do want to
>> have this possible, or accept that it's currently like that and try to
>> do damage control by having users also remove their things again.
> 
> I want both, country AND western :-)
> 
> Without joking:
> I just want to learn more about btrfs and I have various use cases to
> handle.
> 
> Different use cases need different actions.
> 
> 
>> Actually, if you don't want this I think it's quite easily to patch your
>> kernel with one or two lines of code to disallow it.
> 
> Kernel patching is a no-go for me. My boss does not allow it. No
> discussion possible on this topic.
> 
> 
>>> tux@xerus:/test/tux: btrfs subvolume list .
>>> ERROR: can't perform the search - Operation not permitted
>>
>> Yes, because the SEARCH ioctl only allows root to directly query any of
>> the filesystem metadata from kernel memory. subvolume list uses this
>> SEARCH ioctl to find it's info.
> 
> This is the explantion why it does not work, but it does not help me. I
> still have the problem: how can a user get a list of his subvolumes? He
> may created them some time ago and forget it. He now wants to have a list
> of them.


Far to be perfect, but this could help

find / -inum 256  2>/dev/null

each subvolume has inode number 256

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix unexpected result when dio reading corrupted blocks

2017-09-16 Thread Goffredo Baroncelli
On 09/15/2017 11:06 PM, Liu Bo wrote:
> commit 4246a0b63bd8 ("block: add a bi_error field to struct bio")
> changed the logic of how dio read endio reports errors.
> 
> For single stripe dio read, %bio->bi_status reflects the error before
> verifying checksum, and now we're updating it when data block matches
> with its checksum, while in the mismatching case, %bio->bi_status is
> not updated to relfect that.
> 
> When some blocks in a file have been corrupted on disk, reading such a
> file ends up with
> 
> 1) checksum errros are reported in kernel log
> 2) read(2) returns successfully with some content being 0x01.
> 
> In order to fix it, we need to report its checksum mismatch error to
> the upper layer (dio layer in this case) as well.

I tested it, and now it works: even using O_DIRECT -EIO is returned if the file 
is corrupted.

ghigo@venice:~/btrfs/crash-o-direct/t$ ls -li
total 16384
257 -rw-r--r-- 1 root root 16777216 Sep 15 20:51 abcd
ghigo@venice:~/btrfs/crash-o-direct/t$ date
Sat Sep 16 13:56:26 CEST 2017

ghigo@venice:~/btrfs/crash-o-direct/t$ cat abcd 
cat: abcd: Input/output error
ghigo@venice:~/btrfs/crash-o-direct/t$ dmesg -T | tail -1
[Sat Sep 16 13:56:29 2017] BTRFS warning (device sdd5): csum failed root 5 ino 
257 off 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1

ghigo@venice:~/btrfs/crash-o-direct/t$ dd if=abcd iflag=direct
dd: error reading 'abcd': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.000404156 s, 0.0 kB/s
ghigo@venice:~/btrfs/crash-o-direct/t$ dmesg -T | tail -1
[Sat Sep 16 13:56:41 2017] BTRFS warning (device sdd5): csum failed root 5 ino 
257 off 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1




> 
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> Reported-by: Goffredo Baroncelli <kreij...@inwind.it>
> cc: Goffredo Baroncelli <kreij...@inwind.it>
> ---
>  fs/btrfs/inode.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 24bcd5c..a46799e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8267,11 +8267,8 @@ static void btrfs_endio_direct_read(struct bio *bio)
>   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
>   blk_status_t err = bio->bi_status;
>  
> - if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) {
> + if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED)
>   err = btrfs_subio_endio_read(inode, io_bio, err);
> - if (!err)
> - bio->bi_status = 0;
> - }
>  
>   unlock_extent(_I(inode)->io_tree, dip->logical_offset,
> dip->logical_offset + dip->bytes - 1);
> @@ -8279,7 +8276,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>  
>   kfree(dip);
>  
> - dio_bio->bi_status = bio->bi_status;
> + dio_bio->bi_status = err;
>   dio_end_io(dio_bio);
>  
>   if (io_bio->end_io)
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-16 Thread Goffredo Baroncelli
On 09/16/2017 01:22 AM, Kai Krakow wrote:
> Am Sat, 16 Sep 2017 00:02:01 +0200
> schrieb Ulli Horlacher <frams...@rus.uni-stuttgart.de>:
> 
>> On Fri 2017-09-15 (23:44), Ulli Horlacher wrote:
[...]
> 
> See "man mount" in section btrfs mount options: There is a mount option
> to allow normal user to delete snapshots. But this is said to has
> security implication I cannot currently tell. Maybe someone else knows.

"btrfs sub del" removes a subvolume independently by its contents: it doesn't 
check the subvolume files/directories and their permission/ownership. 

This is different from a "rm -rf", which (e.g.) can't delete a directory owned 
by a different user with files

ghigo@venice:/tmp$ mkdir d
ghigo@venice:/tmp$ mkdir d/d
ghigo@venice:/tmp$ touch d/d/f
ghigo@venice:/tmp$ sudo chown nobody d/d
ghigo@venice:/tmp$ rm -rf d
rm: cannot remove 'd/d/f': Permission denied

In the past I proposed to allow an ordinary user to remove an *empty* subvolume 
with a simple rmdir (if he has the permissions). This would solve this kind of 
problem.

https://www.spinics.net/lists/linux-btrfs/msg06499.html

or to relax the check around "btrfs sub del", so an user can remove an _empty_ 
subvolume

https://www.spinics.net/lists/linux-btrfs/msg06522.html

> 
> 
BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >