Re: Copy on write of unmodified data

2016-05-25 Thread H. Peter Anvin
On 05/25/16 02:29, Hugo Mills wrote:
> On Wed, May 25, 2016 at 01:58:15AM -0700, H. Peter Anvin wrote:
>> Hi,
>>
>> I'm looking at using a btrfs with snapshots to implement a generational
>> backup capacity.  However, doing it the naïve way would have the side
>> effect that for a file that has been partially modified, after
>> snapshotting the file would be written with *mostly* the same data.  How
>> does btrfs' COW algorithm deal with that?  If necessary I might want to
>> write some smarter user space utilities for this.
> 
>Sounds like it might be a job for one of the dedup tools
> (deupremove, bedup), or, if you're writing your own, the safe
> deduplication ioctl which underlies those tools.
> 

I guess I would prefer if data wasn't first duplicated and then
deduplicated if possible.  It sounds like I ought to write a "smart
copy-overwrite" tool for this.

-hpa



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-04 Thread H. Peter Anvin
On 06/04/2014 12:24 PM, Arnd Bergmann wrote:
 
 For other timekeeping stuff in the kernel, I agree that using some
 64-bit representation (nanoseconds, 32/32 unsigned seconds/nanoseconds,
 ...) has advantages, that's exactly the point I was making earlier
 against simply extending the internal time_t/timespec to 64-bit
 seconds for everything.
 

How much of a performance issue is it to make time_t 64 bits, and for
the bits there are, how hard are they to fix?

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-02 Thread H. Peter Anvin
On 06/02/2014 12:19 PM, Arnd Bergmann wrote:
 On Monday 02 June 2014 13:52:19 Joseph S. Myers wrote:
 On Fri, 30 May 2014, Arnd Bergmann wrote:

 a) is this the right approach in general? The previous discussion
pointed this way, but there may be other opinions.

 The syscall changes seem like the sort of thing I'd expect, although 
 patches adding new syscalls or otherwise affecting the kernel/userspace 
 interface (as opposed to those relating to an individual filesystem) 
 should go to linux-api as well as other relevant lists.
 
 Ok. Sorry about missing linux-api, I confused it with linux-arch, which
 may not be as relevant here, except for the one question whether we
 actually want to have the new ABI on all 32-bit architectures or only
 as an opt-in for those that expect to stay around for another 24 years.
 
 Two more questions for you:
 
 - are you (and others) happy with adding this type of stat syscall
   (fstatat64/fstat64) as opposed to the more generic xstat that has
   been discussed in the past and that never made it through the bike-
   shedding discussion?
 
 - once we have enough buy-in from reviewers to merge this initial
   series, should we proceed to define rest of the syscall ABI
   (minus driver ioctls) so glibc and kernel can do the conversion
   on top of that, or should we better try to do things one syscall
   family at a time and actually get the kernel to handle them
   correctly internally?
 

The bit that is really going to hurt is every single ioctl that uses a
timespec.

Honestly, though, I really don't understand the point with struct
inode_time.  It seems like the zeroeth-order thing is to change the
kernel internal version of struct timespec to have a 64-bit time... it
isn't just about inodes.  We then should be explicit about the external
uses of time, and use accessors.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-06-02 Thread H. Peter Anvin
On 06/02/2014 12:55 PM, Arnd Bergmann wrote:

 The bit that is really going to hurt is every single ioctl that uses a
 timespec.

 Honestly, though, I really don't understand the point with struct
 inode_time.  It seems like the zeroeth-order thing is to change the
 kernel internal version of struct timespec to have a 64-bit time... it
 isn't just about inodes.  We then should be explicit about the external
 uses of time, and use accessors.
 
 I picked these because they are fairly isolated from all other uses,
 in particular since inode times are the only things where we really
 care about times in the distant past or future (decades away as opposed
 to things that happened between boot and shutdown).
 

If nothing else, I would expect to be able to set the system time to
weird values for testing.  So I'm not so sure I agree with that...

 For other kernel-internal uses, we may be better off migrating to
 a completely different representation, such as nanoseconds since
 boot or the architecture specific ktime_t, but this is really something
 to decide for each subsystem.

Having a bunch of different time representations in the kernel seems
like a real headache...

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 00/32] making inode time stamps y2038 ready

2014-05-31 Thread H. Peter Anvin
Typically they are using 64-bit signed seconds.

On May 31, 2014 11:22:37 AM PDT, Richard Cochran richardcoch...@gmail.com 
wrote:
On Sat, May 31, 2014 at 05:23:02PM +0200, Arnd Bergmann wrote:
 
 It's an approximation:

(Approximately never ;)

 with 64-bit timestamps, you can represent close to 300 billion
 years, which is way past the time that our planet can sustain
 life of any form[1].

Did you mean mean 64 bits worth of seconds?

  2^64 / (3600*24*365) = 584,942,417,355

That is more than 300 billion years, and still, it is not quite the
same as never.

In any case, that term is not too helpful in the comparison table,
IMHO. One could think that some sort of clever running count relative
to the last mount time was implied.

Thanks,
Richard

[1] You are forgetting the immortal robotic overlords.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Formalizing the use of Boot Area B

2014-05-20 Thread H. Peter Anvin
On 05/14/2014 05:01 PM, H. Peter Anvin wrote:
 It turns out that the primary 64K Boot Area A is too small for some
 applications and/or some architectures.
 
 When I discussed this with Chris Mason, he pointed out that the area
 beyond the superblock is also unused, up until at least the megabyte
 point (from my reading of the mkfs code, it is actually slightly more
 than a megabyte.)
 
 This is present in all versions of mkfs.btrfs that has the superblock at
 64K (some very early ones had the superblock at 16K, but that format is
 no longer supported), so all that is needed is formalizing the specs as
 to the use of this area.
 
 My suggestion is that 64-128K is reserved for extension of the
 superblock and/or any other filesystem uses, and 128-1024K is defined as
 Boot Area B.  However, if there may be reason to reserve more, then we
 should do that.  Hence requesting a formal decision as to the extent and
 ownership of this area.
 
   -hpa
 

Ping on this?  If I don't hear back on this I will probably just go
ahead and use 128K-1024K.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Formalizing the use of Boot Area B

2014-05-20 Thread H. Peter Anvin
On 05/20/2014 04:37 PM, Chris Mason wrote:
 On 05/20/2014 07:29 PM, H. Peter Anvin wrote:
 On 05/14/2014 05:01 PM, H. Peter Anvin wrote:
 It turns out that the primary 64K Boot Area A is too small for some
 applications and/or some architectures.

 When I discussed this with Chris Mason, he pointed out that the area
 beyond the superblock is also unused, up until at least the megabyte
 point (from my reading of the mkfs code, it is actually slightly more
 than a megabyte.)

 This is present in all versions of mkfs.btrfs that has the superblock at
 64K (some very early ones had the superblock at 16K, but that format is
 no longer supported), so all that is needed is formalizing the specs as
 to the use of this area.

 My suggestion is that 64-128K is reserved for extension of the
 superblock and/or any other filesystem uses, and 128-1024K is defined as
 Boot Area B.  However, if there may be reason to reserve more, then we
 should do that.  Hence requesting a formal decision as to the extent and
 ownership of this area.

 -hpa


 Ping on this?  If I don't hear back on this I will probably just go
 ahead and use 128K-1024K.
 
 Hi Peter,
 
 We do leave the first 1MB of each device alone.  Can we do 256K-1024K
 for the boot loader?  We don't have an immediate need for the extra
 space, but I'd like to reserve a little more than the extra 64KB.
 

Works for me.  So 64-256K (192K) is reserved for the file system, and
Boot Area B is 256-1024K (768K).

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Formalizing the use of Boot Area B

2014-05-20 Thread H. Peter Anvin
On 05/20/2014 04:37 PM, Chris Mason wrote:
 
 Hi Peter,
 
 We do leave the first 1MB of each device alone.  Can we do 256K-1024K
 for the boot loader?  We don't have an immediate need for the extra
 space, but I'd like to reserve a little more than the extra 64KB.
 

Incidentally, the current version of mkfs.btrfs actually leaves not
1 MB (1024K) but rather 1104K (64K+16K+1024K).  Not sure if that is
intentional.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Formalizing the use of Boot Area B

2014-05-14 Thread H. Peter Anvin
It turns out that the primary 64K Boot Area A is too small for some
applications and/or some architectures.

When I discussed this with Chris Mason, he pointed out that the area
beyond the superblock is also unused, up until at least the megabyte
point (from my reading of the mkfs code, it is actually slightly more
than a megabyte.)

This is present in all versions of mkfs.btrfs that has the superblock at
64K (some very early ones had the superblock at 16K, but that format is
no longer supported), so all that is needed is formalizing the specs as
to the use of this area.

My suggestion is that 64-128K is reserved for extension of the
superblock and/or any other filesystem uses, and 128-1024K is defined as
Boot Area B.  However, if there may be reason to reserve more, then we
should do that.  Hence requesting a formal decision as to the extent and
ownership of this area.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread H. Peter Anvin
On 11/21/2013 04:30 PM, Stan Hoeppner wrote:
 
 The rebuild time of a parity array normally has little to do with CPU
 overhead.

Unless you have to fall back to table driven code.

Anyway, this looks like a great concept.  Now we just need to implement
it ;)

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-20 Thread H. Peter Anvin
It is also possible to quickly multiply by 2^-1 which makes for an interesting 
R parity.

Andrea Mazzoleni amadva...@gmail.com wrote:
Hi David,

 The choice of ZFS to use powers of 4 was likely not optimal,
 because to multiply by 4, it has to do two multiplications by 2.
 I can agree with that.  I didn't copy ZFS's choice here
David, it was not my intention to suggest that you copied from ZFS.
Sorry to have expressed myself badly. I just mentioned ZFS because it's
an implementation that I know uses powers of 4 to generate triple
parity, and I saw in the code that it's implemented with two
multiplication
by 2.

Ciao,
Andrea

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-20 Thread H. Peter Anvin
On 11/20/2013 10:56 AM, Andrea Mazzoleni wrote:
 Hi,
 
 Yep. At present to multiply for 2^-1 I'm using in C:
 
 static inline uint64_t d2_64(uint64_t v)
 {
 uint64_t mask = v  0x0101010101010101U;
 mask = (mask  8) - mask;
 v = (v  1)  0x7f7f7f7f7f7f7f7fU;
 v ^= mask  0x8e8e8e8e8e8e8e8eU;
 return v;
 }
 
 and for SSE2:
 
 asm volatile(movdqa %xmm2,%xmm4);
 asm volatile(pxor %xmm5,%xmm5);
 asm volatile(psllw $7,%xmm4);
 asm volatile(psrlw $1,%xmm2);
 asm volatile(pcmpgtb %xmm4,%xmm5);
 asm volatile(pand %xmm6,%xmm2); with xmm6 == 7f7f7f7f7f7f...
 asm volatile(pand %xmm3,%xmm5); with xmm3 == 8e8e8e8e8e...
 asm volatile(pxor %xmm5,%xmm2);
 
 where xmm2 is the intput/output
 

Now, that doesn't sound like something that can get neatly meshed into
the Cauchy matrix scheme, I assume.  It is somewhat nice to have a
scheme which is arbitrarily expandable without having to fall back to
dual parity during the restripe operation.  It probably also reduces the
amount of code necessary.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-20 Thread H. Peter Anvin
On 11/20/2013 11:05 AM, Andrea Mazzoleni wrote:
 
 For the first row with j=0, I use xi = 2^-i and y0 = 0, that results in:
 

How can xi = 2^-i if x is supposed to be constant?

That doesn't mean that your approach isn't valid, of course, but it
might not be a Cauchy matrix and thus needs additional analysis.

 row j=0 - 1/(xi+y0) = 1/(2^-i + 0) = 2^i (RAID-6 coefficients)
 
 For the next rows with j0, I use yj = 2^j, resulting in:
 
 rows j0 - 1/(xi+yj) = 1/(2^-i + 2^j)

Even more so here... 2^-i and 2^j don't seem to be of the form xi and yj
respectively.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-20 Thread H. Peter Anvin
On 11/20/2013 01:04 PM, Andrea Mazzoleni wrote:
 Hi Peter,
 
 static inline uint64_t d2_64(uint64_t v)
 {
 uint64_t mask = v  0x0101010101010101U;
 mask = (mask  8) - mask;

 (mask  7) I assume...
 No. It's (mask  8) - mask. We want to expand the bit at position 0
 (in each byte) to the full byte, resulting in 0xFF if the bit is at 1,
 and 0x00 if the bit is 0.
 
 (0  8) - 0 = 0x00
 (1  8) - 1 = 0x100 - 1 = 0xFF
 

Oh, right... it is the same as (v  1) - (v  7) except everything is
shifted over one.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-20 Thread H. Peter Anvin
On 11/20/2013 12:30 PM, James Plank wrote:
 Peter, I think I understand it differently.  Concrete example in GF(256) for 
 k=6, m=4:
 
 First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, 
 and y_i = 2^i for other i.  In this case:   x = { 1, 142, 71, 173, 216, 108 } 
  y = { 0, 2, 4).  The cauchy matrix is:

Sorry, I took xi and yj to mean a constant x multiplied with i and a
constant y multiplied with j, rather than x_i and y_j.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-18 Thread H. Peter Anvin
On 11/18/2013 02:08 PM, Andrea Mazzoleni wrote:
 Hi,
 
 I want to report that I recently implemented a support for
 arbitrary number of parities that could be useful also for Linux
 RAID and Btrfs, both currently limited to double parity.
 
 In short, to generate the parity I use a Cauchy matrix specifically
 built to be compatible with the existing Linux parity computation,
 and extensible to an arbitrary number of parities. This without
 limitations on the number of data disks.
 
 The Cauchy matrix for six parities is:
 
 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01...
 01 02 04 08 10 20 40 80 1d 3a 74 e8 cd 87 13 26 4c 98 2d 5a b4 75...
 01 f5 d2 c4 9a 71 f1 7f fc 87 c1 c6 19 2f 40 55 3d ba 53 04 9c 61...
 01 bb a6 d7 c7 07 ce 82 4a 2f a5 9b b6 60 f1 ad e7 f4 06 d2 df 2e...
 01 97 7f 9c 7c 18 bd a2 58 1a da 74 70 a3 e5 47 29 07 f5 80 23 e9...
 01 2b 3f cf 73 2c d6 ed cb 74 15 78 8a c1 17 c9 89 68 21 ab 76 3b...
 
 You can easily recognize the first row as RAID5 based on a simple
 XOR, and the second row as RAID6 based on multiplications by powers
 of 2. The other rows are for additional parity levels and they
 require multiplications by arbitrary values that can be implemented
 using the PSHUFB instruction.
 

Hello,

This looks very interesting indeed.  Could you perhaps describe how the
Cauchy matrix is derived, and under what conditions it would become
singular?

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-18 Thread H. Peter Anvin
On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
 Hi Peter,
 
 The Cauchy matrix has the mathematical property to always have itself
 and all submatrices not singular. So, we are sure that we can always
 solve the equations to recover the data disks.
 
 Besides the mathematical proof, I've also inverted all the
 377,342,351,231 possible submatrices for up to 6 parities and 251 data
 disks, and got an experimental confirmation of this.
 

Nice.


 The only limit is coming from the GF(2^8). You have a maximum number
 of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
 you can have no more of 251 data disks. Over this limit it's not
 possible to build a Cauchy matrix.
 

251?  Not 255?

 Note that instead with a Vandermonde matrix you don't have the
 guarantee to always have all the submatrices not singular. This is the
 reason because using power coefficients, before or late, it happens to
 have unsolvable equations.
 
 You can find the code that generate the Cauchy matrix with some
 explanation in the comments at (see the set_cauchy() function) :
 
 http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c

OK, need to read up on the theoretical aspects of this, but it sounds
promising.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: experimental raid5/6 code in git

2013-02-04 Thread H. Peter Anvin
@@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
}
btrfs_dev_replace_unlock(root-fs_info-dev_replace);

+   if ((all_avail  (BTRFS_BLOCK_GROUP_RAID5 |
+ BTRFS_BLOCK_GROUP_RAID6)  num_devices = 3)) {
+   printk(KERN_ERR btrfs: unable to go below three devices 
+  on raid5 or raid6\n);
+   ret = -EINVAL;
+   goto out;
+   }
+
if ((all_avail  BTRFS_BLOCK_GROUP_RAID10)  num_devices = 4) {
printk(KERN_ERR btrfs: unable to go below four devices 
   on raid10\n);
@@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root, char
*device_path)
goto out;
}

+   if ((all_avail  BTRFS_BLOCK_GROUP_RAID5) 
+   root-fs_info-fs_devices-rw_devices = 2) {
+   printk(KERN_ERR btrfs: unable to go below two 
+  devices on raid5\n);
+   ret = -EINVAL;
+   goto out;
+   }
+   if ((all_avail  BTRFS_BLOCK_GROUP_RAID6) 
+   root-fs_info-fs_devices-rw_devices = 3) {
+   printk(KERN_ERR btrfs: unable to go below three 
+  devices on raid6\n);
+   ret = -EINVAL;
+   goto out;
+   }
+
if (strcmp(device_path, missing) == 0) {
struct list_head *devices;
struct btrfs_device *tmp;


This seems inconsistent?

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: experimental raid5/6 code in git

2013-02-04 Thread H. Peter Anvin
Also, a 2-member raid5 or 3-member raid6 are a raid1 and can be treated as such.

Chris Mason chris.ma...@fusionio.com wrote:

On Mon, Feb 04, 2013 at 02:42:24PM -0700, H. Peter Anvin wrote:
 @@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root,
char
 *device_path)
  }
  btrfs_dev_replace_unlock(root-fs_info-dev_replace);
 
 +if ((all_avail  (BTRFS_BLOCK_GROUP_RAID5 |
 +  BTRFS_BLOCK_GROUP_RAID6)  num_devices = 3)) {
 +printk(KERN_ERR btrfs: unable to go below three devices 
 +   on raid5 or raid6\n);
 +ret = -EINVAL;
 +goto out;
 +}
 +
  if ((all_avail  BTRFS_BLOCK_GROUP_RAID10)  num_devices = 4) {
  printk(KERN_ERR btrfs: unable to go below four devices 
 on raid10\n);
 @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root,
char
 *device_path)
  goto out;
  }
 
 +if ((all_avail  BTRFS_BLOCK_GROUP_RAID5) 
 +root-fs_info-fs_devices-rw_devices = 2) {
 +printk(KERN_ERR btrfs: unable to go below two 
 +   devices on raid5\n);
 +ret = -EINVAL;
 +goto out;
 +}
 +if ((all_avail  BTRFS_BLOCK_GROUP_RAID6) 
 +root-fs_info-fs_devices-rw_devices = 3) {
 +printk(KERN_ERR btrfs: unable to go below three 
 +   devices on raid6\n);
 +ret = -EINVAL;
 +goto out;
 +}
 +
  if (strcmp(device_path, missing) == 0) {
  struct list_head *devices;
  struct btrfs_device *tmp;
 
 
 This seems inconsistent?

Whoops, missed that one.  Thanks!

-chris

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature request: true RAID-1 mode

2012-07-02 Thread H. Peter Anvin

On 06/25/2012 04:00 PM, H. Peter Anvin wrote:


I am aware of that, and it is not a problem... the one-device
bootloader can find out *which* disk it is talking to by comparing
uuids, and the btrfs data structures will tell it how to find the data
on that specific disk.  It does of course mean the bootloader needs to
be aware of the multidisk nature of btrfs, but that isn't a problem in
itself.



So, also, let me address the question why we should care about a 
one-device bootloader.  It is quite common, especially in fileservers, 
for a subset of the boot devices to be inaccessible by the firmware, due 
to bugs, boot time concerns (spinning up all the media in the firmware 
is SLOW) or just plain lack of support of plug-in cards.  As such, the 
reliable thing to do is to make sure that any disk being seen is enough 
to bring up the system; since this is such a small amount of data with 
modern standards, there is just no reason to do anything less robust.


Once the kernel comes up it has all the device drivers, of course.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature request: true RAID-1 mode

2012-06-25 Thread H. Peter Anvin
On 06/25/2012 08:21 AM, Chris Mason wrote:
 Yes and no.  If you have 2 drives and you add one more, we can make it
 do all new chunks over 3 drives.  But, turning the existing double
 mirror chunks into a triple mirror requires a balance.
 
 -chris

So trigger one.  This is the exact analogue to the resync pass that is
required in classic RAID after adding new media.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature request: true RAID-1 mode

2012-06-25 Thread H. Peter Anvin
On 06/25/2012 03:28 PM, Gareth Pye wrote:
 To me one doesn't have to be triggered, a user expects to have to tell
 the disks to rebuild/resync/balance after adding a disk, they may want
 to wait till they've added all 4 disks and run a few extra commands
 before they run the rebalance.

They do?  E.g. mdadm doesn't make them...

 What is important is having a mode that
 doesn't require the user to remember that what they had used as the
 closest analogue to RAID1 that BTRFS supports requires them to run
 another command to change the 'RAID level' to be the RAID1 analogue for
 the new number of disks. 
 
 Users will forget that and they will lose data because of it. At least
 with a M=N mode BTRFS can say they tried to make it easy to avoid that
 pitfall.

Doesn't that contradict your previous statement?  In either case, I
agree with the latter...

-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature request: true RAID-1 mode

2012-06-25 Thread H. Peter Anvin
On 06/25/2012 03:54 PM, Hugo Mills wrote:
 On Mon, Jun 25, 2012 at 10:46:01AM -0700, H. Peter Anvin wrote:
 On 06/25/2012 08:21 AM, Chris Mason wrote:
 Yes and no.  If you have 2 drives and you add one more, we can
 make it do all new chunks over 3 drives.  But, turning the
 existing double mirror chunks into a triple mirror requires a
 balance.
 
 -chris
 
 So trigger one.  This is the exact analogue to the resync pass
 that is required in classic RAID after adding new media.
 
 You'd have to cancel and restart if a second new disk was added 
 while the first balance was ongoing. Fortunately, this isn't a
 problem these days.
 
 Also, it occurs to me that I should just check -- are you aware 
 that the btrfs implementation of RAID-1 makes no guarantees about
 the location of any given piece of data? i.e. if I have a piece of
 data stored at block X on disk 1, it's not guaranteed to be stored
 at block X on disks 2, 3, 4, ... I'm not sure if this is important
 to you, but it's a significant difference between the btrfs
 implementation of RAID-1 and the MD implementation.
 

I am aware of that, and it is not a problem... the one-device
bootloader can find out *which* disk it is talking to by comparing
uuids, and the btrfs data structures will tell it how to find the data
on that specific disk.  It does of course mean the bootloader needs to
be aware of the multidisk nature of btrfs, but that isn't a problem in
itself.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: R: Re: Subvolumes and /proc/self/mountinfo

2012-06-21 Thread H. Peter Anvin
On 06/20/2012 10:47 PM, Goffredo Baroncelli wrote:
 
 This leads to have a separately /boot filesystem. In this case I agree
 with you: make sense that the kernel is near the bootloader files.
 
 But if /boot has to be in a separate filesystem, which is the point to
 support btrfs at all ? Does make sense to support only a subset of btrfs
 features ?
 

Yes, and that's another good reason for /boot: btrfs supports that kind
of policy (e.g. no compression or encryption in this subtree.)


 Now we have the possibility to move the kernel near the modules, and
 this could lead some interesting possibility: think about different
 linux installations, with an own kernel version and an own modules
 version; what are the reasons to put together under /boot different
 kernel which potential conflicting names ? de facto standard ?
 historical reasons ? Nothing wrong here; but also the idea to moving the
 kernel under /lib/modules is not so wrong.

 No, it is completely, totally and very very seriously wrong.
 
 When a bootloader (and the bioses) will be able to address the whole
 diskS, this will change.. Not now
 

People have said that for 15 years.  The reality is that firmware will
always be behind the curve, and *that's ok*, we just need to deal with it.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: R: Re: Subvolumes and /proc/self/mountinfo

2012-06-21 Thread H. Peter Anvin
On 06/21/2012 10:05 AM, Goffredo Baroncelli wrote:
 On 06/21/2012 03:38 PM, H. Peter Anvin wrote:
 But if /boot has to be in a separate filesystem, which is the point to
 support btrfs at all ? Does make sense to support only a subset of btrfs
 features ?

 Yes, and that's another good reason for /boot: btrfs supports that kind
 of policy (e.g. no compression or encryption in this subtree.)
 
 But what about large disk ? Syslinux is able to handle large disk ? Or
 it uses BIOS interrupt?

It uses the firmware.. how well the firmware can handle large disks is
another matter.
-hpa



-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Subvolumes and /proc/self/mountinfo

2012-06-20 Thread H. Peter Anvin
On 06/20/2012 06:34 AM, Chris Mason wrote:

 I want an algorithm, it doesn't have an API per se.  I would really like
 to avoid relying on blkid and udev for this, though... that is pretty
 much a nonstarter.

 If the answer is to walk the tree then I'm fine with that.
 
 Ok, fair enough.
 
 
 Right, the subvolume number doesn't change over the life of the subvol,
 regardless of the path that was used for mounting the subvol.  So all
 you need is that number (64 bits) and the filename relative to the
 subvol root and you're set.
 
 We'll have to add an ioctl for that.
 
 Finding the path relative to the subvol is easy, just walk backwards up
 the directory chain (cd ..) until you get to inode number 256.  All the
 subvol roots have inode number 256.
 

... assuming I can actually see the root (that it is not obscured
because of bindmounts and so on).  Yes, I know I'm weird for worrying
about these hyper-obscure corner cases, but I have a pretty explicit
goal of trying to write Syslinux so it will very rarely if ever do the
wrong thing, and since I can already get that information for other
filesystems.

The other thing, of course, is what is the desirable behavior,which I
have brought up in a few posts already.  Specifically, I see two
possibilities:

a. Always handle a path from the global root, and treat subvolumes as
   directories.  This would mostly require that the behavior of
   /proc/self/mountinfo with regards to mount -o subvolid= would need
   to be fixed.  I also have no idea how one would deal with a detached
   subvolume, or if that subcase even matters.

   A major problem with this is that it may be *very* confusing to a
   user to have to specify a path in their bootloader configuration as
   /subvolume/foo/bar when the string subvolume doesn't show up in
   any way in their normal filesystem.

b. Treat the subvolume as the root (which is what I so far have been
   asssuming.)  In this case, I think the subvolume ID,
   path_in_subvolume ioctl is the way to go, unless there is a way
   to do this with BTRFS_IOC_TREE_SEARCH already.

I think I'm leaning, still, at b just because of the very high
potential for user confusion with a.


 Well, I'd be interested in what Kay's stuff actually does.  Other than
 that, I would suggest adding a pair of ioctls that when executed on an
 arbitrary btrfs inode returns the corresponding subvolume and one which
 returns the path relative to the subvolume root.
 
 udev already scans block devices as they appear.  When it finds btrfs,
 it calls the btrfs dev scan ioctl for that one device.  It also reads in
 the FS uuid and the device uuid and puts them into a tree.
 
 Very simple stuff, but it gets rid of the need to manually call btrfs
 dev scan yourself.
 

For the record, I implemented the use of BTRFS_IOC_DEV_INFO yesterday;
it is still way better than what I had there before and will make an
excellent fallback for a new ioctl.

This would be my suggestion for a new ioctl:

1. Add the device number to the information already returned by
   BTRFS_IOC_DEV_INFO.
2. Allow returning more than one device at a time.  Userspace can
   already know the number of devices from BTRFS_IOC_FS_INFO(*), and
   it'd be better to just size a buffer and return N items rather
   having to iterate over the potentially sparse devid space.

I might write this one up if I can carve out some time today...

-hpa

(*) - because race conditions are still possible, a buffer size/limit
check is still needed.

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Feature request: true RAID-1 mode

2012-06-20 Thread H. Peter Anvin
Yet another boot loader support request.

Right now btrfs' definition of RAID-1 with more than two devices is a
bit unorthodox: it stores on any two drives.  True RAID-1 would
instead store N copies on each of N devices, the same way an actual
RAID-1 would operate with an arbitrary number of devices.

This means that a bootloader can consider a single device in isolation:
if the firmware gives access only to a single device, it can be booted.
 Since /boot is usually a very small amount of data, this is a very
reasonable tradeoff.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: R: Re: Subvolumes and /proc/self/mountinfo

2012-06-20 Thread H. Peter Anvin
On 06/20/2012 09:34 AM, Goffredo Baroncelli wrote:
 
 At the first I tough that having the /boot separate could be a good
 thing. Unfortunately /boot contains both the bootloader code and the
 kernel image. The kernel image should be in sync with the contents of
 /lib/modules/
 
 This is the tricky point. If I handle /boot inside the filesystem
 submodule a de-sync between the bootloader code and the boot sector
 could happens. In I handle /boot as separate subvolume/filesystem a
 de-sync between the kernel image and the modules could happens.
 
 Anyway, from a bootloader POV I think that /boot should be handle
 separately (or as filesystem or as subvolume identified by specific ID).
 The best could be move the kernel in the same subvolume as /lib/modules,
 so a switch of the subvolume as root filesystem would be coherent.
 

You're not really answering the question.   The best could be move the
kernel in the same subvolume as /lib/modules isn't really going to
happen... the whole *point* of /boot is that /boot contains everything
needed to get to the point of kernel initialization.

So, sorry, you're out to sea here...

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: R: Re: Subvolumes and /proc/self/mountinfo

2012-06-20 Thread H. Peter Anvin
On 06/20/2012 11:06 AM, Goffredo Baroncelli wrote:
 
 Am not saying that we *should* move the kernel away from /boot. I am
 only saying that having the kernel near /lib/modules *has* some advantages.
 
 Few year ago there are some gains to have a separate /boot (ah, the time
 when the bios were unable to address the bigger disk), where there are
 the minimum things to bootstrap the system.
 

There still is (in fact this exact problem has made a comeback, as there
are plenty of BIOSes which have bugs above the 2 TB mark); however,
there are also issues with RAID (firmware often cannot address all the
devices in the system -- and no, that isn't ancient history, I have a
system exactly like that that I bought last year), remote boot media
(your / might be on an iSCSI device, or even a network filesystem!) and
all kinds of situations like that.

The bottom line is that /boot is what the bootloader needs to be able to
address, whereas / can wait until the kernel has device drivers.  That
is a *HUGE* difference.

 Now we have the possibility to move the kernel near the modules, and
 this could lead some interesting possibility: think about different
 linux installations, with an own kernel version and an own modules
 version; what are the reasons to put together under /boot different
 kernel which potential conflicting names ? de facto standard ?
 historical reasons ? Nothing wrong here; but also the idea to moving the
 kernel under /lib/modules is not so wrong.

No, it is completely, totally and very very seriously wrong.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Subvolumes and /proc/self/mountinfo

2012-06-20 Thread H. Peter Anvin
On 06/19/2012 11:31 PM, Fajar A. Nugraha wrote:
 
 IMHO a more elegant solution would be similar to what
 (open)solaris/indiana does: make the boot parts (bootloader,
 configuration) as a separate area, separate from root snapshots. In
 solaris case IIRC this is will br /rpool/grub.
 

It is both more and less elegant; it means you don't get the same kind
of atomic update for the bootloader itself.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Feature request: true RAID-1 mode

2012-06-20 Thread H. Peter Anvin
Could you have a mode, though, where M = N at all times, so a user doesn't end 
up adding a new drive and get a nasty surprise?

Chris Mason chris.ma...@fusionio.com wrote:

On Wed, Jun 20, 2012 at 06:35:30PM -0600, Marios Titas wrote:
 On Wed, Jun 20, 2012 at 12:27 PM, H. Peter Anvin h...@zytor.com
wrote:
  Yet another boot loader support request.
 
  Right now btrfs' definition of RAID-1 with more than two devices
is a
  bit unorthodox: it stores on any two drives.  True RAID-1 would
  instead store N copies on each of N devices, the same way an actual
  RAID-1 would operate with an arbitrary number of devices.
 
  This means that a bootloader can consider a single device in
isolation:
  if the firmware gives access only to a single device, it can be
booted.
   Since /boot is usually a very small amount of data, this is a very
  reasonable tradeoff.
 
 +1
 
 In fact, the current RAID-1 should not have been called RAID-1 at
all,
 it is confusing.

With the raid5/6 code, I'm changing raid1 (and raid10) to have a
configurable number of copies.  So, you'll be able to have N copies on
M
drives, where N = M.

-chris

-- 
Sent from my mobile phone. Please excuse brevity and lack of formatting.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Subvolumes and /proc/self/mountinfo

2012-06-19 Thread H. Peter Anvin
On 06/19/2012 07:22 AM, Calvin Walton wrote:
 
 All subvolumes are accessible from the volume mounted when you use -o
 subvolid=0. (Note that 0 is not the real ID of the root volume, it's
 just a shortcut for mounting it.)
 

Could you clarify this bit?  Specifically, what is the real ID of the
root volume, then?

I found that after having set the default subvolume to something other
than the root, and then mounting it without the -o subvol= option, then
the subvolume name does *not* show in /proc/self/mountinfo; the same
happens if a subvolume is mounted by -o subvolid= rather than -o subvol=.

Is this a bug?  This would seem to give the worst of both worlds in
terms of actually knowing what the underlying filesystem path would end
up looking like.

-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Device names

2012-06-19 Thread H. Peter Anvin
On 06/19/2012 04:51 PM, Chris Mason wrote:
 
 At mount time, we go through and verify the path names still belong to
 the filesystem you thought they belonged to.  The bdev is locked during
 the verification, so it won't be able to go away or change.
 
 This is a long way of saying right we don't spit out device numbers.
 Even device numbers can change.  We can easily add a uuid based listing,
 which I think is what you want.
 

No, I want to find the actual devices.  I know I can get the UUID, but
scanning all the block devices in the system looking for that UUID is a
nonstarter.

Device path names can change while the system is operating (and, worse,
are dependent on namespace changes and chroot); device *numbers* cannot
as long as the device is in use (e.g. mounted.)  They can indeed change
while not in use, of course.

-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Subvolumes and /proc/self/mountinfo

2012-06-19 Thread H. Peter Anvin
On 06/19/2012 04:49 PM, Chris Mason wrote:
 On Mon, Jun 18, 2012 at 06:39:31PM -0600, H. Peter Anvin wrote:
 I'm trying to figure out an algorithm from taking an arbitrary mounted
 btrfs directory and break it down into:

 device(s), subvolume, subpath

 where, keep in mind, subpath may not actually be part of the mount.
 
 Do you want an API for this, or is it enough to wander through /dev/disk
 style symlinks?
 
 The big reason it isn't here yet is because Kay had this neat patch to
 blkid and udev to just put all the info you need into /dev/btrfs (or
 some other suitable location).  It would allow you to see which devices
 belong to which filesystems etc.
 

I want an algorithm, it doesn't have an API per se.  I would really like
to avoid relying on blkid and udev for this, though... that is pretty
much a nonstarter.

If the answer is to walk the tree then I'm fine with that.

 subvolumes may become disconnected from the root namespace.  In this
 case we can find it just by the subvol id, and mount it into an
 arbitrary directory.

OK, so it sounds like the best thing is actually to record the subvolume
*number* (ID) where (in my case) Syslinux is installed. This is actually
a good thing because the fewer O(n) strings I have to stick into the
boot block the better.

 b. Are there better ways (walking the tree using BTRFS_IOC_TREE_SEARCH?)
 to accomplish this than using /proc/self/mountinfo?
 
 Not yet, but I'm definitely open to adding them.  Lets just hash out
 what you need and we'll either go through Kay's stuff or add ioctls for
 you.

Well, I'd be interested in what Kay's stuff actually does.  Other than
that, I would suggest adding a pair of ioctls that when executed on an
arbitrary btrfs inode returns the corresponding subvolume and one which
returns the path relative to the subvolume root.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Subvolumes and /proc/self/mountinfo

2012-06-19 Thread H. Peter Anvin
On 06/19/2012 06:16 PM, cwillu wrote:
 The big reason it isn't here yet is because Kay had this neat patch
 to blkid and udev to just put all the info you need into /dev/btrfs
 (or some other suitable location).  It would allow you to see which
 devices belong to which filesystems etc.

 btrfs should work even without any udev installation.
 
 It does; you can always mount with an explicit -o
 device=/dev/foo,device=/dev/bar if you're inclined to punish
 yourself^w^w^w^w^w your requirements dictate that you don't rely on
 udev.

I think you're misunderstanding what this is about.

I'm working on trying to make the Syslinux installer for btrfs as robust
as it possibly can be.  I really don't like leaving corner cases where
it will do the wrong thing and leave your system unbootable.

Now, that having been said, there are a lot of things that are not
really very clear how they should work given btrfs.  Specifically, what
is needed is:

1. The underlying device(s) for boot block installation.
2. A concept of a root.
3. A concept of a path within that root to the installation directory,
   where we can find syslinux.cfg and the other bootloader modules.

All of this needs to be installed in the fixed-sized boot block, so a
compact representation is very much a plus.

The concept of what is the root and what is the path is
straightforward for lesser filesystems: the root of the filesystem is
defined by the root inode, and the path is a unique sequence of
directories from that root.  Note that this is completely independent of
how the filesystem was mounted when the boot loader was installed.

For btrfs, a lot of things aren't so clear-cut, especially in the light
of explicit and implicit subvolumes.  Furthermore, sorting out these
semantic issues is really important in order to support the atomic
update scenario:

a. Make a snapshot of the current root;
b. Mount said snapshot;
c. Install the new distro on the snapshot;
d. Change the bootloader configuration *inside* the snapshot to point
   to the snapshot as the root;
e. Install the bootloader on the snapshot, thereby making the boot
   block point to it and making it live.

If the root also contains subvolumes, e.g. /boot may be a subvolume
because it has different policies, this gets pretty gnarly to get right.
 It is also a very high value to get right.

So it is possible I'm approaching this wrong.  I would love to have a
discussion about this.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Subvolumes and /proc/self/mountinfo

2012-06-18 Thread H. Peter Anvin
I'm trying to figure out an algorithm from taking an arbitrary mounted
btrfs directory and break it down into:

device(s), subvolume, subpath

where, keep in mind, subpath may not actually be part of the mount.

/proc/self/mountinfo seems to have some of that information, however, it
does not appear to distinguish between non-default subvolumes and
directories.  At the same time, once I have mounted a subvolume I see
its name in the root btrfs directory even if I didn't access it.

Questions, thus:

a. Are subvolumes always part of the root namespace?  If so, is it the
mounted root, the default subvolume, or subvolume 0 which always exposes
these other subvolumes?  Are there disambiguation rules so that if I
have /btrfs/root/blah and blah is both a subvolume and a directory (I
presume that can happen?)

b. Are there better ways (walking the tree using BTRFS_IOC_TREE_SEARCH?)
to accomplish this than using /proc/self/mountinfo?

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/

2009-07-29 Thread H. Peter Anvin

 http://www.cs.utk.edu/~plank/plank/papers/CS-96-332.html even describes an 
 implementation _very_ similar to the current code, right down to using a 
 table for the logarithm and inverse logarithm calculations.


We don't use a table for logarithm and inverse logarithm calculations.
Any time you do a table lookup you commit suicide from a performance
standpoint.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/

2009-07-19 Thread H. Peter Anvin
David Woodhouse wrote:
 
 At this point we've actually implemented the fundamental parts of
 RAID[56] support in btrfs, and it's looking like all we really want is
 the arithmetic routines.
 

Given that you have no legacy requirements, and that supporting more
than two disks may be interesting, it may very well be worth spending
some time at new codes now rather than later.  Part of that
investigation, though, is going to have to be if and how they can be
accelerated.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/

2009-07-18 Thread H. Peter Anvin

David Woodhouse wrote:


I'm only interested in what we can use directly within btrfs -- and
ideally I do want something which gives me an _arbitrary_ number of
redundant blocks, rather than limiting me to 2. But the legacy code is
good enough for now¹.

When I get round to wanting more, I was thinking of lifting something
like http://git.infradead.org/mtd-utils.git?a=blob;f=fec.c to start
with, and maybe hoping that someone cleverer will come up with something
better.

The less I have to deal with Galois Fields, the happier I'll be.



Well, if you want something with more than 2-block redundancy you need 
something other than the existing RAID-6 code which, as you know, is a 
special case of general Reed-Solomon coding that I happen to have spent 
a lot of time optimizing.  The FEC code is not optimized at all if I can 
tell, and certainly doesn't use SSE in any way -- never mind the GF 
accelerators that are starting to appear.  That doesn't mean it 
*couldn't*, just that noone has done the work to either implement it or 
prove it can't be done.


Either way, perhaps the Plank paper that Rik pointed to could be useful 
as a starting point; it's probably worth taking their performance 
numbers with a *major* grain of salt: their implementation of RAID-6 
RS-Opt which is supposed to be equivalent to my code performs at
400 MB/s, which is less than Pentium III-era performance of the real 
world code (they compare not to real code but to their own 
implementation in Java, called Jerasure.)  Implementability using real 
array instruction sets is key to decent performance.


-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/

2009-07-17 Thread H. Peter Anvin

Ric Wheeler wrote:


Worth sharing a pointer to a really neat set of papers that describe 
open source friendly RAID6 and erasure encoding algorithms that were 
presented last year and this at FAST:


http://www.cs.utk.edu/~plank/plank/papers/papers.html

If I remember correctly, James Plank's papers also have implemented and 
benchmarked the various encodings,




I have seen the papers; I'm not sure it really makes that much 
difference.  One of the things that bugs me about these papers is that 
he compares to *his* implementation of my optimizations, but not to my 
code.  In real life implementations, on commodity hardware, we're 
limited by memory and disk performance, not by CPU utilization.


-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/

2009-07-17 Thread H. Peter Anvin

Ric Wheeler wrote:


The bottom line is pretty much this: the cost of changing the encoding
would appear to outweigh the benefit. I'm not trying to claim the Linux
RAID-6 implementation is optimal, but it is simple and appears to be
fast enough that the math isn't the bottleneck.


Cost? Thank about how to get free grad student hours testing out things 
that you might or might not want to leverage on down the road :-)




Cost, yes, of changing an on-disk format.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/

2009-07-17 Thread H. Peter Anvin

Ric Wheeler wrote:


The main flaw, as I said, is in the phrase as implemented by the
Jerasure library. He's comparing his own implementations of various
algorithms, not optimized implementations.

The bottom line is pretty much this: the cost of changing the encoding
would appear to outweigh the benefit. I'm not trying to claim the Linux
RAID-6 implementation is optimal, but it is simple and appears to be
fast enough that the math isn't the bottleneck.


Cost? Thank about how to get free grad student hours testing out things 
that you might or might not want to leverage on down the road :-)




Anyway... I don't really care too much.  If someone wants to redesign 
the Linux RAID-6 and Neil decides to take it I'm not going to object. 
I'm also not very likely to do any work on it.


-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-20 Thread H. Peter Anvin

Ingo Molnar wrote:


Hm, GCC uses __restrict__, right?

I'm wondering whether there's any internal tie-up between alias analysis 
and the __restrict__ keyword - so if we turn off aliasing optimizations 
the __restrict__ keyword's optimizations are turned off as well.




Actually I suspect that restrict makes little difference for inlines 
or even statics, since gcc generally can do alias analysis fine there. 
However, in the presence of an intermodule function call, all alias 
analysis is off.  This is presumably why type-based analysis is used at 
all ... to at least be able to a modicum of, say, loop invariant removal 
in the presence of a library call.  This is also where restrict comes 
into play.


-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread H. Peter Anvin

Andi Kleen wrote:

On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote:

Something at the back of my mind said aliasing.

$ gcc linus.c -O2 -S ; grep subl linus.s
subl$1624, %esp
$ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
subl$824, %esp

That's with 4.3.2.
Interesting. 


Nonsensical, but interesting.


What I find nonsensical is that -fno-strict-aliasing generates
better code here. Normally one would expect the compiler seeing
more aliases with that option and then be more conservative
regarding any sharing. But it seems to be the other way round
here.


For this to be convolved with aliasing *AT ALL* indicates this is done 
incorrectly.


This is about storage allocation, not aliases.  Storage allocation only 
depends on lifetime.


-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread H. Peter Anvin
Linus Torvalds wrote:
 
 On Fri, 9 Jan 2009, Ingo Molnar wrote:
  
 -static inline int constant_test_bit(int nr, const volatile unsigned long 
 *addr)
 +static __asm_inline int
 +constant_test_bit(int nr, const volatile unsigned long *addr)
  {
  return ((1UL  (nr % BITS_PER_LONG)) 
  (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
 
 Thios makes absolutely no sense.
 
 It's called __always_inline, not __asm_inline.
 
 Why add a new nonsensical annotations like that?
 

__asm_inline was my suggestion, to distinguish inline this
unconditionally because gcc screws up in the presence of asm() versus
inline this unconditionally because the world ends if it isn't -- to
tell the human reader, not gcc.  I guess the above is a good indicator
that the __asm_inline might have been a bad name.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread H. Peter Anvin
Ingo Molnar wrote:
 
 My goal is to make the kernel smaller and faster, and as far as the 
 placement of 'inline' keywords goes, i dont have too strong feelings about 
 how it's achieved: they have a certain level of documentation value 
 [signalling that a function is _intended_ to be lightweight] but otherwise 
 they are pretty neutral attributes to me.
 

As far as naming is concerned, gcc effectively supports four levels,
which *currently* map onto macros as follows:

__always_inline Inline unconditionally
inline  Inlining hint
nothing   Standard heuristics
noinlineUninline unconditionally

A lot of noise is being made about the naming of the levels (and I
personally believe we should have a different annotation for inline
unconditionally for correctness and inline unconditionally for
performance, as a documentation issue), but those are the four we get.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread H. Peter Anvin
Dirk Hohndel wrote:
 
 Does gcc actually follow the promise? If that's the case (and if it's
 considered a bug when it doesn't), then we can get what Linus wants by
 annotating EVERY function with either __always_inline or noinline.
 

__always_inline and noinline does work.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread H. Peter Anvin

Linus Torvalds wrote:


So we do have special issues. And exactly _because_ we have special issues 
we should also expect that some compiler defaults simply won't ever really 
be appropriate for us.




That is, of course, true.

However, the Linux kernel (and quite a few other kernels) is a very 
important customer of gcc, and adding sustainable modes for the kernel 
that we can rely on is probably something we can work with them on.


I think the relationship between the gcc and Linux kernel people is 
unnecessarily infected, and cultivating a more constructive relationship 
would be good.  I suspect a big part of the reason for the oddities is 
that the timeline for the kernel community from making a request into 
gcc until we can actually rely on it is *very* long, and so we end up 
having to working things around no matter what (usually with copious 
invective), and the gcc people have other customers with shorter lead 
times which therefore drive their development more.


-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread H. Peter Anvin

Richard Guenther wrote:


But it's also not inconceivable that gcc adds a -fkernel-inlining or
similar that changes the parameters if we ask nicely. I suppose
actually such a parameter would be useful for far more programs
than the kernel.


I think that the kernel is a perfect target to optimize default -Os behavior for
(whereas template heavy C++ programs are a target to optimize -O2 for).
And I think we did a good job in listening to kernel developers if once in
time they tried to talk to us - GCC 4.3 should be good in compiling the
kernel with default -Os settings.  We, unfortunately, cannot retroactively
fix old versions that kernel developers happen to like and still use.



Unfortunately I think there have been a lot of we can't talk to them 
on both sides of the kernel-gcc interface, which is incredibly 
unfortunate.  I personally try to at least observe gcc development, 
including monitoring #gcc and knowing enough about gcc internals to 
write a (crappy) port, but I can hardly call myself a gcc expert. 
Still, I am willing to spend some significant time interfacing with 
anyone in the gcc community willing to spend the effort.  I think we can 
do good stuff.


-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread H. Peter Anvin

Richard Guenther wrote:

On Fri, Jan 9, 2009 at 8:21 PM, Andi Kleen a...@firstfloor.org wrote:

GCC 4.3 should be good in compiling the
kernel with default -Os settings.

It's unfortunately not. It doesn't inline a lot of simple asm() inlines
for example.


Reading Ingos posting with the actual numbers states the opposite.



Well, Andi's patch forcing inlining of the bitops chops quite a bit of 
size off the kernel, so there is clearly room for improvement.  From my 
post yesterday:


: voreg 64 ; size o.*/vmlinux
   textdata bss dec hex filename
57590217 24940519 15560504 98091240 5d8c0e8 o.andi/vmlinux
59421552 24912223 15560504 99894279 5f44407 o.noopt/vmlinux
57700527 24950719 15560504 98211750 5da97a6 o.opty/vmlinux

110 KB of code size reduction by force-inlining the small bitops.

-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread H. Peter Anvin

Andi Kleen wrote:
Fetch a gigabyte's worth of data for the debuginfo RPM? 


The suse 11.0 kernel debuginfo is ~120M.


Still, though, hardly worth doing client-side when it can be done 
server-side for all the common distro kernels.  For custom kernels, not 
so, but there you should already have the debuginfo locally.


And yes, there are probably residual holes, but it's questionable if it 
matters.


-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread H. Peter Anvin
Linus Torvalds wrote:
 
 And quite often, some of them go away - or at least shrink a lot - when 
 some config option or other isn't set. So sometimes it's an inline because 
 a certain class of people really want it inlined, simply because for 
 _them_ it makes sense, but when you enable debugging or something, it 
 absolutely explodes.
 

And this is really why getting static inline annotations right is really
hard if not impossible in the general case (especially when considering
the sheer number of architectures we compile on.)  So making it possible
for the compiler to do the right thing for at least this class of
functions really does seem like a good idea.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

2009-01-09 Thread H. Peter Anvin
Arjan van de Ven wrote:
 
 thinking about this.. making a pastebin like thing for oopses is
 relatively trivial for me; all the building blocks I have already.
 
 The hard part is getting the vmlinux files in place. Right now I do
 this manually for popular released kernels.. if the fedora/suse guys
 would help to at least have the vmlinux for their released updates
 easily available that would be a huge help without that it's going
 to suck.
 

We could just pick them up automatically from the kernel.org mirrors
with a little bit of scripting.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread H. Peter Anvin
Ingo Molnar wrote:
 
 Apparently it messes up with asm()s: it doesnt know the contents of the 
 asm() and hence it over-estimates the size [based on string heuristics] 
 ...
 

Right.   gcc simply doesn't have any way to know how heavyweight an
asm() statement is, and it WILL do the wrong thing in many cases --
especially the ones which involve an out-of-line recovery stub.  This is
due to a fundamental design decision in gcc to not integrate the
compiler and assembler (which some compilers do.)

 Which is bad - asm()s tend to be the most important entities to inline - 
 all over our fastpaths .
 
 Despite that messup it's still a 1% net size win:
 
   textdata bss dec hex filename
7109652 1464684  802888 9377224  8f15c8 vmlinux.always-inline
7046115 1465324  802888 9314327  8e2017 vmlinux.optimized-inlining
 
 That win is mixed in slowpath and fastpath as well.

The good part here is that the assembly ones really don't have much
subtlety -- a function call is at least five bytes, usually more once
you count in the register spill penalties -- so __always_inline-ing them
should still end up with numbers looking very much like the above.

 I see three options:
 
  - Disable CONFIG_OPTIMIZE_INLINING=y altogether (it's already 
default-off)
 
  - Change the asm() inline markers to something new like asm_inline, which
defaults to __always_inline.
 
  - Just mark all asm() inline markers as __always_inline - realizing that 
these should never ever be out of line.
 
 We might still try the second or third options, as i think we shouldnt go 
 back into the business of managing the inline attributes of ~100,000 
 kernel functions.
 
 I'll try to annotate the inline asms (there's not _that_ many of them), 
 and measure what the size impact is.

The main reason to do #2 over #3 would be for programmer documentation.
 There simply should be no reason to ever out-of-lining these.  However,
documenting the reason to the programmer is a valuable thing in itself.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread H. Peter Anvin
Harvey Harrison wrote:

 We might still try the second or third options, as i think we shouldnt go 
 back into the business of managing the inline attributes of ~100,000 
 kernel functions.
 
 Or just make it clear that inline shouldn't (unless for a very good reason)
 _ever_ be used in a .c file.
 

The question is if that would produce acceptable quality code.  In
theory it should, but I'm more than wondering if it really will.

It would be ideal, of course, as it would mean less typing.  I guess we
could try it out by disabling any inline in the current code that
isn't __always_inline...

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread H. Peter Anvin
Linus Torvalds wrote:
 
 First off, gcc _does_ have a perfectly fine notion of how heavy-weight an 
 asm statement is: just count it as a single instruction (and count the 
 argument setup cost that gcc _can_ estimate).
 

True.  It's not what it's doing, though.  It looks for '\n' and ';'
characters, and counts the maximum instruction size for each possible
instruction.

The reason why is that gcc's size estimation is partially designed to
select what kind of branches it needs to use on architectures which have
more than one type of branches.  As a result, it tends to drastically
overestimate, on purpose.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-08 Thread H. Peter Anvin
Harvey Harrison wrote:
 
 A lot of code was written assuming inline means __always_inline, I'd suggest
 keeping that assumption and working on removing inlines that aren't
 strictly necessary as there's no way to know what inlines meant 'try to 
 inline'
 and what ones really should have been __always_inline.
 
 Not that I feel _that_ strongly about it.
 

Actually, we have that reasonably well down by now.  There seems to be a
couple of minor tweaking still necessary, but I think we're 90-95% there
already.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html