Re: Copy on write of unmodified data
On 05/25/16 02:29, Hugo Mills wrote: > On Wed, May 25, 2016 at 01:58:15AM -0700, H. Peter Anvin wrote: >> Hi, >> >> I'm looking at using a btrfs with snapshots to implement a generational >> backup capacity. However, doing it the naïve way would have the side >> effect that for a file that has been partially modified, after >> snapshotting the file would be written with *mostly* the same data. How >> does btrfs' COW algorithm deal with that? If necessary I might want to >> write some smarter user space utilities for this. > >Sounds like it might be a job for one of the dedup tools > (deupremove, bedup), or, if you're writing your own, the safe > deduplication ioctl which underlies those tools. > I guess I would prefer if data wasn't first duplicated and then deduplicated if possible. It sounds like I ought to write a "smart copy-overwrite" tool for this. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
On 06/04/2014 12:24 PM, Arnd Bergmann wrote: For other timekeeping stuff in the kernel, I agree that using some 64-bit representation (nanoseconds, 32/32 unsigned seconds/nanoseconds, ...) has advantages, that's exactly the point I was making earlier against simply extending the internal time_t/timespec to 64-bit seconds for everything. How much of a performance issue is it to make time_t 64 bits, and for the bits there are, how hard are they to fix? -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
On 06/02/2014 12:19 PM, Arnd Bergmann wrote: On Monday 02 June 2014 13:52:19 Joseph S. Myers wrote: On Fri, 30 May 2014, Arnd Bergmann wrote: a) is this the right approach in general? The previous discussion pointed this way, but there may be other opinions. The syscall changes seem like the sort of thing I'd expect, although patches adding new syscalls or otherwise affecting the kernel/userspace interface (as opposed to those relating to an individual filesystem) should go to linux-api as well as other relevant lists. Ok. Sorry about missing linux-api, I confused it with linux-arch, which may not be as relevant here, except for the one question whether we actually want to have the new ABI on all 32-bit architectures or only as an opt-in for those that expect to stay around for another 24 years. Two more questions for you: - are you (and others) happy with adding this type of stat syscall (fstatat64/fstat64) as opposed to the more generic xstat that has been discussed in the past and that never made it through the bike- shedding discussion? - once we have enough buy-in from reviewers to merge this initial series, should we proceed to define rest of the syscall ABI (minus driver ioctls) so glibc and kernel can do the conversion on top of that, or should we better try to do things one syscall family at a time and actually get the kernel to handle them correctly internally? The bit that is really going to hurt is every single ioctl that uses a timespec. Honestly, though, I really don't understand the point with struct inode_time. It seems like the zeroeth-order thing is to change the kernel internal version of struct timespec to have a 64-bit time... it isn't just about inodes. We then should be explicit about the external uses of time, and use accessors. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
On 06/02/2014 12:55 PM, Arnd Bergmann wrote: The bit that is really going to hurt is every single ioctl that uses a timespec. Honestly, though, I really don't understand the point with struct inode_time. It seems like the zeroeth-order thing is to change the kernel internal version of struct timespec to have a 64-bit time... it isn't just about inodes. We then should be explicit about the external uses of time, and use accessors. I picked these because they are fairly isolated from all other uses, in particular since inode times are the only things where we really care about times in the distant past or future (decades away as opposed to things that happened between boot and shutdown). If nothing else, I would expect to be able to set the system time to weird values for testing. So I'm not so sure I agree with that... For other kernel-internal uses, we may be better off migrating to a completely different representation, such as nanoseconds since boot or the architecture specific ktime_t, but this is really something to decide for each subsystem. Having a bunch of different time representations in the kernel seems like a real headache... -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 00/32] making inode time stamps y2038 ready
Typically they are using 64-bit signed seconds. On May 31, 2014 11:22:37 AM PDT, Richard Cochran richardcoch...@gmail.com wrote: On Sat, May 31, 2014 at 05:23:02PM +0200, Arnd Bergmann wrote: It's an approximation: (Approximately never ;) with 64-bit timestamps, you can represent close to 300 billion years, which is way past the time that our planet can sustain life of any form[1]. Did you mean mean 64 bits worth of seconds? 2^64 / (3600*24*365) = 584,942,417,355 That is more than 300 billion years, and still, it is not quite the same as never. In any case, that term is not too helpful in the comparison table, IMHO. One could think that some sort of clever running count relative to the last mount time was implied. Thanks, Richard [1] You are forgetting the immortal robotic overlords. -- Sent from my mobile phone. Please pardon brevity and lack of formatting. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Formalizing the use of Boot Area B
On 05/14/2014 05:01 PM, H. Peter Anvin wrote: It turns out that the primary 64K Boot Area A is too small for some applications and/or some architectures. When I discussed this with Chris Mason, he pointed out that the area beyond the superblock is also unused, up until at least the megabyte point (from my reading of the mkfs code, it is actually slightly more than a megabyte.) This is present in all versions of mkfs.btrfs that has the superblock at 64K (some very early ones had the superblock at 16K, but that format is no longer supported), so all that is needed is formalizing the specs as to the use of this area. My suggestion is that 64-128K is reserved for extension of the superblock and/or any other filesystem uses, and 128-1024K is defined as Boot Area B. However, if there may be reason to reserve more, then we should do that. Hence requesting a formal decision as to the extent and ownership of this area. -hpa Ping on this? If I don't hear back on this I will probably just go ahead and use 128K-1024K. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Formalizing the use of Boot Area B
On 05/20/2014 04:37 PM, Chris Mason wrote: On 05/20/2014 07:29 PM, H. Peter Anvin wrote: On 05/14/2014 05:01 PM, H. Peter Anvin wrote: It turns out that the primary 64K Boot Area A is too small for some applications and/or some architectures. When I discussed this with Chris Mason, he pointed out that the area beyond the superblock is also unused, up until at least the megabyte point (from my reading of the mkfs code, it is actually slightly more than a megabyte.) This is present in all versions of mkfs.btrfs that has the superblock at 64K (some very early ones had the superblock at 16K, but that format is no longer supported), so all that is needed is formalizing the specs as to the use of this area. My suggestion is that 64-128K is reserved for extension of the superblock and/or any other filesystem uses, and 128-1024K is defined as Boot Area B. However, if there may be reason to reserve more, then we should do that. Hence requesting a formal decision as to the extent and ownership of this area. -hpa Ping on this? If I don't hear back on this I will probably just go ahead and use 128K-1024K. Hi Peter, We do leave the first 1MB of each device alone. Can we do 256K-1024K for the boot loader? We don't have an immediate need for the extra space, but I'd like to reserve a little more than the extra 64KB. Works for me. So 64-256K (192K) is reserved for the file system, and Boot Area B is 256-1024K (768K). -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Formalizing the use of Boot Area B
On 05/20/2014 04:37 PM, Chris Mason wrote: Hi Peter, We do leave the first 1MB of each device alone. Can we do 256K-1024K for the boot loader? We don't have an immediate need for the extra space, but I'd like to reserve a little more than the extra 64KB. Incidentally, the current version of mkfs.btrfs actually leaves not 1 MB (1024K) but rather 1104K (64K+16K+1024K). Not sure if that is intentional. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Formalizing the use of Boot Area B
It turns out that the primary 64K Boot Area A is too small for some applications and/or some architectures. When I discussed this with Chris Mason, he pointed out that the area beyond the superblock is also unused, up until at least the megabyte point (from my reading of the mkfs code, it is actually slightly more than a megabyte.) This is present in all versions of mkfs.btrfs that has the superblock at 64K (some very early ones had the superblock at 16K, but that format is no longer supported), so all that is needed is formalizing the specs as to the use of this area. My suggestion is that 64-128K is reserved for extension of the superblock and/or any other filesystem uses, and 128-1024K is defined as Boot Area B. However, if there may be reason to reserve more, then we should do that. Hence requesting a formal decision as to the extent and ownership of this area. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/21/2013 04:30 PM, Stan Hoeppner wrote: The rebuild time of a parity array normally has little to do with CPU overhead. Unless you have to fall back to table driven code. Anyway, this looks like a great concept. Now we just need to implement it ;) -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
It is also possible to quickly multiply by 2^-1 which makes for an interesting R parity. Andrea Mazzoleni amadva...@gmail.com wrote: Hi David, The choice of ZFS to use powers of 4 was likely not optimal, because to multiply by 4, it has to do two multiplications by 2. I can agree with that. I didn't copy ZFS's choice here David, it was not my intention to suggest that you copied from ZFS. Sorry to have expressed myself badly. I just mentioned ZFS because it's an implementation that I know uses powers of 4 to generate triple parity, and I saw in the code that it's implemented with two multiplication by 2. Ciao, Andrea -- Sent from my mobile phone. Please pardon brevity and lack of formatting. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/20/2013 10:56 AM, Andrea Mazzoleni wrote: Hi, Yep. At present to multiply for 2^-1 I'm using in C: static inline uint64_t d2_64(uint64_t v) { uint64_t mask = v 0x0101010101010101U; mask = (mask 8) - mask; v = (v 1) 0x7f7f7f7f7f7f7f7fU; v ^= mask 0x8e8e8e8e8e8e8e8eU; return v; } and for SSE2: asm volatile(movdqa %xmm2,%xmm4); asm volatile(pxor %xmm5,%xmm5); asm volatile(psllw $7,%xmm4); asm volatile(psrlw $1,%xmm2); asm volatile(pcmpgtb %xmm4,%xmm5); asm volatile(pand %xmm6,%xmm2); with xmm6 == 7f7f7f7f7f7f... asm volatile(pand %xmm3,%xmm5); with xmm3 == 8e8e8e8e8e... asm volatile(pxor %xmm5,%xmm2); where xmm2 is the intput/output Now, that doesn't sound like something that can get neatly meshed into the Cauchy matrix scheme, I assume. It is somewhat nice to have a scheme which is arbitrarily expandable without having to fall back to dual parity during the restripe operation. It probably also reduces the amount of code necessary. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/20/2013 11:05 AM, Andrea Mazzoleni wrote: For the first row with j=0, I use xi = 2^-i and y0 = 0, that results in: How can xi = 2^-i if x is supposed to be constant? That doesn't mean that your approach isn't valid, of course, but it might not be a Cauchy matrix and thus needs additional analysis. row j=0 - 1/(xi+y0) = 1/(2^-i + 0) = 2^i (RAID-6 coefficients) For the next rows with j0, I use yj = 2^j, resulting in: rows j0 - 1/(xi+yj) = 1/(2^-i + 2^j) Even more so here... 2^-i and 2^j don't seem to be of the form xi and yj respectively. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/20/2013 01:04 PM, Andrea Mazzoleni wrote: Hi Peter, static inline uint64_t d2_64(uint64_t v) { uint64_t mask = v 0x0101010101010101U; mask = (mask 8) - mask; (mask 7) I assume... No. It's (mask 8) - mask. We want to expand the bit at position 0 (in each byte) to the full byte, resulting in 0xFF if the bit is at 1, and 0x00 if the bit is 0. (0 8) - 0 = 0x00 (1 8) - 1 = 0x100 - 1 = 0xFF Oh, right... it is the same as (v 1) - (v 7) except everything is shifted over one. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/20/2013 12:30 PM, James Plank wrote: Peter, I think I understand it differently. Concrete example in GF(256) for k=6, m=4: First, create a 3 by 6 cauchy matrix, using x_i = 2^-i, and y_i = 0 for i=0, and y_i = 2^i for other i. In this case: x = { 1, 142, 71, 173, 216, 108 } y = { 0, 2, 4). The cauchy matrix is: Sorry, I took xi and yj to mean a constant x multiplied with i and a constant y multiplied with j, rather than x_i and y_j. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/18/2013 02:08 PM, Andrea Mazzoleni wrote: Hi, I want to report that I recently implemented a support for arbitrary number of parities that could be useful also for Linux RAID and Btrfs, both currently limited to double parity. In short, to generate the parity I use a Cauchy matrix specifically built to be compatible with the existing Linux parity computation, and extensible to an arbitrary number of parities. This without limitations on the number of data disks. The Cauchy matrix for six parities is: 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01... 01 02 04 08 10 20 40 80 1d 3a 74 e8 cd 87 13 26 4c 98 2d 5a b4 75... 01 f5 d2 c4 9a 71 f1 7f fc 87 c1 c6 19 2f 40 55 3d ba 53 04 9c 61... 01 bb a6 d7 c7 07 ce 82 4a 2f a5 9b b6 60 f1 ad e7 f4 06 d2 df 2e... 01 97 7f 9c 7c 18 bd a2 58 1a da 74 70 a3 e5 47 29 07 f5 80 23 e9... 01 2b 3f cf 73 2c d6 ed cb 74 15 78 8a c1 17 c9 89 68 21 ab 76 3b... You can easily recognize the first row as RAID5 based on a simple XOR, and the second row as RAID6 based on multiplications by powers of 2. The other rows are for additional parity levels and they require multiplications by arbitrary values that can be implemented using the PSHUFB instruction. Hello, This looks very interesting indeed. Could you perhaps describe how the Cauchy matrix is derived, and under what conditions it would become singular? -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Triple parity and beyond
On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote: Hi Peter, The Cauchy matrix has the mathematical property to always have itself and all submatrices not singular. So, we are sure that we can always solve the equations to recover the data disks. Besides the mathematical proof, I've also inverted all the 377,342,351,231 possible submatrices for up to 6 parities and 251 data disks, and got an experimental confirmation of this. Nice. The only limit is coming from the GF(2^8). You have a maximum number of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities, you can have no more of 251 data disks. Over this limit it's not possible to build a Cauchy matrix. 251? Not 255? Note that instead with a Vandermonde matrix you don't have the guarantee to always have all the submatrices not singular. This is the reason because using power coefficients, before or late, it happens to have unsolvable equations. You can find the code that generate the Cauchy matrix with some explanation in the comments at (see the set_cauchy() function) : http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c OK, need to read up on the theoretical aspects of this, but it sounds promising. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: experimental raid5/6 code in git
@@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) } btrfs_dev_replace_unlock(root-fs_info-dev_replace); + if ((all_avail (BTRFS_BLOCK_GROUP_RAID5 | + BTRFS_BLOCK_GROUP_RAID6) num_devices = 3)) { + printk(KERN_ERR btrfs: unable to go below three devices + on raid5 or raid6\n); + ret = -EINVAL; + goto out; + } + if ((all_avail BTRFS_BLOCK_GROUP_RAID10) num_devices = 4) { printk(KERN_ERR btrfs: unable to go below four devices on raid10\n); @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) goto out; } + if ((all_avail BTRFS_BLOCK_GROUP_RAID5) + root-fs_info-fs_devices-rw_devices = 2) { + printk(KERN_ERR btrfs: unable to go below two + devices on raid5\n); + ret = -EINVAL; + goto out; + } + if ((all_avail BTRFS_BLOCK_GROUP_RAID6) + root-fs_info-fs_devices-rw_devices = 3) { + printk(KERN_ERR btrfs: unable to go below three + devices on raid6\n); + ret = -EINVAL; + goto out; + } + if (strcmp(device_path, missing) == 0) { struct list_head *devices; struct btrfs_device *tmp; This seems inconsistent? -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: experimental raid5/6 code in git
Also, a 2-member raid5 or 3-member raid6 are a raid1 and can be treated as such. Chris Mason chris.ma...@fusionio.com wrote: On Mon, Feb 04, 2013 at 02:42:24PM -0700, H. Peter Anvin wrote: @@ -1389,6 +1392,14 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) } btrfs_dev_replace_unlock(root-fs_info-dev_replace); +if ((all_avail (BTRFS_BLOCK_GROUP_RAID5 | + BTRFS_BLOCK_GROUP_RAID6) num_devices = 3)) { +printk(KERN_ERR btrfs: unable to go below three devices + on raid5 or raid6\n); +ret = -EINVAL; +goto out; +} + if ((all_avail BTRFS_BLOCK_GROUP_RAID10) num_devices = 4) { printk(KERN_ERR btrfs: unable to go below four devices on raid10\n); @@ -1403,6 +1414,21 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) goto out; } +if ((all_avail BTRFS_BLOCK_GROUP_RAID5) +root-fs_info-fs_devices-rw_devices = 2) { +printk(KERN_ERR btrfs: unable to go below two + devices on raid5\n); +ret = -EINVAL; +goto out; +} +if ((all_avail BTRFS_BLOCK_GROUP_RAID6) +root-fs_info-fs_devices-rw_devices = 3) { +printk(KERN_ERR btrfs: unable to go below three + devices on raid6\n); +ret = -EINVAL; +goto out; +} + if (strcmp(device_path, missing) == 0) { struct list_head *devices; struct btrfs_device *tmp; This seems inconsistent? Whoops, missed that one. Thanks! -chris -- Sent from my mobile phone. Please excuse brevity and lack of formatting. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature request: true RAID-1 mode
On 06/25/2012 04:00 PM, H. Peter Anvin wrote: I am aware of that, and it is not a problem... the one-device bootloader can find out *which* disk it is talking to by comparing uuids, and the btrfs data structures will tell it how to find the data on that specific disk. It does of course mean the bootloader needs to be aware of the multidisk nature of btrfs, but that isn't a problem in itself. So, also, let me address the question why we should care about a one-device bootloader. It is quite common, especially in fileservers, for a subset of the boot devices to be inaccessible by the firmware, due to bugs, boot time concerns (spinning up all the media in the firmware is SLOW) or just plain lack of support of plug-in cards. As such, the reliable thing to do is to make sure that any disk being seen is enough to bring up the system; since this is such a small amount of data with modern standards, there is just no reason to do anything less robust. Once the kernel comes up it has all the device drivers, of course. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature request: true RAID-1 mode
On 06/25/2012 08:21 AM, Chris Mason wrote: Yes and no. If you have 2 drives and you add one more, we can make it do all new chunks over 3 drives. But, turning the existing double mirror chunks into a triple mirror requires a balance. -chris So trigger one. This is the exact analogue to the resync pass that is required in classic RAID after adding new media. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature request: true RAID-1 mode
On 06/25/2012 03:28 PM, Gareth Pye wrote: To me one doesn't have to be triggered, a user expects to have to tell the disks to rebuild/resync/balance after adding a disk, they may want to wait till they've added all 4 disks and run a few extra commands before they run the rebalance. They do? E.g. mdadm doesn't make them... What is important is having a mode that doesn't require the user to remember that what they had used as the closest analogue to RAID1 that BTRFS supports requires them to run another command to change the 'RAID level' to be the RAID1 analogue for the new number of disks. Users will forget that and they will lose data because of it. At least with a M=N mode BTRFS can say they tried to make it easy to avoid that pitfall. Doesn't that contradict your previous statement? In either case, I agree with the latter... -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature request: true RAID-1 mode
On 06/25/2012 03:54 PM, Hugo Mills wrote: On Mon, Jun 25, 2012 at 10:46:01AM -0700, H. Peter Anvin wrote: On 06/25/2012 08:21 AM, Chris Mason wrote: Yes and no. If you have 2 drives and you add one more, we can make it do all new chunks over 3 drives. But, turning the existing double mirror chunks into a triple mirror requires a balance. -chris So trigger one. This is the exact analogue to the resync pass that is required in classic RAID after adding new media. You'd have to cancel and restart if a second new disk was added while the first balance was ongoing. Fortunately, this isn't a problem these days. Also, it occurs to me that I should just check -- are you aware that the btrfs implementation of RAID-1 makes no guarantees about the location of any given piece of data? i.e. if I have a piece of data stored at block X on disk 1, it's not guaranteed to be stored at block X on disks 2, 3, 4, ... I'm not sure if this is important to you, but it's a significant difference between the btrfs implementation of RAID-1 and the MD implementation. I am aware of that, and it is not a problem... the one-device bootloader can find out *which* disk it is talking to by comparing uuids, and the btrfs data structures will tell it how to find the data on that specific disk. It does of course mean the bootloader needs to be aware of the multidisk nature of btrfs, but that isn't a problem in itself. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: R: Re: Subvolumes and /proc/self/mountinfo
On 06/20/2012 10:47 PM, Goffredo Baroncelli wrote: This leads to have a separately /boot filesystem. In this case I agree with you: make sense that the kernel is near the bootloader files. But if /boot has to be in a separate filesystem, which is the point to support btrfs at all ? Does make sense to support only a subset of btrfs features ? Yes, and that's another good reason for /boot: btrfs supports that kind of policy (e.g. no compression or encryption in this subtree.) Now we have the possibility to move the kernel near the modules, and this could lead some interesting possibility: think about different linux installations, with an own kernel version and an own modules version; what are the reasons to put together under /boot different kernel which potential conflicting names ? de facto standard ? historical reasons ? Nothing wrong here; but also the idea to moving the kernel under /lib/modules is not so wrong. No, it is completely, totally and very very seriously wrong. When a bootloader (and the bioses) will be able to address the whole diskS, this will change.. Not now People have said that for 15 years. The reality is that firmware will always be behind the curve, and *that's ok*, we just need to deal with it. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: R: Re: Subvolumes and /proc/self/mountinfo
On 06/21/2012 10:05 AM, Goffredo Baroncelli wrote: On 06/21/2012 03:38 PM, H. Peter Anvin wrote: But if /boot has to be in a separate filesystem, which is the point to support btrfs at all ? Does make sense to support only a subset of btrfs features ? Yes, and that's another good reason for /boot: btrfs supports that kind of policy (e.g. no compression or encryption in this subtree.) But what about large disk ? Syslinux is able to handle large disk ? Or it uses BIOS interrupt? It uses the firmware.. how well the firmware can handle large disks is another matter. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Subvolumes and /proc/self/mountinfo
On 06/20/2012 06:34 AM, Chris Mason wrote: I want an algorithm, it doesn't have an API per se. I would really like to avoid relying on blkid and udev for this, though... that is pretty much a nonstarter. If the answer is to walk the tree then I'm fine with that. Ok, fair enough. Right, the subvolume number doesn't change over the life of the subvol, regardless of the path that was used for mounting the subvol. So all you need is that number (64 bits) and the filename relative to the subvol root and you're set. We'll have to add an ioctl for that. Finding the path relative to the subvol is easy, just walk backwards up the directory chain (cd ..) until you get to inode number 256. All the subvol roots have inode number 256. ... assuming I can actually see the root (that it is not obscured because of bindmounts and so on). Yes, I know I'm weird for worrying about these hyper-obscure corner cases, but I have a pretty explicit goal of trying to write Syslinux so it will very rarely if ever do the wrong thing, and since I can already get that information for other filesystems. The other thing, of course, is what is the desirable behavior,which I have brought up in a few posts already. Specifically, I see two possibilities: a. Always handle a path from the global root, and treat subvolumes as directories. This would mostly require that the behavior of /proc/self/mountinfo with regards to mount -o subvolid= would need to be fixed. I also have no idea how one would deal with a detached subvolume, or if that subcase even matters. A major problem with this is that it may be *very* confusing to a user to have to specify a path in their bootloader configuration as /subvolume/foo/bar when the string subvolume doesn't show up in any way in their normal filesystem. b. Treat the subvolume as the root (which is what I so far have been asssuming.) In this case, I think the subvolume ID, path_in_subvolume ioctl is the way to go, unless there is a way to do this with BTRFS_IOC_TREE_SEARCH already. I think I'm leaning, still, at b just because of the very high potential for user confusion with a. Well, I'd be interested in what Kay's stuff actually does. Other than that, I would suggest adding a pair of ioctls that when executed on an arbitrary btrfs inode returns the corresponding subvolume and one which returns the path relative to the subvolume root. udev already scans block devices as they appear. When it finds btrfs, it calls the btrfs dev scan ioctl for that one device. It also reads in the FS uuid and the device uuid and puts them into a tree. Very simple stuff, but it gets rid of the need to manually call btrfs dev scan yourself. For the record, I implemented the use of BTRFS_IOC_DEV_INFO yesterday; it is still way better than what I had there before and will make an excellent fallback for a new ioctl. This would be my suggestion for a new ioctl: 1. Add the device number to the information already returned by BTRFS_IOC_DEV_INFO. 2. Allow returning more than one device at a time. Userspace can already know the number of devices from BTRFS_IOC_FS_INFO(*), and it'd be better to just size a buffer and return N items rather having to iterate over the potentially sparse devid space. I might write this one up if I can carve out some time today... -hpa (*) - because race conditions are still possible, a buffer size/limit check is still needed. -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Feature request: true RAID-1 mode
Yet another boot loader support request. Right now btrfs' definition of RAID-1 with more than two devices is a bit unorthodox: it stores on any two drives. True RAID-1 would instead store N copies on each of N devices, the same way an actual RAID-1 would operate with an arbitrary number of devices. This means that a bootloader can consider a single device in isolation: if the firmware gives access only to a single device, it can be booted. Since /boot is usually a very small amount of data, this is a very reasonable tradeoff. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: R: Re: Subvolumes and /proc/self/mountinfo
On 06/20/2012 09:34 AM, Goffredo Baroncelli wrote: At the first I tough that having the /boot separate could be a good thing. Unfortunately /boot contains both the bootloader code and the kernel image. The kernel image should be in sync with the contents of /lib/modules/ This is the tricky point. If I handle /boot inside the filesystem submodule a de-sync between the bootloader code and the boot sector could happens. In I handle /boot as separate subvolume/filesystem a de-sync between the kernel image and the modules could happens. Anyway, from a bootloader POV I think that /boot should be handle separately (or as filesystem or as subvolume identified by specific ID). The best could be move the kernel in the same subvolume as /lib/modules, so a switch of the subvolume as root filesystem would be coherent. You're not really answering the question. The best could be move the kernel in the same subvolume as /lib/modules isn't really going to happen... the whole *point* of /boot is that /boot contains everything needed to get to the point of kernel initialization. So, sorry, you're out to sea here... -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: R: Re: Subvolumes and /proc/self/mountinfo
On 06/20/2012 11:06 AM, Goffredo Baroncelli wrote: Am not saying that we *should* move the kernel away from /boot. I am only saying that having the kernel near /lib/modules *has* some advantages. Few year ago there are some gains to have a separate /boot (ah, the time when the bios were unable to address the bigger disk), where there are the minimum things to bootstrap the system. There still is (in fact this exact problem has made a comeback, as there are plenty of BIOSes which have bugs above the 2 TB mark); however, there are also issues with RAID (firmware often cannot address all the devices in the system -- and no, that isn't ancient history, I have a system exactly like that that I bought last year), remote boot media (your / might be on an iSCSI device, or even a network filesystem!) and all kinds of situations like that. The bottom line is that /boot is what the bootloader needs to be able to address, whereas / can wait until the kernel has device drivers. That is a *HUGE* difference. Now we have the possibility to move the kernel near the modules, and this could lead some interesting possibility: think about different linux installations, with an own kernel version and an own modules version; what are the reasons to put together under /boot different kernel which potential conflicting names ? de facto standard ? historical reasons ? Nothing wrong here; but also the idea to moving the kernel under /lib/modules is not so wrong. No, it is completely, totally and very very seriously wrong. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Subvolumes and /proc/self/mountinfo
On 06/19/2012 11:31 PM, Fajar A. Nugraha wrote: IMHO a more elegant solution would be similar to what (open)solaris/indiana does: make the boot parts (bootloader, configuration) as a separate area, separate from root snapshots. In solaris case IIRC this is will br /rpool/grub. It is both more and less elegant; it means you don't get the same kind of atomic update for the bootloader itself. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature request: true RAID-1 mode
Could you have a mode, though, where M = N at all times, so a user doesn't end up adding a new drive and get a nasty surprise? Chris Mason chris.ma...@fusionio.com wrote: On Wed, Jun 20, 2012 at 06:35:30PM -0600, Marios Titas wrote: On Wed, Jun 20, 2012 at 12:27 PM, H. Peter Anvin h...@zytor.com wrote: Yet another boot loader support request. Right now btrfs' definition of RAID-1 with more than two devices is a bit unorthodox: it stores on any two drives. True RAID-1 would instead store N copies on each of N devices, the same way an actual RAID-1 would operate with an arbitrary number of devices. This means that a bootloader can consider a single device in isolation: if the firmware gives access only to a single device, it can be booted. Since /boot is usually a very small amount of data, this is a very reasonable tradeoff. +1 In fact, the current RAID-1 should not have been called RAID-1 at all, it is confusing. With the raid5/6 code, I'm changing raid1 (and raid10) to have a configurable number of copies. So, you'll be able to have N copies on M drives, where N = M. -chris -- Sent from my mobile phone. Please excuse brevity and lack of formatting. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Subvolumes and /proc/self/mountinfo
On 06/19/2012 07:22 AM, Calvin Walton wrote: All subvolumes are accessible from the volume mounted when you use -o subvolid=0. (Note that 0 is not the real ID of the root volume, it's just a shortcut for mounting it.) Could you clarify this bit? Specifically, what is the real ID of the root volume, then? I found that after having set the default subvolume to something other than the root, and then mounting it without the -o subvol= option, then the subvolume name does *not* show in /proc/self/mountinfo; the same happens if a subvolume is mounted by -o subvolid= rather than -o subvol=. Is this a bug? This would seem to give the worst of both worlds in terms of actually knowing what the underlying filesystem path would end up looking like. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device names
On 06/19/2012 04:51 PM, Chris Mason wrote: At mount time, we go through and verify the path names still belong to the filesystem you thought they belonged to. The bdev is locked during the verification, so it won't be able to go away or change. This is a long way of saying right we don't spit out device numbers. Even device numbers can change. We can easily add a uuid based listing, which I think is what you want. No, I want to find the actual devices. I know I can get the UUID, but scanning all the block devices in the system looking for that UUID is a nonstarter. Device path names can change while the system is operating (and, worse, are dependent on namespace changes and chroot); device *numbers* cannot as long as the device is in use (e.g. mounted.) They can indeed change while not in use, of course. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Subvolumes and /proc/self/mountinfo
On 06/19/2012 04:49 PM, Chris Mason wrote: On Mon, Jun 18, 2012 at 06:39:31PM -0600, H. Peter Anvin wrote: I'm trying to figure out an algorithm from taking an arbitrary mounted btrfs directory and break it down into: device(s), subvolume, subpath where, keep in mind, subpath may not actually be part of the mount. Do you want an API for this, or is it enough to wander through /dev/disk style symlinks? The big reason it isn't here yet is because Kay had this neat patch to blkid and udev to just put all the info you need into /dev/btrfs (or some other suitable location). It would allow you to see which devices belong to which filesystems etc. I want an algorithm, it doesn't have an API per se. I would really like to avoid relying on blkid and udev for this, though... that is pretty much a nonstarter. If the answer is to walk the tree then I'm fine with that. subvolumes may become disconnected from the root namespace. In this case we can find it just by the subvol id, and mount it into an arbitrary directory. OK, so it sounds like the best thing is actually to record the subvolume *number* (ID) where (in my case) Syslinux is installed. This is actually a good thing because the fewer O(n) strings I have to stick into the boot block the better. b. Are there better ways (walking the tree using BTRFS_IOC_TREE_SEARCH?) to accomplish this than using /proc/self/mountinfo? Not yet, but I'm definitely open to adding them. Lets just hash out what you need and we'll either go through Kay's stuff or add ioctls for you. Well, I'd be interested in what Kay's stuff actually does. Other than that, I would suggest adding a pair of ioctls that when executed on an arbitrary btrfs inode returns the corresponding subvolume and one which returns the path relative to the subvolume root. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Subvolumes and /proc/self/mountinfo
On 06/19/2012 06:16 PM, cwillu wrote: The big reason it isn't here yet is because Kay had this neat patch to blkid and udev to just put all the info you need into /dev/btrfs (or some other suitable location). It would allow you to see which devices belong to which filesystems etc. btrfs should work even without any udev installation. It does; you can always mount with an explicit -o device=/dev/foo,device=/dev/bar if you're inclined to punish yourself^w^w^w^w^w your requirements dictate that you don't rely on udev. I think you're misunderstanding what this is about. I'm working on trying to make the Syslinux installer for btrfs as robust as it possibly can be. I really don't like leaving corner cases where it will do the wrong thing and leave your system unbootable. Now, that having been said, there are a lot of things that are not really very clear how they should work given btrfs. Specifically, what is needed is: 1. The underlying device(s) for boot block installation. 2. A concept of a root. 3. A concept of a path within that root to the installation directory, where we can find syslinux.cfg and the other bootloader modules. All of this needs to be installed in the fixed-sized boot block, so a compact representation is very much a plus. The concept of what is the root and what is the path is straightforward for lesser filesystems: the root of the filesystem is defined by the root inode, and the path is a unique sequence of directories from that root. Note that this is completely independent of how the filesystem was mounted when the boot loader was installed. For btrfs, a lot of things aren't so clear-cut, especially in the light of explicit and implicit subvolumes. Furthermore, sorting out these semantic issues is really important in order to support the atomic update scenario: a. Make a snapshot of the current root; b. Mount said snapshot; c. Install the new distro on the snapshot; d. Change the bootloader configuration *inside* the snapshot to point to the snapshot as the root; e. Install the bootloader on the snapshot, thereby making the boot block point to it and making it live. If the root also contains subvolumes, e.g. /boot may be a subvolume because it has different policies, this gets pretty gnarly to get right. It is also a very high value to get right. So it is possible I'm approaching this wrong. I would love to have a discussion about this. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Subvolumes and /proc/self/mountinfo
I'm trying to figure out an algorithm from taking an arbitrary mounted btrfs directory and break it down into: device(s), subvolume, subpath where, keep in mind, subpath may not actually be part of the mount. /proc/self/mountinfo seems to have some of that information, however, it does not appear to distinguish between non-default subvolumes and directories. At the same time, once I have mounted a subvolume I see its name in the root btrfs directory even if I didn't access it. Questions, thus: a. Are subvolumes always part of the root namespace? If so, is it the mounted root, the default subvolume, or subvolume 0 which always exposes these other subvolumes? Are there disambiguation rules so that if I have /btrfs/root/blah and blah is both a subvolume and a directory (I presume that can happen?) b. Are there better ways (walking the tree using BTRFS_IOC_TREE_SEARCH?) to accomplish this than using /proc/self/mountinfo? -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
http://www.cs.utk.edu/~plank/plank/papers/CS-96-332.html even describes an implementation _very_ similar to the current code, right down to using a table for the logarithm and inverse logarithm calculations. We don't use a table for logarithm and inverse logarithm calculations. Any time you do a table lookup you commit suicide from a performance standpoint. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
David Woodhouse wrote: At this point we've actually implemented the fundamental parts of RAID[56] support in btrfs, and it's looking like all we really want is the arithmetic routines. Given that you have no legacy requirements, and that supporting more than two disks may be interesting, it may very well be worth spending some time at new codes now rather than later. Part of that investigation, though, is going to have to be if and how they can be accelerated. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
David Woodhouse wrote: I'm only interested in what we can use directly within btrfs -- and ideally I do want something which gives me an _arbitrary_ number of redundant blocks, rather than limiting me to 2. But the legacy code is good enough for now¹. When I get round to wanting more, I was thinking of lifting something like http://git.infradead.org/mtd-utils.git?a=blob;f=fec.c to start with, and maybe hoping that someone cleverer will come up with something better. The less I have to deal with Galois Fields, the happier I'll be. Well, if you want something with more than 2-block redundancy you need something other than the existing RAID-6 code which, as you know, is a special case of general Reed-Solomon coding that I happen to have spent a lot of time optimizing. The FEC code is not optimized at all if I can tell, and certainly doesn't use SSE in any way -- never mind the GF accelerators that are starting to appear. That doesn't mean it *couldn't*, just that noone has done the work to either implement it or prove it can't be done. Either way, perhaps the Plank paper that Rik pointed to could be useful as a starting point; it's probably worth taking their performance numbers with a *major* grain of salt: their implementation of RAID-6 RS-Opt which is supposed to be equivalent to my code performs at 400 MB/s, which is less than Pentium III-era performance of the real world code (they compare not to real code but to their own implementation in Java, called Jerasure.) Implementability using real array instruction sets is key to decent performance. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
Ric Wheeler wrote: Worth sharing a pointer to a really neat set of papers that describe open source friendly RAID6 and erasure encoding algorithms that were presented last year and this at FAST: http://www.cs.utk.edu/~plank/plank/papers/papers.html If I remember correctly, James Plank's papers also have implemented and benchmarked the various encodings, I have seen the papers; I'm not sure it really makes that much difference. One of the things that bugs me about these papers is that he compares to *his* implementation of my optimizations, but not to my code. In real life implementations, on commodity hardware, we're limited by memory and disk performance, not by CPU utilization. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
Ric Wheeler wrote: The bottom line is pretty much this: the cost of changing the encoding would appear to outweigh the benefit. I'm not trying to claim the Linux RAID-6 implementation is optimal, but it is simple and appears to be fast enough that the math isn't the bottleneck. Cost? Thank about how to get free grad student hours testing out things that you might or might not want to leverage on down the road :-) Cost, yes, of changing an on-disk format. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] md: Factor out RAID6 algorithms into lib/
Ric Wheeler wrote: The main flaw, as I said, is in the phrase as implemented by the Jerasure library. He's comparing his own implementations of various algorithms, not optimized implementations. The bottom line is pretty much this: the cost of changing the encoding would appear to outweigh the benefit. I'm not trying to claim the Linux RAID-6 implementation is optimal, but it is simple and appears to be fast enough that the math isn't the bottleneck. Cost? Thank about how to get free grad student hours testing out things that you might or might not want to leverage on down the road :-) Anyway... I don't really care too much. If someone wants to redesign the Linux RAID-6 and Neil decides to take it I'm not going to object. I'm also not very likely to do any work on it. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Ingo Molnar wrote: Hm, GCC uses __restrict__, right? I'm wondering whether there's any internal tie-up between alias analysis and the __restrict__ keyword - so if we turn off aliasing optimizations the __restrict__ keyword's optimizations are turned off as well. Actually I suspect that restrict makes little difference for inlines or even statics, since gcc generally can do alias analysis fine there. However, in the presence of an intermodule function call, all alias analysis is off. This is presumably why type-based analysis is used at all ... to at least be able to a modicum of, say, loop invariant removal in the presence of a library call. This is also where restrict comes into play. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Andi Kleen wrote: On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote: Something at the back of my mind said aliasing. $ gcc linus.c -O2 -S ; grep subl linus.s subl$1624, %esp $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s subl$824, %esp That's with 4.3.2. Interesting. Nonsensical, but interesting. What I find nonsensical is that -fno-strict-aliasing generates better code here. Normally one would expect the compiler seeing more aliases with that option and then be more conservative regarding any sharing. But it seems to be the other way round here. For this to be convolved with aliasing *AT ALL* indicates this is done incorrectly. This is about storage allocation, not aliases. Storage allocation only depends on lifetime. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: On Fri, 9 Jan 2009, Ingo Molnar wrote: -static inline int constant_test_bit(int nr, const volatile unsigned long *addr) +static __asm_inline int +constant_test_bit(int nr, const volatile unsigned long *addr) { return ((1UL (nr % BITS_PER_LONG)) (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0; Thios makes absolutely no sense. It's called __always_inline, not __asm_inline. Why add a new nonsensical annotations like that? __asm_inline was my suggestion, to distinguish inline this unconditionally because gcc screws up in the presence of asm() versus inline this unconditionally because the world ends if it isn't -- to tell the human reader, not gcc. I guess the above is a good indicator that the __asm_inline might have been a bad name. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
Ingo Molnar wrote: My goal is to make the kernel smaller and faster, and as far as the placement of 'inline' keywords goes, i dont have too strong feelings about how it's achieved: they have a certain level of documentation value [signalling that a function is _intended_ to be lightweight] but otherwise they are pretty neutral attributes to me. As far as naming is concerned, gcc effectively supports four levels, which *currently* map onto macros as follows: __always_inline Inline unconditionally inline Inlining hint nothing Standard heuristics noinlineUninline unconditionally A lot of noise is being made about the naming of the levels (and I personally believe we should have a different annotation for inline unconditionally for correctness and inline unconditionally for performance, as a documentation issue), but those are the four we get. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
Dirk Hohndel wrote: Does gcc actually follow the promise? If that's the case (and if it's considered a bug when it doesn't), then we can get what Linus wants by annotating EVERY function with either __always_inline or noinline. __always_inline and noinline does work. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
Linus Torvalds wrote: So we do have special issues. And exactly _because_ we have special issues we should also expect that some compiler defaults simply won't ever really be appropriate for us. That is, of course, true. However, the Linux kernel (and quite a few other kernels) is a very important customer of gcc, and adding sustainable modes for the kernel that we can rely on is probably something we can work with them on. I think the relationship between the gcc and Linux kernel people is unnecessarily infected, and cultivating a more constructive relationship would be good. I suspect a big part of the reason for the oddities is that the timeline for the kernel community from making a request into gcc until we can actually rely on it is *very* long, and so we end up having to working things around no matter what (usually with copious invective), and the gcc people have other customers with shorter lead times which therefore drive their development more. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
Richard Guenther wrote: But it's also not inconceivable that gcc adds a -fkernel-inlining or similar that changes the parameters if we ask nicely. I suppose actually such a parameter would be useful for far more programs than the kernel. I think that the kernel is a perfect target to optimize default -Os behavior for (whereas template heavy C++ programs are a target to optimize -O2 for). And I think we did a good job in listening to kernel developers if once in time they tried to talk to us - GCC 4.3 should be good in compiling the kernel with default -Os settings. We, unfortunately, cannot retroactively fix old versions that kernel developers happen to like and still use. Unfortunately I think there have been a lot of we can't talk to them on both sides of the kernel-gcc interface, which is incredibly unfortunate. I personally try to at least observe gcc development, including monitoring #gcc and knowing enough about gcc internals to write a (crappy) port, but I can hardly call myself a gcc expert. Still, I am willing to spend some significant time interfacing with anyone in the gcc community willing to spend the effort. I think we can do good stuff. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
Richard Guenther wrote: On Fri, Jan 9, 2009 at 8:21 PM, Andi Kleen a...@firstfloor.org wrote: GCC 4.3 should be good in compiling the kernel with default -Os settings. It's unfortunately not. It doesn't inline a lot of simple asm() inlines for example. Reading Ingos posting with the actual numbers states the opposite. Well, Andi's patch forcing inlining of the bitops chops quite a bit of size off the kernel, so there is clearly room for improvement. From my post yesterday: : voreg 64 ; size o.*/vmlinux textdata bss dec hex filename 57590217 24940519 15560504 98091240 5d8c0e8 o.andi/vmlinux 59421552 24912223 15560504 99894279 5f44407 o.noopt/vmlinux 57700527 24950719 15560504 98211750 5da97a6 o.opty/vmlinux 110 KB of code size reduction by force-inlining the small bitops. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
Andi Kleen wrote: Fetch a gigabyte's worth of data for the debuginfo RPM? The suse 11.0 kernel debuginfo is ~120M. Still, though, hardly worth doing client-side when it can be done server-side for all the common distro kernels. For custom kernels, not so, but there you should already have the debuginfo locally. And yes, there are probably residual holes, but it's questionable if it matters. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: And quite often, some of them go away - or at least shrink a lot - when some config option or other isn't set. So sometimes it's an inline because a certain class of people really want it inlined, simply because for _them_ it makes sense, but when you enable debugging or something, it absolutely explodes. And this is really why getting static inline annotations right is really hard if not impossible in the general case (especially when considering the sheer number of architectures we compile on.) So making it possible for the compiler to do the right thing for at least this class of functions really does seem like a good idea. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact
Arjan van de Ven wrote: thinking about this.. making a pastebin like thing for oopses is relatively trivial for me; all the building blocks I have already. The hard part is getting the vmlinux files in place. Right now I do this manually for popular released kernels.. if the fedora/suse guys would help to at least have the vmlinux for their released updates easily available that would be a huge help without that it's going to suck. We could just pick them up automatically from the kernel.org mirrors with a little bit of scripting. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Ingo Molnar wrote: Apparently it messes up with asm()s: it doesnt know the contents of the asm() and hence it over-estimates the size [based on string heuristics] ... Right. gcc simply doesn't have any way to know how heavyweight an asm() statement is, and it WILL do the wrong thing in many cases -- especially the ones which involve an out-of-line recovery stub. This is due to a fundamental design decision in gcc to not integrate the compiler and assembler (which some compilers do.) Which is bad - asm()s tend to be the most important entities to inline - all over our fastpaths . Despite that messup it's still a 1% net size win: textdata bss dec hex filename 7109652 1464684 802888 9377224 8f15c8 vmlinux.always-inline 7046115 1465324 802888 9314327 8e2017 vmlinux.optimized-inlining That win is mixed in slowpath and fastpath as well. The good part here is that the assembly ones really don't have much subtlety -- a function call is at least five bytes, usually more once you count in the register spill penalties -- so __always_inline-ing them should still end up with numbers looking very much like the above. I see three options: - Disable CONFIG_OPTIMIZE_INLINING=y altogether (it's already default-off) - Change the asm() inline markers to something new like asm_inline, which defaults to __always_inline. - Just mark all asm() inline markers as __always_inline - realizing that these should never ever be out of line. We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. I'll try to annotate the inline asms (there's not _that_ many of them), and measure what the size impact is. The main reason to do #2 over #3 would be for programmer documentation. There simply should be no reason to ever out-of-lining these. However, documenting the reason to the programmer is a valuable thing in itself. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Harvey Harrison wrote: We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. Or just make it clear that inline shouldn't (unless for a very good reason) _ever_ be used in a .c file. The question is if that would produce acceptable quality code. In theory it should, but I'm more than wondering if it really will. It would be ideal, of course, as it would mean less typing. I guess we could try it out by disabling any inline in the current code that isn't __always_inline... -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: First off, gcc _does_ have a perfectly fine notion of how heavy-weight an asm statement is: just count it as a single instruction (and count the argument setup cost that gcc _can_ estimate). True. It's not what it's doing, though. It looks for '\n' and ';' characters, and counts the maximum instruction size for each possible instruction. The reason why is that gcc's size estimation is partially designed to select what kind of branches it needs to use on architectures which have more than one type of branches. As a result, it tends to drastically overestimate, on purpose. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Harvey Harrison wrote: A lot of code was written assuming inline means __always_inline, I'd suggest keeping that assumption and working on removing inlines that aren't strictly necessary as there's no way to know what inlines meant 'try to inline' and what ones really should have been __always_inline. Not that I feel _that_ strongly about it. Actually, we have that reasonably well down by now. There seems to be a couple of minor tweaking still necessary, but I think we're 90-95% there already. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html