Re: Readahead for compressed data
On 10/22/21 4:17 AM, Matthew Wilcox wrote: As far as I can tell, the following filesystems support compressed data: bcachefs, btrfs, erofs, ntfs, squashfs, zisofs Hi Matthew, There is a new bcachefs mailing list linux-bcach...@vger.kernel.org for bcachefs. I add it in Cc in this reply email. Just FYI for you and other receivers. Thanks. Coly Li I'd like to make it easier and more efficient for filesystems to implement compressed data. There are a lot of approaches in use today, but none of them seem quite right to me. I'm going to lay out a few design considerations next and then propose a solution. Feel free to tell me I've got the constraints wrong, or suggest alternative solutions. When we call ->readahead from the VFS, the VFS has decided which pages are going to be the most useful to bring in, but it doesn't know how pages are bundled together into blocks. As I've learned from talking to Gao Xiang, sometimes the filesystem doesn't know either, so this isn't something we can teach the VFS. We (David) added readahead_expand() recently to let the filesystem opportunistically add pages to the page cache "around" the area requested by the VFS. That reduces the number of times the filesystem has to decompress the same block. But it can fail (due to memory allocation failures or pages already being present in the cache). So filesystems still have to implement some kind of fallback. For many (all?) compression algorithms (all?) the data must be mapped at all times. Calling kmap() and kunmap() would be an intolerable overhead. At the same time, we cannot write to a page in the page cache which is marked Uptodate. It might be mapped into userspace, or a read() be in progress against it. For writable filesystems, it might even be dirty! As far as I know, no compression algorithm supports "holes", implying that we must allocate memory which is then discarded. To me, this calls for a vmap() based approach. So I'm thinking something like ... void *readahead_get_block(struct readahead_control *ractl, loff_t start, size_t len); void readahead_put_block(struct readahead_control *ractl, void *addr, bool success); Once you've figured out which bytes this encrypted block expands to, you call readahead_get_block(), specifying the offset in the file and length and get back a pointer. When you're done decompressing that block of the file, you get rid of it again. It's the job of readahead_get_block() to allocate additional pages into the page cache or temporary pages. readahead_put_block() will mark page cache pages as Uptodate if 'success' is true, and unlock them. It'll free any temporary pages. Thoughts? Anyone want to be the guinea pig? ;-)
Re: Readahead for compressed data
Jan Kara writes: > Well, one of the problems with keeping compressed data is that for mmap(2) > you have to have pages decompressed so that CPU can access them. So keeping > compressed data in the page cache would add a bunch of complexity. That > being said keeping compressed data cached somewhere else than in the page > cache may certainly me worth it and then just filling page cache on demand > from this data... True... Did that multi generational LRU cache ever get merged? I was thinking you could use that to make sure that the kernel prefers to reclaim the decompressed pages in favor of keeping the compressed ones around.
Re: Readahead for compressed data
On 2021/10/22 17:54, Gao Xiang wrote: On Fri, Oct 22, 2021 at 05:39:29PM +0800, Gao Xiang wrote: Hi Qu, On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote: On 2021/10/22 17:11, Gao Xiang wrote: On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote: On Thu 21-10-21 21:04:45, Phillip Susi wrote: Matthew Wilcox writes: As far as I can tell, the following filesystems support compressed data: bcachefs, btrfs, erofs, ntfs, squashfs, zisofs I'd like to make it easier and more efficient for filesystems to implement compressed data. There are a lot of approaches in use today, but none of them seem quite right to me. I'm going to lay out a few design considerations next and then propose a solution. Feel free to tell me I've got the constraints wrong, or suggest alternative solutions. When we call ->readahead from the VFS, the VFS has decided which pages are going to be the most useful to bring in, but it doesn't know how pages are bundled together into blocks. As I've learned from talking to Gao Xiang, sometimes the filesystem doesn't know either, so this isn't something we can teach the VFS. We (David) added readahead_expand() recently to let the filesystem opportunistically add pages to the page cache "around" the area requested by the VFS. That reduces the number of times the filesystem has to decompress the same block. But it can fail (due to memory allocation failures or pages already being present in the cache). So filesystems still have to implement some kind of fallback. Wouldn't it be better to keep the *compressed* data in the cache and decompress it multiple times if needed rather than decompress it once and cache the decompressed data? You would use more CPU time decompressing multiple times, but be able to cache more data and avoid more disk IO, which is generally far slower than the CPU can decompress the data. Well, one of the problems with keeping compressed data is that for mmap(2) you have to have pages decompressed so that CPU can access them. So keeping compressed data in the page cache would add a bunch of complexity. That being said keeping compressed data cached somewhere else than in the page cache may certainly me worth it and then just filling page cache on demand from this data... It can be cached with a special internal inode, so no need to take care of the memory reclaim or migration by yourself. There is another problem for btrfs (and maybe other fses). Btrfs can refer to part of the uncompressed data extent. Thus we could have the following cases: 0 4K 8K 12K | | | | | \- 4K of an 128K compressed extent, | compressed size is 16K \- 4K of an 64K compressed extent, compressed size is 8K Thanks for this, but the diagram is broken on my side. Could you help fix this? Ok, I understand it. I think here is really a strategy problem out of CoW, since only 2*4K is needed, you could 1) cache the whole compressed extent and hope they can be accessed later, so no I/O later at all; 2) don't cache such incomplete compressed extents; 3) add some trace record and do some finer strategy. Yeah, this should be determined by each fs, like whether they want to cache compressed extent at all, and at which condition to cache compressed extent. (e.g. for btrfs, if we find the range we want is even smaller than the compressed size, we can skip such cache) Thus I don't think there would be a silver bullet for such case. In above case, we really only need 8K for page cache, but if we're caching the compressed extent, it will take extra 24K. Apart from that, with my wild guess, could we cache whatever the real I/O is rather than the whole compressed extent unconditionally? If the whole compressed extent is needed then, we could check if it's all available in cache, or read the rest instead? Also, I think no need to cache uncompressed COW data... Yeah, that's definitely the case, as page cache is already doing the work for us. Thanks, Qu Thanks, Gao Xiang It's definitely not really worthy for this particular corner case. But it would be pretty common for btrfs, as CoW on compressed data can easily lead to above cases. Thanks, Qu Otherwise, these all need to be take care of. For fixed-sized input compression, since they are reclaimed in page unit, so it won't be quite friendly since such data is all coupling. But for fixed-sized output compression, it's quite natural. Thanks, Gao Xiang Honza -- Jan Kara SUSE Labs, CR
Re: Readahead for compressed data
On Fri, Oct 22, 2021 at 05:39:29PM +0800, Gao Xiang wrote: > Hi Qu, > > On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote: > > > > > > On 2021/10/22 17:11, Gao Xiang wrote: > > > On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote: > > > > On Thu 21-10-21 21:04:45, Phillip Susi wrote: > > > > > > > > > > Matthew Wilcox writes: > > > > > > > > > > > As far as I can tell, the following filesystems support compressed > > > > > > data: > > > > > > > > > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > > > > > > > > > > > I'd like to make it easier and more efficient for filesystems to > > > > > > implement compressed data. There are a lot of approaches in use > > > > > > today, > > > > > > but none of them seem quite right to me. I'm going to lay out a few > > > > > > design considerations next and then propose a solution. Feel free > > > > > > to > > > > > > tell me I've got the constraints wrong, or suggest alternative > > > > > > solutions. > > > > > > > > > > > > When we call ->readahead from the VFS, the VFS has decided which > > > > > > pages > > > > > > are going to be the most useful to bring in, but it doesn't know how > > > > > > pages are bundled together into blocks. As I've learned from > > > > > > talking to > > > > > > Gao Xiang, sometimes the filesystem doesn't know either, so this > > > > > > isn't > > > > > > something we can teach the VFS. > > > > > > > > > > > > We (David) added readahead_expand() recently to let the filesystem > > > > > > opportunistically add pages to the page cache "around" the area > > > > > > requested > > > > > > by the VFS. That reduces the number of times the filesystem has to > > > > > > decompress the same block. But it can fail (due to memory > > > > > > allocation > > > > > > failures or pages already being present in the cache). So > > > > > > filesystems > > > > > > still have to implement some kind of fallback. > > > > > > > > > > Wouldn't it be better to keep the *compressed* data in the cache and > > > > > decompress it multiple times if needed rather than decompress it once > > > > > and cache the decompressed data? You would use more CPU time > > > > > decompressing multiple times, but be able to cache more data and avoid > > > > > more disk IO, which is generally far slower than the CPU can > > > > > decompress > > > > > the data. > > > > > > > > Well, one of the problems with keeping compressed data is that for > > > > mmap(2) > > > > you have to have pages decompressed so that CPU can access them. So > > > > keeping > > > > compressed data in the page cache would add a bunch of complexity. That > > > > being said keeping compressed data cached somewhere else than in the > > > > page > > > > cache may certainly me worth it and then just filling page cache on > > > > demand > > > > from this data... > > > > > > It can be cached with a special internal inode, so no need to take > > > care of the memory reclaim or migration by yourself. > > > > There is another problem for btrfs (and maybe other fses). > > > > Btrfs can refer to part of the uncompressed data extent. > > > > Thus we could have the following cases: > > > > 0 4K 8K 12K > > | | | | > > | \- 4K of an 128K compressed extent, > > | compressed size is 16K > > \- 4K of an 64K compressed extent, > > compressed size is 8K > > Thanks for this, but the diagram is broken on my side. > Could you help fix this? Ok, I understand it. I think here is really a strategy problem out of CoW, since only 2*4K is needed, you could 1) cache the whole compressed extent and hope they can be accessed later, so no I/O later at all; 2) don't cache such incomplete compressed extents; 3) add some trace record and do some finer strategy. > > > > > In above case, we really only need 8K for page cache, but if we're > > caching the compressed extent, it will take extra 24K. > > Apart from that, with my wild guess, could we cache whatever the > real I/O is rather than the whole compressed extent unconditionally? > If the whole compressed extent is needed then, we could check if > it's all available in cache, or read the rest instead? > > Also, I think no need to cache uncompressed COW data... > > Thanks, > Gao Xiang > > > > > It's definitely not really worthy for this particular corner case. > > > > But it would be pretty common for btrfs, as CoW on compressed data can > > easily lead to above cases. > > > > Thanks, > > Qu > > > > > > > > Otherwise, these all need to be take care of. For fixed-sized input > > > compression, since they are reclaimed in page unit, so it won't be > > > quite friendly since such data is all coupling. But for fixed-sized > > > output compression, it's quite natural. > > > > > > Thanks, > > > Gao Xiang > > > > > > > > > > > Honza
Re: Readahead for compressed data
Hi Qu, On Fri, Oct 22, 2021 at 05:22:28PM +0800, Qu Wenruo wrote: > > > On 2021/10/22 17:11, Gao Xiang wrote: > > On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote: > > > On Thu 21-10-21 21:04:45, Phillip Susi wrote: > > > > > > > > Matthew Wilcox writes: > > > > > > > > > As far as I can tell, the following filesystems support compressed > > > > > data: > > > > > > > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > > > > > > > > > I'd like to make it easier and more efficient for filesystems to > > > > > implement compressed data. There are a lot of approaches in use > > > > > today, > > > > > but none of them seem quite right to me. I'm going to lay out a few > > > > > design considerations next and then propose a solution. Feel free to > > > > > tell me I've got the constraints wrong, or suggest alternative > > > > > solutions. > > > > > > > > > > When we call ->readahead from the VFS, the VFS has decided which pages > > > > > are going to be the most useful to bring in, but it doesn't know how > > > > > pages are bundled together into blocks. As I've learned from talking > > > > > to > > > > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > > > > > something we can teach the VFS. > > > > > > > > > > We (David) added readahead_expand() recently to let the filesystem > > > > > opportunistically add pages to the page cache "around" the area > > > > > requested > > > > > by the VFS. That reduces the number of times the filesystem has to > > > > > decompress the same block. But it can fail (due to memory allocation > > > > > failures or pages already being present in the cache). So filesystems > > > > > still have to implement some kind of fallback. > > > > > > > > Wouldn't it be better to keep the *compressed* data in the cache and > > > > decompress it multiple times if needed rather than decompress it once > > > > and cache the decompressed data? You would use more CPU time > > > > decompressing multiple times, but be able to cache more data and avoid > > > > more disk IO, which is generally far slower than the CPU can decompress > > > > the data. > > > > > > Well, one of the problems with keeping compressed data is that for mmap(2) > > > you have to have pages decompressed so that CPU can access them. So > > > keeping > > > compressed data in the page cache would add a bunch of complexity. That > > > being said keeping compressed data cached somewhere else than in the page > > > cache may certainly me worth it and then just filling page cache on demand > > > from this data... > > > > It can be cached with a special internal inode, so no need to take > > care of the memory reclaim or migration by yourself. > > There is another problem for btrfs (and maybe other fses). > > Btrfs can refer to part of the uncompressed data extent. > > Thus we could have the following cases: > > 0 4K 8K 12K > | | | | > | \- 4K of an 128K compressed extent, > | compressed size is 16K > \- 4K of an 64K compressed extent, > compressed size is 8K Thanks for this, but the diagram is broken on my side. Could you help fix this? > > In above case, we really only need 8K for page cache, but if we're > caching the compressed extent, it will take extra 24K. Apart from that, with my wild guess, could we cache whatever the real I/O is rather than the whole compressed extent unconditionally? If the whole compressed extent is needed then, we could check if it's all available in cache, or read the rest instead? Also, I think no need to cache uncompressed COW data... Thanks, Gao Xiang > > It's definitely not really worthy for this particular corner case. > > But it would be pretty common for btrfs, as CoW on compressed data can > easily lead to above cases. > > Thanks, > Qu > > > > > Otherwise, these all need to be take care of. For fixed-sized input > > compression, since they are reclaimed in page unit, so it won't be > > quite friendly since such data is all coupling. But for fixed-sized > > output compression, it's quite natural. > > > > Thanks, > > Gao Xiang > > > > > > > > Honza > > > -- > > > Jan Kara > > > SUSE Labs, CR
Re: Readahead for compressed data
On 2021/10/22 17:11, Gao Xiang wrote: On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote: On Thu 21-10-21 21:04:45, Phillip Susi wrote: Matthew Wilcox writes: As far as I can tell, the following filesystems support compressed data: bcachefs, btrfs, erofs, ntfs, squashfs, zisofs I'd like to make it easier and more efficient for filesystems to implement compressed data. There are a lot of approaches in use today, but none of them seem quite right to me. I'm going to lay out a few design considerations next and then propose a solution. Feel free to tell me I've got the constraints wrong, or suggest alternative solutions. When we call ->readahead from the VFS, the VFS has decided which pages are going to be the most useful to bring in, but it doesn't know how pages are bundled together into blocks. As I've learned from talking to Gao Xiang, sometimes the filesystem doesn't know either, so this isn't something we can teach the VFS. We (David) added readahead_expand() recently to let the filesystem opportunistically add pages to the page cache "around" the area requested by the VFS. That reduces the number of times the filesystem has to decompress the same block. But it can fail (due to memory allocation failures or pages already being present in the cache). So filesystems still have to implement some kind of fallback. Wouldn't it be better to keep the *compressed* data in the cache and decompress it multiple times if needed rather than decompress it once and cache the decompressed data? You would use more CPU time decompressing multiple times, but be able to cache more data and avoid more disk IO, which is generally far slower than the CPU can decompress the data. Well, one of the problems with keeping compressed data is that for mmap(2) you have to have pages decompressed so that CPU can access them. So keeping compressed data in the page cache would add a bunch of complexity. That being said keeping compressed data cached somewhere else than in the page cache may certainly me worth it and then just filling page cache on demand from this data... It can be cached with a special internal inode, so no need to take care of the memory reclaim or migration by yourself. There is another problem for btrfs (and maybe other fses). Btrfs can refer to part of the uncompressed data extent. Thus we could have the following cases: 0 4K 8K 12K | | | | | \- 4K of an 128K compressed extent, | compressed size is 16K \- 4K of an 64K compressed extent, compressed size is 8K In above case, we really only need 8K for page cache, but if we're caching the compressed extent, it will take extra 24K. It's definitely not really worthy for this particular corner case. But it would be pretty common for btrfs, as CoW on compressed data can easily lead to above cases. Thanks, Qu Otherwise, these all need to be take care of. For fixed-sized input compression, since they are reclaimed in page unit, so it won't be quite friendly since such data is all coupling. But for fixed-sized output compression, it's quite natural. Thanks, Gao Xiang Honza -- Jan Kara SUSE Labs, CR
Re: Readahead for compressed data
On Fri, Oct 22, 2021 at 10:41:27AM +0200, Jan Kara wrote: > On Thu 21-10-21 21:04:45, Phillip Susi wrote: > > > > Matthew Wilcox writes: > > > > > As far as I can tell, the following filesystems support compressed data: > > > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > > > > > I'd like to make it easier and more efficient for filesystems to > > > implement compressed data. There are a lot of approaches in use today, > > > but none of them seem quite right to me. I'm going to lay out a few > > > design considerations next and then propose a solution. Feel free to > > > tell me I've got the constraints wrong, or suggest alternative solutions. > > > > > > When we call ->readahead from the VFS, the VFS has decided which pages > > > are going to be the most useful to bring in, but it doesn't know how > > > pages are bundled together into blocks. As I've learned from talking to > > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > > > something we can teach the VFS. > > > > > > We (David) added readahead_expand() recently to let the filesystem > > > opportunistically add pages to the page cache "around" the area requested > > > by the VFS. That reduces the number of times the filesystem has to > > > decompress the same block. But it can fail (due to memory allocation > > > failures or pages already being present in the cache). So filesystems > > > still have to implement some kind of fallback. > > > > Wouldn't it be better to keep the *compressed* data in the cache and > > decompress it multiple times if needed rather than decompress it once > > and cache the decompressed data? You would use more CPU time > > decompressing multiple times, but be able to cache more data and avoid > > more disk IO, which is generally far slower than the CPU can decompress > > the data. > > Well, one of the problems with keeping compressed data is that for mmap(2) > you have to have pages decompressed so that CPU can access them. So keeping > compressed data in the page cache would add a bunch of complexity. That > being said keeping compressed data cached somewhere else than in the page > cache may certainly me worth it and then just filling page cache on demand > from this data... It can be cached with a special internal inode, so no need to take care of the memory reclaim or migration by yourself. Otherwise, these all need to be take care of. For fixed-sized input compression, since they are reclaimed in page unit, so it won't be quite friendly since such data is all coupling. But for fixed-sized output compression, it's quite natural. Thanks, Gao Xiang > > Honza > -- > Jan Kara > SUSE Labs, CR
Re: Readahead for compressed data
On Thu 21-10-21 21:04:45, Phillip Susi wrote: > > Matthew Wilcox writes: > > > As far as I can tell, the following filesystems support compressed data: > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > > > I'd like to make it easier and more efficient for filesystems to > > implement compressed data. There are a lot of approaches in use today, > > but none of them seem quite right to me. I'm going to lay out a few > > design considerations next and then propose a solution. Feel free to > > tell me I've got the constraints wrong, or suggest alternative solutions. > > > > When we call ->readahead from the VFS, the VFS has decided which pages > > are going to be the most useful to bring in, but it doesn't know how > > pages are bundled together into blocks. As I've learned from talking to > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > > something we can teach the VFS. > > > > We (David) added readahead_expand() recently to let the filesystem > > opportunistically add pages to the page cache "around" the area requested > > by the VFS. That reduces the number of times the filesystem has to > > decompress the same block. But it can fail (due to memory allocation > > failures or pages already being present in the cache). So filesystems > > still have to implement some kind of fallback. > > Wouldn't it be better to keep the *compressed* data in the cache and > decompress it multiple times if needed rather than decompress it once > and cache the decompressed data? You would use more CPU time > decompressing multiple times, but be able to cache more data and avoid > more disk IO, which is generally far slower than the CPU can decompress > the data. Well, one of the problems with keeping compressed data is that for mmap(2) you have to have pages decompressed so that CPU can access them. So keeping compressed data in the page cache would add a bunch of complexity. That being said keeping compressed data cached somewhere else than in the page cache may certainly me worth it and then just filling page cache on demand from this data... Honza -- Jan Kara SUSE Labs, CR
Re: Readahead for compressed data
On 21/10/2021 21:17, Matthew Wilcox wrote: > As far as I can tell, the following filesystems support compressed data: > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > I'd like to make it easier and more efficient for filesystems to > implement compressed data. There are a lot of approaches in use today, > but none of them seem quite right to me. I'm going to lay out a few > design considerations next and then propose a solution. Feel free to > tell me I've got the constraints wrong, or suggest alternative solutions. > > When we call ->readahead from the VFS, the VFS has decided which pages > are going to be the most useful to bring in, but it doesn't know how > pages are bundled together into blocks. As I've learned from talking to > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > something we can teach the VFS. > > We (David) added readahead_expand() recently to let the filesystem > opportunistically add pages to the page cache "around" the area requested > by the VFS. That reduces the number of times the filesystem has to > decompress the same block. But it can fail (due to memory allocation > failures or pages already being present in the cache). So filesystems > still have to implement some kind of fallback. > > For many (all?) compression algorithms (all?) the data must be mapped at > all times. Calling kmap() and kunmap() would be an intolerable overhead. > At the same time, we cannot write to a page in the page cache which is > marked Uptodate. It might be mapped into userspace, or a read() be in > progress against it. For writable filesystems, it might even be dirty! > As far as I know, no compression algorithm supports "holes", implying > that we must allocate memory which is then discarded. > > To me, this calls for a vmap() based approach. So I'm thinking > something like ... > > void *readahead_get_block(struct readahead_control *ractl, loff_t start, >size_t len); > void readahead_put_block(struct readahead_control *ractl, void *addr, >bool success); > > Once you've figured out which bytes this encrypted block expands to, you > call readahead_get_block(), specifying the offset in the file and length > and get back a pointer. When you're done decompressing that block of > the file, you get rid of it again. Hi Matthew, I quite like this new interface. It should be fairly straight-forward to make Squashfs use this interface, and it will simplify some of the code, and make some of the decompressors more efficient. As I see it, it removes many of the hoops that Squashfs has to go through to push the additional uncompressed data into the page cache. It is also a generic solution, and one which doesn't rely on a particular decompressor API or way of working, which means it shouldn't break any of the existing decompressor usage in the kernel. The one issue with this generic solution is I fear a lot of people will complain it is too generic, and prevents some optimisations which they could have made on their particular decompressor or filesystem. The issue that it requires temporary pages to be allocated upfront (if the page cannot be added to the page cache) has already been brought up. At this point I will try to play devil's advocate. What is the alternative to passing back a vmapped area of memory to the filesystem? The obvious alternative is to pass back an array of pointers to the individual page structures, or a NULL pointer representing a hole where the page could not be allocated in the cache. This alternative interface has the advantage that a NULL pointer is passed representing a hole, rather than temporary memory being allocated upfront. It is then up to the filesystem and/or decompressor to deal with the NULL pointer hole which may avoid the use of so much temporary memory. But the fact is in the kernel there are many different decompressors with different APIs and different ways of working. There are some (like zlib, zstd, xz) which are "multi-shot" and some (like lzo, lz4) which are "single-shot". Multi-shot decompressors allow themselves to be called with only a small output buffer. Once the output buffer runs out, they exit and ask to be called again with another output buffer. Squashfs uses that to pass in the set of discontiguous PAGE sized buffers allocated from the page cache. Obviously, if Squashfs got a NULL pointer hole, it could switch to using a single temporary 4K buffer at that point. But single-shot decompressors expect to be called once, with a single contiguous output buffer. They don't work with a set of discontiguous PAGE sized buffers. Due to this Squashfs has to use a contiguous "bounce buffer" which the decompressor outputs to, and then copy it to the page cache buffers. The vmap based interface proposed is a generic interface, it works with both "multi-shot" and "single-shot" decompressors, because it presents a single contiguous output buffer, and avoids
Re: Readahead for compressed data
On Fri, Oct 22, 2021 at 03:09:03AM +0100, Phillip Lougher wrote: > On 22/10/2021 02:04, Phillip Susi wrote: > > > > Matthew Wilcox writes: > > > > > As far as I can tell, the following filesystems support compressed data: > > > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > > > > > I'd like to make it easier and more efficient for filesystems to > > > implement compressed data. There are a lot of approaches in use today, > > > but none of them seem quite right to me. I'm going to lay out a few > > > design considerations next and then propose a solution. Feel free to > > > tell me I've got the constraints wrong, or suggest alternative solutions. > > > > > > When we call ->readahead from the VFS, the VFS has decided which pages > > > are going to be the most useful to bring in, but it doesn't know how > > > pages are bundled together into blocks. As I've learned from talking to > > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > > > something we can teach the VFS. > > > > > > We (David) added readahead_expand() recently to let the filesystem > > > opportunistically add pages to the page cache "around" the area requested > > > by the VFS. That reduces the number of times the filesystem has to > > > decompress the same block. But it can fail (due to memory allocation > > > failures or pages already being present in the cache). So filesystems > > > still have to implement some kind of fallback. > > > > Wouldn't it be better to keep the *compressed* data in the cache and > > decompress it multiple times if needed rather than decompress it once > > and cache the decompressed data? You would use more CPU time > > decompressing multiple times > > There is a fairly obvious problem here. A malicious attacker could > "trick" the filesystem into endlessly decompressing the same blocks, > which because the compressed data is cached, could cause it to use > all CPU, and cause a DOS attack. Caching the uncompressed data prevents > these decompression attacks from occurring so easily. Yes, that is another good point. But with artifact memory pressure, there is no difference to cache compressed data or decompressed data, otherwise these pages will be unreclaimable, reclaim any of cached decompressed data will also need decompress the whole pcluster. The same to zram or zswap or what else software compression solution, what we can do is to limit the CPU utilization if any such requirement. But that is quite hard for lz4 since in-memory lz4 decompression speed is usually fast than the backend storage. By the way, as far as I know there were some experimental hardware accelerator integrated to storage devices. So they don't need to expand decompress pages as well. And do inplace I/O only for such devices. Thanks, Gao Xiang > > Phillip > >
Re: Readahead for compressed data
On 22/10/2021 02:04, Phillip Susi wrote: Matthew Wilcox writes: As far as I can tell, the following filesystems support compressed data: bcachefs, btrfs, erofs, ntfs, squashfs, zisofs I'd like to make it easier and more efficient for filesystems to implement compressed data. There are a lot of approaches in use today, but none of them seem quite right to me. I'm going to lay out a few design considerations next and then propose a solution. Feel free to tell me I've got the constraints wrong, or suggest alternative solutions. When we call ->readahead from the VFS, the VFS has decided which pages are going to be the most useful to bring in, but it doesn't know how pages are bundled together into blocks. As I've learned from talking to Gao Xiang, sometimes the filesystem doesn't know either, so this isn't something we can teach the VFS. We (David) added readahead_expand() recently to let the filesystem opportunistically add pages to the page cache "around" the area requested by the VFS. That reduces the number of times the filesystem has to decompress the same block. But it can fail (due to memory allocation failures or pages already being present in the cache). So filesystems still have to implement some kind of fallback. Wouldn't it be better to keep the *compressed* data in the cache and decompress it multiple times if needed rather than decompress it once and cache the decompressed data? You would use more CPU time decompressing multiple times There is a fairly obvious problem here. A malicious attacker could "trick" the filesystem into endlessly decompressing the same blocks, which because the compressed data is cached, could cause it to use all CPU, and cause a DOS attack. Caching the uncompressed data prevents these decompression attacks from occurring so easily. Phillip
Re: Readahead for compressed data
On Fri, Oct 22, 2021 at 09:28:14AM +0800, Gao Xiang wrote: > On Thu, Oct 21, 2021 at 09:04:45PM -0400, Phillip Susi wrote: > > > > Matthew Wilcox writes: > > > > > As far as I can tell, the following filesystems support compressed data: > > > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > > > > > I'd like to make it easier and more efficient for filesystems to > > > implement compressed data. There are a lot of approaches in use today, > > > but none of them seem quite right to me. I'm going to lay out a few > > > design considerations next and then propose a solution. Feel free to > > > tell me I've got the constraints wrong, or suggest alternative solutions. > > > > > > When we call ->readahead from the VFS, the VFS has decided which pages > > > are going to be the most useful to bring in, but it doesn't know how > > > pages are bundled together into blocks. As I've learned from talking to > > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > > > something we can teach the VFS. > > > > > > We (David) added readahead_expand() recently to let the filesystem > > > opportunistically add pages to the page cache "around" the area requested > > > by the VFS. That reduces the number of times the filesystem has to > > > decompress the same block. But it can fail (due to memory allocation > > > failures or pages already being present in the cache). So filesystems > > > still have to implement some kind of fallback. > > > > Wouldn't it be better to keep the *compressed* data in the cache and > > decompress it multiple times if needed rather than decompress it once > > and cache the decompressed data? You would use more CPU time > > decompressing multiple times, but be able to cache more data and avoid > > more disk IO, which is generally far slower than the CPU can decompress > > the data. > > Yes, that was also my another point yesterday talked on #xfs IRC. For > high-decompresion speed algorithms like lz4, yes, thinking about the > benefits of zcache or zram solutions, caching compressed data for > incomplete read requests is more effective than caching uncompressed > data (so we don't need zcache for EROFS at all). But if such data will > be used completely immediately, EROFS will only do inplace I/O only > (since cached compressed data can only increase memory overhead). > Also, considering some algorithms is slow, inplace I/O is more useful > for them. Anyway, it depends on the detail strategy of different > algorithms and can be fined later. > But I don't think endless compressed data caching is generally useful especially for devices that need extra memory as much possible (For example, we need to keep apps alive as many as possible in Android without jagging, otherwise it would be bad for user experience). So limited prefetching/caching only increasing memory reclaim overhead and page cache thrashing without benefits since compressed data cannot be used immediately without decompression. Thanks, Gao Xiang > Thanks, > Gao Xiang > > > > >
Re: Readahead for compressed data
On Thu, Oct 21, 2021 at 09:04:45PM -0400, Phillip Susi wrote: > > Matthew Wilcox writes: > > > As far as I can tell, the following filesystems support compressed data: > > > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > > > I'd like to make it easier and more efficient for filesystems to > > implement compressed data. There are a lot of approaches in use today, > > but none of them seem quite right to me. I'm going to lay out a few > > design considerations next and then propose a solution. Feel free to > > tell me I've got the constraints wrong, or suggest alternative solutions. > > > > When we call ->readahead from the VFS, the VFS has decided which pages > > are going to be the most useful to bring in, but it doesn't know how > > pages are bundled together into blocks. As I've learned from talking to > > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > > something we can teach the VFS. > > > > We (David) added readahead_expand() recently to let the filesystem > > opportunistically add pages to the page cache "around" the area requested > > by the VFS. That reduces the number of times the filesystem has to > > decompress the same block. But it can fail (due to memory allocation > > failures or pages already being present in the cache). So filesystems > > still have to implement some kind of fallback. > > Wouldn't it be better to keep the *compressed* data in the cache and > decompress it multiple times if needed rather than decompress it once > and cache the decompressed data? You would use more CPU time > decompressing multiple times, but be able to cache more data and avoid > more disk IO, which is generally far slower than the CPU can decompress > the data. Yes, that was also my another point yesterday talked on #xfs IRC. For high-decompresion speed algorithms like lz4, yes, thinking about the benefits of zcache or zram solutions, caching compressed data for incomplete read requests is more effective than caching uncompressed data (so we don't need zcache for EROFS at all). But if such data will be used completely immediately, EROFS will only do inplace I/O only (since cached compressed data can only increase memory overhead). Also, considering some algorithms is slow, inplace I/O is more useful for them. Anyway, it depends on the detail strategy of different algorithms and can be fined later. Thanks, Gao Xiang > >
Re: Readahead for compressed data
Matthew Wilcox writes: > As far as I can tell, the following filesystems support compressed data: > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > I'd like to make it easier and more efficient for filesystems to > implement compressed data. There are a lot of approaches in use today, > but none of them seem quite right to me. I'm going to lay out a few > design considerations next and then propose a solution. Feel free to > tell me I've got the constraints wrong, or suggest alternative solutions. > > When we call ->readahead from the VFS, the VFS has decided which pages > are going to be the most useful to bring in, but it doesn't know how > pages are bundled together into blocks. As I've learned from talking to > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > something we can teach the VFS. > > We (David) added readahead_expand() recently to let the filesystem > opportunistically add pages to the page cache "around" the area requested > by the VFS. That reduces the number of times the filesystem has to > decompress the same block. But it can fail (due to memory allocation > failures or pages already being present in the cache). So filesystems > still have to implement some kind of fallback. Wouldn't it be better to keep the *compressed* data in the cache and decompress it multiple times if needed rather than decompress it once and cache the decompressed data? You would use more CPU time decompressing multiple times, but be able to cache more data and avoid more disk IO, which is generally far slower than the CPU can decompress the data.
Re: Readahead for compressed data
On Thu, Oct 21, 2021 at 09:17:40PM +0100, Matthew Wilcox wrote: > As far as I can tell, the following filesystems support compressed data: > > bcachefs, btrfs, erofs, ntfs, squashfs, zisofs > > I'd like to make it easier and more efficient for filesystems to > implement compressed data. There are a lot of approaches in use today, > but none of them seem quite right to me. I'm going to lay out a few > design considerations next and then propose a solution. Feel free to > tell me I've got the constraints wrong, or suggest alternative solutions. > > When we call ->readahead from the VFS, the VFS has decided which pages > are going to be the most useful to bring in, but it doesn't know how > pages are bundled together into blocks. As I've learned from talking to > Gao Xiang, sometimes the filesystem doesn't know either, so this isn't > something we can teach the VFS. > > We (David) added readahead_expand() recently to let the filesystem > opportunistically add pages to the page cache "around" the area requested > by the VFS. That reduces the number of times the filesystem has to > decompress the same block. But it can fail (due to memory allocation > failures or pages already being present in the cache). So filesystems > still have to implement some kind of fallback. > > For many (all?) compression algorithms (all?) the data must be mapped at > all times. Calling kmap() and kunmap() would be an intolerable overhead. > At the same time, we cannot write to a page in the page cache which is > marked Uptodate. It might be mapped into userspace, or a read() be in > progress against it. For writable filesystems, it might even be dirty! > As far as I know, no compression algorithm supports "holes", implying > that we must allocate memory which is then discarded. > > To me, this calls for a vmap() based approach. So I'm thinking > something like ... > > void *readahead_get_block(struct readahead_control *ractl, loff_t start, > size_t len); > void readahead_put_block(struct readahead_control *ractl, void *addr, > bool success); > > Once you've figured out which bytes this encrypted block expands to, you > call readahead_get_block(), specifying the offset in the file and length > and get back a pointer. When you're done decompressing that block of > the file, you get rid of it again. > > It's the job of readahead_get_block() to allocate additional pages > into the page cache or temporary pages. readahead_put_block() will > mark page cache pages as Uptodate if 'success' is true, and unlock > them. It'll free any temporary pages. > > Thoughts? Anyone want to be the guinea pig? ;-) Copied from IRC for reference as well.. As for vmap() strategy, No need to allocate some temporary pages in advance before compressed I/O is done and async work is triggered, since I/Os / works could cause noticable latencies. Logically, only inplace or cached I/O strategy should be decided before I/O and compressed pages need to be prepared. The benefits of fixed-sized output compression aside from the relative higher compression ratios is that each compressed pcluster can be completely decompressed independently, you can select inplace or cache I/O for each pclusters. And when you decide inplace I/O for some pcluster, no extra incompressible data is readed from disk or cached uselessly. As I said, even overall read request (with many pclusters) is 1MiB or 2MiB or some else, also only need allocate 16 pages (64KiB) at most for lz4 for each read request (used for lz4 sliding window), no need such many extra temporary pages. Allocating too many pages in advance before I/O IMO is just increasing the overall memory overhead. Low-ended devices cannot handle I/O quickly but has limited memory. Temporary pages are only needed before decompression, not exactly before I/O. Thanks, Gao Xiang