Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
On Tue, Jun 27, 2017 at 05:27:51AM +, Nick Terrell wrote: > Adam, I’ve applied the same patch in my tree. I’ll send out the update [1] > once it's reviewed, since I also reduced the stack usage of functions > using over 1 KB of stack space. > > I have userland tests set up mocking the linux kernel headers, and tested > 32-bit mode there, but neglected to test the kernel on a 32-bit VM, which > I’ve now corrected. Thanks for testing the patch on your ARM machine! Is there a version I should be testing? I got a bunch of those: [10170.448783] kworker/u8:6: page allocation stalls for 60720ms, order:0, mode:0x14000c2(GFP_KERNEL|__GFP_HIGHMEM), nodemask=(null) [10170.448819] kworker/u8:6 cpuset=/ mems_allowed=0 [10170.448842] CPU: 3 PID: 13430 Comm: kworker/u8:6 Not tainted 4.12.0-rc7-00034-gdff47ed160bb #1 [10170.448846] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree) [10170.448872] Workqueue: btrfs-endio btrfs_endio_helper [10170.448910] [] (unwind_backtrace) from [] (show_stack+0x10/0x14) [10170.448925] [] (show_stack) from [] (dump_stack+0x78/0x8c) [10170.448942] [] (dump_stack) from [] (warn_alloc+0xc0/0x170) [10170.448952] [] (warn_alloc) from [] (__alloc_pages_nodemask+0x97c/0xe30) [10170.448964] [] (__alloc_pages_nodemask) from [] (__vmalloc_node_range+0x144/0x27c) [10170.448976] [] (__vmalloc_node_range) from [] (__vmalloc_node.constprop.10+0x48/0x50) [10170.448982] [] (__vmalloc_node.constprop.10) from [] (vmalloc+0x2c/0x34) [10170.448990] [] (vmalloc) from [] (zstd_alloc_workspace+0x6c/0xb8) [10170.448997] [] (zstd_alloc_workspace) from [] (find_workspace+0x120/0x1f4) [10170.449002] [] (find_workspace) from [] (end_compressed_bio_read+0x1d4/0x3b0) [10170.449016] [] (end_compressed_bio_read) from [] (process_one_work+0x1d8/0x3f0) [10170.449026] [] (process_one_work) from [] (worker_thread+0x38/0x558) [10170.449035] [] (worker_thread) from [] (kthread+0x124/0x154) [10170.449042] [] (kthread) from [] (ret_from_fork+0x14/0x3c) which never happened with compress=lzo, and a 2GB RAM machine that runs 4 threads of various builds runs into memory pressure quite often. On the other hand, I used 4.11 for lzo so this needs more testing before I can blame the zstd code. Also, I had network problems all day today so the machine was mostly idle instead of doing further tests -- not quite going to pull sources to build over a phone connection. I'm on linus:4.12-rc7 with only a handful of btrfs patches (v3 of Qu's chunk check, some misc crap) -- I guess I should use at least btrfs-for-4.13. Or would you prefer full-blown next? Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can. ⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener. ⠈⠳⣄ A master species delegates.
Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
On Tue, Jun 27, 2017 at 05:27:51AM +, Nick Terrell wrote: > Adam, I’ve applied the same patch in my tree. I’ll send out the update [1] > once it's reviewed, since I also reduced the stack usage of functions > using over 1 KB of stack space. > > I have userland tests set up mocking the linux kernel headers, and tested > 32-bit mode there, but neglected to test the kernel on a 32-bit VM, which > I’ve now corrected. Thanks for testing the patch on your ARM machine! Is there a version I should be testing? I got a bunch of those: [10170.448783] kworker/u8:6: page allocation stalls for 60720ms, order:0, mode:0x14000c2(GFP_KERNEL|__GFP_HIGHMEM), nodemask=(null) [10170.448819] kworker/u8:6 cpuset=/ mems_allowed=0 [10170.448842] CPU: 3 PID: 13430 Comm: kworker/u8:6 Not tainted 4.12.0-rc7-00034-gdff47ed160bb #1 [10170.448846] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree) [10170.448872] Workqueue: btrfs-endio btrfs_endio_helper [10170.448910] [] (unwind_backtrace) from [] (show_stack+0x10/0x14) [10170.448925] [] (show_stack) from [] (dump_stack+0x78/0x8c) [10170.448942] [] (dump_stack) from [] (warn_alloc+0xc0/0x170) [10170.448952] [] (warn_alloc) from [] (__alloc_pages_nodemask+0x97c/0xe30) [10170.448964] [] (__alloc_pages_nodemask) from [] (__vmalloc_node_range+0x144/0x27c) [10170.448976] [] (__vmalloc_node_range) from [] (__vmalloc_node.constprop.10+0x48/0x50) [10170.448982] [] (__vmalloc_node.constprop.10) from [] (vmalloc+0x2c/0x34) [10170.448990] [] (vmalloc) from [] (zstd_alloc_workspace+0x6c/0xb8) [10170.448997] [] (zstd_alloc_workspace) from [] (find_workspace+0x120/0x1f4) [10170.449002] [] (find_workspace) from [] (end_compressed_bio_read+0x1d4/0x3b0) [10170.449016] [] (end_compressed_bio_read) from [] (process_one_work+0x1d8/0x3f0) [10170.449026] [] (process_one_work) from [] (worker_thread+0x38/0x558) [10170.449035] [] (worker_thread) from [] (kthread+0x124/0x154) [10170.449042] [] (kthread) from [] (ret_from_fork+0x14/0x3c) which never happened with compress=lzo, and a 2GB RAM machine that runs 4 threads of various builds runs into memory pressure quite often. On the other hand, I used 4.11 for lzo so this needs more testing before I can blame the zstd code. Also, I had network problems all day today so the machine was mostly idle instead of doing further tests -- not quite going to pull sources to build over a phone connection. I'm on linus:4.12-rc7 with only a handful of btrfs patches (v3 of Qu's chunk check, some misc crap) -- I guess I should use at least btrfs-for-4.13. Or would you prefer full-blown next? Meow! -- ⢀⣴⠾⠻⢶⣦⠀ ⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can. ⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener. ⠈⠳⣄ A master species delegates.
Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
> Please don't top post. Sorry about that. > Which function needs 1KB of stack space? That's quite a lot. FSE_buildCTable_wksp(), FSE_compress_wksp(), and HUF_readDTableX4() required over 1 KB of stack space. > I can see in [1] that there are some on-stack buffers replaced by > pointers to the workspace. That's good, but I would like to know if > there's any hidden gem that grags the precious stack space. I've been hunting down functions that use up the most stack trace and replacing buffers with pointers to the workspace. I compiled the code with -Wframe-larger-than=512 and reduced the stack usage of all offending functions. In the next version of the patch, no function uses more than 400 B of stack space. We'll be porting the changes back upstream as well. > Hm, I'd suggest to create a version optimized for kernel, eg. expecting > that 4+ GB buffer will never be used and you can use the most fittin in > type. This should affect only the function signatures, not the > algorithm implementation, so porting future zstd changes should be > straightforward. If the functions were exposed, then I would agree 100%. However, since these are internal functions, and the rest of zstd uses size_t to represent buffer sizes, I think it would be awkward to change just FSE/HUF functions. I also prefer size_t because it is friendlier to the optimizer, especially the loop optimizer, since the compiler doesn't have to worry about unsigned overflow. On a related note, zstd performs automatic optimizations to improve compression speed and reduce memory usage when given small sources, which is the common case in the kernel.
Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
> Please don't top post. Sorry about that. > Which function needs 1KB of stack space? That's quite a lot. FSE_buildCTable_wksp(), FSE_compress_wksp(), and HUF_readDTableX4() required over 1 KB of stack space. > I can see in [1] that there are some on-stack buffers replaced by > pointers to the workspace. That's good, but I would like to know if > there's any hidden gem that grags the precious stack space. I've been hunting down functions that use up the most stack trace and replacing buffers with pointers to the workspace. I compiled the code with -Wframe-larger-than=512 and reduced the stack usage of all offending functions. In the next version of the patch, no function uses more than 400 B of stack space. We'll be porting the changes back upstream as well. > Hm, I'd suggest to create a version optimized for kernel, eg. expecting > that 4+ GB buffer will never be used and you can use the most fittin in > type. This should affect only the function signatures, not the > algorithm implementation, so porting future zstd changes should be > straightforward. If the functions were exposed, then I would agree 100%. However, since these are internal functions, and the rest of zstd uses size_t to represent buffer sizes, I think it would be awkward to change just FSE/HUF functions. I also prefer size_t because it is friendlier to the optimizer, especially the loop optimizer, since the compiler doesn't have to worry about unsigned overflow. On a related note, zstd performs automatic optimizations to improve compression speed and reduce memory usage when given small sources, which is the common case in the kernel.
Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
Please don't top post. On Tue, Jun 27, 2017 at 05:27:51AM +, Nick Terrell wrote: > Adam, I’ve applied the same patch in my tree. I’ll send out the update [1] > once it's reviewed, since I also reduced the stack usage of functions > using over 1 KB of stack space. Which function needs 1KB of stack space? That's quite a lot. I can see in [1] that there are some on-stack buffers replaced by pointers to the workspace. That's good, but I would like to know if there's any hidden gem that grags the precious stack space. > You’re right that div_u64() will work, since the FSE functions are only > called on blocks of at most 128 KB at a time. Perhaps a u32 would be > clearer, but I would prefer to leave the signatures as is, to stay closer > to upstream. Upstream FSE should work with sizes larger than 4 GB, but > since it can't happen in zstd, it isn't a priority. Hm, I'd suggest to create a version optimized for kernel, eg. expecting that 4+ GB buffer will never be used and you can use the most fittin in type. This should affect only the function signatures, not the algorithm implementation, so porting future zstd changes should be straightforward.
Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
Please don't top post. On Tue, Jun 27, 2017 at 05:27:51AM +, Nick Terrell wrote: > Adam, I’ve applied the same patch in my tree. I’ll send out the update [1] > once it's reviewed, since I also reduced the stack usage of functions > using over 1 KB of stack space. Which function needs 1KB of stack space? That's quite a lot. I can see in [1] that there are some on-stack buffers replaced by pointers to the workspace. That's good, but I would like to know if there's any hidden gem that grags the precious stack space. > You’re right that div_u64() will work, since the FSE functions are only > called on blocks of at most 128 KB at a time. Perhaps a u32 would be > clearer, but I would prefer to leave the signatures as is, to stay closer > to upstream. Upstream FSE should work with sizes larger than 4 GB, but > since it can't happen in zstd, it isn't a priority. Hm, I'd suggest to create a version optimized for kernel, eg. expecting that 4+ GB buffer will never be used and you can use the most fittin in type. This should affect only the function signatures, not the algorithm implementation, so porting future zstd changes should be straightforward.
Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
Adam, I’ve applied the same patch in my tree. I’ll send out the update [1] once it's reviewed, since I also reduced the stack usage of functions using over 1 KB of stack space. You’re right that div_u64() will work, since the FSE functions are only called on blocks of at most 128 KB at a time. Perhaps a u32 would be clearer, but I would prefer to leave the signatures as is, to stay closer to upstream. Upstream FSE should work with sizes larger than 4 GB, but since it can't happen in zstd, it isn't a priority. I have userland tests set up mocking the linux kernel headers, and tested 32-bit mode there, but neglected to test the kernel on a 32-bit VM, which I’ve now corrected. Thanks for testing the patch on your ARM machine! [1] https://github.com/facebook/zstd/pull/738/files On 6/26/17, 9:18 PM, "Adam Borowski"wrote: David Sterba wrote: > > Thus, you want do_div() instead of /; do check widths and signedness of > > arguments. > > No do_div please, div_u64 or div64_u64. Good to know, the interface of do_div() is indeed weird. I guess Nick has found and fixed the offending divisions in his tree already, but this patch I'm sending is what I'm testing. One thing to note is that it divides u64 by size_t, so the actual operation differs on 32 vs 64-bit. Yet the code fails to handle compressing pieces bigger than 4GB in other places -- so use of size_t is misleading. Perhaps u32 would better convey this limitation? Anyway, that this code didn't even compile on 32-bit also means it hasn't been tested. I just happen to have such an ARM machine doing Debian archive rebuilds; I've rewritten the chroots with compress=zstd; this should be a nice non-artificial test. The load consists of snapshot+dpkg+gcc/etc+ assorted testsuites, two sbuild instances. Seems to work fine for a whole hour (yay!) already, let's see if there'll be any explosions. -- >8 >8 >8 >8 >8 >8 >8 >8 >8 -- Note that "total" is limited to 2³²-1 elsewhere despite being declared as size_t, so it's ok to use 64/32 -- it's much faster on eg. x86-32 than 64/64. Signed-off-by: Adam Borowski --- lib/zstd/fse_compress.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/zstd/fse_compress.c b/lib/zstd/fse_compress.c index e016bb177833..f59f9ebfe9c0 100644 --- a/lib/zstd/fse_compress.c +++ b/lib/zstd/fse_compress.c @@ -49,6 +49,7 @@ #include "fse.h" #include #include /* memcpy, memset */ +#include /* ** * Error Management @@ -575,7 +576,7 @@ static size_t FSE_normalizeM2(short *norm, U32 tableLog, const unsigned *count, { U64 const vStepLog = 62 - tableLog; U64 const mid = (1ULL << (vStepLog - 1)) - 1; - U64 const rStep = U64)1 << vStepLog) * ToDistribute) + mid) / total; /* scale on remaining */ + U64 const rStep = div_u64U64)1 << vStepLog) * ToDistribute) + mid, total); /* scale on remaining */ U64 tmpTotal = mid; for (s = 0; s <= maxSymbolValue; s++) { if (norm[s] == NOT_YET_ASSIGNED) { @@ -609,7 +610,7 @@ size_t FSE_normalizeCount(short *normalizedCounter, unsigned tableLog, const uns { U32 const rtbTable[] = {0, 473195, 504333, 520860, 55, 70, 75, 83}; U64 const scale = 62 - tableLog; - U64 const step = ((U64)1 << 62) / total; /* <== here, one division ! */ + U64 const step = div_u64((U64)1 << 62, total); /* <== here, one division ! */ U64 const vStep = 1ULL << (scale - 20); int stillToDistribute = 1 << tableLog; unsigned s; -- 2.13.1
Re: [PATCH] lib/zstd: use div_u64() to let it build on 32-bit
Adam, I’ve applied the same patch in my tree. I’ll send out the update [1] once it's reviewed, since I also reduced the stack usage of functions using over 1 KB of stack space. You’re right that div_u64() will work, since the FSE functions are only called on blocks of at most 128 KB at a time. Perhaps a u32 would be clearer, but I would prefer to leave the signatures as is, to stay closer to upstream. Upstream FSE should work with sizes larger than 4 GB, but since it can't happen in zstd, it isn't a priority. I have userland tests set up mocking the linux kernel headers, and tested 32-bit mode there, but neglected to test the kernel on a 32-bit VM, which I’ve now corrected. Thanks for testing the patch on your ARM machine! [1] https://github.com/facebook/zstd/pull/738/files On 6/26/17, 9:18 PM, "Adam Borowski" wrote: David Sterba wrote: > > Thus, you want do_div() instead of /; do check widths and signedness of > > arguments. > > No do_div please, div_u64 or div64_u64. Good to know, the interface of do_div() is indeed weird. I guess Nick has found and fixed the offending divisions in his tree already, but this patch I'm sending is what I'm testing. One thing to note is that it divides u64 by size_t, so the actual operation differs on 32 vs 64-bit. Yet the code fails to handle compressing pieces bigger than 4GB in other places -- so use of size_t is misleading. Perhaps u32 would better convey this limitation? Anyway, that this code didn't even compile on 32-bit also means it hasn't been tested. I just happen to have such an ARM machine doing Debian archive rebuilds; I've rewritten the chroots with compress=zstd; this should be a nice non-artificial test. The load consists of snapshot+dpkg+gcc/etc+ assorted testsuites, two sbuild instances. Seems to work fine for a whole hour (yay!) already, let's see if there'll be any explosions. -- >8 >8 >8 >8 >8 >8 >8 >8 >8 -- Note that "total" is limited to 2³²-1 elsewhere despite being declared as size_t, so it's ok to use 64/32 -- it's much faster on eg. x86-32 than 64/64. Signed-off-by: Adam Borowski --- lib/zstd/fse_compress.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/zstd/fse_compress.c b/lib/zstd/fse_compress.c index e016bb177833..f59f9ebfe9c0 100644 --- a/lib/zstd/fse_compress.c +++ b/lib/zstd/fse_compress.c @@ -49,6 +49,7 @@ #include "fse.h" #include #include /* memcpy, memset */ +#include /* ** * Error Management @@ -575,7 +576,7 @@ static size_t FSE_normalizeM2(short *norm, U32 tableLog, const unsigned *count, { U64 const vStepLog = 62 - tableLog; U64 const mid = (1ULL << (vStepLog - 1)) - 1; - U64 const rStep = U64)1 << vStepLog) * ToDistribute) + mid) / total; /* scale on remaining */ + U64 const rStep = div_u64U64)1 << vStepLog) * ToDistribute) + mid, total); /* scale on remaining */ U64 tmpTotal = mid; for (s = 0; s <= maxSymbolValue; s++) { if (norm[s] == NOT_YET_ASSIGNED) { @@ -609,7 +610,7 @@ size_t FSE_normalizeCount(short *normalizedCounter, unsigned tableLog, const uns { U32 const rtbTable[] = {0, 473195, 504333, 520860, 55, 70, 75, 83}; U64 const scale = 62 - tableLog; - U64 const step = ((U64)1 << 62) / total; /* <== here, one division ! */ + U64 const step = div_u64((U64)1 << 62, total); /* <== here, one division ! */ U64 const vStep = 1ULL << (scale - 20); int stillToDistribute = 1 << tableLog; unsigned s; -- 2.13.1