Re: Stuck in Needbuf state, trying to understand (6.7)
On Mon, Jun 29, 2020 at 03:56:43PM -0400, sven falempin wrote: > On Mon, Jun 29, 2020 at 12:58 PM sven falempin > wrote: > > It works in the original problematic setup. > > Will it go to base ? > Yes. revision 1.201 date: 2020/07/14 06:02:50; author: beck; state: Exp; lines: +9 -3; commitid: G6yRUUYskLjLY0oH; Do not convert the NOCACHE buffers that come from a vnd strategy routine into more delayed writes if the vnd is mounted from a file on an MNT_ASYNC filesystem. This prevents a situaiton where the cleaner can not clean delayed writes out without making more delayed writes, and we end up waiting for the syncer to spit things occasionaly when it runs. noticed and reported by sven falempin on tech, who validated this fixes his issue. ok krw@
Re: Stuck in Needbuf state, trying to understand (6.7)
On Mon, Jun 29, 2020 at 12:58 PM sven falempin wrote: > > > On Mon, Jun 29, 2020 at 12:12 PM Bob Beck wrote: > >> >> > Awesome, thanks! >> > >> > I will test that, ASAP, >> > do not hesitate to slay dragon, >> > i heard the bathing in the blood pool is good for the skin >> > >> > Little concern, I did the test without the MFS and ran into issues , >> > anyway i get back to you (or list ?) when i have test report with >> patched >> > kernel >> >> Yes, howver, you didn't tell my what options you had on the filesystem >> mounted >> when you did the test without MFS, because it matters. If you had your >> filesystem >> mounted ASYNC it would have exhibited the same behavoir. the issue is >> due to the >> async mount, which MFS does by default, not strictly to do with MFS. >> >> > tmpfs was just not mounted so it was the option of the underlying /home > > /dev/sd0d on /home type ffs (local, nodev, nosuid) > > So far the above patch improve the situation drastically > > I will now perform a test in the original device. > > > It works in the original problematic setup. Will it go to base ? -- -- - Knowing is not enough; we must apply. Willing is not enough; we must do
Re: Stuck in Needbuf state, trying to understand (6.7)
On Mon, Jun 29, 2020 at 12:12 PM Bob Beck wrote: > > > Awesome, thanks! > > > > I will test that, ASAP, > > do not hesitate to slay dragon, > > i heard the bathing in the blood pool is good for the skin > > > > Little concern, I did the test without the MFS and ran into issues , > > anyway i get back to you (or list ?) when i have test report with patched > > kernel > > Yes, howver, you didn't tell my what options you had on the filesystem > mounted > when you did the test without MFS, because it matters. If you had your > filesystem > mounted ASYNC it would have exhibited the same behavoir. the issue is due > to the > async mount, which MFS does by default, not strictly to do with MFS. > > tmpfs was just not mounted so it was the option of the underlying /home /dev/sd0d on /home type ffs (local, nodev, nosuid) So far the above patch improve the situation drastically I will now perform a test in the original device.
Re: Stuck in Needbuf state, trying to understand (6.7)
> Awesome, thanks! > > I will test that, ASAP, > do not hesitate to slay dragon, > i heard the bathing in the blood pool is good for the skin > > Little concern, I did the test without the MFS and ran into issues , > anyway i get back to you (or list ?) when i have test report with patched > kernel Yes, howver, you didn't tell my what options you had on the filesystem mounted when you did the test without MFS, because it matters. If you had your filesystem mounted ASYNC it would have exhibited the same behavoir. the issue is due to the async mount, which MFS does by default, not strictly to do with MFS. > > Again thanks for helping. > > -- > -- > - > Knowing is not enough; we must apply. Willing is not enough; we must do
Re: Stuck in Needbuf state, trying to understand (6.7)
On Mon, Jun 29, 2020 at 11:44 AM Bob Beck wrote: > On Sun, Jun 28, 2020 at 12:18:06PM -0400, sven falempin wrote: > > On Sun, Jun 28, 2020 at 2:40 AM Bryan Linton wrote: > > > > > On 2020-06-27 19:29:31, Bob Beck wrote: > > > > > > > > No. > > > > > > > > I know *exactly* what needbuf is but to attempt to diagnose what your > > > > problem is we need exact details. especially: > > > > > > > > 1) The configuration of your system including all the details of the > > > filesystems > > > > you have mounted, all options used, etc. > > > > > > > > 2) The script you are using to generate the problem (Not a > paraphrasing > > > of what > > > > you think the script does) What filesystems it is using. > > > > > > > > > > Not the OP, but this problem sounds almost exactly like the bug I > > > reported last year. > > > > > > There is a detailed list of steps I used to reproduce the bug in > > > the following bug report. > > > > > > https://marc.info/?l=openbsd-bugs&m=156412299418191 > > > > > > I was even able to bisect and identify the commit which first > > > caused the breakage for me. > > > > > > > > > ---8<--- > > > > > > CVSROOT:/cvs > > > Module name:src > > > Changes by: b...@cvs.openbsd.org2019/05/08 06:40:57 > > > > > > Modified files: > > > sys/kern : vfs_bio.c vfs_biomem.c > > > > > > Log message: > > > Modify the buffer cache to always flip recovered DMA buffers high. > > > > > > This also modifies the backoff logic to only back off what is requested > > > and not a "mimimum" amount. Tested by me, benno@, tedu@ anda ports > build > > > by naddy@. > > > > > > ok tedu@ > > > > > > ---8<--- > > > > > > However, I have since migrated away from using vnd(4)s since I was > > > able to find other solutions that worked for my use cases. So I > > > may not be able to provide much additional information other than > > > what is contained in the above bug report. > > > > > > -- > > > Bryan > > > > > > > > > > > > > > > > > > > Reproduction of BUG. > > > > > > # optional > > mkdir /tmpfs > > mount_mfs -o rw -s 2500M swap /tmpfs # i mounted through fstab so this > line > > is not tested > > #the bug > > /bin/dd if=/dev/zero of=/tmpfs/img.dd count=0 bs=1 seek=25 > > vnconfig vnd3 /tmpfs/img.dd > > printf "a a\n\n\n\nw\nq\n" | disklabel -E vnd3 > > newfs vnd3a > > mount /dev/vnd3a /mnt > > cd /tmp && ftp https://cdn.openbsd.org/pub/OpenBSD/6.7/amd64/base67.tgz > > cd /mnt > > #will occur here (the mkdir was ub beedbuf state for a while ... > > for v in 1 2 3 4 5 6 7 8 9; do mkdir /tmp/$v; tar xzvf /tmp/base67.tgz -C > > /mnt/$v; done > > > > Ready to test patches. > > > > > > So, your problem is that you have your vnd created in an mfs > filesystem, when I run your test with the vnd backed by a regular > filesystem (withe softdep even) it works fine. > > The trouble happens when your VND has buffers cached in it's > "filesystem" but then is not flushing them out to the underlyin file > (vnode) that you have your vnd backed by. On normal filesystems this > works fine, since vnd tells the lower layer to not cache the writes > and to do them syncrhonously, to avoid an explosion of delayed writes > and dependencies of buffers. > > The problem happens when we convert syncryonous bwrites to > asynchronous bdwrites if the fileystem is mounted ASYNC, which, > curiously, MFS always is (I don't know why, it doesn't really make any > sense, and I might even look at changing that) All the writes you do > end up being delayed anc chewing up more buffer space. And they are > all tied to one vnode (your image). once you exhaust the buffer > space, the cleaner runs, but as you have noticed can't clean out your > vnode until the syncer runs (every 60 seconds). This is why your > thing "takes a long time", and things stall in need buffer. softdep > has deep dark voodoo in it to avoid this problem and therefore when I > use a softdep filesystem instead of an ASYNC filesystem it works. > > Anyway, what's below fixes your issue on my machine. I'm not sure I'm > happy that it's the final fix but it does fix it. there are many other > dragons lurking in here. > > Index: sys/kern/vfs_bio.c > === > RCS file: /cvs/src/sys/kern/vfs_bio.c,v > retrieving revision 1.200 > diff -u -p -u -p -r1.200 vfs_bio.c > --- sys/kern/vfs_bio.c 29 Apr 2020 02:25:48 - 1.200 > +++ sys/kern/vfs_bio.c 29 Jun 2020 15:18:21 - > @@ -706,8 +706,14 @@ bwrite(struct buf *bp) > */ > async = ISSET(bp->b_flags, B_ASYNC); > if (!async && mp && ISSET(mp->mnt_flag, MNT_ASYNC)) { > - bdwrite(bp); > - return (0); > + /* > +* Don't convert writes from VND on async filesystems > +* that already have delayed writes in the upper layer. > +*/ > + if (!ISSET(bp->b_flags, B_NOCACHE)) { > + bdwrite(bp); >
Re: Stuck in Needbuf state, trying to understand (6.7)
On Sun, Jun 28, 2020 at 12:18:06PM -0400, sven falempin wrote: > On Sun, Jun 28, 2020 at 2:40 AM Bryan Linton wrote: > > > On 2020-06-27 19:29:31, Bob Beck wrote: > > > > > > No. > > > > > > I know *exactly* what needbuf is but to attempt to diagnose what your > > > problem is we need exact details. especially: > > > > > > 1) The configuration of your system including all the details of the > > filesystems > > > you have mounted, all options used, etc. > > > > > > 2) The script you are using to generate the problem (Not a paraphrasing > > of what > > > you think the script does) What filesystems it is using. > > > > > > > Not the OP, but this problem sounds almost exactly like the bug I > > reported last year. > > > > There is a detailed list of steps I used to reproduce the bug in > > the following bug report. > > > > https://marc.info/?l=openbsd-bugs&m=156412299418191 > > > > I was even able to bisect and identify the commit which first > > caused the breakage for me. > > > > > > ---8<--- > > > > CVSROOT:/cvs > > Module name:src > > Changes by: b...@cvs.openbsd.org2019/05/08 06:40:57 > > > > Modified files: > > sys/kern : vfs_bio.c vfs_biomem.c > > > > Log message: > > Modify the buffer cache to always flip recovered DMA buffers high. > > > > This also modifies the backoff logic to only back off what is requested > > and not a "mimimum" amount. Tested by me, benno@, tedu@ anda ports build > > by naddy@. > > > > ok tedu@ > > > > ---8<--- > > > > However, I have since migrated away from using vnd(4)s since I was > > able to find other solutions that worked for my use cases. So I > > may not be able to provide much additional information other than > > what is contained in the above bug report. > > > > -- > > Bryan > > > > > > > > > > > > > Reproduction of BUG. > > > # optional > mkdir /tmpfs > mount_mfs -o rw -s 2500M swap /tmpfs # i mounted through fstab so this line > is not tested > #the bug > /bin/dd if=/dev/zero of=/tmpfs/img.dd count=0 bs=1 seek=25 > vnconfig vnd3 /tmpfs/img.dd > printf "a a\n\n\n\nw\nq\n" | disklabel -E vnd3 > newfs vnd3a > mount /dev/vnd3a /mnt > cd /tmp && ftp https://cdn.openbsd.org/pub/OpenBSD/6.7/amd64/base67.tgz > cd /mnt > #will occur here (the mkdir was ub beedbuf state for a while ... > for v in 1 2 3 4 5 6 7 8 9; do mkdir /tmp/$v; tar xzvf /tmp/base67.tgz -C > /mnt/$v; done > > Ready to test patches. > > So, your problem is that you have your vnd created in an mfs filesystem, when I run your test with the vnd backed by a regular filesystem (withe softdep even) it works fine. The trouble happens when your VND has buffers cached in it's "filesystem" but then is not flushing them out to the underlyin file (vnode) that you have your vnd backed by. On normal filesystems this works fine, since vnd tells the lower layer to not cache the writes and to do them syncrhonously, to avoid an explosion of delayed writes and dependencies of buffers. The problem happens when we convert syncryonous bwrites to asynchronous bdwrites if the fileystem is mounted ASYNC, which, curiously, MFS always is (I don't know why, it doesn't really make any sense, and I might even look at changing that) All the writes you do end up being delayed anc chewing up more buffer space. And they are all tied to one vnode (your image). once you exhaust the buffer space, the cleaner runs, but as you have noticed can't clean out your vnode until the syncer runs (every 60 seconds). This is why your thing "takes a long time", and things stall in need buffer. softdep has deep dark voodoo in it to avoid this problem and therefore when I use a softdep filesystem instead of an ASYNC filesystem it works. Anyway, what's below fixes your issue on my machine. I'm not sure I'm happy that it's the final fix but it does fix it. there are many other dragons lurking in here. Index: sys/kern/vfs_bio.c === RCS file: /cvs/src/sys/kern/vfs_bio.c,v retrieving revision 1.200 diff -u -p -u -p -r1.200 vfs_bio.c --- sys/kern/vfs_bio.c 29 Apr 2020 02:25:48 - 1.200 +++ sys/kern/vfs_bio.c 29 Jun 2020 15:18:21 - @@ -706,8 +706,14 @@ bwrite(struct buf *bp) */ async = ISSET(bp->b_flags, B_ASYNC); if (!async && mp && ISSET(mp->mnt_flag, MNT_ASYNC)) { - bdwrite(bp); - return (0); + /* +* Don't convert writes from VND on async filesystems +* that already have delayed writes in the upper layer. +*/ + if (!ISSET(bp->b_flags, B_NOCACHE)) { + bdwrite(bp); + return (0); + } } /*
Re: Stuck in Needbuf state, trying to understand (6.7)
On Sun, Jun 28, 2020 at 2:40 AM Bryan Linton wrote: > On 2020-06-27 19:29:31, Bob Beck wrote: > > > > No. > > > > I know *exactly* what needbuf is but to attempt to diagnose what your > > problem is we need exact details. especially: > > > > 1) The configuration of your system including all the details of the > filesystems > > you have mounted, all options used, etc. > > > > 2) The script you are using to generate the problem (Not a paraphrasing > of what > > you think the script does) What filesystems it is using. > > > > Not the OP, but this problem sounds almost exactly like the bug I > reported last year. > > There is a detailed list of steps I used to reproduce the bug in > the following bug report. > > https://marc.info/?l=openbsd-bugs&m=156412299418191 > > I was even able to bisect and identify the commit which first > caused the breakage for me. > > > ---8<--- > > CVSROOT:/cvs > Module name:src > Changes by: b...@cvs.openbsd.org2019/05/08 06:40:57 > > Modified files: > sys/kern : vfs_bio.c vfs_biomem.c > > Log message: > Modify the buffer cache to always flip recovered DMA buffers high. > > This also modifies the backoff logic to only back off what is requested > and not a "mimimum" amount. Tested by me, benno@, tedu@ anda ports build > by naddy@. > > ok tedu@ > > ---8<--- > > However, I have since migrated away from using vnd(4)s since I was > able to find other solutions that worked for my use cases. So I > may not be able to provide much additional information other than > what is contained in the above bug report. > > -- > Bryan > > > > > > > Reproduction of BUG. # optional mkdir /tmpfs mount_mfs -o rw -s 2500M swap /tmpfs # i mounted through fstab so this line is not tested #the bug /bin/dd if=/dev/zero of=/tmpfs/img.dd count=0 bs=1 seek=25 vnconfig vnd3 /tmpfs/img.dd printf "a a\n\n\n\nw\nq\n" | disklabel -E vnd3 newfs vnd3a mount /dev/vnd3a /mnt cd /tmp && ftp https://cdn.openbsd.org/pub/OpenBSD/6.7/amd64/base67.tgz cd /mnt #will occur here (the mkdir was ub beedbuf state for a while ... for v in 1 2 3 4 5 6 7 8 9; do mkdir /tmp/$v; tar xzvf /tmp/base67.tgz -C /mnt/$v; done Ready to test patches. -- -- - Knowing is not enough; we must apply. Willing is not enough; we must do
Re: Stuck in Needbuf state, trying to understand (6.7)
On 2020-06-27 19:29:31, Bob Beck wrote: > > No. > > I know *exactly* what needbuf is but to attempt to diagnose what your > problem is we need exact details. especially: > > 1) The configuration of your system including all the details of the > filesystems > you have mounted, all options used, etc. > > 2) The script you are using to generate the problem (Not a paraphrasing of > what > you think the script does) What filesystems it is using. > Not the OP, but this problem sounds almost exactly like the bug I reported last year. There is a detailed list of steps I used to reproduce the bug in the following bug report. https://marc.info/?l=openbsd-bugs&m=156412299418191 I was even able to bisect and identify the commit which first caused the breakage for me. ---8<--- CVSROOT:/cvs Module name:src Changes by: b...@cvs.openbsd.org2019/05/08 06:40:57 Modified files: sys/kern : vfs_bio.c vfs_biomem.c Log message: Modify the buffer cache to always flip recovered DMA buffers high. This also modifies the backoff logic to only back off what is requested and not a "mimimum" amount. Tested by me, benno@, tedu@ anda ports build by naddy@. ok tedu@ ---8<--- However, I have since migrated away from using vnd(4)s since I was able to find other solutions that worked for my use cases. So I may not be able to provide much additional information other than what is contained in the above bug report. -- Bryan > > > On Sat, Jun 27, 2020 at 08:09:18PM -0400, sven falempin wrote: > > On Fri, Jun 26, 2020 at 7:35 PM sven falempin > > wrote: > > > > > > > > > > > On Fri, Jun 26, 2020 at 5:22 PM Stuart Henderson > > > wrote: > > > > > >> On 2020/06/26 15:30, sven falempin wrote: > > >> > behavior confirmed on current. > > >> > > > >> > Once the process stalls, ( could be anything writing to the vnconfig > > >> disk, > > >> > cp , umount ) > > >> > a few other calls like df , or ps, etc may hang, never the same > > >> > sp or mp kernel, reproduced on today's snapshots. > > >> > > >> vnconfig is used as part of "make release", many builds are done every > > >> week using this so it's not a general problem with vnconfig. > > >> > > >> Can you show some commands or a script to trigger the behaviour? > > >> > > > > > > the perl script use the system to call : > > > > > > vnconfig. > > > mount. > > > umount. <- saw hanged > > > cp.<- saw hanged > > > tar.<- saw hanged > > > svn up.<- saw hanged > > > and dd. > > > newfs. > > > > > > really nothing fancy, only stuff writing to disk got stuck. > > > > > > At one point it does a chroot but it never hangs near that , most of the > > > time it hangs before. > > > > > > The script has been used like 1000 times on 6.0 and maybe twice more on > > > 6.4. > > > > > > I have absolutely no idea what the 'needbuf' of top is . > > > > > > the script hangs at random position , always writing into vnconfig. > > > > > > I have no idea how to reproduce outside the perl script , so maybe it is > > > related > > > to some devious perl stdin/stdout buffer . > > > > > > Nevertheless there's like a 5% chance that's the script will work( slowly > > > ) > > > > > > Most of the system call are inside a routine to log > > > > > > sub debug_system { > > > $logger->debug('running: '.join(' ', @_)); > > > return system(@_); > > > } > > > > > > so i can easily put things inside to try to understand the issue. > > > > > > It is really a strange behavior, and the device must be shut down > > > electrically. > > > Something really odd, i run syslogd on a buffer, and syslogc buffer is > > > stuck too > > > when the device stuck (but it supposed to be mostly already allocated > > > memory ). > > > > > > It's really like the vm does not want to give anymore bucket (<- i > > > don't know what i m talking about here, > > > but i looks like that anything that doesn't malloc is ok , computer reply > > > to ping , can do a few things for a while , and then complete > > > hang ) > > > > > > I ran the 6.7 release on a VM somewhere and another device with many perl > > > script and they work. > > > > > > Only this fails 95% of the time and is VERY VERY slow when ok. > > > compared to what i saw in /usr/src the vnconfig is big , ( forgot to copy > > > df -h ), > > > like 2GB > > > > > > > > > i put ktrace in front of the perl system call > > > > An di was able to recover a 800MB trace > > > > $ kdump -f ./trace.out | tail -20 > > kdump: realloc: Cannot allocate memory > > 25955 UNKNOWN(1634890859) > > 72466 ? CALL syscall() > > > > > > could that be of some use ? > > > > > > -- > > -- > > - > > Knowing is not enough; we must apply. Willing is not enough; we must do > >
Re: Stuck in Needbuf state, trying to understand (6.7)
No. I know *exactly* what needbuf is but to attempt to diagnose what your problem is we need exact details. especially: 1) The configuration of your system including all the details of the filesystems you have mounted, all options used, etc. 2) The script you are using to generate the problem (Not a paraphrasing of what you think the script does) What filesystems it is using. On Sat, Jun 27, 2020 at 08:09:18PM -0400, sven falempin wrote: > On Fri, Jun 26, 2020 at 7:35 PM sven falempin > wrote: > > > > > > > On Fri, Jun 26, 2020 at 5:22 PM Stuart Henderson > > wrote: > > > >> On 2020/06/26 15:30, sven falempin wrote: > >> > behavior confirmed on current. > >> > > >> > Once the process stalls, ( could be anything writing to the vnconfig > >> disk, > >> > cp , umount ) > >> > a few other calls like df , or ps, etc may hang, never the same > >> > sp or mp kernel, reproduced on today's snapshots. > >> > >> vnconfig is used as part of "make release", many builds are done every > >> week using this so it's not a general problem with vnconfig. > >> > >> Can you show some commands or a script to trigger the behaviour? > >> > > > > the perl script use the system to call : > > > > vnconfig. > > mount. > > umount. <- saw hanged > > cp.<- saw hanged > > tar.<- saw hanged > > svn up.<- saw hanged > > and dd. > > newfs. > > > > really nothing fancy, only stuff writing to disk got stuck. > > > > At one point it does a chroot but it never hangs near that , most of the > > time it hangs before. > > > > The script has been used like 1000 times on 6.0 and maybe twice more on > > 6.4. > > > > I have absolutely no idea what the 'needbuf' of top is . > > > > the script hangs at random position , always writing into vnconfig. > > > > I have no idea how to reproduce outside the perl script , so maybe it is > > related > > to some devious perl stdin/stdout buffer . > > > > Nevertheless there's like a 5% chance that's the script will work( slowly ) > > > > Most of the system call are inside a routine to log > > > > sub debug_system { > > $logger->debug('running: '.join(' ', @_)); > > return system(@_); > > } > > > > so i can easily put things inside to try to understand the issue. > > > > It is really a strange behavior, and the device must be shut down > > electrically. > > Something really odd, i run syslogd on a buffer, and syslogc buffer is > > stuck too > > when the device stuck (but it supposed to be mostly already allocated > > memory ). > > > > It's really like the vm does not want to give anymore bucket (<- i > > don't know what i m talking about here, > > but i looks like that anything that doesn't malloc is ok , computer reply > > to ping , can do a few things for a while , and then complete > > hang ) > > > > I ran the 6.7 release on a VM somewhere and another device with many perl > > script and they work. > > > > Only this fails 95% of the time and is VERY VERY slow when ok. > > compared to what i saw in /usr/src the vnconfig is big , ( forgot to copy > > df -h ), > > like 2GB > > > > > i put ktrace in front of the perl system call > > An di was able to recover a 800MB trace > > $ kdump -f ./trace.out | tail -20 > kdump: realloc: Cannot allocate memory > 25955 UNKNOWN(1634890859) > 72466 ? CALL syscall() > > > could that be of some use ? > > > -- > -- > - > Knowing is not enough; we must apply. Willing is not enough; we must do
Re: Stuck in Needbuf state, trying to understand (6.7)
On Fri, Jun 26, 2020 at 7:35 PM sven falempin wrote: > > > On Fri, Jun 26, 2020 at 5:22 PM Stuart Henderson > wrote: > >> On 2020/06/26 15:30, sven falempin wrote: >> > behavior confirmed on current. >> > >> > Once the process stalls, ( could be anything writing to the vnconfig >> disk, >> > cp , umount ) >> > a few other calls like df , or ps, etc may hang, never the same >> > sp or mp kernel, reproduced on today's snapshots. >> >> vnconfig is used as part of "make release", many builds are done every >> week using this so it's not a general problem with vnconfig. >> >> Can you show some commands or a script to trigger the behaviour? >> > > the perl script use the system to call : > > vnconfig. > mount. > umount. <- saw hanged > cp.<- saw hanged > tar.<- saw hanged > svn up.<- saw hanged > and dd. > newfs. > > really nothing fancy, only stuff writing to disk got stuck. > > At one point it does a chroot but it never hangs near that , most of the > time it hangs before. > > The script has been used like 1000 times on 6.0 and maybe twice more on > 6.4. > > I have absolutely no idea what the 'needbuf' of top is . > > the script hangs at random position , always writing into vnconfig. > > I have no idea how to reproduce outside the perl script , so maybe it is > related > to some devious perl stdin/stdout buffer . > > Nevertheless there's like a 5% chance that's the script will work( slowly ) > > Most of the system call are inside a routine to log > > sub debug_system { > $logger->debug('running: '.join(' ', @_)); > return system(@_); > } > > so i can easily put things inside to try to understand the issue. > > It is really a strange behavior, and the device must be shut down > electrically. > Something really odd, i run syslogd on a buffer, and syslogc buffer is > stuck too > when the device stuck (but it supposed to be mostly already allocated > memory ). > > It's really like the vm does not want to give anymore bucket (<- i > don't know what i m talking about here, > but i looks like that anything that doesn't malloc is ok , computer reply > to ping , can do a few things for a while , and then complete > hang ) > > I ran the 6.7 release on a VM somewhere and another device with many perl > script and they work. > > Only this fails 95% of the time and is VERY VERY slow when ok. > compared to what i saw in /usr/src the vnconfig is big , ( forgot to copy > df -h ), > like 2GB > i put ktrace in front of the perl system call An di was able to recover a 800MB trace $ kdump -f ./trace.out | tail -20 kdump: realloc: Cannot allocate memory 25955 UNKNOWN(1634890859) 72466 ▒▒▒ CALL syscall() could that be of some use ? -- -- - Knowing is not enough; we must apply. Willing is not enough; we must do
Re: Stuck in Needbuf state, trying to understand (6.7)
On Fri, Jun 26, 2020 at 5:22 PM Stuart Henderson wrote: > On 2020/06/26 15:30, sven falempin wrote: > > behavior confirmed on current. > > > > Once the process stalls, ( could be anything writing to the vnconfig > disk, > > cp , umount ) > > a few other calls like df , or ps, etc may hang, never the same > > sp or mp kernel, reproduced on today's snapshots. > > vnconfig is used as part of "make release", many builds are done every > week using this so it's not a general problem with vnconfig. > > Can you show some commands or a script to trigger the behaviour? > the perl script use the system to call : vnconfig. mount. umount. <- saw hanged cp.<- saw hanged tar.<- saw hanged svn up.<- saw hanged and dd. newfs. really nothing fancy, only stuff writing to disk got stuck. At one point it does a chroot but it never hangs near that , most of the time it hangs before. The script has been used like 1000 times on 6.0 and maybe twice more on 6.4. I have absolutely no idea what the 'needbuf' of top is . the script hangs at random position , always writing into vnconfig. I have no idea how to reproduce outside the perl script , so maybe it is related to some devious perl stdin/stdout buffer . Nevertheless there's like a 5% chance that's the script will work( slowly ) Most of the system call are inside a routine to log sub debug_system { $logger->debug('running: '.join(' ', @_)); return system(@_); } so i can easily put things inside to try to understand the issue. It is really a strange behavior, and the device must be shut down electrically. Something really odd, i run syslogd on a buffer, and syslogc buffer is stuck too when the device stuck (but it supposed to be mostly already allocated memory ). It's really like the vm does not want to give anymore bucket (<- i don't know what i m talking about here, but i looks like that anything that doesn't malloc is ok , computer reply to ping , can do a few things for a while , and then complete hang ) I ran the 6.7 release on a VM somewhere and another device with many perl script and they work. Only this fails 95% of the time and is VERY VERY slow when ok. compared to what i saw in /usr/src the vnconfig is big , ( forgot to copy df -h ), like 2GB -- -- - Knowing is not enough; we must apply. Willing is not enough; we must do
Re: Stuck in Needbuf state, trying to understand (6.7)
On 2020/06/26 15:30, sven falempin wrote: > behavior confirmed on current. > > Once the process stalls, ( could be anything writing to the vnconfig disk, > cp , umount ) > a few other calls like df , or ps, etc may hang, never the same > sp or mp kernel, reproduced on today's snapshots. vnconfig is used as part of "make release", many builds are done every week using this so it's not a general problem with vnconfig. Can you show some commands or a script to trigger the behaviour?