Re: Complete lock-up from using pkgsrc/net/darkstat
On 27/05/22 06:00, John Klos wrote: So here's an interesting problem: On NetBSD 8, 9, current, with both ipfilter and with npf, with different kinds of ethernet interfaces (re*, wm*), run pkgsrc/net/darkstat. Pass a lot of traffic (like a week's worth of Internet traffic). Stop darkstat. Machine locks. I had a go at reproducing this on NetBSD 9.99.79 with no luck. I was just pumping several TB of data into nc running on the host, so no IP forwarding or anything. I do recall seeing a message about UDP buffer problems from the DNS lookup child and I've had named running in small systems fail with similar sounding error messages. Maybe the problem is related to the number of DNS lookups the child process is doing rather than the number of TB the parent process is counting? Cheers, Lloyd
Re: regarding the changes to kernel entropy gathering
With some trepidation, I'm going to dip into this conversation even though I haven't read all of. I don't have the mental fortitude for that. I have two suggestions, one short and one long. Firstly, we could just have an rc.d script that checks to see if the system has /var/db/entropy-file or an rng device, and if not then it prints a warning and then generates some simplistic entropy with "ls -lR / ; dd if=/dev/urandom of=/dev/random bs=32 count=1 ; sysctl -w kern.entropy.consolidate=1". The system owner has been warned and the system proceeds to run. Secondly we could fix what I see as the biggest problem with the new implementation that I see right now and that is that it is unreasonably difficult for people to work out how to make their system go forwards once it has stopped. Note that making the system go forwards is easy, it's work out what to do that's hard. We can fix that. The current implementation prints out a message whenever it blocks a process that wants randomness, which immediately makes this implementation superior to all others that I have ever seen. The number of times I've logged into systems that have stalled on boot and made them finish booting by running "ls -lR /" over the past 20 years are too many to count. I don't know if I just needed to wait longer for the boot to finish, or if generating entropy was the fix, and I will never know. This is nuts. We can use the message to point the system administrator to a manual page that tells them what to do, and by "tells them what to do", I mean in plain simple language, right at the top of the page, without scaring them. How about this.. "entropy: pid %d (%s) blocking due to lack of entropy, see entropy(4)" and then in entropy(4) we can start with something like "If you are reading this because you have read a kernel message telling you that a process is blocking due to a lack of entropy then it is almost certainly because your hardware doesn't have a reliable source of randomness. If you have no particular requirements for cryptographic security on your system, you can generate some entropy and then tell the kernel that this entropy is 'enough' with the commands ls -lR / dd if=/dev/urandom of=/dev/random bs=32 count=1 sysctl -w kern.entropy.consolidate=1 If have strong requirements for cryptographic security on your system then you should run 'rndctl -S /root/seed' on a system with hardware random number generate (most modern CPUs), copy the seed file over to this system as /var/db/entropy-file and then run 'rndctl -L /var/db/entropy-file'. This only needs to be done once since scripts in rc.d will take care of saving and restoring system entropy in /var/db/entropy-file across reboots." We could even do both of these things.
Re: 5.x filesystem performance regression
??? Why is this in tech-net? On 5/06/2011, at 2:40 AM, Edgar Fuß wrote: > Having fixed my performace-critical RAID configuration, I think there's some > serious filesystem performance regression from 4.x to 5.x. > > I've tested every possible combination of 4.0.1 vs. 5.1, softdep vs. WAPBL, > parity maps enabled vs. disabled, bare disc vs. RAID 1 vs. RAID 5. Excellent. > The test case was extracting the 5.1 src.tgz set onto the filesystem under > test. > The extraction was done twice (having deleted the extracted tree in between); I always reboot between such tests to ensure that the buffer cache has been cleared out. If I ever get around to running RAID benchmarks again, I'll script it all in /etc/rc.d with reboots between each run so that I can get a number of runs without having to run anything by hand. > in some cases the time for the first run is missing because I forgot to time > the tar command. That's a problem because that is what is required to show the effect of the buffer cache. > So, almost everywhere, 4.0.1 is three to fiveteen times as fast as 5.1. I'm afraid that is isn't even close to almost everywhere because there are so many missing measurements. If we ignore all of the second runs because of the buffer cache issue, we only have two columns that contain enough data. The first is the plain disc column and it shows things looking pretty good for 5.1. The second is RAID 5 32k which doesn't look so good. For some reason, RAID 5 appears to be very slow and it needs looking at. If we want to look at the second runs in order to work out why 5.1 looks so much worse in the second runs, we still only have enough data in the plain disk and RAID 5 32k columns. For the plain disk, 5.1 does perform better in the second run than the first, just nowhere near as well as 4.0.1. My guess is that the VM parameters changed between 4.0.1. and 5.1 (they did change, I just can't remember when). Try comparing the output of "sysctl vm" on the two versions of NetBSD. My experience is that the VM settings need adjusting in order to get acceptable performance from any specialised workload and I suspect that under 4.0.1 your file set fits in memory, but under 5.1 it doesn't fit in the allowed file memory. Once again RAID 5 appears to be very slow and it needs looking at. Cheers, Lloyd