Re: to configure HammerFS
:Hello! : :In order to gain perfomance in some peculiar cases: :To what extent HammerFS is able to be tuned? :Are there any configurable options in HammerFS? :How to access them? Most of the information will be in 'man hammer'. There are numerous sysctls under vfs.hammer (sysctl vfs.hammer) but they are already tuned. The only real adjustments you might want to make would be if you also were running swapcache with a SSD. -Matt Matthew Dillon
Re: modifying nullfs
:honestly, i think about some kind of abstraction layer over HammerFS, :that's why a stackable FS impressed me. Stackable FS's are always interesting, but they are also full of problems. The NFS server implementation is a good example.. when you export a filesystem via NFS the NFS client has to talk to the NFS server and that's essentially creating a stacking layer on top of the original filesystem being exported by the server. There are three primary problems with any stacking filesystem: * Coherency if someone goes and does something to a file or directory (like remove or rename it) via the underlying filesystem. The stacked filesystem doesn't know about it. * Tracking the vnode associations is particularly difficult because you can't just keep the pairs of vnodes (the overlayed vnode and the underlying vnode) referenced all the time. There are too many. In particular, even just leaving the underlying vnodes referenced creates a real problem for the kernel's vnode cache management code because it can only hold a limited number of vnodes. (The NFS server handles this by not keeping a permanent ref on the vnodes requested by clients. Instead it can force clients to re-lookup the filename and re-establish any vnode association it had removed from the cache. It works for most cases but does not work well for the open-descriptor-after-unlinking case and can cause serious confusion when multiple NFS clients rename the same file or directory). * And overhead. When you have a stacked filesystem (such as a NFS server), verses a filesystem alias (such as NULLFS), the stacked filesystem has considerable kernel memory overhead to track the stacking which creates a memory management issue if you try to stack very large filesystems.o Another example of a stacked filesystem would be the UFS union mount (unionfs) in FreeBSD. It was removed from DragonFly and has had endemic problems in FreeBSD for, oh, ever since it was written. It depended heavily on the 'VOP_WHITEOUT' feature which is something only UFS really supports, and not very well at that because directory-entry whiteouts can't really be backed up. The union filesystem tried to stack two real filesystem and present essentially a 'writable snapshot' as the mount. So it's a very interesting area but complex and difficult to implement properly under any circumstances. -Matt Matthew Dillon
Re: modifying nullfs
Most nullfs VOP's are going to go directly to the underlying filesystem and NOT run through nullfs itself. In DragonFly we don't have to replicate the vnode infrastructure for directory nodes in nullfs because we track { mp, vnode } instead of just { vnode }. nullfs is basically only used to track the mount structure. Our namecache code handles mount points via the chaining within the mount structures and NOT via chaining within directory vnodes. In otherwords, in DragonFly a nullfs mount as just as good as the underlying filesystem mount, with no added overhead to use it. -Matt Matthew Dillon
Re: Errors on SSD
Also note that you may be able to get more detailed information on the problem using smartctl: pkg_radd smartmontools smartctl -d sat -a /dev/daXXX (where daXXX is the correct device for the SSD). In particular look at the wear indicator, which is typically attribute 233, and available reserved space, which is typically attribute 232 (but it depends on the SSD). -Matt
Re: HAMMER2 progress report - 07-Aug-2012
: :On Wed, Aug 8, 2012 at 10:44 AM, Matthew Dillon : wrote: : :> Full graph spanning tree protocol so there can be loops, multiple ways :> to get to the same target, and so on and so forth. The SPANs propagate :> the best N (one or two) paths for each mount. :> : :Can this be tested in development. :What commands should be used? :-) : :--Thanks It's a bit opaque but basically you create the /etc/hammer2 directory infrastructure and then setup some hammer2 filesystems. hammer2 rsainit mkdir /etc/hammer2/remote cd /etc/hammer2/remote (create .pub or .none files) Essentially you copy the rsa.pub key from the source machine into /etc/hammer2/remote/.pub on the target machine. You can then connect from the source machine to the target machine as described below. Normally you also create a localhost link and, for testing purposes, it isn't root protected and you can tell it not to use encryption: touch /etc/hammer2/remote/127.0.0.1.none In order to be able connect to the service daemon and have the daemon be able to connect to other service daemons you need to set up encryption on the machine: #!/bin/csh # # example disk by serial number set disk = ".s1d" newfs_hammer2 /dev/serno/$disk mount /dev/serno/$disk@ROOT /mnt cd /mnt hammer2 pfs-create TEST1 set uuid = `hammer2 pfs-clid TEST1` echo "cluster uuid $uuid" foreach i ( TEST2 TEST3 TEST4 TEST5 TEST6 TEST7 TEST8 TEST9 ) hammer2 -u $uuid pfs-create $i mkdir -p /test$i mount /dev/serno/$disk@TEST$i /test$i end The mounts will start up a hammer2 service daemon which connects to each mount. You can kill the daemon and start it manually and it will reconnect automatically. The service daemon runs in the background. To see all the debug output kill it and start it in the foreground with -d: killall hammer2 hammer2 -d service I usually do this on each test machine. Then I connect the service daemons to each other in various ways. # hammer2 -s /mnt status # hammer2 -s /mnt connect span:test28.fu.bar.com # hammer2 -s /mnt status CPYID LABEL STATUS PATH 1 ---.00 span:test28.fu.bar.com (you can 'disconnect' a span as well. The spans will attempt to reconnect every 5 seconds forever while in the table). (the connection table is on the media, thus persistent) You can also connect a 'shell' to a running service daemon, as long as /etc/hammer2/remote allows it: hammer2 shell 127.0.0.1 (do various commands) only the 'tree' command is really useful here though you can also manually connect spans. You can't kill them though. In anycase, that's the jist of it for the moment. The 'tree' command from the shell gives you a view of the spans from the point of view of whatever machine you are connecting to. Remember that the HAMMER2 filesystem itself is not production ready... it can't free blocks, for example (kinda like a PROM atm). And, of course, there are no protocols running on the links yet. I haven't gotten routing working yet. The core is fairly advanced... encryption, messaging transactions, notification of media config changes, automatic reconnect on connection failure, etc. -Matt Matthew Dillon
Re: fails to mount root
:On Monday 13 August 2012 15:38:46 Matthew Dillon wrote: :>If you have a DDB> prompt you can hit the scoll-lock button and then :>cursor up. : :HAMMER(ROOT) recovery check seqno=7a824b53 :recovery range 308735b0-30878e48 :recovery nexto 30878e48 endseqno=7a824c80 :recovery undo 308735b0-30878e48 (22680 bytes)(RW) :Found REDO_SYNC 308516b0 :Ignoring extra REDO_SYNC records in UNDO/REDO FIFO. (9 times) :recovery complete :recovery redo 308735b0-30878e48 (22680 bytes)(RW) :Find extended redo 308516b0, 139008 exbytes :panic: lockmgr: LK_RELEASE: no lock held :cpuid = 0 : :Pierre :-- :I believe in Yellow when I'm in Sweden and in Black when I'm in Wales. Hmm. Well, that's clearly a software bug but I'm not sure what is causing the lock to be lost. The debugger backtrace isn't consistent. I would love to get a kernel core out of this but it's too early in the boot sequence. Antonio will have a patch for a boot-time tunable that will bypass the hammer2 recovery code tomorrow sometime. -Matt Matthew Dillon
Re: fails to mount root
: :On Monday 13 August 2012 12:30:05 Matthew Dillon wrote: :> Well, a 2.8 CD wouldn't have worked but you now burned a more recent :> CD and you are getting the panic again? The question is what is the :> console output above the 'lockmgr' line ? i.e. all I see there is :> part of the backtrace, and not the actual reason for the panic. It's :> quite possible that it is a software bug that is exploding it. : :"lockmgr" was on the top line. When I mount it with the CD booted, the top :line is "panic", which is one above "lockmgr", and I can't scroll back. Is :there a way to make more lines on the console? : :Pierre :-- :ve ka'a ro klaji la .romas. se jmaji If you have a DDB> prompt you can hit the scoll-lock button and then cursor up. -Matt Matthew Dillon
Re: fails to mount root
Well, a 2.8 CD wouldn't have worked but you now burned a more recent CD and you are getting the panic again? The question is what is the console output above the 'lockmgr' line ? i.e. all I see there is part of the backtrace, and not the actual reason for the panic. It's quite possible that it is a software bug that is exploding it. Antonio had a similar issue with the hammer stage 2 mount where it couldn't run the REDOs (which is stage 2) and then refused to mount because of that. It didn't crash but it did refuse to mount, and I think Antonio is working on a boot-time setting to tell hammer to ignore a stage 2 failure and continue on with the mount anyway. However, your crash is slightly different... it's an actual crash, which means that there's a software bug somewhere. The hammer volume might still not be able to mount without the variable Antonio's has been working on, but we should be able to track down and fix the panic. -Matt Matthew Dillon
HAMMER2 progress report - 07-Aug-2012
Hammer2 continues to progress. I've been working on the userland spanning tree protocol. * The socket/messaging system now connects, handshakes with public key encryption, and negotiates AES keys for the session data stream. * The low level transactional messaging subsystem is pretty solid now. * The initial spanning tree protocol implementation is propagating node information across the cluster and is handling connect/disconnect events properly. So far I've only tested two hosts x 10 mounts, but pretty soon now I will start using vkernels to create much larger topologies for testing purposes. Essentially the topology is (ascii art): Host #1Any cluster (graph) topology __ __ / \/ \ PFS mount --\ PFS mount --\\ /---(TCP)-- Host #2 --(TCP)\ PFS mount -- hammer2 service ---(TCP)- Host #3 PFS mount --// \---(TCP)-- Host #4 --(TCP)/ PFS mount --/ Full graph spanning tree protocol so there can be loops, multiple ways to get to the same target, and so on and so forth. The SPANs propagate the best N (one or two) paths for each mount. Any given mount is just a HAMMER2 PFS, so there will be immense flexibility in what a 'mount' means. i.e. is it a master node? Is it a slave? Is it a cache-only node? Maybe its a diskless client-only node (no persistent storage at all), etc. Because each node is a PFS, and PFS's can be trivially created (any single physical HAMMER2 filesystem can contain any number of PFS's)... because of that there will be immense flexibility in how people construct their clusters. * The low level messaging subsystem is solid. Message relaying is next on my TODO list (using the spanning tree to relay messages). After that I'll have to get automatic-reconnection working properly. Once the low level messaging subsystem is solid I will be able to start working on the higher-level protocols, which is the fun part. There is still a very long ways to go. Ultimately the feature set is going to be huge, which is one reason why there is so much work left to do. For example, we want to be able to have millions of diskless or cache-only clients be able to connect into a cluster and have it actually work... which means that the topology would have to support 'satellite' hosts to aggregate the clients and implement a proxy protocol to the core of the topology without having to propagate millions of spanning tree nodes. Ultimately the topology has to allow for proxy operation, otherwise the spanning tree overhead becomes uncontrolled. This would also make it possible to have internet-facing hosts without compromising the cluster's core. Also note that dealing with multiple physical disks and failures will also be part of the equation. The cluster mechanic described above is an abstraction for having multiple copies of the same filesystem in different places, with varying amounts of data and thus gaining redundancy. But we ALSO want to be able to have a SINGLE copy of the filesystem (homed at a particular machine) to use the SAME mechanism to glue together all of its physical storage into a single entity (plus with a copies mechanic for redundancy), and then allow that filesystem to take part in the multi-master cluster as one of the masters. All of these vastly different feature sets will use the same underlying transactional messaging protocol. x bazillion more features and that's my goal. -Matt
Re: solid-state drives
Well, dedup has fairly low overhead so that would be fine on a SSD too, but because SSD's tend to be smaller than HDDs there also tends to be not so much data to dedup so you might not get much out of enabling it. -- The SSD's biggest benefit is as a cache, though I don't discount the wonderfully fast boots I get with SSD-based roots on my laptops. Random access read I/O on a SSD is several orders of magnitude faster than on a HDD (e.g. 20,000+ iops vs 250-400 iops)... that's a 50x factor and a 15K rpm HDD won't help much. Random write I/O is a bit more problematic and depends on many factors, mainly related to how well the SSD is able to write-combine the I/O requests and the size of the blocks being written. I haven't run any tests in this regard, but something like the OCZ's with their massive ram caches (and higher power requirements) will likely do better with random writes than, e.g. the Intel SSDs which have very little ram. Linear read and write I/O between a SSD and a HDD are closer. The SSD will be 2x-4x faster on the linear read I/O (instead of 50x faster), and maybe 1.5-2x faster for linear write I/O. NOTE! This is for a reasonably-sized SSD, 200GB or larger. SSD performance is DIRECTLY related to the actual number of flash chips in the SSD, so there is a huge difference in the performance of, say, a 200GB SSD verses the performance of a 40GB SSD. A 40GB SSD can be limited to e.g. 40 MBytes/sec writing. A 200GB SSD with a 6GBit/sec SATA phy can do 400 MBytes/sec writing and exceed 500 MBytes/sec reading. Big difference. -Matt Matthew Dillon
Re: solid-state drives
:On Wed, Aug 01, 2012 at 06:16:13PM -0400, Pierre Abbat wrote: :> This is a spinoff of the Aleutia question, since Aleutia puts SSDs in :> computers. How does the periodic Hammer job handle SSDs? Does reblocking do :> anything different than on an HDD? If a computer has an SSD and an HDD, which :> should get the swap space? :> :> Pierre : :On my workstation I use an SSD for the root filesystem, swapcache and :/usr/pkg. : :The current configuration has snapshots set to 1d (10d retention time) :and reblocking is set to 7d (1m runtime). All other option (prune, :rebalance, dedup and recopy) are disabled. : :Currently it is running fine, but in my opinion running swapcache on a :workstation that just runs for a couple of hours is not always :necessary. I'm just running this setup to play with the swapcache and :the SSD, because I think it is a very nice feature. : :Regards, :Sven You will definitely want to turn pruning on, it doesn't do all that much I/O and its needed to clean up the fine-grained snapshots. Rebalance, dedup, and recopy can be left turned off. -Matt Matthew Dillon
Re: frequency scaling on D525MW not working properly
Also on the D5* atoms on FreeBSD it would be nice to check that it actually works as advertised, by running a few cpu-bound processes (i.e. for (;;); ) and measuring the watts being burned at different frequencies. That's the real proof that the frequency scaling is doing something real. -Matt
Re: solid-state drives
:This is a spinoff of the Aleutia question, since Aleutia puts SSDs in :computers. How does the periodic Hammer job handle SSDs? Does reblocking do :anything different than on an HDD? If a computer has an SSD and an HDD, which :should get the swap space? : :Pierre :-- :lo ponse be lo mruli po'o cu ga'ezga roda lo ka dinko It depends on the purpose. I run several laptops with both swap and root on the SSD. Obviously swapcache is turned off in that situation. I usually adjust the hammer config to turn off the 'recopy' feature and I usually set the reblocker to run less often, but that's about it. I only suggest running a hammer filesystem on a SSD under carefully controlled conditions... that is, not if you are going to be manipulating any large databases that could blow out the SSD. Normal laptop use should work fine but one always has to be cognizant of the SSD's limited write cycles. -- For machines that are working harder or which need a lot of disk space I run hammer on the HDD and put swap on the SSD, and enable swapcache. The SSD works very well as a cache under these circumstances. You still need to avoid heavy paging due to running programs which are too big to fit into memory, since that can wear out the SSD. This is my preferred setup. All the DragonFly boxes run with SSD-based swap and swapcache turned on. -Matt Matthew Dillon
Re: Latest 3.1 development version core dumps while destroying master PFS
:Hi, : :I tried to destroy the PFS after unmounting : :1. after downgrading :2. with latest dev snapshot usb stick :3. in single user mode :4. after creating a link :DataNew -> @@-1:8 : :The system always core dumps. : :I guess Matt will be busy working on HAMMER2 and wonder if i should :keep waiting till the bug is fixed. :Since this is my Main backup Server Should I just re-install the whole :thing and move forward? : :Thanks : :--Siju Well, the media looks corrupted to me. It hit a fairly serious assertion. If you need the data on that media you should be able to 'hammer recover' it to another filesystem on a different partition, but that particular filesystem looks like it is toast to me. -Matt Matthew Dillon
Re: Unable to mount hammer file system Undo failed
:I have PFS slaves on a second disk. :I have already fitted a new disk and the OS installation is complete. : I will upgrade the Slaves to Master and then configure slaves for :them so there is no problem. : :But I have lost the snapshot symlinks :-( :In the PFSes I snapshotted every 5 minutes I have a lot of symlinks. : :Is there any easy way to recreate those symlinks from the snapshot IDs ? : :Thanks : :Siju Try 'hammer snapls '. The snapshots are recorded in meta-data so if they're still there you can get them back. You may have to write a script to recreate the softlinks from the output. -Matt Matthew Dillon
Re: Unable to mount hammer file system Undo failed
People who use HAMMER also tend to backup their filesystems using the streaming mirroring feature. You need a backup anyway, regardless. HAMMER makes it easy, and this is the recommended method for dealing with media faults on HDDs not backed by hardware RAID (and even if they are). You need to backup your data anyway, after all, regardless of the filesystem (even ZFS's 'copies' feature has its limits due to the fact that the copies are all being managed from the same machine). FreeBSD's background fsck and mounting without an fsck (depending on softupdates) has NEVER been well vetted to ensure that it works in all situations. There have been lots of complaints about related failures over the years, mostly blamed on failed writes to disks or people not having UPS's (since UFS was never designed to work with a disk synchronization command, crashes from e.g. power failures could seriously corrupt disks above and beyond lost sectors). They can claim it works better now, but I would never trust it. Background fsck itself can render a server unusable due to lost performance. HAMMER has a 'hammer recover' command meant to be used when all else fails. It can be used directly with the bad/corrupted disk as the source and a new disk as the destination. It scans the disk, yes. A full fsck on a very large (2TB+) filled filesystem is almost as bad when it starts having to seek around. I have had numerous failed disks over the years and have never had to actually use the recover command. I always initialize a replacement from one of the several live backups I keep. HAMMER2 will have some more interesting features that flesh out the live backup mechanic a bit better, making it possible to e.g. initialize a replacement disk locally and leave the filesystem live using a remotely served backup as the replacement is reloaded from the backup. But it isn't possible with HAMMER1, sorry. -Matt Matthew Dillon
Re: questions from FreeBSD user
:On Sun, Jul 15, 2012 at 5:02 PM, Wojciech Puchar : wrote: :> i have few questions. i am currently using FreeBSD, dragonfly was just :> tried. :> :> 1) why on amd64 platform swapcache is said to be limited to 512GB? actually :> it may be real limit on larger setup with more than one SSD. It seemed like a reasonable limit for the KVM overhead involved, though I don't remember the exact reason I chose it originally. The practical limitation for swap is 4096GB (4TB) due to the use of 32 bit block numbers coupled with internal arithmatic overflows in the swap algorithms which eats another 2 bits. We do not want to increase the size of the radix tree element because the larger structure size would double the per-swap-block physical memory overhead, and physical memory overhead is already fairly significant... around 1MB of physical memory is needed per 1GB of swap. There are a maximum of 4 swap devices (w/512GB limit by default in total, with the per-device limit 1/4 of that). Devices are automatically interleaved and can be added and removed on the fly. The maximum can be increased with a kernel rebuild but it is not recommended... you generally won't get more performance once you get past 4 devices. :> 2) it is said that you are limited to cache about 40 inodes unless you :> use sysctl setting vfs.hammer.doublebuffer or so. :> :> in the same time it is said to be able to cache any filesystem. :> :> Can UFS be cached efficiently with millions of files? 32 bit systems will be limited to ~100,000 inodes are so. 64 bit systems calculated a default limit (kern.maxvnodes) based on available ram, with no cap. So values > 1 million will be common. And you can always raise this value via the sysctl. UFS ought to be be cached by swapcache but there's no point using it on DragonFly. You should use HAMMER. :> 3) how about reboots? From my understanding reboot, even clean, means losing :> ALL cached data. am i right? All swapcache-cache data is lost on reboot. :> In spite of HAMMER being far far far better implementation of filesystem :> that ZFS, i don't want to use any of them for the same reasons. :> :> UFS is safe. A large, full UFS filesystem can take hours to fsck, meaning that a crash/reboot of the system could end up not coming back on line for a long, long time. On 32-bit systems the UFS fsck can even run the system out of memory and not be able to complete. On 64-bit systems this won't happen but the system can still end up paging heavily depending on how much ram it has. In contrast, HAMMER is instant-up and has no significant physical memory limitations (very large HAMMER filesystems can run on systems with small amounts of memory). :> :> 4) will virtualbox or qemu-kvm or similar tool be ported ever to DragonFly? :... :> i am not fan of virtualizing everything, which is pure marketing nonsense, :> but i do some virtualize few windows sessions on server. :> :> thanks With some work, people have had mixed results, but DragonFly is designed to run on actual hardware and not under virtualization. -Matt Matthew Dillon
Re: machine won't start
:> I consider it almost a lost cause. : : :Don't get it: trying to fix this is a lost cause? Yah, because if we fix it for one BIOS we break it for another. Hence, a lost cause. There is no single fix which covers all BIOSs. -Matt
Re: machine won't start
:Thanks Matt for the explanation and tip. : :It did of course hang when I tried to DEL into the BIOS. :What worked is pulling out the sata connector, entering :the BIOS putting it back and then detecting the disk. :Interesting the auto detection then worked. I've explicitly :set it to LARGE and now I can boot a rescue cd. : :How many bytes should I zero out for the disk to be :"normal" again? 512bytes? 4megs? The BIOS is basically just accessing the slice table in the first 512 bytes of the disk. If I want to completely wipe a non-GPT formatted disk I usually zero-out (with dd) the first ~32MB or so to catch both the slice table and the stage-2 boot and the disklabel and the likely filesystem header. Destroying a GPT disk requires (to be safe) zero'ing out both the first AND the last X bytes of the physical media to also ensure that the backup GPT table is also scrapped. Again, to be safe I zero-out around 32MB at the beginning and 32MB at the end w/dd (if it's GPT). This will effectively destroy everything on the disk from the point of view of probing, so please note that these instructions are NOT going to leave multi-OS installations intact. -Matt Matthew Dillon
Re: machine won't start
Normally this issue can be fixed by setting the BIOS to access the disk in LBA or LARGE mode. The problem is due to a bug in the BIOS's attempt to interpret the slice table in CHS mode instead of logical block mode. It's a BIOS bug. These old BIOS's make a lot of assumptions w/regards to the contents of the slice table, including making explicit checks for particular OS types in the table. I've only ever seen the problem on old machines, and I've always been able to solve it by setting the BIOS access mode. I've never, ever found a slice table format that works properly across all BIOSs. At this juncture we are using only newer (newer being 'only' 25+ years old) slice table formats (aka LBA layouts and using proper capped values for hard drives that are larger than the 32-bit LBA layout can handle). Ultimately we will want to start formatting things w/GPT, but that opens up a whole new can of worms... old BIOSes can explode even more easily when presented with a GPT's compat slice format, at least as defined by GPT. Numerous vendors such as Apple modified their GPT to try to work around the even larger number of BIOS bugs related to GPT formatting than were present for the older LBA formatting. I consider it almost a lost cause. -Matt
Re: pkgsrcv2.git stopped syncing?
:I did that - rm'd the checkout. I was going to write an email about :this as soon as I saw whether it worked again after the next 'normal' :checkout. Looking at gitweb, I see the conversion commit, but I don't :see any subsequent commits... but I don't know if there are any yet. : Ah ha! The ghost in the machine strikes again! We should probably modify the script to blow the directory away once a week just to make sure it can auto-recover from that situation. -Matt Matthew Dillon
Re: pkgsrcv2.git stopped syncing?
::Hi. :: ::The latest commit on pkgsrcv2.git is 8ce625e3, which is from ::9 days ago. But I see more commits after this date on ::pkgsrc-changes@. :: http://mail-index.netbsd.org/pkgsrc-changes/ :: ::Could someone take care of it? :: ::Best Regards, ::YONETANI Tomokazu. : :Ok, working on it. Grr, that thing is getting more fragile. :It's probably an incremental update failure of the CVS repo. : : -Matt It should be fixed now (it fixed itself this morning, I didn't have to do anything). The script was failing trying to do the incremental cvs checkout. e.g. from the logs: U cvs-master/emulators/ucon64/patches/patch-af U cvs-master/net/xymon/PLIST U cvs-master/net/xymon/options.mk cvs checkout: move away `cvs-master/print/LPRng/Makefile'; it is in the way C cvs-master/print/LPRng/Makefile cvs checkout: move away `cvs-master/print/LPRng-core/MESSAGE'; it is in the way C cvs-master/print/LPRng-core/MESSAGE ... (repeat a hundred times) ... I'm not sure why this happens. CVS somehow gets confused over what files are in the repo and what files are not, possibly due to catching an update in the middle or surgery done in the master cvs repo. This time it fixed itself. Sometimes I have to blow the checkout away and let it re-checkout everything over again. Theoretically I could do a fresh rm -rf and checkout every time, but that seems really wasteful of crater's disk and time. It already takes crater between 1 and 2 hours to run the cvs->git script so for now I am still leaving it set to do an incremental checkout. -Matt Matthew Dillon
Re: pkgsrcv2.git stopped syncing?
:Hi. : :The latest commit on pkgsrcv2.git is 8ce625e3, which is from :9 days ago. But I see more commits after this date on :pkgsrc-changes@. : http://mail-index.netbsd.org/pkgsrc-changes/ : :Could someone take care of it? : :Best Regards, :YONETANI Tomokazu. Ok, working on it. Grr, that thing is getting more fragile. It's probably an incremental update failure of the CVS repo. -Matt
Re: help with a failed cpdup assert
:I am trying to sync some pretty similar directories with cpdup over ssh :and three of them are syncing, but the fourth fails with this: :cpdup: hclink.c:343: hcc_leaf_data: Assertion `trans->windex + :sizeof(*item) + bytes < 65536' failed. : :Additionally it only fails when :source (cpdup slave) <=pull= destination (cpdup master) :and not when :source (cpdup master) =push=> destination (cpdup slave) : :What could be causing this? : :- Nikolai The assertion was incorrect. That's an old version of cpdup, updating to the latest should solve the problem. -Matt Matthew Dillon
Re: How to suppress kernel hammer debug messages.
:Hello, : :I am new to this mailing list and was wondering if anyone could :help me figure out how to suppress or otherwise disable the logging of these :apparently benign debug messages that are filling up my syslog file. :. :hammer: debug: forcing async flush ip 0001093483e9... The debugging message was added to verify that a particular bug was being caught and fixed. It's one of several unconditional debugging kprintf()'s that could probably be stripped out from the code. There's no conditionalization on it. I will push a conditionalization of this particular message to master and the 3.0 release branches. Getting rid of them will require recompiling the kernel w/updated sources. Or you can just strip the related kprintf out yourself and recompile your kernel (the three lines at line 2438 of /usr/src/sys/vfs/hammer/hammer_inode.c if you have unpacked the sources should be where this kprintf() resides). -Matt Matthew Dillon
Re: Install DragonFlyBSD on 48 MB RAM
:One of our developers tested with snapshots; it looks like the DMA :reserve commit is the one that made DF no longer run w/ 48MB. That :makes sense, as 16MB of physical memory is locked up by that commit. :You should be able to boot with a loader variable set to reserve less :physical memory. : :We someday need a better physmem allocator; the 16MB reserve is a good :step, but a low-fragmentation allocator would be better. : :-- vs; It should be reserving less space on low-memory machines. if (vm_dma_reserved == 0) { vm_dma_reserved = 16 * 1024 * 1024; /* 16MB */ if (vm_dma_reserved > total / 16) vm_dma_reserved = total / 16; } We could try zeroing it. Or perhaps the calculation is wrong... maybe it should be basing the test on 'npages' instead of 'total'. e.g. ((vm_paddr_t)npages * PAGE_SIZE / 16) instead of (total / 16). However, we really don't support machines with so little memory, even if the thing manages to boot. If a simple change makes it work then fine, but otherwise I'm skeptical of the value. This variable is a tunable. Try setting 'vm.dma_reserved=0' in the boot loader. -Matt Matthew Dillon
Re: Install DragonFlyBSD on 48 MB RAM
I think the answer is probably 'no'. We don't try to make the system work with such a small amount of memory. It should be able to boot with 128MB of ram or more, though to really be decent a more contemporary machine is necessary. It might boot on less memory... in fact it will, but I don't think I've ever tried to boot it with less than 64M and even 64M is probably too little. -Matt
DESIGN document for HAMMER2 (08-Feb-2012 update)
This is the current design document for HAMMER2. It lists every feature I intend to implement for HAMMER2. Everything except the freemap and cluster protocols (which are both big ticket items) has been completely speced out. There are many additional features verses the original document, including hardlinks. HAMMER2 is all I am working on this year so I expect to make good progress, but it will probably still be July before we have anything usable, and well into 2013 before the whole mess is implemented and even later before the clustering is 100% stable. However, I expect to be able to stabilize all non-cluster related features in fairly short order. Even though HAMMER2 has a lot more features then HAMMER1 the actual design is simpler than HAMMER1, with virtually no edge cases to worry about (I spent 12+ months working edge cases out in HAMMER1's B-Tree, for example... that won't be an issue for HAMMER2 development). The work is being done in the 'hammer2' branch off the main dragonfly repo in appropriate subdirs. Right now just vsrinivas and I but hopefully enough will get fleshed out in a few months that other people can help too. Ok, here's what I have got. HAMMER2 DESIGN DOCUMENT Matthew Dillon 08-Feb-2012 dil...@backplane.com * These features have been speced in the media structures. * Implementation work has begun. * A working filesystem with some features implemented is expected by July 2012. * A fully functional filesystem with most (but not all) features is expected by the end of 2012. * All elements of the filesystem have been designed except for the freemap (which isn't needed for initial work). 8MB per 2GB of filesystem storage has been reserved for the freemap. The design of the freemap is expected to be completely speced by mid-year. * This is my only project this year. I'm not going to be doing any major kernel bug hunting this year. Feature List * Multiple roots (allowing snapshots to be mounted). This is implemented via the super-root concept. When mounting a HAMMER2 filesystem you specify a device path and a directory name in the super-root. * HAMMER1 had PFS's. HAMMER2 does not. Instead, in HAMMER2 any directory in the tree can be configured as a PFS, causing all elements recursively underneath that directory to become a part of that PFS. * Writable snapshots. Any subdirectory tree can be snapshotted. Snapshots show up in the super-root. It is possible to snapshot a subdirectory and then later snapshot a parent of that subdirectory... really there are no limitations here. * Directory sub-hierarchy based quotas and space and inode usage tracking. Any directory sub-tree, whether at a mount point or not, tracks aggregate inode use and data space use. This is stored in the directory inode all the way up the chain. * Incremental queueless mirroring / mirroring-streams. Because HAMMER2 is block-oriented and copy-on-write each blockref tracks both direct modifications to the referenced data via (modify_tid) and indirect modifications to the referenced data or any sub-tree via (mirror_tid). This makes it possible to do an incremental scan of meta-data that covers only changes made since the mirror_tid recorded in a prior-run. This feature is also intended to be used to locate recently allocated blocks and thus be able to fixup the freemap after a crash. HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in that HAMMER2 does not keep track of 'deleted' records. Instead any recursion by the mirroring code which finds that (modify_tid) has been updated must also send the direct block table or indirect block table state it winds up recursing through so the target can check similar key ranges and locate elements to be deleted. This can be avoided if the mirroring stream is mostly caught up in that very recent deletions will be cached in memory and can be queried, allowing shorter record deletions to be passed in the stream instead. * Will support multiple compression algorithms configured on subdirectory tree basis and on a file basis. Up to 64K block compression will be used. Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1, 4:1, 8:1, etc) will work in this scheme because physical block allocations in HAMMER2 are always power-of-2. Compression algorithm #0 will mean no compression and no zero-checking. Compression algorithm #1 will mean zero-checking but no other compression. Real compression will be supported starting with algorithm 2. * Zero detection on write (writing all-zeros), which requires the data buffer to be scanned, will be supported as compre
hammer2 branch in dragonfly repo created - won't be operational for 6-12 months.
I have created a hammer2 branch in the main repo so related commit messages are going to start showing up in the commits@ list. This branch will loosely track master but also contain the hammer2 bits that we are working on. The initial commit this branch contains mostly non-compilable specifications work and header files. hammer2 is NOT expected to be operational for at least 6 months, so don't get your hopes up for it becoming available any time soon. Once it becomes operational most of the features are NOT expected to be in place until the end of the year (hardlinks probably being one of those features that will happen last). At some point starting at around 6 months, when all the basics are working and the media structures are stable, it will be possible to split the workload up for remaining features. I'll be posting another followup in a few minutes on the design work done since the last posting. -Matt Matthew Dillon
Re: File corrupted on crash reboot. Can someone help diagnose?
:I had an email that I was writing to a few people. The computer rebooted :itself. I restarted Kmail and found the message window empty. I cd'ed into the :directory where it keeps autosaved copies of email being composed and found :that it had been overwritten with zero bytes. Fortunately I could recover the :content with undo (I've had this happen on Linux and was out of luck). Can :someone receive the undo output and the reboot times and figure out what :happened? I don't want to post it publicly, as it's a personal email, but I :can send it privately to a developer. : :Pierre :-- :li fi'u vu'u fi'u fi'u du li pa The file might not be recoverable if it wasn't fsynced to disk. It might have still been in the memory cache for the filesystem. You can try running 'undo -i ' but you may be out of luck if the file contents isn't available with any of the transaction ids it lists. -Matt Matthew Dillon
Re: top command
:Hello! : :Why 'PRES' in 'top' is 0 for all processes? : :-- :Vitaly I had to disable statistics collection for PRES because there were some serious SMP concurrency issues when the VM subsystem was changed over to using fine-grained locks. -Matt Matthew Dillon
New colo box installed, kronos.dragonflybsd.org
Kronos will be mirroring most of avalon's services, act as another off-site DNS server, help with our off-site backups, and probably also eventually host our web site. We'll be working it up over the next few weeks. It's a ridiculously overpowered box with 16G of ram and a 200G SSD for swapcache. -Matt Matthew Dillon
Mailing list archive operational again, nntp service discontinued
* The mailing list archive is operational again and most/all of the lost messages have been fed into it. * Our nntp service has been discontinued, superceeded by things like gmail and such which provide really nice multi-device interfaces for threaded list mail. It's time has come. * Work on a new, better web-based mailing list management interface is ongoing. We know the old mail-based bestserv stuff has gotten a bit too crufty. -Matt Matthew Dillon
Re: disable lpr
:I installed cups, which has its own lpr program, and deleted the lpr that is :in world. If I rebuild world, how do I tell it not to install lpr? I know I :did this for sendmail, but I forgot where the configuration is. : :Pierre :-- :lo ponse be lo mruli po'o cu ga'ezga roda lo ka dinko I'm running cups on my workstation too, talking to a Canon printer. Instead of disabling lpr I just reworked the PATH environment variable to put /usr/pkg/bin before /usr/bin. Another trick I use if the above is too sneaky is to put /usr/local/bin first in the PATH and create a script called lpr to exec the the one from /usr/pkg/bin. -- If we really wanted to make things easy we could make /usr/bin/lpr recognize an environment variable to tell it to forward to another lpr (aka /usr/pkg/bin/lpr).. though we'd have to be careful since /usr/bin/lpr is suid and sgid. Maybe a simple 'LPR_USE_PKGSRC' env variable that could be set to '1'. -Matt Matthew Dillon
Re: Request for suggestion for setting up a server with 4 HDDs
: :Thanks for the pointer, but again the dragonflybsd site is down (GMT :08:53:45 Decemeber 27, 2011) to access the link Justin pointed to: :leaf.dragonflybsd.org/mailarchive/commits/2009-12/msg00068.html. :-( Insofar as I can tell the site is up, accessed from the outside internet. -Matt
Re: Dragonflybsd site seems to go down frequently!
: :Again the site is down (GMT 08:53:45 Decemeber 27, 2011). Make me :worry whether I could really go for a dfbsd production server?!!! And there will probably be downtime in the future. The machines behind our web site typically run the absolute latest development code and we expect there to be crashes or other issues, which we then go and fix. Since the project is small this is the only real way we can test the system. -Matt
Re: Which is ideal with HAMMER? softraid or hammer volume_add
Definitely not hammer volume add, that's too experimental. Soft-raid is a bit of a joke in my view, since it typically ties you to a particular motherboard and bios (making it difficult to physically move disks to another machine if the mobo or psu dies), and as with all soft-raid systems any sort of power failure during a write is likely to cause unrecoverable data loss. Honestly I don't know of a single system that ever had fewer failures with soft-raid than with single disks w/ near real-time backup streams. For HAMMER1 the best set-up is either a real raid system or no raid at all and a master/slave server setup, depending on what is being served. Unfortunately nothing in BSD really approaches Linux's block level clustering and VZ container system at the moment (which is a bit of a joke too when it comes to multiple failover events but works pretty well otherwise). If you have a small system then there's no point running RAID. If you have a larger system then there's no point running a single server. And running RAID on multiple servers eats a lot of power so for storage needs less than what conveniently fits on one or two disks there's no point running RAID at all... you run redundant servers instead and use a SSD as a caching layer in front of the slower hard drive. For larger single-volume storage needs multiple real raid system for primary and backup with all the insundry fallback hardware is the only way to go. Soft-raid won't cut it. -Matt Matthew Dillon
Re: bug in du: truncates filenames
:I'm running du on snapshots to see how much space is taken by work directories :(which will stick around for over another month; the downloaded tarballs will :disappear in just a few days). I got this error: : :# du -s /var/hammer/usr/snap-20111?11*/pkgsrc/ :du: /var/hammer/usr/snap-2011-0501/pkgsrc/x11/xterm/work/xterm-259/xtermcfg.h: :No such file or directory :du: /var/hammer/usr/snap-2011-0501/pkgsrc/x11/xterm/work/xterm-259/Makefile: :No such file or directory : :I checked the directory; the files are actually Makefile.in and xtermcfg.hin . :It's not simply truncating the filename to a fixed length, since there's a :file named xterm.log.html , which is longer. Any idea what's going on? : :Pierre What's probably happening is the snapshot caught a flush inbetween its directory entry creation and its inode creation. There is probably a directory entry for the files in question but no inode. It isn't supposed to happen but does sometimes. It's a bug in HAMMER that I haven't found yet. If you cd into the snapshot and ls cd /var/hammer/usr/snap-2011-0501/pkgsrc/x11/xterm/work/xterm-259 ls You should see the 'ls' program complain about a missing 'Makefile'. -Matt
Re: Dragonflybsd site seems to go down frequently!
:Is it only me or others also experience frequent downtime with the :downtime. I experienced downtime several times and right now (GMT :13:43 December 26, 2011) when I tried to access the official :dragonflybsd.org site! No, we had some issues overnight, primarily a cpu hogging bug on avalon (which routes dragonfly's internal network via openvpn) which I thought I had fixed but hadn't. The site should be accessible again. -Matt Matthew Dillon
Merry X-Mas and 3.0 release after the holidays - date not yet decided
Hello everyone! First, I apologize for the aborted 2.12 release. We got as far as rolling it but I decided to make a real push to try to fix the occassional random seg-fault bug that we were still seeing on 64-bit at the time. The seg-fault issue has now been resolved, I posted an exhaustive synopsis to the kernel@ list just a moment ago. Basically it appears to be an AMD cpu bug and not a DragonFly bug. We don't have final confirmation that it isn't a DragonFly bug because it is so sensitive to %rip and %rsp values that reproducing the environment to test it on other OSs (even FreeBSD) is difficult, but I'm 99% certain it's an AMD bug. Add a single NOP instruction to the end of one routine in the gcc-4.4 codebase appears to work around the bug. So moving on to rolling an official release... (1) Through past experience we will NOT do a release during the holidays! So everyone please enjoy Christmas and New Years! (2) I would like to call the release 3.0. Why? Because while spending the last ~1-2 months tracking down the cpu bug a whole lot of other work has gone into the kernel including major network protocol stack work and major SMP work. My contribution to the SMP work was to completely rewrite the 64-bit pmap, VM object handling code, and VM fault handling code, as well as some other stuff. This has resulted in a phenominal improvement in concurrency and in particular concurrent compilations or anything that takes a lot of page faults. SMP contention was completely removed from the page fault path and for most VM related operations, and almost completely removed from the exec*() path. Other related work has continued to improve mysql and postgresql numbers as well. (3) Release date is as-yet undecided. It will probably be mid-February to end-February in order to synchronize with the pkgsrc 2011-Q4 release and give things time to settle. The release meisters will be discussing it on IRC. I will say that there are NO serious showstoppers this time. I'd like us to take our time and make this the best release we've ever done! -Matt
Concurrent buildworld -j N heads up - update both install and mkdir
We are still stabilizing the new buildworld -j N changes. In addition to the install utility needing some internal fixes, the mkdir(1) utility also needed an internal fix. In order to bootstrap being able to run buildworld -j N you may have to update both of these utilities manually as follows (after updating your sources to the latest master), before running your buildworld: cd /usr/src/usr.bin/xinstall make clean; make obj; make all install cd /usr/src/bin/mkdir make clean; make obj; make all install I still expect there to be a few more races that crop up every once in a while and we will continue to fix them as they pop up. -- Also note that the higher concurrency will of course also use more memory, including potentially a lot more memory during the GCC build. When running on machines with limited ram you may have to reduce the -j N value you used in the past. As before, machines with very little ram (e.g. less than 1G) will probably page to swap during a -j N build even with N as low as 4. -Matt Matthew Dillon
Significantly faster concurrent buildworld times
I did a pass on the buildworld infrastructure and added a new features to allow SUBDIR recursions to run concurrently. This should improve buildworld -j 12 (or similar) significantly. I was able to get a 28% improvement on our quad-core (8 thread) Xeons (1075 -> 769 seconds). This is still a bit experimental in that there may be build dependencies that we haven't ferreted out yet. In particular, you might have to update your 'install' program to the latest in master to avoid a race inside its mkdir() function which could error-out the build (only if you are doing make -j N on your buildworlds). This work cleaned up probably 70-80% of the bottlenecks we had in the buildworld. There are far fewer periods showing idle cpu during the build with these changes. http://gitweb.dragonflybsd.org/dragonfly.git/commit/d2e9c9d8664f753a0d599eceed1dd98ffa7ef479 http://gitweb.dragonflybsd.org/dragonfly.git/commit/67be553814c6242d4a801d26dc2f6e5ca4b1aa8a http://gitweb.dragonflybsd.org/dragonfly.git/commit/0e6b9ee838cf6370c2c8e3ea723839c383de96cc http://gitweb.dragonflybsd.org/dragonfly.git/commit/6e73105ec5492ebba66b83ced8a62e16f87e0498 -Matt Matthew Dillon
leaf upgrade status
* Leaf has been upgraded to 64-bits and all-new hardware. The new machine is about 30% faster than the old one and has five times the ram (16G total). * Most services are operational. However, our web front-page still has an issue with the embedded digest and the bugtracker is currently non-operational. Expect instability for the next few days. * There may be pkgsrc packages missing that developers need. If you need a package installed get onto our efnet IRC channel (#dragonflybsd) and tell us, we will add it back in. * Any local binaries developers have compiled in their leaf accounts will have to be recompiled. Our repository box will probably be upgraded Thursday afternoon. -Matt Matthew Dillon
Re: Can someone upgrade tor in Q3?
Sorry folks, the cvs2git scripts were running only the base conversion for the 2011 pkgsrc branches and not running the synchronization pass to fixup missing bits. I've adding the synchronization pass for all 2011 branches. It will be a few hours before it gets them synced up. -Matt
heads up - Machine upgrades this week.
Both crater and leaf will be upgraded this week. Either Wednesday or Thursday. We'll try to make it as painless as possible but because we are upgrading the boxes from 32 bits to 64 bits there will be services downtime. * Probable web site down time for a few hours (up to 6). * Commits (for developers) may be disallowed for a few hours. * Documentation, Mailing list, mailing list archive, and news services may be down for a few hours. * pkgsrc mirrors will NOT be effected. Poor crater is running cvs and git conversion scripts on almost a hundred gigabytes of material four times a day and its lowly 2G of ram isn't enough. The only reason it still works 'ok' is it's 100G of SSD swapcache. Poor leaf is in similar straights, handling cgit and gitweb requests from various search engines (which we want) and having to deal with a multitude of concurrent 300MB+ process images, not to mention developer git repos and vkernel images. With only 3G of ram only its SSD swapcache allows it to continue to function. Both will be upgraded to Xeon E3 (Sandybridge) based boxes w/16G of ECC ram each. Should be really nice after that, particularly for developers who use leaf regularly. This will occur Wednesday and/or Thursday if all goes well. -Matt Matthew Dillon
Performance results / VM related SMP locking work - committed (3)
he same time, 2500 seconds after all four were started) is only 500 seconds slower than for one, meaning that we are getting very good concurrency now. BUILDKERNEL NO_MODULES=YES TESTS This set of tests is using a buildkernel without modules, which has much greater compiler concurrency verses a buildworld tests since the make can keep N gcc's running most the time. 137.95 real 277.44 user 155.28 sys monster -j4 (prepatch) 143.44 real 276.47 user 126.79 sys monster -j4 (patch) 122.24 real 281.13 user97.74 sys monster -j4 (commit) 127.16 real 274.20 user 108.37 sys monster -j4 (commit 3) 89.61 real 196.30 user59.04 sys test29 -j4 (patch) 86.55 real 195.14 user49.52 sys test29 -j4 (commit) 93.77 real 195.94 user67.68 sys test29 -j4 (commit 3) 167.62 real 360.44 user 4148.45 sys monster -j48 (prepatch) 110.26 real 362.93 user 1281.41 sys monster -j48 (patch) 101.68 real 380.67 user 1864.92 sys monster -j48 (commit 1) 59.66 real 349.45 user 208.59 sys monster -j48 (commit 3)<<< 96.37 real 209.52 user63.77 sys test29 -j48 (patch) 85.72 real 196.93 user52.08 sys test29 -j48 (commit 1) 90.01 real 196.91 user70.32 sys test29 -j48 (commit 3) Kernel build results are as expected for the most part. -j 48 build times on the many-cores monster are GREATLY improved, from 101 seconds to 59.66 seconds (and down from 167 seconds before this work began). That's a +181% improvement, almost 3x faster. The -j 4 build and the quad-core test29 build were not expected to show any improvement since there isn't really any spinlock contention with only 4 cores. There was a slight nerf on test28 (the quad-core box) but that might be related to some of the lwkt_yield()s added and not so much the PQ_INACTIVE/PQ_ACTIVE vm_page_queues[] changes. -Matt Matthew Dillon
Performance results / VM related SMP locking work - committed (2)
Here is an update with the quad buildworld tests on monster, with the latest commit. Historical data. Note that I fixed the improvement numbers, I was calculating them wrong. 100% improvement means half the time. Tests on monster (48 core opteron w/64G ram) :monster buildworld -j 40 timings 4x prepatch: (baseline) : 8302.17 real 4629.97 user 17617.84 sys : 8308.01 real 4716.70 user 22330.26 sys :monster buildworld -j 40 timings 4x postpatch 1: (41.4% improvement) : 5799.53 real 5254.76 user 23651.73 sys : 5800.49 real 5314.23 user 23499.59 sys :monster buildworld -j 40 timings 4x COMMIT#1: (93.4% improvement) : 4207.85 real 4869.90 user 20673.71 sys : 4248.45 real 4899.08 user 21697.11 sys And with the latest commit today (on first run after boot even!) :monster buildworld -j 40 timings 4x COMMIT#2: (108% improvement) <<< 3943.25 real 4630.76 user 21062.91 sys And I expect to do better yet. Current system status: We have one issue with a reproducable seg-fault/bus-fault related to mmap which I hope to squash today. -Matt
Performance results / VM related SMP locking work - committed
real 209.52 user63.77 sys test29 -j48 (patch) 85.72 real 196.93 user52.08 sys test29 -j48 (commit) <<< For the kernel build, a 11.4% improvement -j4 on monster (only utilizing 4 of the 48 cores, well, at least as far as make -j goes). For the kernel build, a 39.4% improvement -j48 on monster, utilizing all 48 cores. On the test29 quad-core the numbers weren't expected to improve a whole lot, and they didn't, because single-chip multi-core spin locks are very, very fast. Surprisingly though the -j48 build improved performance by quite a bit, around 11%. The real improvements are on systems with more cores. Monster, with 48-cores, made for a very good test case. -Matt Matthew Dillon
Performance results / VM related SMP locking work - prelim VM patch
52 user63.77 sys test29 -j48 (patch) * The -j 4 builds don't have much contention before or after the patch, monster was actually slightly faster pre-patch (though I'm not being scientific here, it's within the realm of error on my part). * Keeping in mind that test29 is a quad-core the parallelism is really only 4, I use a -j 48 build to approximate other non-SMP related overheads on test29 for comparison to monster's -j 48 build in the last two results. Of course, -j 48 on monster is a different story entirely. That will use all 48 cores in this test. * test29 is 1.5x faster than monster, hence the 4-core limited results make sense (89.61 vs 143.44 seconds, which is 1.60x). * The monster -j 48 kernel build without modules has better compiler concurrency vs a buildworld. So the final two lines show how the contention effects the build. Monster was able to reduce build times from 143 to 110 seconds with -j 48 but as you can see the system time balooned up massively due to contention that is still present. Monster -j 48 pre-patch vs post-patch shows how well contention was reduced in the patch. 167 seconds vs 110 seconds, a 34.1% improvement! system time was reduced 4148 seconds to 1281 seconds. The interesting thing to note here is the 1281 seconds of system time the 48-core 48-process compiler concurrency test ate. This clearly shows what contention still remains. From the ps output (not shown) it's still mostly associated with the vm_token (probably the vm_page allocation and freeing path) and vmobj_token (probably the vm_fault path through the vm_map and vm_object chain). I'll focus on these later on once I've stabilized what I have already. Even a -j N kernel build with NO_MODULES=TRUE has two major bottlenecks: the make depend and the link line at the end, which together account for (off the cuff) somewhere around ~45 seconds of serialized single-core cpu on monster. So even in the ideal case monster probably couldn't do this build in less than about ~55 seconds or so. -Matt Matthew Dillon
Re: Why /dev/serno doesn't show USB disk? also SMP version
:I pulled the laptop disk out of the USB adapter and put it in the laptop :(which didn't work, apparently the laptop doesn't recognize a disk bigger :than 128 GB). I then put it back in the adapter and plugged it in. It came up :as da9. I tried to mount it and got an error, because fstab says it's da8 (I :had forgotten to turn off cryptsetup). /dev/serno doesn't show any entry for :it. How come? : :The kernel version I'm running is: :DragonFly darner.ixazon.lan 2.11-DEVELOPMENT DragonFly :v2.11.0.203.g0e5ac-DEVELOPMENT #2: Mon May 16 09:57:06 UTC 2011 :r...@darner.ixazon.lan:/usr/obj/usr/src/sys/GENERIC_SMP i386 :This is one day after Sephe announced "SMP kernel now boots UP system", but :that's the compilation date. If I boot the kernel I'm running on a UP system, :what'll happen? Should I recompile? : :Pierre Probing serial numbers over USB attachments often either return no serial number or lock up the USB stick. Most USB hard drives run through a USB<->SATA/Firewire bridge chip and most of these bridge chips can't handle serial number queries. So for USB attachments one mostly has to depend on the fact that USB attachments start at /dev/da8 and go from there. And hopefully not have more than 8 normal SATA drives. In anycase, /dev/da8 didn't detach probably because the device was still referenced by the crypto code. If you completely dereference the device it should detach from that attachment point so the next plug-in reuses the same attachment point. I don't think there's a good solution for the usb serial number issue atm, other than to not use USB for any serious hard drive attachments. For a laptop which only has USB I guess there's no choice, and in that case you just have to hardwire it to /dev/da8 or something like that. -Matt
Re: process flips between CPUs
A process which is sleeping most of the time will tend to be scheduled on whatever cpu is available. From the perspective of the scheduler which may switch between user processes on a 1/100 second clock a process which uses the cpu heavily will tend to be scheduled on the same cpu. However, from the human perspective observing the top or ps output, even a heavily cpu-bound program will switch between cpus every so often. Normally locking a process to a particular cpu is not necessary. -Matt
Re: Recover slave PFS
It is a bug, it shouldn't have removed the softlink for the PFS. However, the only way to destroy a pfs is with pfs-destroy and since you didn't do that the PFS is still intact. All you have to do is re-create the softlink. The PFS softlink points to "@@-1:n" Where 'n' is the pfs number. For example, PFS #5 would be: "@@-1:5" The format must be precise. If you recreate the softlink for the missing pfs in your /pfs directory you should be able to CD into it and get it back. If you don't know the PFS number look at the PFS numbers for the existing PFS's and guess at the ones that might be missing. -Matt
hammer dedup in HEAD now has a memory limiting option
The hammer dedup and dedup-simulate directives in HEAD now accept a memory use limit option, e.g. '-m 200m'. The default memory use limit is 1G. If the dedup code exceeds the memory limit it will automatically restrict the CRC range it collects information on and will loop as many times as necessary to dedup the whole disk. This should make dedup viable inside qemu or other virtual environments. A few minor I/O optimizations were also made to try to pre-cache the b-tree metadata blocks and to allow the dedup code to get past areas already dedupped more quickly. Initial dedups will still take a long time. ^C and ^T are also now supported during hammer dedup runs so you can see the progress. It has to pre-scan the b-tree but once it actually gets into dedupping stuff ^T will give you a good indication of its progress. ^C was being ignored before and now works as well. -Matt Matthew Dillon
Re: pkgsrc-update failes with core dumps
I think master currently has a VM issue somewhere (in software). I'm sometimes getting an internal compiler error when building the world, too. -Matt Matthew Dillon
Re: pkgsrcv2.git not syncing correctly; around 400 missing files
:Hi Matt, :It looks much better now. :All the "MISSING" files have been restored. :There are still some "DIFF" files making it through the script. I :increased the regex to filter out $Revision[:$] and $Date[:$] as well as :$Id[:$] and $NetBSD[:$], and the attached file shows what is left. : :The remaining files on the list feature the $Log$ CVSID and others, so :the git pkgsrc repository looks 100% synchronized to me! : :John Yes, this is because CVS $variable expansions are not formally stored as patches in the CVS archive. Instead the variable-expansion is done after the file is checked out. The version of the file in the CVS archive will often contain the variable expansions related to the previous version rather than that particular version. So anything related to variable expansion will be broken no matter what we do. The git conversion scripts effectively have to tell cvs not to expand anything and work just with the pure CVS archive (which contains the broken expansions associated with the version previous to the one being checked out), otherwise incremental patches will not work properly. My pkgsrc cvs->git conversion script is ridiculously complex. Not only can the cvs2git conversion not always work properly, the rsync of the cvs repo itself can catch a cvs commit in the middle so the script has to loop the rsync until it detects the topology hasn't changed recently (i.e. is stable). And even then it doesn't always stay in sync so my script then does a catch-all cvs checkout, git checkout, and diff/patch, then a forced git commit to clean up the loose ends. Of course, it is all for naught if rsync itself breaks like it just did :-( -Matt
Re: pkgsrcv2.git not syncing correctly; around 400 missing files
Ok, I upgraded rsync to the latest version and it appears to work now. I think it might have been a protocol incompatibility between the older rsync crater was running (2.something) verses the current version 3.0.8. I will manually run the pkgsrc updating script, please check in about an hour to see if the repo has been corrected. -Matt
Re: pkgsrcv2.git not syncing correctly; around 400 missing files
:I have been using pkgsrc from our git mirror (pkgsrcv2), but I recently :noticed some patches were missing as it caused me to submit a bad patch :to pkgsrc while fixing multimedia/xine-lib port, and since then I've :found many missing files. : :I pulled pkgsrc via CVS and created a script to compare both :repositories. I had to tell diff to ignore differences that we caused :by CVSID tags (e.g. $NetBSD$ and $Id$) because for some reason these :CVSIDs were the only difference in hundreds of files. : :The result is attached. :367 files are shown as missing and the remaining 36 are shown as different. : :At the very least, this report could be used to manually sync :pkgsrcv2.git, but it appears something systematic is amiss due to the :large number of missing patches. Hopefully this can be fixed? : :Regards, :John Hmm. It looks like the rsync our script is running to get the CVS archive is failing. I'm getting tons of these sorts of messages in the logs: rsync: recv_generator: failed to stat "/archive/NetBSD-CVS/xsrc/external/mit/xwininfo/dist/man/xwininfo.man,v": Unknown error: 0 (0) ... I'm not sure what is going on. The directory structure looks ok. The lstat() it is failing on, when I ktrace, is returning a proper ENOENT error code. If I start with a clean, empty target directory I get the same problem. rsync is trying to stat stuff which doesn't exist and is then complaining about it. It thinks the error code is 0 when it isn't. This is blasted confusing. I am running this rsync: /usr/pkg/bin/rsync -aHS --delete --exclude '#cvs.lock' rsync://anoncvs.NetBSD.org/cvsroot /archive/NetBSD-CVS ... 13690 rsync0.07 CALL lstat(0xbfbff2f0,0xbfbfe9e0) 13690 rsync0.03 NAMI "CVSROOT/config" 13690 rsync0.16 RET lstat -1 errno 2 No such file or directory 13690 rsync0.74 CALL write(0x2,0xbfbfd470,0x60) 13690 rsync0.14 GIO fd 2 wrote 96 bytes "rsync: recv_generator: failed to stat "/archive/NetBSD-CVS/CVSROOT/config": Unknown error: 0 (0)" 13690 rsync0.05 RET write 96/0x60 I have verified that it does not try to create the file beforehand in the ktrace. Insofar as I can tell there's nothing wrong with HAMMER or the directory structure. rsync's memory use does hit around 32MB, then stabilizes, then a short time later it starts spewing out tons of these errors. I wonder if there is an issue with rsync's memory use? -Matt
Re: Running OpenGrok on DragonFly
I've always wanted to run OpenGrok on our /archive, using Leaf. The last time I tried I got stuck on the JDK dependency too. Now that you've got the JDK working I may try setting it up again. -Matt Matthew Dillon
Re: Real World DragonFlyBSD Hammer DeDup figures from HiFX - Reclaiming more than 1/4th ( 30% ) Disk Space from an Almost Full Drive
:... :> >> take? :> >> :> > :> > I ran them one by one. at my own pace but the biggest two :> > simultaneously did not take more than 2 hrs. :> > So I guess 2-3 hrs would be a nice approximation :-) :> :> My experiences were different on a file system containing a lot of data :> (>2TB). :> :> I didn't try dedup itself but a dedup-simulate already ran for more than :> two days (consuming a lot of memory in the process) before I finally :> cancelled it. : : Most odd - I just tried a dedup-simulate on a 2TB filesystem with :about 840GB used, it finished in about 30 seconds and reported a ratio of :1.01 (dedup has been running automatically every night on this FS). : :-- :Steve O'Hara-Smith | Directable Mirror Arrays I think this could be a case of the more CRC collisions we have, the more I/O dedup (or dedup-simulate) needs to issue to determine if the collision is an actual dup or if it's just a CRC collision and the data is different. The memory use can be bounded with some additional work on the software, if someone wants to have a go at it. Basically the way you limit memory use is by dynamically limiting the CRC range that you observe in a pass. As you reach a self-imposed memory limit you reduce the CRC range and throw away out-of-range records. Once the pass is done you start a new pass with the remaining range. Rinse, repeat until the whole thing is done. That would make it possible to run de-dup with bounded memory. However, the extra I/O's required to verify duplicate data cannot be avoided. -Matt Matthew Dillon
Re: cache_lock: blocked & unblocked
:Hi, : :I got these on my DragonFly v2.11.0.247.gda17d9-DEVELOPMENT : :[diagnostic] cache_lock: blocked on 0xdacafa28 "2.0" :[diagnostic] cache_lock: unblocked 2.0 after 9 secs :[diagnostic] cache_lock: blocked on 0xdc244c18 "2.0" :[diagnostic] cache_lock: unblocked 2.0 after 2 secs :[diagnostic] cache_lock: blocked on 0xd9968ea8 "mail" :[diagnostic] cache_lock: unblocked mail after 15 secs :[diagnostic] cache_lock: blocked on 0xc46a7378 "" :[diagnostic] cache_lock: blocked on 0xc46a7378 "" :[diagnostic] cache_lock: unblocked after 0 secs :[diagnostic] cache_lock: unblocked after 0 secs : :is there any thing I should chek out? : :Thanks : :--Siju No, as long as the blockages unblock at some point it's ok. The blockages are likely due to hammer's flusher. I have a patch under test (related to the blogbench thread) that also seems to reduce the namecache stalls. -Matt Matthew Dillon
Re: cpdup /pfs
The problem here is that cpdup'ing /pfs will result in the wrong symlinks on the target filesystem because the PFS IDs are different on the target filesystem. There is nothing cpdup can do here to help, you have to tell it to ignore the pfs directory (see -x option to cpdup and the use of a file containing a list of exclusions). -Matt
Re: newfs_hammer doesn't set dedup time
: :I made a new Hammer filesystem on the laptop disk and looked at the output of :hammer cleanup, which shows no dedup. I ran hammer config on it, and there is :a dedup line, but it's commented out. The PFSs have no config. How come? : :Pierre :-- dedup isn't turned on my default. dedup is being used regularly now but deduplication in general can lead to I/O fragmentation so it isn't the default. Not everyone needs dedup. hammer cleanup should have installed a default config for each PFS. -Matt Matthew Dillon
HEADS UP - Dragonfly network renumbering
The DragonFly network is being renumbered. Hopefully it will be painless but we're doing it in stages and there may be some disruption. -Matt Matthew Dillon
Re: md5 sums and hammerfs encryption
:http://www.dragonflybsd.org/release210/ has MD5 sums listed there. I :don't have access to crater to update the md5.txt file, though. Ok, I pasted them into md5.txt. -Matt
Re: Hammer on multiple hot-swappable disks
:I'm thinking of founding an ISP and running it with a mix of DragonFly and :Linux boxes. My current boss showed me a rack-mountable server which he uses. :If I understood him right, it has three bays where hot-swappable SCSI drives :can be inserted. I was thinking about how to handle disks that are about to :fail, or whose filesystems are getting too big. : :Suppose I have a bunch of disks all partitioned like this: :da#s1a 768 MB ufs /boot :da#s1b 1 GB swap :da#s1d hammer :da#s1e luks hammer. :I have da0 and da1 in the server and I want to insert a disk into da2 and pull :out the one in da1. Can I do this with the "hammer volume-add da2s1d; hammer :volume-del da1s1d"? How long will this take? Do I run cryptsetup on da2s1e :before adding the volume? : :Pierre No, unfortunately there is still one sticking point preventing that from working. The volume delete code can't remove the root volume (in a multi-volume hammer mount one is designated as the root volume. In a single-volume hammer mount that volume IS the root volume for the mount). -Matt
Re: Updating Development Version on Slow machines from another Fast machine
:Siju, : :I NFS mount /usr/src and /usr/obj in the slow machine (being the NFS :server the faster machine) and then I issue the usual :installkernel/installworld/upgrade commands. : :Cheers, :Antonio Huete I do the same thing. In fact, sometimes I even NFS-mount /usr/obj across the internet and make installworld is still faster than compiling it up locally on the slow box :-) -Matt
Re: What does this mean ?
: Hi, : : Message on console not too frequent, possibly associated with heavy :disk usage: : :thr_umtx_wait FAULT VALUE CHANGE 7162->7165 oncond 0x800990104 : : What does it mean, and should I worry ? : :-- :Steve O'Hara-Smith | Directable Mirror Arrays No, it just means a block of memory being used for mutexes suffered from a copy-on-write (probably due to a fork()). The mutex code in the kernel deals with this situation automatically. It was just some old debugging cruft. Matthew Dillon
Re: Intel vs AMD DragonFly 2.11 parallel kernel build tests
Here is a fun statistic. For running a server 24x7 how many days do you have to run the Intel i7 vs the phenom II to make up for the $100 difference in the price tag? Using a generous 65W for the AMD and 33W for the intel, assuming a mostly idle server, and $0.25/kWh, you get $0.192/day savings with the i7. With a $100 difference in price that comes to 520 days. So if you are running a server 24x7 that is mostly idle, the Intel-i7 pays for its higher price in 520 days (a bit over 1.5 years). If you are running under load the i7 will pay for its higher price tag more quickly. This is ignoring the lack of ECC issue with the i7 though, and a Xeon system will be more expensive. -Matt
Intel vs AMD DragonFly 2.11 parallel kernel build tests
es about data integrity will care. Other than that I would happily replace all my servers w/Sandybridge today. As it stands though I don't actually need a ton of horsepower on the servers. Our build boxes are the only things that really need the horsepower of a Sandybridge. The reduced power consumption is very provacative but it's a non-starter without ECC. And AMD has saved me a ton of money over the years with their AM2+/AM3 socket compatibility. I've gone through three major generational cycles on cpus with the same mobos just by buying a new cpu. Intel suffers from too much socketmania and it gets expensive when you have to replace the mobo, the memory, AND the cpu whenever you upgrade. So for the moment I am willing to wait for AMD to come out with something better. It doesn't have to beat Intel, but it does have to get within shouting distance and 30% aint within shouting distance. Even factoring in a current higher-end AMD cpu we still aren't going to get more than another 7% improvement (23% is still too much). If AMD can get within 15% in the next year or so I'll happily stick with them on principle. But if they can't then I will grudgingly pay Intel's premium. (And, p.s. this is why I invest in Intel and not AMD. Intel has the monopoly and intentionally keeps AMD as a poor second cousin to keep the anti-trust hounds at bay. Sorry AMD, I love you but I can only support you in some ways :-( ) -Matt Matthew Dillon
Re: System on SSD
:Hi, : :I just bought an 60 GB SSD (OCZ Vertex 2). I want :to use about 20 GB for swapcache. But I think about :putting the system also on this SSD. To reduce writes :I want to disable history keeping and mount the pfs :with noatime. I also want to move /usr/src and :/usr/pkgsrc and the build directories to a normal HDD. : :Are there any issues to keep in mind? Any suggestion? : :Thanks a lot. : :Sven If you are going to run HAMMER on the SSD then you also have to manage the reblocking operation(s) on the PFSs. I would completely disable the 'recopy' function by commenting it out and I would adjust the reblocking parameters to spend 1 minute every 5 days instead of 5 minutes every 1 day. Everything else can be left as-is. You can also leave history enabled. nohistory will actually generate more write activity. Though you do have to be careful about the retention time due to the limited amount of space available on the SSD, so you might want to adjust the snapshot interval down from 60d to 10d or something like that. History is one of HAMMER's most important features, it is best to leave it on for all primary information storage. I usually turn it off only for things like /usr/obj. Most of these parameters are controlled via 'hammer viconfig '. You want to adjust the config for each mounted PFS and for '/'. -- In terms of layout you will want around a ~1G 'a' partition for /boot, which must be UFS, then I recommend a 32G 'b' swap partition and the remainder for a HAMMER 'd' partition. I usually leave ~4-8G unpartitioned (I setup a dummy 'e' partition that is 4-8G in size which is left unused), assuming a pristine SSD. -- In terms of putting the root filesystem on the SSD and not the HDD, I think it is reasonable to do and if you do you will almost certainly want to put any heavily modified subdirectories on the HDD. /usr/src, /usr/pkgsrc, possibly also /home and /usr/pkg, but it is up to you. Usually it is easier just to use the SSD as your 'a' boot + 'b' swap and put your root on your HDD. You can use the remaining space on the SSD as an emergency 'd' root. The reason it is generally better to put the normal root on the HDD is that you don't have to worry about fine tuning the space and you don't have to worry about write activity. You can still use swapcache to cache a great deal of the stuff on the HDD onto the SSD via the swap partition on the SSD. Booting w/root on the HDD will be slightly slower but not unduly so, and once stuff gets cached on the SSD things get pretty snappy. -- Finally, i386 vs x86-64. If you are running a 32 bit kernel the maximum (default) swap space is 32G. With a 64 bit kernel the maximum is 512G. swapcache works very nicely either way but if you intend to run a 64 bit kernel you might want to consider configuring a larger swap space and essentially dedicating the SSD to just boot + swap. It depends a lot on how much information needs to be cached for all of the system's nominal operations. With swapcache you will universally want to cache meta-data. Caching file data depends on the situation. -Matt Matthew Dillon
Re: Git core dumped
On the other core dumps, I'm not sure what is going on but make sure the repo and the source tree is fully owned by the user (you or root) doing the git operations. I don't rebase often myself. One possibility is that the pthreads per-thread stack is too small, and the complexity of the operation is blowing it out. Git appears to set the stack size to 65536 bytes (the default is ~1MB). If so this would be a bug in git. DragonFly creates a stack guard at the bottom of every thread stack so it might be catching a condition that other OSs are not. -Matt
Re: Git core dumped
:Hello! :I can't download pkgsrc-repository via git. I got such message: :* [new branch] dragonfly-2010Q3 -> origin/dragonfly-2010Q3 : :May 4 17:24:10 kernel: pid 801 (git), uid 0: exited on signal 10 (core :dumped) :*** Signal 10 :Stop in /usr : :What's wrong? Check your /usr/lib/libpthread* and see if it is linked to the wrong theading library: ls -la /usr/lib/libpthread* If it is linked to libc_r that is the problem. It needs to be linked to libthread_xu instead. cd /usr/lib ln -fs thread/libthread_xu.a libpthread.a ln -fs thread/libthread_xu.so libpthread.so.0 I will fix installworld. Was your system originally installed from a fairly old DragonFly? -Matt
Re: dntpd
:Is there a way in dntpd.conf to specify from which hosts dntpd will accept :time requests? : :Pierre dntpd is client-only (though I think it would be fairly easy to have it serve requests if someone wanted to add that). It pulls the time from the hosts specified in /etc/dntpd.conf. /etc/dntpd.conf is installed by default with {0,1,2}.pool.ntp.org. -Matt Matthew Dillon
Re: Buffer strategy message?
:I see this message on halt/reboot occasionally. Is it something I need to :worry about? : :Synching disks... :done :No strategy for buffer at 0xffe056aabf00 :: 0xffe0840876a8: type VBAD, sysrefs 1, writecount 0, holdcnt 0, :Uptime: 12h9m53s :the operating system has halted :\ : :Tim It's 'probably' ok, but it isn't desireable. It means one part of the system detached while another part of the system was still using it. -Matt
DragonFly 2.10 RELEASED!
Hello everyone! 2.10 has finally been released. Our mirrors are still pulling the hot press. Our main mirror site, avalon, has the goods if your favorite mirror doesn't yet. Here's a quick smattering of links: http://www.dragonflybsd.org/ http://www.dragonflybsd.org/release210/ http://www.dragonflybsd.org/mirrors/ http://avalon.dragonflybsd.org/iso-images/ Both 32-bit and 64-bit USB and ISO images are available. My recommendation is to use the 64-bit usb image or the 64-bit gui usb image (if your machine can handle 64-bits). That is, dfly-x86_64-gui-2.10.1_REL.img.bz2. Note that linux emulation only works w/ the 32 bit image so if you need linux emulation you have to go with 32 bits. The gui images contain a full X environment and the git repos for /usr/src and /usr/pkgsrc and is recommended if you have the bandwidth to pull it down. It is approximately 1.2GB. -- This release contains many features, see the release page for an exhaustive list. The big ticket items are significantly better MP performance and significantly better filesystem performance with the AHCI and SILI drivers. -Matt Matthew Dillon
2.10 Release scheduled for Monday.
My weekend schedule is too crowded so we will be doing the official release Monday evening. HEAD is now 2.11 and we have a 2.10 release branch. 2011Q1 packages have been built though some work is still ongoing. Preliminary nrelease builds have succeeded and testing continues. We couldn't quite fit a moderate unrolling of the global VM system token into the release but Venkatesh will be working on it with a vengeance in HEAD after the release. Except for the VM subsystem, all other critical paths are MPSAFE. The AHCI/CAM driver enhancements have made it into the release, so significant improvements in concurrent random disk I/O for AHCI-attached devices should be noticeable. -Matt Matthew Dillon
Re: Hammer deduplication needs for RAM size
:Hi all, : :can someone compare/describe need of RAM size by deduplication in :Hammer? There's something interesting about deduplication in ZFS :http://openindiana.org/pipermail/openindiana-discuss/2011-April/003574.html : :Thx The ram is basically needed to store matching CRCs. The on-line dedup uses a limited fixed-sized hash table to remember CRCs, designed to match recently read data with future written data (e.g. 'cp'). The off-line dedup (when you run 'hammer dedup ...' or 'hammer dedup-simulate ...' will keep track of ALL data CRCs when it scans the filesystem B-Tree. It will happily use lots of swap space if it comes down to it, which is probably a bug. But that's how it works now. Actual file data is not persistently cached in memory. It is read only when the dedup locates a potential match and sticks around in a limited cache before getting thrown away, and will be re-read as needed. -Matt Matthew Dillon
Re: 2.10 Release schedule - Release will be April 23rd 2011
The only issue w/ using dedup is you may need to upgrade the hammer filesystem to at least version 5. It's best that all mirrors be running the same version. If you update to version 6 and use mirroring then both sides have to be version 6 for sure because the directory hash algorithm changes. That's the only issue w/ regards to upgrading. -Matt Matthew Dillon
Recent concurrency improvements in the AHCI driver and CAM need testing
I've pushed some serious changes to the AHCI SATA driver and CAM. One fixes issues where the tags were not being utilized to their fullest extent... well, really they weren't being utilized at all. I'm not sure how I missed the problem before, but it is fixed now. The second ensures that read requests cannot saturate all available tags and cause writes to stall, and vise-versa, and also separates out the read and write BIO streams and treats them as separate entities, which means that reads can continue to be dispatched even if writes saturate the drive's cache and writes can continue to be dispatched even if concurrent read(s) would otherwise eat all available tags. The reason the read/write saturation fixes are important is because writes are usually completed instantly since they just go to the drive cache, so even if reads are saturated there's no reason not to push writes to the drive. Plus when the HD's cache becomes saturated writes no longer complete instantly and would prevent reads from being dispatched if all the tags were used to hold the writes. -- With these fixes I am getting much better numbers with concurrency tests: I now get around 37000 IOPS doing random 512-byte sector reads with a Crucial C300 SSD, verses ~8000 or so before the fix. And I now get around ~365 IOPS with the same test on a hard drive, verses ~150 IOPS before (remember these are random reads!). blogbench also appears to have much better write/read parallelism against the swapcache with the SSD/HD combo. Memory caches blow out at around blog #1300 on my test boxes. With the changes blogbench write performance is maintained through blog #1600 or so, without the changes it drops off at #1300. With the changes the swapcache SSD is pushing ~1400 IOPS or so satisfying random read requests. Without the changes the swapcache SSD is only pushing ~130 IOPS. With the changes blogbench is able to maintain a ~6 article read rate at the end of the test. Without the changes the read rate is more around ~1 at the end of the test. At this stage swapcache has cached a significant chunk of the data in the SSD so the I/O activity is mixed random SSD and HD reads. -- Ok, so I feel a bit sheepish that I missed the fact that the AHCI driver wasn't utilizing its tags properly before. The difference in performance is phenominal. Maybe we will start winning some of those I/O benchmark tests now. -Matt Matthew Dillon
2.10 Release schedule - Release will be April 23rd 2011
Saturday Apr 9 - We branch Saturday Apr 23 - We release 2.10 This gives us two weeks to stabilize the release and build 2011Q1 packages. Developers need to pounce on showstopper bugs such as rebooting issues, mbuf leaks, panics, and so forth. There will be a ton of features in this release, including major compiler toolchain updates, better acpi, better swapcache, PF upgrade, HAMMER live dedup, and many many other goodies. SMP has progressed significantly in this release. All nominal kernel paths are MPSAFE. The VM system is still using a global token and is the only real bottleneck left. -Matt Matthew Dillon
Improvements in swapcache's ability to cache data using HAMMER double_buffer mode.
Normally data is only cached via the file vnode which means the cache is blown away when the vnode gets cycled out of the vnode cache. With kern.maxvnodes around ~100,000 on 32 bit systems and ~400,000 on 64 bit systems any filesystem which exceeds the limit will cause vnode recycling to occur. Nearly all filesystems these days exceed these limits, particularly on 32 bit systems. And on 64-bit systems files are often not large enough to utilize available memory before hitting the vnode limit and causing the data to be thrown away despite there being plenty of free ram. It is now possible to bypass these limitations in DragonFly master by enabling both the HAMMER double_buffer feature (vfs.hammer.double_buffer=1) AND the swapcache data caching feature (vm.swapcache.data_enable=1). See 'man swapcache' for additional information on swapcache. When both features are enabled together swapcache will cache file data via HAMMER's block device instead of via individual file vnodes, making the swapcache'd data immune to vnode recyclement. Swapcache is thus able to cache the data for potentially millions of files up to 75% of available swap (normally configured up to 32G on 32-bit systems and up to 512G on 64-bit systems). -- Now add the fact that Sata-III is now widely available on motherboards and Sata-III SSDs are now in mass production. Intel's 510 series, OCZ's Vertex III, and Crucial's C300 and M4 series are capable of delivering 300-500 MBytes/sec reading and 200-400 MBytes/sec writing from a single device. Crucial's C300 series is very cost effective w/64GB at SATA-III speeds for $160. Compare this to the measily 2-5MBytes/sec a hard drive can do in a random seek/read environment. We're talking 100x the performance already with just a single SSD swap device. With swapcache this means being able shrink the cost and the size of what we might consider to be a 'server' by a factor of three or more. -- The only downside to the new feature is that data is double-buffered in ram. That is, file data is cached via the block device AND also via the file vnode, and there is really no way to get around this other than to expire one of the copies of the cached data more quickly (which we try to do). I still consider the feature a bit experimental due to these inefficiencies. We are definitely on the right track and regardless of the memory inefficiency the HD accesses go away for real when swapcache SSD can take the load instead. On one of our older servers I can now grep through 950,000 files (~15GB worth of file data) at ~2000-4000 files per second pulling 40-50 MBytes/sec from the SSD and *zero* activity on the HD. That is a big deal that only a big whopping RAID system or a ton of ram could compete with prior to the advent of SSDs... all from a little $700 box with an older $100 SSD in it. -Matt Matthew Dillon
Re: ACPI based interrupt routing and new ACPI code ready for testing
:Hi Sephe. Great work :) : :Seems to boot fine on my x86_64 UP box. Anything you want tested with it :running? Verbose dmesg: http://leaf.dragonflybsd.org/~mh/vdmesg_acpi_randy : :However, it makes my graphics card (an ati 9200 agp) lose some speed. It :usually gets ~1600 fps with glxgears and with 2) enabled it drops to ~20 :fps. From what I can see it gives no error about this problem. : ://Max What about the rest of the system? Run some simple cpu benchmarks. If those are slow also this could be an indication of an interrupt storm (and possibly even a SMI storm, similar to ruse39's issue). -Matt Matthew Dillon
Dragonflybsd.org IP space renumbered
The IP space for the primary dragonflybsd.org network has been reworked, please report any problems! -Matt Matthew Dillon
Re: Home stretch on new network - if_bridge looking better
:On 02/24/11 11:50, Matthew Dillon wrote: : :> http://apollo-vc.backplane.com/DFlyMisc/bridge1.txt :> http://apollo-vc.backplane.com/DFlyMisc/bridge2.txt : :So - reading over this - is it correct that the setup is roughly like: : :- assign a local interface (lan0) to a network :- add this network to the bridge :- create openvpn 'bridged' mode tunnels :- add these to the bridge In the case of my current setup, lan0, uverse0, comcast0, and aerio0 are all physical ethernet ports. lan0 is the LAN, and the other three connect to the three different WAN services I have. Only lan0 and the tunnels (tap0, tap1, tap2) are associated with the bridge. The other physical ethernet ports (uverse0, comcast0, and aerio0) each have a different IP and a different default route and I use IPFW to associate packets sourced from the IP to the default route for each port. Currently uverse0 and comcast0 are both dynamic while aerio0 is a static IP (the old DragonFly net /26). The OpenVPN tunnels are built using these IPs and back the tap devices. The tap devices are then associated with the bridge and the main LAN. The tap devices themselves, and the bridge, have *NO* IPs associated with them. All the local IP spaces are on lan0, including some local NATted spaces (10.x.x.x). The bridge code and the ARP code deal with the inconsistencies and provide a consistent ARP for the bridge members. Also, not shown here, is that I have a massive set of PF rules and ALTQs on each of the TAP interfaces (tap0, tap1, and tap2). In particular I'm running the ALTQs on the TAP devices with fair-share scheduling and tuned to the bandwidth of each WAN so ping times will be low no matter what topology the bridge is using. (Of course I can't do fair-share scheduling on the WAN ports, uverse0, comcast0, and aerio0, because the only thing running over them is the OpenVPN UDP packets and it can't dig into them to see what they represent). :so the L2 bridge / STP will 'map' according to the state of :the ethernet bridging, which in turn relates to the openvpn tunnel :state? Exactly. The if_bridge module does its own 'pinging' using STP config packets so it can detect when a link goes down. OpenVPN itself also has a ping/restart feature. I use both. OpenVPNs internal keepalive auto-restarts openvpn on failure, and the if_bridge's pinging is used to detect actual good flow across the link and controls the failover. :Without diverging any security sensitive whatnot, :Is the VPN tunnel created to the ISP or to say, the colo space? :(I'd assume the latter) Yes, a colo space that the DragonFly project controls, provided by Peter Avalos. OpenVPN itself is running encrypted UDP packets. Very easy to set up. The colo has around 10 MBytes/sec of bandwidth which is plenty for our project. :Have been working on my own openvpn (routing mode) fun to a pair :of VPS's as well over the last few days so this is of interest :D : :also - I note in the "bridge2.txt" file you 'cd /usr/pkg/etc/openvpn' :before running - is this so openvpn can find the config files? Yes, that's actually a bit broken. I've since changed it to put a 'cd' directive in the config file itself and then just run openvpn with the full path to the config file. Openvpn has problems restarting itself if you don't do this (it winds up getting confused and not being able to find the key files if it restarts). :if so - to note, you can add a 'cd /path/to/configdir' within the :config files.. Yah, found that :-) :also - assuming you have statics on both end of the tunnels - :why did you choose openvpn ethernet bridging over say IP layer + ipsec? :(or even openvpn 'routing' mode) with something like OSPF or similar : :and - do you have hw crypto cards on either endpoint? I originally attempted to route a subnet but the problem is we have a full class C at the colo, but DragonFly isn't really designed to operate with two different subnets where one subnet overlaps the other. Ethernet switching turned out to be the better solution. The colocated box itself is ON the class C, it doesn't have a separate IP outside the class C space. So there was no easy way to swing a routed network. I wouldn't even consider something as complex as OSPF for a simple setup like this, even with a routed solution. :(my soekris 486 gets a little bogged down by the crypto, which is why I ask) : :ok enough questions ;) : :its definitely fun trying to convert consumer internet into a 'real :connection' :D : :- Chris : :(from a gigabit LAN piggybacked on a sometimes 56k wifi link) OpenVPN has options to run in the clear after authentication is
Re: Home stretch on new network - if_bridge looking better
: :Great news! : :Is there any chance to support more features in the bridge code? RSTP, :span port , filtering based on mac address . : :Godot RSTP would be doable as a GSOC project, I think it would be very easy to implement. Perhaps almost too easy but if someone were to do it I would require significant testing to make sure the protocol operates properly. I have to move onto other things myself. (RSTP is STP with a faster recovery time in case of link failure. STP takes about 30 seconds to transition to a new topology while RSTP takes about 10 seconds). The span port is theoretically operational but it has NOT been tested in any way, so something might blow if you try to use it. This would be more of a bug-fix type of thing, not worthy of a GSOC project. MAC based filtering would be worthy of a GSOC project. We don't have it now but IPFW at least already has hooks for ethernet-level firewalling. Doing it w/PF would be a lot more difficult as PF is designed as a routed packet filter (routing vs switching). -Matt
Home stretch on new network - if_bridge looking better
I'm in the home stretch of finishing up the new DragonFly network! It's been pretty unstable the last week or so as I struggled first with the (now failed) attempt at using an at&t static block with U-Verse and then gave up on that and started working on running a VPN over a dynamic-IP based at&t U-Verse + comcast internet. I wanted bonding with failover. Most of my struggles with U-Verse were in dealing with the stateful firewall at&t has that cannot be turned off, even for the static IP block. It had serious issues dealing with many concurrent connections and would drop connections randomly (it would send a RST!). The VPN bypasses the whole mess. The last few days have been spent essentially rewriting half of if_bridge so it would work properly, and testing it while I am still tripple-homed (DSL, U-Verse, and ComCast). Well, it caused a lot of havoc on my network while I was beating it into shape and that's putting it mildly! But I think I now have if_bridge and openvpn and my ipfw and PF rules smacked into shape. I am going to implement line bonding in if_bridge today (on top of the spanning tree and failover which now works) and track down one or two remaining ARP issues and then I'll call it done. The basic setup is as shown below: http://apollo-vc.backplane.com/DFlyMisc/bridge1.txt http://apollo-vc.backplane.com/DFlyMisc/bridge2.txt + There are PF rules and ALTQs on each TAP interface to manage its outgoing bandwidth and keep network latencies down (on both sides of the VC). + IPFW forwarding (fwd) rules to manage multiple default routes based on the source IP. The spanning tree appears to be working properly with the 2x2 and the 3x3 'real' configuration I'm testing it with. Once I get line bonding working I expect my downlink to achieve ~30MBits+ and my uplink will be 4.8MBits. I'm seriously considering keeping both U-Verse and ComCast and just paring the service levels down a little (top tier isn't needed). The poor old DSL with its 600KBit uplink is going to hit the trash heap. It might have been slow, but that ISP served my old /26 static block fairly well for many years. -Matt Matthew Dillon
Re: Can't mount my hammer filesystem
: :Thanks for your reply. : :I don't remember if I installed it on a disklabel or a slice. I will :be able to know what I did once I get the usb flash disk with the :system and look at the fstab. : :Hopefully, I didn't lose data because I did several backups before :-) Ok, if the data is important we *can* recover it, so don't throw it away, but it might require you making the whole image available to me. I would need to add another option to the hammer recover directive to supply the missing info (if the volume header is truly blown away) and experiment a bit to figure out what the offset is in the image. I've been meaning to add the option for a while now but that isn't the real problem. The real problem is that the volume header contains a single piece of info, the data zone offset relative to the base of the hammer filesystem, and it's a bit non-trivial to 'guess' it. -Matt
Re: Hammer recover question
:This was a 1.8.2 system. Having a 1.9 system handy, I plugged the drive :(300GB IDE) into it and tried hammer recover for the first time to see what :I could save. The good news is that it's recovering a ton of data! The bad :news is that it's taking an incredible amount of time. So far it's been :running 24 hours. Is that to be expected? The bad disk had approximately :50GB on it, as reported by the df utility, but I don't know how much of that :is snapshots. : :Tim It scans the entire disk linearly, so however long that takes is how long recover takes to run. -Matt
Dragonfly network changes - U-Verse almost a complete failure
Hahaha... ok, well, I spoke too soon. U-Verse is a piece of crap. That's my conclusion. Here's some detail: * The physical infrastructure is fine, as long as you make sure there's no packet loss. To make sure you have to upload and download continuously at the same time and look for glitching and stalls. * The AT&T iNID/RG router is a piece of crap, and it's impossible to replace it with anything else because it also takes the VDSL2 from the street. The iNID/RG router basically has a fully stateful firewall in it WHICH CANNOT BE TURNED OFF for either static or dynamic IPs. There are lots of instructions on how to setup static IP and how to 'open' the firewall to let everything through. All lies. No matter what you do, the firewall's stateful tracking is turned on even for your static block. It tries to track every single 'connection' running through it even when the Firewall has been turned 'off' in the config. Worse, it is buggy as hell. It drops connections (as in sends a TCP RESET!!! to either my end or the remote end) ALL THE TIME. It loses packets. It drops critical ICMP packets and gets confused about normal ICMP packets. It gets confused when lots of connections are opened all at once (for example, running a simple iPAD video app such as CrunchyRoll)... or running an actual business with servers. It can't handle third-party NATs... It can BARELY handle its own NAT but even its own wireless/NAT (bypassing all my stuff and tying my iPAD directly into the iNID/RG over the RG's wireless) drops connections noticeably. On top of that the uverse router/firewall uses MAC-based security and only allows one IP assignment per MAC. This means that your 'network' cannot be routed, it can only be bridged, and you can't mix private and public IPs on the same MAC (which is a very common setup). If the uverse router/firewall gets packets from the same IP but different MACs, it blows up... it drops connections, it refuses to route packets, it gets confused. I spent a long time with PF and if_bridge and 'fixed' the MAC issue with filters, and verified that only the correct MACs were getting through, but I *STILL* get connection drops for no reason. -- Ok, so what does work? Drilling a PPTP through to a provider works. That is what I finally did. I drilled PPTP through the U-Verse to my old provider, so my *original* IP block from my old ISP (who I still have the DSL line with as a backup) is now running through U-Verse. Let me repeat that... running my iPAD test through my own NAT and wireless network through the PPTP link to bypass the U-Verse router crap and to my old provider, who has LESS bandwidth than the U-Verse link I'm drilling through, works BETTER than running the iPAD test directly on U-Verse through the U-Verse iNID/RG/wireless (bypassing all my own gear). That's it. That's all that works. Even if you were to get a normal u-verse link with dynamic IP and no static IP you are still SEVERELY restricted in what you can do. Your own NAT servers will simply not work well. You would HAVE to use AT&T's NAT & RG/wireless. You would HAVE to be on a simple bridged network with no other firewall beyond the AT&T iNID/RG. You would HAVE to have just one IP assignment for each machine. In otherwords, the simplest of network configurations will work. Nothing else will work very well. -- It isn't ideal, my old ISP can't push 2 MBytes/sec downlink to me through the PPTP link. But neither does it drop connections. And my uplink speed is still good which is the main thing I care about for the DragonFly network. I'm going to stick with the U-Verse so I can get rid of the much costlier COMCAST. However, I am going to cancel the static IP block and stick with drilling the PPTP through to my old ISP (which I'm keeping for the backup DSL line anyway). Sigh. You'd think AT&T would be smart enough to do this properly, but after 5 years of trying they are still clueless about IP networks. Maybe in another year or two they will fix their stuff. Or not. -Matt
Re: Can't mount my hammer filesystem
:Hi, : :So I deciced to format the master drive to install the system on and :then get back my data from the slave. But, that's not cool, when I try :to mount I get this message "Not a valid HAMMER filesystem". : :Did I destroyed the filesystem by installing the bootblock on both disks ? :Can I get my data back ? How ? : :I tried some commands unsuccessfully : : :# hammer -f /dev/serno/S1PZJ1DQ508109.s4 recover /media/dd2/ :hammer: setup_volume: /dev/serno/S1PZJ1DQ508109.s4: Header does not :indicate that this is a hammer volume s4 ? Not s4d ? Did you accidently install HAMMER directly on a slice and not install it in a disklabeled partition? Installing boot blocks would have wiped the header if you installed HAMMER in a slice instead of a partition. The hammer recover code needs information from the volume header at the moment. That's the only piece of the disk it needs to be able to do a recovery scan. It's a design bug that will require a media format change to fix. -Matt
Dragonfly network changes
Various DragonFly machines are now running on a much faster network thanks to AT&T u-verse, and despite utterly horrid disaster that at&t's little router box is I am slowly managing to thrash it into shape. Our main web site is now on the new network (www, gitweb, wiki, and bugs). http://www.dragonflybsd.org/ Developer access to leaf via the new network will work if you use 'leaf-uv.dragonflybsd.org'. leaf.dragonflybsd.org will continue to use the old network for a while. Our nameserver topology has been revamped a bit to remove old cruft and dual-home the networks. I will not be renumbering until I can get the reverse DNS operational (lots of phone tag with AT&T), plus give the new network a good burn-in. -- For those interested this is AT&T Business U-Verse. Downlink speed is around 16 MBits and uplink speed is around 2 MBits with their highest-grade service. My comcast cable internet (which I will be getting rid of soon), also the highest grade service, has a faster downlink speed of around 30 MBits, but around the same uplink speed of 2 MBits. Of course, I only really care about uplink speed here, since I'm serving data out. However, the AT&T service so far does seem a bit more consistent and I will test it vs my comcast internet (before I get rid of it) with hulu et-all. -Matt Matthew Dillon
New ps feature -R
master now has a new feature to /bin/ps, -R, which sub-sorts by parent/child association and follows the chain in the output, indenting the command to make it obvious. Sample usage: ps axlR This is a pretty cool feature I think. I had written something similar 15 years ago and really began to miss it once I started doing parallel pkgsrc bulkbuild tests. -Matt Matthew Dillon
Re: How to tell where/how space is used?
Not to mention the fact that I just upped the minimum UNDO/REDO fifo size to 500M to help deal with an overflow issue. I may reduce the minimum to 300MB, but 100MB is just too small since most systems these days have at least 1GB of ram and on-media fifo use under stress tends to be related to the amount of ram on the box. HAMMER is not designed to be run on 2G partitions. When it says that 50G is the minimum it really means it. You can get away with smaller sizes with care, e.g. I run HAMMER just fine on a 40G SSD, but I wouldn't go much below that. For small partitions, UFS is just fine. UFS's fsck runs in just a few seconds on filesystems that small. -Matt
Re: ad1 renumbered to ad0
:I just rebooted (after running into the "alt-ctrl-F1 hangs with beeper on" bug :again), went into the BIOS setup, and enabled audio and SATA (because my :friend is talking about getting a big SATA disk). On reaching cryptdisks, it :said "device ad1s1e is not a luks device". I checked /dev/ad* and found that :it's now ad0. Some months ago, when I upgraded the kernel on the laptop, ad0 :changed to ad1. What's going on? : :Pierre Device probe order can change due to BIOS adjustments, which is why you should always reference your disk drives by their serial number instead of by the device name & unit number. ls /dev/serno dmesg | less<--- look for the device, it should also print out the serial number nearby. For example, on one of my machines the swap partition is: /dev/serno/CVGB951400U5040GGN.s1b -Matt Matthew Dillon
Re: avalon down
I power cycled avalon. It crashed due to a known bug which I thought I had fixed but I missed a case. I'm testing a new fix now. -Matt Matthew Dillon
Re: hyperthreaded?
::The guy who gave me the box says he has another one like it, but one is ::hyperthreaded and the other isn't. Here's the beginning of dmesg. Is it ::hyperthreaded, and if so, should I compile a kernel to take advantage of it? :: ::CPU: Intel(R) Pentium(R) 4 CPU 2.80GHz (2793.02-MHz 686-class CPU) :: Logical CPUs per core: 2 : :Yes. It has 2 real cpus and 2 hyper-threads per real cpu (4 total). : :Definitely worth running a SMP kernel. Even things like the atom :with one real cpu and 2 hyperthreads is worth running a SMP kernel :on. : Oops, I've been corrected. That baby has 1 core and 2 hyperthreads. In anycase, it is worth running a SMP kernel on it. -Matt Matthew Dillon
Re: hyperthreaded?
:The guy who gave me the box says he has another one like it, but one is :hyperthreaded and the other isn't. Here's the beginning of dmesg. Is it :hyperthreaded, and if so, should I compile a kernel to take advantage of it? : :CPU: Intel(R) Pentium(R) 4 CPU 2.80GHz (2793.02-MHz 686-class CPU) : Logical CPUs per core: 2 Yes. It has 2 real cpus and 2 hyper-threads per real cpu (4 total). Definitely worth running a SMP kernel. Even things like the atom with one real cpu and 2 hyperthreads is worth running a SMP kernel on. -Matt
Re: System has insufficient buffers to rebalance the tree
:On Sunday 30 January 2011 22:43:01 Matthew Dillon wrote: :> :I still get this warning. Is it ever going to be fixed? :> : :> :Pierre :> :> I could remove the kprintf I guess and have it just reported by :> hammer cleanup. : :Is the number of buffers something I can change, or is it determined by the :kernel based on memory size? : :Pierre It's based on memory size and while it is possible to change it the problem is that your system doesn't have enough memory for what the hammer rebalance code really needs to operate. Part of this is that the rebalance algorithm simply requires a huge number of concurrent buffers. Ultimately the fix would be to change the algorithm but it isn't easy to reformulate and I don't really want to break it open because one mistake in that code could blow up the filesystem. -Matt Matthew Dillon
Re: System has insufficient buffers to rebalance the tree
:I still get this warning. Is it ever going to be fixed? : :Pierre I could remove the kprintf I guess and have it just reported by hammer cleanup. -Matt Matthew Dillon
Re: 2.8.3 coming?
:Hi, :I had read in november that there was plans for 2.8.3 release coming. :I plan to install a server next week with 2.8.2, but i will delay install if :2.8.3 is coming in few more weeks. :Someone got an estimation of release date? :Thanks :Damian : :-- :http://dfbsd.trackbsd.org.ar It's not looking like it. There just isn't enough time to figure out what more needs to be merged in from master to make a 2.8.3 release, verses just compiling whatever is the latest on the 2.8.x branch. -Matt