Re: to configure HammerFS

2012-09-08 Thread Matthew Dillon

:In order to gain perfomance in some peculiar cases:
:To what extent HammerFS is able to be tuned?
:Are there any configurable options in HammerFS?
:How to access them?

Most of the information will be in 'man hammer'.  There are numerous
sysctls under vfs.hammer (sysctl vfs.hammer) but they are already tuned.
The only real adjustments you might want to make would be if you also
were running swapcache with a SSD.

Matthew Dillon

Re: modifying nullfs

2012-09-07 Thread Matthew Dillon
Most nullfs VOP's are going to go directly to the underlying
filesystem and NOT run through nullfs itself.

In DragonFly we don't have to replicate the vnode infrastructure for
directory nodes in nullfs because we track { mp, vnode } instead of
just { vnode }.  nullfs is basically only used to track the
mount structure.  Our namecache code handles mount points via the
chaining within the mount structures and NOT via chaining within
directory vnodes.

In otherwords, in DragonFly a nullfs mount as just as good as the
underlying filesystem mount, with no added overhead to use it.

Matthew Dillon

Re: modifying nullfs

2012-09-07 Thread Matthew Dillon
:honestly, i think about some kind of abstraction layer over HammerFS, 
:that's why a stackable FS impressed me.

Stackable FS's are always interesting, but they are also full of

The NFS server implementation is a good example.. when you export
a filesystem via NFS the NFS client has to talk to the NFS server
and that's essentially creating a stacking layer on top of the
original filesystem being exported by the server.

There are three primary problems with any stacking filesystem:

* Coherency if someone goes and does something to a file or directory
  (like remove or rename it) via the underlying filesystem.  The
  stacked filesystem doesn't know about it.

* Tracking the vnode associations is particularly difficult because
  you can't just keep the pairs of vnodes (the overlayed vnode and
  the underlying vnode) referenced all the time.  There are too many.
  In particular, even just leaving the underlying vnodes referenced
  creates a real problem for the kernel's vnode cache management code
  because it can only hold a limited number of vnodes.

  (The NFS server handles this by not keeping a permanent ref on the
  vnodes requested by clients.  Instead it can force clients to
  re-lookup the filename and re-establish any vnode association it
  had removed from the cache.  It works for most cases but does not
  work well for the open-descriptor-after-unlinking case and can cause
  serious confusion when multiple NFS clients rename the same file or

* And overhead.  When you have a stacked filesystem (such as a NFS
  server), verses a filesystem alias (such as NULLFS), the stacked
  filesystem has considerable kernel memory overhead to track the
  stacking which creates a memory management issue if you try to
  stack very large filesystems.o

Another example of a stacked filesystem would be the UFS union mount
(unionfs) in FreeBSD.  It was removed from DragonFly and has had
endemic problems in FreeBSD for, oh, ever since it was written.  It
depended heavily on the 'VOP_WHITEOUT' feature which is something only
UFS really supports, and not very well at that because directory-entry
whiteouts can't really be backed up.  The union filesystem tried to
stack two real filesystem and present essentially a 'writable snapshot'
as the mount.

So it's a very interesting area but complex and difficult to implement
properly under any circumstances.

Matthew Dillon

Re: Errors on SSD

2012-08-26 Thread Matthew Dillon
Also note that you may be able to get more detailed information
on the problem using smartctl:

pkg_radd smartmontools

smartctl -d sat -a /dev/daXXX   (where daXXX is the correct device for
 the SSD).

In particular look at the wear indicator, which is typically attribute 233,
and available reserved space, which is typically attribute 232 (but it
depends on the SSD).


Re: HAMMER2 progress report - 07-Aug-2012

2012-08-16 Thread Matthew Dillon

:On Wed, Aug 8, 2012 at 10:44 AM, Matthew Dillon wrote:
:   Full graph spanning tree protocol so there can be loops, multiple ways
:   to get to the same target, and so on and so forth.  The SPANs propagate
:   the best N (one or two) paths for each mount.
:Can this be tested in development.
:What commands should be used? :-)

It's a bit opaque but basically you create the /etc/hammer2 directory
infrastructure and then setup some hammer2 filesystems.

hammer2 rsainit
mkdir /etc/hammer2/remote
cd /etc/hammer2/remote
(create or IPADDR.none files)

Essentially you copy the key from the source machine
into /etc/hammer2/remote/ on the target machine.
You can then connect from the source machine to the target
machine as described below.

Normally you also create a localhost link and, for testing purposes,
it isn't root protected and you can tell it not to use encryption:

touch /etc/hammer2/remote/

In order to be able connect to the service daemon and have the
daemon be able to connect to other service daemons you need to
set up encryption on the machine:


# example disk by serial number
set disk = SERNO.s1d
newfs_hammer2 /dev/serno/$disk
mount /dev/serno/$disk@ROOT /mnt

cd /mnt
hammer2 pfs-create TEST1
set uuid = `hammer2 pfs-clid TEST1`
echo cluster uuid $uuid

hammer2 -u $uuid pfs-create $i
mkdir -p /test$i
mount /dev/serno/$disk@TEST$i /test$i

The mounts will start up a hammer2 service daemon which connects
to each mount.  You can kill the daemon and start it manually and
it will reconnect automatically.  The service daemon runs in the
background.  To see all the debug output kill it and start it in
the foreground with -d:

killall hammer2
hammer2 -d service

I usually do this on each test machine.  Then I connect the service
daemons to each other in various ways.

# hammer2 -s /mnt status
# hammer2 -s /mnt connect
# hammer2 -s /mnt status
1 ---.00

(you can 'disconnect' a span as well.  The spans will attempt to
reconnect every 5 seconds forever while in the table).

(the connection table is on the media, thus persistent)

You can also connect a 'shell' to a running service daemon, as long
as /etc/hammer2/remote allows it:

hammer2 shell
(do various commands)

only the 'tree' command is really useful here though you can
also manually connect spans.  You can't kill them though.

In anycase, that's the jist of it for the moment.  The 'tree' command
from the shell gives you a view of the spans from the point of view of
whatever machine you are connecting to.

Remember that the HAMMER2 filesystem itself is not production ready...
it can't free blocks, for example (kinda like a PROM atm).  And, of
course, there are no protocols running on the links yet.  I haven't
gotten routing working yet.

The core is fairly advanced... encryption, messaging transactions,
notification of media config changes, automatic reconnect on connection
failure, etc.

Matthew Dillon

Re: fails to mount root

2012-08-13 Thread Matthew Dillon

:On Monday 13 August 2012 12:30:05 Matthew Dillon wrote:
: Well, a 2.8 CD wouldn't have worked but you now burned a more recent
: CD and you are getting the panic again?  The question is what is the
: console output above the 'lockmgr' line ?  i.e. all I see there is
: part of the backtrace, and not the actual reason for the panic.  It's
: quite possible that it is a software bug that is exploding it.
:lockmgr was on the top line. When I mount it with the CD booted, the top 
:line is panic, which is one above lockmgr, and I can't scroll back. Is 
:there a way to make more lines on the console?
:ve ka'a ro klaji la .romas. se jmaji

   If you have a DDB prompt you can hit the scoll-lock button and then
   cursor up.

Matthew Dillon

Re: fails to mount root

2012-08-13 Thread Matthew Dillon

:On Monday 13 August 2012 15:38:46 Matthew Dillon wrote:
:If you have a DDB prompt you can hit the scoll-lock button and then
:cursor up.
:HAMMER(ROOT) recovery check seqno=7a824b53
:recovery range 308735b0-30878e48
:recovery nexto 30878e48 endseqno=7a824c80
:recovery undo 308735b0-30878e48 (22680 bytes)(RW)
:Found REDO_SYNC 308516b0
:Ignoring extra REDO_SYNC records in UNDO/REDO FIFO. (9 times)
:recovery complete
:recovery redo 308735b0-30878e48 (22680 bytes)(RW)
:Find extended redo 308516b0, 139008 exbytes
:panic: lockmgr: LK_RELEASE: no lock held
:cpuid = 0
:I believe in Yellow when I'm in Sweden and in Black when I'm in Wales.

Hmm.  Well, that's clearly a software bug but I'm not sure what
is causing the lock to be lost.  The debugger backtrace isn't
consistent.  I would love to get a kernel core out of this but
it's too early in the boot sequence.

Antonio will have a patch for a boot-time tunable that will bypass
the hammer2 recovery code tomorrow sometime.

Matthew Dillon

HAMMER2 progress report - 07-Aug-2012

2012-08-07 Thread Matthew Dillon
Hammer2 continues to progress.  I've been working on the userland
spanning tree protocol.

* The socket/messaging system now connects, handshakes with public
  key encryption, and negotiates AES keys for the session data stream.

* The low level transactional messaging subsystem is pretty solid now.

* The initial spanning tree protocol implementation is propagating
  node information across the cluster and is handling connect/disconnect
  events properly.  So far I've only tested two hosts x 10 mounts,
  but pretty soon now I will start using vkernels to create much
  larger topologies for testing purposes.

  Essentially the topology is (ascii art):

   Host #1Any cluster (graph) topology
   __  __
  /  \/  \
  PFS mount --\
  PFS mount --\\ /---(TCP)-- Host #2 --(TCP)\
  PFS mount -- hammer2 service ---(TCP)- Host #3
  PFS mount --// \---(TCP)-- Host #4 --(TCP)/
  PFS mount --/

  Full graph spanning tree protocol so there can be loops, multiple ways
  to get to the same target, and so on and so forth.  The SPANs propagate
  the best N (one or two) paths for each mount.

  Any given mount is just a HAMMER2 PFS, so there will be immense
  flexibility in what a 'mount' means.  i.e. is it a master node?  Is
  it a slave?  Is it a cache-only node?  Maybe its a diskless client-only
  node (no persistent storage at all), etc.

  Because each node is a PFS, and PFS's can be trivially created
  (any single physical HAMMER2 filesystem can contain any number
  of PFS's)... because of that there will be immense flexibility in
  how people construct their clusters.

* The low level messaging subsystem is solid.  Message relaying is next
  on my TODO list (using the spanning tree to relay messages).  After
  that I'll have to get automatic-reconnection working properly.

Once the low level messaging subsystem is solid I will be able to start
working on the higher-level protocols, which is the fun part.  There is
still a very long ways to go.

Ultimately the feature set is going to be huge, which is one reason why
there is so much work left to do.  For example, we want to be able to
have millions of diskless or cache-only clients be able to connect into
a cluster and have it actually work... which means that the topology
would have to support 'satellite' hosts to aggregate the clients and
implement a proxy protocol to the core of the topology without having
to propagate millions of spanning tree nodes.  Ultimately the topology
has to allow for proxy operation, otherwise the spanning tree overhead
becomes uncontrolled.  This would also make it possible to have
internet-facing hosts without compromising the cluster's core.

Also note that dealing with multiple physical disks and failures will
also be part of the equation.  The cluster mechanic described above is
an abstraction for having multiple copies of the same filesystem in
different places, with varying amounts of data and thus gaining

But we ALSO want to be able to have a SINGLE copy of the filesystem
(homed at a particular machine) to use the SAME mechanism to glue
together all of its physical storage into a single entity (plus with
a copies mechanic for redundancy), and then allow that filesystem to
take part in the multi-master cluster as one of the masters.

All of these vastly different feature sets will use the same underlying
transactional messaging protocol.

x bazillion more features and that's my goal.


Re: solid-state drives

2012-08-03 Thread Matthew Dillon
Well, dedup has fairly low overhead so that would be fine on a SSD
too, but because SSD's tend to be smaller than HDDs there also tends to
be not so much data to dedup so you might not get much out of enabling


The SSD's biggest benefit is as a cache, though I don't discount the
wonderfully fast boots I get with SSD-based roots on my laptops.
Random access read I/O on a SSD is several orders of magnitude faster
than on a HDD (e.g.  20,000+ iops vs 250-400 iops)... that's a 50x
factor and a 15K rpm HDD won't help much.

Random write I/O is a bit more problematic and depends on many
factors, mainly related to how well the SSD is able to write-combine
the I/O requests and the size of the blocks being written.  I haven't
run any tests in this regard, but something like the OCZ's with their
massive ram caches (and higher power requirements) will likely do better
with random writes than, e.g. the Intel SSDs which have very little ram.

Linear read and write I/O between a SSD and a HDD are closer.  The SSD
will be 2x-4x faster on the linear read I/O (instead of 50x faster),
and maybe 1.5-2x faster for linear write I/O.

NOTE!  This is for a reasonably-sized SSD, 200GB or larger.  SSD
performance is DIRECTLY related to the actual number of flash chips in
the SSD, so there is a huge difference in the performance of, say,
a 200GB SSD verses the performance of a 40GB SSD.

A 40GB SSD can be limited to e.g. 40 MBytes/sec writing.  A 200GB SSD
with a 6GBit/sec SATA phy can do 400 MBytes/sec writing and exceed
500 MBytes/sec reading.  Big difference.

Matthew Dillon

Re: solid-state drives

2012-08-02 Thread Matthew Dillon

:On Wed, Aug 01, 2012 at 06:16:13PM -0400, Pierre Abbat wrote:
: This is a spinoff of the Aleutia question, since Aleutia puts SSDs in 
: computers. How does the periodic Hammer job handle SSDs? Does reblocking do 
: anything different than on an HDD? If a computer has an SSD and an HDD, 
: should get the swap space?
: Pierre
:On my workstation I use an SSD for the root filesystem, swapcache and
:The current configuration has snapshots set to 1d (10d retention time)
:and reblocking is set to 7d (1m runtime). All other option (prune,
:rebalance, dedup and recopy) are disabled.
:Currently it is running fine, but in my opinion running swapcache on a
:workstation that just runs for a couple of hours is not always
:necessary. I'm just running this setup to play with the swapcache and
:the SSD, because I think it is a very nice feature.

You will definitely want to turn pruning on, it doesn't do all that much
I/O and its needed to clean up the fine-grained snapshots.  Rebalance,
dedup, and recopy can be left turned off.

Matthew Dillon

Re: frequency scaling on D525MW not working properly

2012-08-02 Thread Matthew Dillon
Also on the D5* atoms on FreeBSD it would be nice to check that it
actually works as advertised, by running a few cpu-bound processes
(i.e. for (;;); ) and measuring the watts being burned at different
frequencies.  That's the real proof that the frequency scaling is doing
something real.


Re: solid-state drives

2012-08-01 Thread Matthew Dillon

:This is a spinoff of the Aleutia question, since Aleutia puts SSDs in 
:computers. How does the periodic Hammer job handle SSDs? Does reblocking do 
:anything different than on an HDD? If a computer has an SSD and an HDD, which 
:should get the swap space?
:lo ponse be lo mruli po'o cu ga'ezga roda lo ka dinko

It depends on the purpose.  I run several laptops with both swap and
root on the SSD.  Obviously swapcache is turned off in that situation.
I usually adjust the hammer config to turn off the 'recopy' feature
and I usually set the reblocker to run less often, but that's about

I only suggest running a hammer filesystem on a SSD under carefully
controlled conditions... that is, not if you are going to be manipulating
any large databases that could blow out the SSD.  Normal laptop use
should work fine but one always has to be cognizant of the SSD's limited
write cycles.


For machines that are working harder or which need a lot of disk space
I run hammer on the HDD and put swap on the SSD, and enable swapcache.
The SSD works very well as a cache under these circumstances.

You still need to avoid heavy paging due to running programs which
are too big to fit into memory, since that can wear out the SSD.

This is my preferred setup.  All the DragonFly boxes run with SSD-based
swap and swapcache turned on.

Matthew Dillon

Re: Latest 3.1 development version core dumps while destroying master PFS

2012-07-25 Thread Matthew Dillon

:I tried to destroy the PFS after unmounting
:1. after downgrading
:2. with latest dev snapshot usb stick
:3. in single user mode
:4. after creating a link
:DataNew - @@-1:8
:The system always core dumps.
:I guess Matt will be busy working on  HAMMER2 and wonder if i should
:keep waiting till the bug is fixed.
:Since this is my Main backup Server Should I just re-install the whole
:thing and move forward?

Well, the media looks corrupted to me.  It hit a fairly serious
assertion.  If you need the data on that media you should be able
to 'hammer recover' it to another filesystem on a different partition,
but that particular filesystem looks like it is toast to me.

Matthew Dillon

Re: Unable to mount hammer file system Undo failed

2012-07-19 Thread Matthew Dillon
People who use HAMMER also tend to backup their filesystems using
the streaming mirroring feature.  You need a backup anyway, regardless.
HAMMER makes it easy, and this is the recommended method for dealing
with media faults on HDDs not backed by hardware RAID (and even if
they are).  You need to backup your data anyway, after all, regardless
of the filesystem (even ZFS's 'copies' feature has its limits due to
the fact that the copies are all being managed from the same machine).

FreeBSD's background fsck and mounting without an fsck (depending on
softupdates) has NEVER been well vetted to ensure that it works in
all situations.  There have been lots of complaints about related
failures over the years, mostly blamed on failed writes to disks or
people not having UPS's (since UFS was never designed to work with
a disk synchronization command, crashes from e.g. power failures could
seriously corrupt disks above and beyond lost sectors).  They can
claim it works better now, but I would never trust it.  Background fsck
itself can render a server unusable due to lost performance.

HAMMER has a 'hammer recover' command meant to be used when all else
fails.  It can be used directly with the bad/corrupted disk as the source
and a new disk as the destination.  It scans the disk, yes.  A full
fsck on a very large (2TB+) filled filesystem is almost as bad when it
starts having to seek around.

I have had numerous failed disks over the years and have never had to
actually use the recover command.  I always initialize a replacement
from one of the several live backups I keep.

HAMMER2 will have some more interesting features that flesh out the
live backup mechanic a bit better, making it possible to e.g. initialize
a replacement disk locally and leave the filesystem live using a remotely
served backup as the replacement is reloaded from the backup.  But it
isn't possible with HAMMER1, sorry.

Matthew Dillon

Re: Unable to mount hammer file system Undo failed

2012-07-19 Thread Matthew Dillon

:I have PFS slaves on a second disk.
:I have already fitted a new disk and the OS installation is complete.
: I will upgrade the Slaves to Master and then configure slaves for
:them so there is no problem.
:But I have lost the snapshot symlinks :-(
:In the PFSes I snapshotted every 5 minutes I have a lot of symlinks.
:Is there any easy way to recreate those symlinks from the snapshot IDs ?

Try 'hammer snapls mountpt'.  The snapshots are recorded in meta-data
so if they're still there you can get them back.  You may have to
write a script to recreate the softlinks from the output.

Matthew Dillon

Re: questions from FreeBSD user

2012-07-15 Thread Matthew Dillon
:On Sun, Jul 15, 2012 at 5:02 PM, Wojciech Puchar wrote:
: i have few questions. i am currently using FreeBSD, dragonfly was just
: tried.
: 1) why on amd64 platform swapcache is said to be limited to 512GB? actually
: it may be real limit on larger setup with more than one SSD.

It seemed like a reasonable limit for the KVM overhead involved,
though I don't remember the exact reason I chose it originally.

The practical limitation for swap is 4096GB (4TB) due to the use
of 32 bit block numbers coupled with internal arithmatic overflows
in the swap algorithms which eats another 2 bits.

We do not want to increase the size of the radix tree element because
the larger structure size would double the per-swap-block physical memory
overhead, and physical memory overhead is already fairly significant...
around  1MB of physical memory is needed per 1GB of swap.

There are a maximum of 4 swap devices (w/512GB limit by default in total,
with the per-device limit 1/4 of that).  Devices are automatically
interleaved and can be added and removed on the fly.  The maximum can
be increased with a kernel rebuild but it is not recommended... you
generally won't get more performance once you get past 4 devices.

: 2) it is said that you are limited to cache about 40 inodes unless you
: use sysctl setting vfs.hammer.doublebuffer or so.
: in the same time it is said to be able to cache any filesystem.
: Can UFS be cached efficiently with millions of files?

32 bit systems will be limited to ~100,000 inodes are so.

64 bit systems calculated a default limit (kern.maxvnodes) based on
available ram, with no cap.  So values  1 million will be common.

And you can always raise this value via the sysctl.

UFS ought to be be cached by swapcache but there's no point using it
on DragonFly.  You should use HAMMER.

: 3) how about reboots? From my understanding reboot, even clean, means losing
: ALL cached data. am i right?

All swapcache-cache data is lost on reboot.

: In spite of HAMMER being far far far better implementation of filesystem
: that ZFS, i don't want to use any of them for the same reasons.
: UFS is safe.

A large, full UFS filesystem can take hours to fsck, meaning that a
crash/reboot of the system could end up not coming back on line for
a long, long time.  On 32-bit systems the UFS fsck can even run the
system out of memory and not be able to complete.  On 64-bit systems
this won't happen but the system can still end up paging heavily
depending on how much ram it has.

In contrast, HAMMER is instant-up and has no significant physical
memory limitations (very large HAMMER filesystems can run on systems
with small amounts of memory).

: 4) will virtualbox or qemu-kvm or similar tool be ported ever to DragonFly?
: i am not fan of virtualizing everything, which is pure marketing nonsense,
: but i do some virtualize few windows sessions on server.
: thanks

With some work, people have had mixed results, but DragonFly is designed
to run on actual hardware and not under virtualization.

Matthew Dillon

Re: machine won't start

2012-07-04 Thread Matthew Dillon
Normally this issue can be fixed by setting the BIOS to access the
disk in LBA or LARGE mode.  The problem is due to a bug in the BIOS's
attempt to interpret the slice table in CHS mode instead of logical
block mode.  It's a BIOS bug.  These old BIOS's make a lot of assumptions
w/regards to the contents of the slice table, including making explicit
checks for particular OS types in the table.

I've only ever seen the problem on old machines, and I've always
been able to solve it by setting the BIOS access mode.

I've never, ever found a slice table format that works properly across
all BIOSs.  At this juncture we are using only newer (newer being 'only'
25+ years old) slice table formats (aka LBA layouts and using proper
capped values for hard drives that are larger than the 32-bit LBA layout
can handle).

Ultimately we will want to start formatting things w/GPT, but that opens
up a whole new can of worms... old BIOSes can explode even more easily
when presented with a GPT's compat slice format, at least as defined
by GPT.  Numerous vendors such as Apple modified their GPT to try
to work around the even larger number of BIOS bugs related to GPT
formatting than were present for the older LBA formatting.

I consider it almost a lost cause.


Re: machine won't start

2012-07-04 Thread Matthew Dillon

:Thanks Matt for the explanation and tip.
:It did of course hang when I tried to DEL into the BIOS.
:What worked is pulling out the sata connector, entering
:the BIOS putting it back and then detecting the disk.
:Interesting the auto detection then worked. I've explicitly
:set it to LARGE and now I can boot a rescue cd.
:How many bytes should I zero out for the disk to be
:normal again? 512bytes? 4megs?

The BIOS is basically just accessing the slice table in the first
512 bytes of the disk.  If I want to completely wipe a non-GPT
formatted disk I usually zero-out (with dd) the first ~32MB or so
to catch both the slice table and the stage-2 boot and the disklabel
and the likely filesystem header.

Destroying a GPT disk requires (to be safe) zero'ing out both the first
AND the last X bytes of the physical media to also ensure that the
backup GPT table is also scrapped.  Again, to be safe I zero-out around
32MB at the beginning and 32MB at the end w/dd (if it's GPT).

This will effectively destroy everything on the disk from the point
of view of probing, so please note that these instructions are NOT
going to leave multi-OS installations intact.

Matthew Dillon

Re: machine won't start

2012-07-04 Thread Matthew Dillon
: I consider it almost a lost cause.
:Don't get it: trying to fix this is a lost cause?

Yah, because if we fix it for one BIOS we break it for another.
Hence, a lost cause.  There is no single fix which covers all BIOSs.


Re: pkgsrcv2.git stopped syncing?

2012-05-12 Thread Matthew Dillon
:The latest commit on pkgsrcv2.git is 8ce625e3, which is from
:9 days ago.  But I see more commits after this date on
:Could someone take care of it?
:Best Regards,
:YONETANI Tomokazu.

Ok, working on it.  Grr, that thing is getting more fragile.
It's probably an incremental update failure of the CVS repo.


Re: pkgsrcv2.git stopped syncing?

2012-05-12 Thread Matthew Dillon

::The latest commit on pkgsrcv2.git is 8ce625e3, which is from
::9 days ago.  But I see more commits after this date on
::Could someone take care of it?
::Best Regards,
::YONETANI Tomokazu.
:Ok, working on it.  Grr, that thing is getting more fragile.
:It's probably an incremental update failure of the CVS repo.
:   -Matt

It should be fixed now (it fixed itself this morning, I didn't have
to do anything).  The script was failing trying to do the incremental
cvs checkout.

e.g. from the logs:

U cvs-master/emulators/ucon64/patches/patch-af
U cvs-master/net/xymon/PLIST
U cvs-master/net/xymon/
cvs checkout: move away `cvs-master/print/LPRng/Makefile'; it is in the way
C cvs-master/print/LPRng/Makefile
cvs checkout: move away `cvs-master/print/LPRng-core/MESSAGE'; it is in the way
C cvs-master/print/LPRng-core/MESSAGE
... (repeat a hundred times) ...

I'm not sure why this happens.  CVS somehow gets confused over what
files are in the repo and what files are not, possibly due to catching
an update in the middle or surgery done in the master cvs repo.

This time it fixed itself.  Sometimes I have to blow the checkout away
and let it re-checkout everything over again.

Theoretically I could do a fresh rm -rf and checkout every time, but
that seems really wasteful of crater's disk and time.  It already takes
crater between 1 and 2 hours to run the cvs-git script so for now I
am still leaving it set to do an incremental checkout.

Matthew Dillon

Re: pkgsrcv2.git stopped syncing?

2012-05-12 Thread Matthew Dillon

:I did that - rm'd the checkout.  I was going to write an email about
:this as soon as I saw whether it worked again after the next 'normal'
:checkout.  Looking at gitweb, I see the conversion commit, but I don't
:see any subsequent commits... but I don't know if there are any yet.

Ah ha!  The ghost in the machine strikes again!

We should probably modify the script to blow the directory away once
a week just to make sure it can auto-recover from that situation.

Matthew Dillon

Re: help with a failed cpdup assert

2012-04-26 Thread Matthew Dillon

:I am trying to sync some pretty similar directories with cpdup over ssh  
:and three of them are syncing, but the fourth fails with this:
:cpdup: hclink.c:343: hcc_leaf_data: Assertion `trans-windex +  
:sizeof(*item) + bytes  65536' failed.
:Additionally it only fails when
:source (cpdup slave) =pull= destination (cpdup master)
:and not when
:source (cpdup master) =push= destination (cpdup slave)
:What could be causing this?
:- Nikolai

The assertion was incorrect.  That's an old version of cpdup, updating
to the latest should solve the problem.

Matthew Dillon

Re: How to suppress kernel hammer debug messages.

2012-03-08 Thread Matthew Dillon

:I am new to this mailing list and was wondering if anyone could
:help me figure out how to suppress or otherwise disable the logging of these
:apparently benign debug messages that are filling up my syslog file.
:hammer: debug: forcing async flush ip 0001093483e9...

The debugging message was added to verify that a particular bug
was being caught and fixed.  It's one of several unconditional
debugging kprintf()'s that could probably be stripped out from
the code.

There's no conditionalization on it.  I will push a conditionalization
of this particular message to master and the 3.0 release branches.
Getting rid of them will require recompiling the kernel w/updated sources.

Or you can just strip the related kprintf out yourself and recompile
your kernel (the three lines at line 2438 of
/usr/src/sys/vfs/hammer/hammer_inode.c if you have unpacked the sources
should be where this kprintf() resides).

Matthew Dillon

Re: Install DragonFlyBSD on 48 MB RAM

2012-03-01 Thread Matthew Dillon
:One of our developers tested with snapshots; it looks like the DMA
:reserve commit is the one that made DF no longer run w/ 48MB. That
:makes sense, as 16MB of physical memory is locked up by that commit.
:You should be able to boot with a loader variable set to reserve less
:physical memory.
:We someday need a better physmem allocator; the 16MB reserve is a good
:step, but a low-fragmentation allocator would be better.
:-- vs;

   It should be reserving less space on low-memory machines.

if (vm_dma_reserved == 0) {
vm_dma_reserved = 16 * 1024 * 1024; /* 16MB */
if (vm_dma_reserved  total / 16)
vm_dma_reserved = total / 16;

   We could try zeroing it.  Or perhaps the calculation is wrong...
   maybe it should be basing the test on 'npages' instead of 'total'.
   e.g. ((vm_paddr_t)npages * PAGE_SIZE / 16) instead of (total / 16).

   However, we really don't support machines with so little memory,
   even if the thing manages to boot.  If a simple change makes it work
   then fine, but otherwise I'm skeptical of the value.

   This variable is a tunable.  Try setting 'vm.dma_reserved=0' in the
   boot loader.

Matthew Dillon

Re: Install DragonFlyBSD on 48 MB RAM

2012-02-24 Thread Matthew Dillon
I think the answer is probably 'no'.  We don't try to make the system
work with such a small amount of memory.  It should be able to boot
with 128MB of ram or more, though to really be decent a more contemporary
machine is necessary.

It might boot on less memory... in fact it will, but I don't think I've
ever tried to boot it with less than 64M and even 64M is probably too


hammer2 branch in dragonfly repo created - won't be operational for 6-12 months.

2012-02-08 Thread Matthew Dillon
I have created a hammer2 branch in the main repo so related commit messages
are going to start showing up in the commits@ list.  This branch will loosely
track master but also contain the hammer2 bits that we are working on.

The initial commit this branch contains mostly non-compilable specifications
work and header files.

hammer2 is NOT expected to be operational for at least 6 months, so don't get
your hopes up for it becoming available any time soon.  Once it becomes
operational most of the features are NOT expected to be in place until the
end of the year (hardlinks probably being one of those features that will
happen last).

At some point starting at around 6 months, when all the basics are working
and the media structures are stable, it will be possible to split the
workload up for remaining features.  I'll be posting another followup in
a few minutes on the design work done since the last posting.

Matthew Dillon

DESIGN document for HAMMER2 (08-Feb-2012 update)

2012-02-08 Thread Matthew Dillon
This is the current design document for HAMMER2.  It lists every feature
I intend to implement for HAMMER2.  Everything except the freemap and
cluster protocols (which are both big ticket items) has been completely
speced out.

There are many additional features verses the original document,
including hardlinks.

HAMMER2 is all I am working on this year so I expect to make good
progress, but it will probably still be July before we have anything
usable, and well into 2013 before the whole mess is implemented and
even later before the clustering is 100% stable.

However, I expect to be able to stabilize all non-cluster related features
in fairly short order.  Even though HAMMER2 has a lot more features then
HAMMER1 the actual design is simpler than HAMMER1, with virtually no edge
cases to worry about (I spent 12+ months working edge cases out in
HAMMER1's B-Tree, for example... that won't be an issue for HAMMER2

The work is being done in the 'hammer2' branch off the main dragonfly
repo in appropriate subdirs.  Right now just vsrinivas and I but
hopefully enough will get fleshed out in a few months that other people
can help too.

Ok, here's what I have got.


Matthew Dillon

* These features have been speced in the media structures.

* Implementation work has begun.

* A working filesystem with some features implemented is expected by July 2012.

* A fully functional filesystem with most (but not all) features is expected
  by the end of 2012.

* All elements of the filesystem have been designed except for the freemap
  (which isn't needed for initial work).  8MB per 2GB of filesystem
  storage has been reserved for the freemap.  The design of the freemap
  is expected to be completely speced by mid-year.

* This is my only project this year.  I'm not going to be doing any major
  kernel bug hunting this year.

Feature List

* Multiple roots (allowing snapshots to be mounted).  This is implemented
  via the super-root concept.  When mounting a HAMMER2 filesystem you specify
  a device path and a directory name in the super-root.

* HAMMER1 had PFS's.  HAMMER2 does not.  Instead, in HAMMER2 any directory
  in the tree can be configured as a PFS, causing all elements recursively
  underneath that directory to become a part of that PFS.

* Writable snapshots.  Any subdirectory tree can be snapshotted.  Snapshots
  show up in the super-root.  It is possible to snapshot a subdirectory
  and then later snapshot a parent of that subdirectory... really there are
  no limitations here.

* Directory sub-hierarchy based quotas and space and inode usage tracking.
  Any directory sub-tree, whether at a mount point or not, tracks aggregate
  inode use and data space use.  This is stored in the directory inode all
  the way up the chain.

* Incremental queueless mirroring / mirroring-streams.  Because HAMMER2 is
  block-oriented and copy-on-write each blockref tracks both direct
  modifications to the referenced data via (modify_tid) and indirect
  modifications to the referenced data or any sub-tree via (mirror_tid).
  This makes it possible to do an incremental scan of meta-data that covers
  only changes made since the mirror_tid recorded in a prior-run.

  This feature is also intended to be used to locate recently allocated
  blocks and thus be able to fixup the freemap after a crash.

  HAMMER2 mirroring works a bit differently than HAMMER1 mirroring in
  that HAMMER2 does not keep track of 'deleted' records.  Instead any
  recursion by the mirroring code which finds that (modify_tid) has
  been updated must also send the direct block table or indirect block
  table state it winds up recursing through so the target can check
  similar key ranges and locate elements to be deleted.  This can be
  avoided if the mirroring stream is mostly caught up in that very recent
  deletions will be cached in memory and can be queried, allowing shorter
  record deletions to be passed in the stream instead.

* Will support multiple compression algorithms configured on subdirectory
  tree basis and on a file basis.  Up to 64K block compression will be used.
  Only compression ratios near powers of 2 that are at least 2:1 (e.g. 2:1,
  4:1, 8:1, etc) will work in this scheme because physical block allocations
  in HAMMER2 are always power-of-2.

  Compression algorithm #0 will mean no compression and no zero-checking.
  Compression algorithm #1 will mean zero-checking but no other compression.
  Real compression will be supported starting with algorithm 2.

* Zero detection on write (writing all-zeros), which requires the data
  buffer to be scanned, will be supported as compression algorithm #1.
  This allows the writing of 0

Re: File corrupted on crash reboot. Can someone help diagnose?

2012-02-03 Thread Matthew Dillon

:I had an email that I was writing to a few people. The computer rebooted 
:itself. I restarted Kmail and found the message window empty. I cd'ed into the 
:directory where it keeps autosaved copies of email being composed and found 
:that it had been overwritten with zero bytes. Fortunately I could recover the 
:content with undo (I've had this happen on Linux and was out of luck). Can 
:someone receive the undo output and the reboot times and figure out what 
:happened? I don't want to post it publicly, as it's a personal email, but I 
:can send it privately to a developer.
:li fi'u vu'u fi'u fi'u du li pa

The file might not be recoverable if it wasn't fsynced to disk.  It
might have still been in the memory cache for the filesystem.  You
can try running 'undo -i filename' but you may be out of luck if
the file contents isn't available with any of the transaction ids it

Matthew Dillon

Re: top command

2012-01-27 Thread Matthew Dillon

:Why 'PRES' in 'top' is 0 for all processes?

I had to disable statistics collection for PRES because there were some
serious SMP concurrency issues when the VM subsystem was changed over
to using fine-grained locks.

Matthew Dillon

New colo box installed,

2012-01-15 Thread Matthew Dillon
Kronos will be mirroring most of avalon's services, act as another
off-site DNS server, help with our off-site backups, and probably
also eventually host our web site.

We'll be working it up over the next few weeks.  It's a ridiculously
overpowered box with 16G of ram and a 200G SSD for swapcache.

Matthew Dillon

Mailing list archive operational again, nntp service discontinued

2012-01-11 Thread Matthew Dillon
* The mailing list archive is operational again and most/all of the
  lost messages have been fed into it.

* Our nntp service has been discontinued, superceeded by things like
  gmail and such which provide really nice multi-device interfaces for
  threaded list mail.  It's time has come.

* Work on a new, better web-based mailing list management interface is
  ongoing.  We know the old mail-based bestserv stuff has gotten a bit
  too crufty.

Matthew Dillon

Re: disable lpr

2012-01-03 Thread Matthew Dillon

:I installed cups, which has its own lpr program, and deleted the lpr that is 
:in world. If I rebuild world, how do I tell it not to install lpr? I know I 
:did this for sendmail, but I forgot where the configuration is.
:lo ponse be lo mruli po'o cu ga'ezga roda lo ka dinko

I'm running cups on my workstation too, talking to a Canon printer.

Instead of disabling lpr I just reworked the PATH environment variable
to put /usr/pkg/bin before /usr/bin.

Another trick I use if the above is too sneaky is to put /usr/local/bin
first in the PATH and create a script called lpr to exec the the one
from /usr/pkg/bin.


If we really wanted to make things easy we could make /usr/bin/lpr
recognize an environment variable to tell it to forward to another
lpr (aka /usr/pkg/bin/lpr).. though we'd have to be careful since
/usr/bin/lpr is suid and sgid.  Maybe a simple 'LPR_USE_PKGSRC'
env variable that could be set to '1'.

Matthew Dillon

Re: Dragonflybsd site seems to go down frequently!

2011-12-27 Thread Matthew Dillon

:Again the site is down (GMT 08:53:45 Decemeber 27, 2011). Make me
:worry whether I could really go for a dfbsd production server?!!!

And there will probably be downtime in the future.  The machines behind
our web site typically run the absolute latest development code and
we expect there to be crashes or other issues, which we then go and fix.
Since the project is small this is the only real way we can test the


Re: Which is ideal with HAMMER? softraid or hammer volume_add

2011-12-27 Thread Matthew Dillon
Definitely not hammer volume add, that's too experimental. Soft-raid
is a bit of a joke in my view, since it typically ties you to a
particular motherboard and bios (making it difficult to physically
move disks to another machine if the mobo or psu dies), and as with
all soft-raid systems any sort of power failure during a write is
likely to cause unrecoverable data loss.  Honestly I don't know of a
single system that ever had fewer failures with soft-raid than with
single disks w/ near real-time backup streams.

For HAMMER1 the best set-up is either a real raid system or no raid
at all and a master/slave server setup, depending on what is being
served.  Unfortunately nothing in BSD really approaches Linux's block
level clustering and VZ container system at the moment (which is a bit of
a joke too when it comes to multiple failover events but works pretty
well otherwise).

If you have a small system then there's no point running RAID.  If you
have a larger system then there's no point running a single server.
And running RAID on multiple servers eats a lot of power so for storage
needs less than what conveniently fits on one or two disks there's no
point running RAID at all... you run redundant servers instead and use
a SSD as a caching layer in front of the slower hard drive.

For larger single-volume storage needs multiple real raid system for
primary and backup with all the insundry fallback hardware is the only
way to go.  Soft-raid won't cut it.

Matthew Dillon

Re: Request for suggestion for setting up a server with 4 HDDs

2011-12-27 Thread Matthew Dillon

:Thanks for the pointer, but again the dragonflybsd site is down (GMT
:08:53:45 Decemeber 27, 2011) to access the link Justin pointed to: :-(

   Insofar as I can tell the site is up, accessed from the outside internet.


Re: Dragonflybsd site seems to go down frequently!

2011-12-26 Thread Matthew Dillon
:Is it only me or others also experience frequent downtime with the
:downtime. I experienced downtime several times and right now (GMT
:13:43 December 26, 2011) when I tried to access the official site!

No, we had some issues overnight, primarily a cpu hogging bug on
avalon (which routes dragonfly's internal network via openvpn) which
I thought I had fixed but hadn't.  The site should be accessible again.

Matthew Dillon

Re: bug in du: truncates filenames

2011-12-26 Thread Matthew Dillon
:I'm running du on snapshots to see how much space is taken by work directories 
:(which will stick around for over another month; the downloaded tarballs will 
:disappear in just a few days). I got this error:
:# du -s /var/hammer/usr/snap-20111?11*/pkgsrc/
:No such file or directory
:No such file or directory
:I checked the directory; the files are actually and xtermcfg.hin . 
:It's not simply truncating the filename to a fixed length, since there's a 
:file named xterm.log.html , which is longer. Any idea what's going on?

What's probably happening is the snapshot caught a flush inbetween its
directory entry creation and its inode creation.  There is probably a
directory entry for the files in question but no inode.

It isn't supposed to happen but does sometimes.  It's a bug in HAMMER
that I haven't found yet.

If you cd into the snapshot and ls

cd /var/hammer/usr/snap-2011-0501/pkgsrc/x11/xterm/work/xterm-259

You should see the 'ls' program complain about a missing 'Makefile'.


Merry X-Mas and 3.0 release after the holidays - date not yet decided

2011-12-25 Thread Matthew Dillon
Hello everyone!  First, I apologize for the aborted 2.12 release.  We
got as far as rolling it but I decided to make a real push to try to
fix the occassional random seg-fault bug that we were still seeing
on 64-bit at the time.

The seg-fault issue has now been resolved, I posted an exhaustive
synopsis to the kernel@ list just a moment ago.  Basically it appears
to be an AMD cpu bug and not a DragonFly bug.  We don't have final
confirmation that it isn't a DragonFly bug because it is so sensitive
to %rip and %rsp values that reproducing the environment to test it
on other OSs (even FreeBSD) is difficult, but I'm 99% certain it's an
AMD bug.  Add a single NOP instruction to the end of one routine in
the gcc-4.4 codebase appears to work around the bug.

So moving on to rolling an official release...

(1) Through past experience we will NOT do a release during the
holidays!  So everyone please enjoy Christmas and New Years!

(2) I would like to call the release 3.0.  Why?  Because while
spending the last ~1-2 months tracking down the cpu bug a whole lot
of other work has gone into the kernel including major network
protocol stack work and major SMP work.  My contribution to the SMP
work was to completely rewrite the 64-bit pmap, VM object handling
code, and VM fault handling code, as well as some other stuff.

This has resulted in a phenominal improvement in concurrency and
in particular concurrent compilations or anything that takes a lot
of page faults.  SMP contention was completely removed from the
page fault path and for most VM related operations, and almost
completely removed from the exec*() path.

Other related work has continued to improve mysql and postgresql
numbers as well.

(3) Release date is as-yet undecided.  It will probably be mid-February
to end-February in order to synchronize with the pkgsrc 2011-Q4
release and give things time to settle.  The release meisters will
be discussing it on IRC.

I will say that there are NO serious showstoppers this time.  I'd
like us to take our time and make this the best release we've ever


Concurrent buildworld -j N heads up - update both install and mkdir

2011-11-30 Thread Matthew Dillon
We are still stabilizing the new buildworld -j N changes.  In addition
to the install utility needing some internal fixes, the mkdir(1) utility
also needed an internal fix.  In order to bootstrap being able to run
buildworld -j N you may have to update both of these utilities manually
as follows (after updating your sources to the latest master), before
running your buildworld:

cd /usr/src/usr.bin/xinstall
make clean; make obj; make all install
cd /usr/src/bin/mkdir
make clean; make obj; make all install

I still expect there to be a few more races that crop up every once in
a while and we will continue to fix them as they pop up.


Also note that the higher concurrency will of course also use more
memory, including potentially a lot more memory during the GCC build.
When running on machines with limited ram you may have to reduce the
-j N value you used in the past.  As before, machines with very little
ram (e.g. less than 1G) will probably page to swap during a -j N build
even with N as low as 4.

Matthew Dillon

Significantly faster concurrent buildworld times

2011-11-29 Thread Matthew Dillon
I did a pass on the buildworld infrastructure and added a new
features to allow SUBDIR recursions to run concurrently.  This
should improve buildworld -j 12 (or similar) significantly.  I
was able to get a 28% improvement on our quad-core (8 thread) Xeons
(1075 - 769 seconds).

This is still a bit experimental in that there may be build dependencies
that we haven't ferreted out yet.  In particular, you might have to
update your 'install' program to the latest in master to avoid a race
inside its mkdir() function which could error-out the build (only if you
are doing make -j N on your buildworlds).

This work cleaned up probably 70-80% of the bottlenecks we had in the
buildworld.  There are far fewer periods showing idle cpu during the
build with these changes.

Matthew Dillon

leaf upgrade status

2011-11-10 Thread Matthew Dillon
* Leaf has been upgraded to 64-bits and all-new hardware.  The new
  machine is about 30% faster than the old one and has five times the
  ram (16G total).

* Most services are operational.  However, our web front-page still has
  an issue with the embedded digest and the bugtracker is currently
  non-operational.  Expect instability for the next few days.

* There may be pkgsrc packages missing that developers need.  If you
  need a package installed get onto our efnet IRC channel (#dragonflybsd)
  and tell us, we will add it back in.

* Any local binaries developers have compiled in their leaf accounts
  will have to be recompiled.

Our repository box will probably be upgraded Thursday afternoon.

Matthew Dillon

Re: Can someone upgrade tor in Q3?

2011-11-08 Thread Matthew Dillon
Sorry folks, the cvs2git scripts were running only the base
conversion for the 2011 pkgsrc branches and not running the
synchronization pass to fixup missing bits.

I've adding the synchronization pass for all 2011 branches.  It will
be a few hours before it gets them synced up.


heads up - Machine upgrades this week.

2011-11-08 Thread Matthew Dillon
Both crater and leaf will be upgraded this week.  Either Wednesday or
Thursday.  We'll try to make it as painless as possible but because we
are upgrading the boxes from 32 bits to 64 bits there will be services

* Probable web site down time for a few hours (up to 6).

* Commits (for developers) may be disallowed for a few hours.

* Documentation, Mailing list, mailing list archive, and news services
  may be down for a few hours.

* pkgsrc mirrors will NOT be effected.

Poor crater is running cvs and git conversion scripts on almost a hundred
gigabytes of material four times a day and its lowly 2G of ram isn't
enough.  The only reason it still works 'ok' is it's 100G of SSD

Poor leaf is in similar straights, handling cgit and gitweb requests
from various search engines (which we want) and having to deal with a
multitude of concurrent 300MB+ process images, not to mention developer
git repos and vkernel images.  With only 3G of ram only its SSD swapcache
allows it to continue to function.

Both will be upgraded to Xeon E3 (Sandybridge) based boxes w/16G of
ECC ram each.  Should be really nice after that, particularly for
developers who use leaf regularly.

This will occur Wednesday and/or Thursday if all goes well.

Matthew Dillon

Performance results / VM related SMP locking work - committed (3)

2011-10-28 Thread Matthew Dillon
) is only 500 seconds slower than for one, meaning that we are
getting very good concurrency now.


This set of tests is using a buildkernel without modules, which has
much greater compiler concurrency verses a buildworld tests since
the make can keep N gcc's running most the time.

 137.95 real   277.44 user   155.28 sys  monster -j4 (prepatch)
 143.44 real   276.47 user   126.79 sys  monster -j4 (patch)
 122.24 real   281.13 user97.74 sys  monster -j4 (commit)
 127.16 real   274.20 user   108.37 sys  monster -j4 (commit 3)

  89.61 real   196.30 user59.04 sys  test29 -j4 (patch)
  86.55 real   195.14 user49.52 sys  test29 -j4 (commit)
  93.77 real   195.94 user67.68 sys  test29 -j4 (commit 3)

 167.62 real   360.44 user  4148.45 sys  monster -j48 (prepatch)
 110.26 real   362.93 user  1281.41 sys  monster -j48 (patch)
 101.68 real   380.67 user  1864.92 sys  monster -j48 (commit 1)
  59.66 real   349.45 user   208.59 sys  monster -j48 (commit 3)

  96.37 real   209.52 user63.77 sys  test29 -j48 (patch)
  85.72 real   196.93 user52.08 sys  test29 -j48 (commit 1)
  90.01 real   196.91 user70.32 sys  test29 -j48 (commit 3)

Kernel build results are as expected for the most part.  -j 48 build
times on the many-cores monster are GREATLY improved, from 101 seconds
to 59.66 seconds (and down from 167 seconds before this work began).

That's a +181% improvement, almost 3x faster.

The -j 4 build and the quad-core test29 build were not expected to show
any improvement since there isn't really any spinlock contention with
only 4 cores.  There was a slight nerf on test28 (the quad-core box) but
that might be related to some of the lwkt_yield()s added and not so
much the PQ_INACTIVE/PQ_ACTIVE vm_page_queues[] changes.

Matthew Dillon

Re: process flips between CPUs

2011-08-20 Thread Matthew Dillon
A process which is sleeping most of the time will tend to be scheduled
on whatever cpu is available.  From the perspective of the scheduler
which may switch between user processes on a 1/100 second clock a
process which uses the cpu heavily will tend to be scheduled on the
same cpu.  However, from the human perspective observing the top or ps
output, even a heavily cpu-bound program will switch between cpus every
so often.

Normally locking a process to a particular cpu is not necessary.


Re: Recover slave PFS

2011-08-06 Thread Matthew Dillon
It is a bug, it shouldn't have removed the softlink for the PFS.  However, 
only way to destroy a pfs is with pfs-destroy and since you didn't do that 
PFS is still intact.

All you have to do is re-create the softlink.

The PFS softlink points to @@-1:n Where 'n' is the pfs number.  For 
PFS #5 would be: @@-1:5

The format must be precise.  If you recreate the softlink for the missing 
pfs in
your /pfs directory you should be able to CD into it and get it back.

If you don't know the PFS number look at the PFS numbers for the existing 
PFS's and
guess at the ones that might be missing.


hammer dedup in HEAD now has a memory limiting option

2011-08-03 Thread Matthew Dillon
The hammer dedup and dedup-simulate directives in HEAD now accept
a memory use limit option, e.g. '-m 200m'.  The default memory use
limit is 1G.  If the dedup code exceeds the memory limit it will
automatically restrict the CRC range it collects information on and
will loop as many times as necessary to dedup the whole disk.

This should make dedup viable inside qemu or other virtual

A few minor I/O optimizations were also made to try to pre-cache the
b-tree metadata blocks and to allow the dedup code to get past areas
already dedupped more quickly.  Initial dedups will still take a long

^C and ^T are also now supported during hammer dedup runs so you can
see the progress.  It has to pre-scan the b-tree but once it actually
gets into dedupping stuff ^T will give you a good indication of its
progress.  ^C was being ignored before and now works as well.

Matthew Dillon

Re: pkgsrc-update failes with core dumps

2011-07-30 Thread Matthew Dillon
I think master currently has a VM issue somewhere (in software).  I'm
sometimes getting an internal compiler error when building the world,

Matthew Dillon

Re: pkgsrcv2.git not syncing correctly; around 400 missing files

2011-07-25 Thread Matthew Dillon
:I have been using pkgsrc from our git mirror (pkgsrcv2), but I recently
:noticed some patches were missing as it caused me to submit a bad patch
:to pkgsrc while fixing multimedia/xine-lib port, and since then I've
:found many missing files.
:I pulled pkgsrc via CVS and created a script to compare both
:repositories.  I had to tell diff to ignore differences that we caused
:by CVSID tags (e.g. $NetBSD$ and $Id$) because for some reason these
:CVSIDs were the only difference in hundreds of files.
:The result is attached.
:367 files are shown as missing and the remaining 36 are shown as different.
:At the very least, this report could be used to manually sync
:pkgsrcv2.git, but it appears something systematic is amiss due to the
:large number of missing patches.  Hopefully this can be fixed?

Hmm.  It looks like the rsync our script is running to get the CVS
archive is failing.  I'm getting tons of these sorts of messages
in the logs:

rsync: recv_generator: failed to stat 
Unknown error: 0 (0)

I'm not sure what is going on.  The directory structure looks ok.
The lstat() it is failing on, when I ktrace, is returning a proper
ENOENT error code.

If I start with a clean, empty target directory I get the same
problem.  rsync is trying to stat stuff which doesn't exist and
is then complaining about it.  It thinks the error code is 0 when
it isn't.  This is blasted confusing.  I am running this rsync:

/usr/pkg/bin/rsync -aHS --delete --exclude '#cvs.lock' 
rsync:// /archive/NetBSD-CVS

13690 rsync0.07 CALL  lstat(0xbfbff2f0,0xbfbfe9e0)
13690 rsync0.03 NAMI  CVSROOT/config
13690 rsync0.16 RET   lstat -1 errno 2 No such file or directory
13690 rsync0.74 CALL  write(0x2,0xbfbfd470,0x60)
13690 rsync0.14 GIO   fd 2 wrote 96 bytes
   rsync: recv_generator: failed to stat 
/archive/NetBSD-CVS/CVSROOT/config: Unknown error: 0 (0)
13690 rsync0.05 RET   write 96/0x60

I have verified that it does not try to create the file beforehand
in the ktrace.  Insofar as I can tell there's nothing wrong with
HAMMER or the directory structure.

rsync's memory use does hit around 32MB, then stabilizes, then a short
time later it starts spewing out tons of these errors.  I wonder if
there is an issue with rsync's memory use?


Re: pkgsrcv2.git not syncing correctly; around 400 missing files

2011-07-25 Thread Matthew Dillon
Ok, I upgraded rsync to the latest version and it appears to work now.
I think it might have been a protocol incompatibility between the
older rsync crater was running (2.something) verses the current version

I will manually run the pkgsrc updating script, please check in about
an hour to see if the repo has been corrected.


Re: pkgsrcv2.git not syncing correctly; around 400 missing files

2011-07-25 Thread Matthew Dillon

:Hi Matt,
:It looks much better now.
:All the MISSING files have been restored.
:There are still some DIFF files making it through the script. I
:increased the regex to filter out $Revision[:$] and $Date[:$] as well as
:$Id[:$] and $NetBSD[:$], and the attached file shows what is left.
:The remaining files on the list feature the $Log$ CVSID and others, so
:the git pkgsrc repository looks 100% synchronized to me!

Yes, this is because CVS $variable expansions are not formally stored
as patches in the CVS archive.  Instead the variable-expansion is done
after the file is checked out.  The version of the file in the CVS
archive will often contain the variable expansions related to the
previous version rather than that particular version.

So anything related to variable expansion will be broken no matter what
we do.  The git conversion scripts effectively have to tell cvs not
to expand anything and work just with the pure CVS archive (which
contains the broken expansions associated with the version previous to
the one being checked out), otherwise incremental patches will not work

My pkgsrc cvs-git conversion script is ridiculously complex.  Not only
can the cvs2git conversion not always work properly, the rsync of the
cvs repo itself can catch a cvs commit in the middle so the script has
to loop the rsync until it detects the topology hasn't changed recently
(i.e. is stable).  And even then it doesn't always stay in sync so my
script then does a catch-all cvs checkout, git checkout, and diff/patch,
then a forced git commit to clean up the loose ends.

Of course, it is all for naught if rsync itself breaks like it just
did :-(


Re: Running OpenGrok on DragonFly

2011-07-23 Thread Matthew Dillon
I've always wanted to run OpenGrok on our /archive, using Leaf.
The last time I tried I got stuck on the JDK dependency too.  Now
that you've got the JDK working I may try setting it up again.

Matthew Dillon

Re: cache_lock: blocked unblocked

2011-07-20 Thread Matthew Dillon

:I got these on my DragonFly v2.11.0.247.gda17d9-DEVELOPMENT
:[diagnostic] cache_lock: blocked on 0xdacafa28 2.0
:[diagnostic] cache_lock: unblocked 2.0 after 9 secs
:[diagnostic] cache_lock: blocked on 0xdc244c18 2.0
:[diagnostic] cache_lock: unblocked 2.0 after 2 secs
:[diagnostic] cache_lock: blocked on 0xd9968ea8 mail
:[diagnostic] cache_lock: unblocked mail after 15 secs
:[diagnostic] cache_lock: blocked on 0xc46a7378 
:[diagnostic] cache_lock: blocked on 0xc46a7378 
:[diagnostic] cache_lock: unblocked  after 0 secs
:[diagnostic] cache_lock: unblocked  after 0 secs
:is there any thing I should chek out?

No, as long as the blockages unblock at some point it's ok.  The
blockages are likely due to hammer's flusher.  I have a patch under
test (related to the blogbench thread) that also seems to reduce
the namecache stalls.

Matthew Dillon

Re: cpdup /pfs

2011-06-17 Thread Matthew Dillon
The problem here is that cpdup'ing /pfs will result in the wrong
symlinks on the target filesystem because the PFS IDs are different on
the target filesystem.  There is nothing cpdup can do here to help,
you have to tell it to ignore the pfs directory (see -x option to cpdup
and the use of a file containing a list of exclusions).


Re: newfs_hammer doesn't set dedup time

2011-06-15 Thread Matthew Dillon

:I made a new Hammer filesystem on the laptop disk and looked at the output of 
:hammer cleanup, which shows no dedup. I ran hammer config on it, and there is 
:a dedup line, but it's commented out. The PFSs have no config. How come?

dedup isn't turned on my default.  dedup is being used regularly now
but deduplication in general can lead to I/O fragmentation so it isn't
the default.  Not everyone needs dedup.

hammer cleanup should have installed a default config for each PFS.

Matthew Dillon

HEADS UP - Dragonfly network renumbering

2011-06-03 Thread Matthew Dillon
The DragonFly network is being renumbered.  Hopefully it will be painless
but we're doing it in stages and there may be some disruption.

Matthew Dillon

Re: md5 sums and hammerfs encryption

2011-05-24 Thread Matthew Dillon

: has MD5 sums listed there.  I
:don't have access to crater to update the md5.txt file, though.

Ok, I pasted them into md5.txt.


Re: Hammer on multiple hot-swappable disks

2011-05-23 Thread Matthew Dillon

:I'm thinking of founding an ISP and running it with a mix of DragonFly and 
:Linux boxes. My current boss showed me a rack-mountable server which he uses. 
:If I understood him right, it has three bays where hot-swappable SCSI drives 
:can be inserted. I was thinking about how to handle disks that are about to 
:fail, or whose filesystems are getting too big.
:Suppose I have a bunch of disks all partitioned like this:
:da#s1a 768 MB ufs /boot
:da#s1b 1 GB swap
:da#s1d hammer
:da#s1e luks hammer.
:I have da0 and da1 in the server and I want to insert a disk into da2 and pull 
:out the one in da1. Can I do this with the hammer volume-add da2s1d; hammer 
:volume-del da1s1d? How long will this take? Do I run cryptsetup on da2s1e 
:before adding the volume?

No, unfortunately there is still one sticking point preventing that
from working.  The volume delete code can't remove the root volume
(in a multi-volume hammer mount one is designated as the root volume.
In a single-volume hammer mount that volume IS the root volume for
the mount).


Re: What does this mean ?

2011-05-13 Thread Matthew Dillon

:   Hi,
:   Message on console not too frequent, possibly associated with heavy
:disk usage:
:thr_umtx_wait FAULT VALUE CHANGE 7162-7165 oncond 0x800990104
:   What does it mean, and should I worry ?
:Steve O'Hara-Smith  |   Directable Mirror Arrays

No, it just means a block of memory being used for mutexes suffered
from a copy-on-write (probably due to a fork()).  The mutex code
in the kernel deals with this situation automatically.  It was just
some old debugging cruft.

Matthew Dillon

Re: Updating Development Version on Slow machines from another Fast machine

2011-05-13 Thread Matthew Dillon

:I NFS mount /usr/src and /usr/obj in the slow machine (being the NFS
:server the faster machine) and then I issue the usual
:installkernel/installworld/upgrade commands.
:Antonio Huete

I do the same thing.  In fact, sometimes I even NFS-mount /usr/obj
across the internet and make installworld is still faster than compiling
it up locally on the slow box :-)


Intel vs AMD DragonFly 2.11 parallel kernel build tests

2011-05-12 Thread Matthew Dillon
 than that I would happily replace all my servers w/Sandybridge
today.  As it stands though I don't actually need a ton of horsepower on
the servers.  Our build boxes are the only things that really need the
horsepower of a Sandybridge.  The reduced power consumption is very
provacative but it's a non-starter without ECC.

And AMD has saved me a ton of money over the years with their AM2+/AM3
socket compatibility.  I've gone through three major generational cycles
on cpus with the same mobos just by buying a new cpu.  Intel suffers
from too much socketmania and it gets expensive when you have to replace
the mobo, the memory, AND the cpu whenever you upgrade.

So for the moment I am willing to wait for AMD to come out with
something better.  It doesn't have to beat Intel, but it does have to
get within shouting distance and 30% aint within shouting distance.
Even factoring in a current higher-end AMD cpu we still aren't going to
get more than another 7% improvement (23% is still too much).  If AMD can
get within 15% in the next year or so I'll happily stick with them on
principle.  But if they can't then I will grudgingly pay Intel's premium.

(And, p.s. this is why I invest in Intel and not AMD.  Intel has the
monopoly and intentionally keeps AMD as a poor second cousin to keep
the anti-trust hounds at bay.  Sorry AMD, I love you but I can only
support you in some ways :-( )

Matthew Dillon

Re: Intel vs AMD DragonFly 2.11 parallel kernel build tests

2011-05-12 Thread Matthew Dillon
Here is a fun statistic.  For running a server 24x7 how many days
do you have to run the Intel i7 vs the phenom II to make up for the
$100 difference in the price tag?

Using a generous 65W for the AMD and 33W for the intel, assuming
a mostly idle server, and $0.25/kWh, you get $0.192/day savings
with the i7.  With a $100 difference in price that comes to 520 days.

So if you are running a server 24x7 that is mostly idle, the Intel-i7
pays for its higher price in 520 days (a bit over 1.5 years).

If you are running under load the i7 will pay for its higher price
tag more quickly.  This is ignoring the lack of ECC issue with the
i7 though, and a Xeon system will be more expensive.


Re: System on SSD

2011-05-10 Thread Matthew Dillon

:I just bought an 60 GB SSD (OCZ Vertex 2). I want
:to use about 20 GB for swapcache. But I think about
:putting the system also on this SSD. To reduce writes
:I want to disable history keeping and mount the pfs
:with noatime. I also want to move /usr/src and
:/usr/pkgsrc and the build directories to a normal HDD.
:Are there any issues to keep in mind? Any suggestion?
:Thanks a lot.

If you are going to run HAMMER on the SSD then you also have to
manage the reblocking operation(s) on the PFSs.  I would completely
disable the 'recopy' function by commenting it out and I would adjust
the reblocking parameters to spend 1 minute every 5 days instead of
5 minutes every 1 day.

Everything else can be left as-is.  You can also leave history enabled.
nohistory will actually generate more write activity.  Though you do have
to be careful about the retention time due to the limited amount of space
available on the SSD, so you might want to adjust the snapshot interval
down from 60d to 10d or something like that.  History is one of HAMMER's
most important features, it is best to leave it on for all primary
information storage.  I usually turn it off only for things like

Most of these parameters are controlled via 'hammer viconfig pfs'.
You want to adjust the config for each mounted PFS and for '/'.


In terms of layout you will want around a ~1G 'a' partition for /boot,
which must be UFS, then I recommend a 32G 'b' swap partition and the
remainder for a HAMMER 'd' partition.

I usually leave ~4-8G unpartitioned (I setup a dummy 'e' partition that
is 4-8G in size which is left unused), assuming a pristine SSD.


In terms of putting the root filesystem on the SSD and not the HDD, I
think it is reasonable to do and if you do you will almost certainly want
to put any heavily modified subdirectories on the HDD.  /usr/src,
/usr/pkgsrc, possibly also /home and /usr/pkg, but it is up to you.

Usually it is easier just to use the SSD as your 'a' boot + 'b' swap and
put your root on your HDD.  You can use the remaining space on the SSD
as an emergency 'd' root.  The reason it is generally better to put the
normal root on the HDD is that you don't have to worry about fine tuning
the space and you don't have to worry about write activity.

You can still use swapcache to cache a great deal of the stuff on the HDD
onto the SSD via the swap partition on the SSD.

Booting w/root on the HDD will be slightly slower but not unduly so, and
once stuff gets cached on the SSD things get pretty snappy.


Finally, i386 vs x86-64.  If you are running a 32 bit kernel the maximum
(default) swap space is 32G.  With a 64 bit kernel the maximum is 512G.
swapcache works very nicely either way but if you intend to run a 64 bit
kernel you might want to consider configuring a larger swap space and
essentially dedicating the SSD to just boot + swap.

It depends a lot on how much information needs to be cached for all of
the system's nominal operations.  With swapcache you will universally
want to cache meta-data.  Caching file data depends on the situation.

Matthew Dillon

Re: Git core dumped

2011-05-06 Thread Matthew Dillon

:I can't download pkgsrc-repository via git. I got such message:
:* [new branch] dragonfly-2010Q3 - origin/dragonfly-2010Q3
:May 4 17:24:10  kernel: pid 801 (git), uid 0: exited on signal 10 (core
:*** Signal 10
:Stop in /usr
:What's wrong?

Check your /usr/lib/libpthread* and see if it is linked to the wrong
theading library:

ls -la /usr/lib/libpthread*

If it is linked to libc_r that is the problem.  It needs to be linked
to libthread_xu instead.

cd /usr/lib
ln -fs thread/libthread_xu.a libpthread.a
ln -fs thread/

I will fix installworld.  Was your system originally installed from
a fairly old DragonFly?


Re: Git core dumped

2011-05-06 Thread Matthew Dillon
On the other core dumps, I'm not sure what is going on but make sure
the repo and the source tree is fully owned by the user (you or root)
doing the git operations.

I don't rebase often myself.

One possibility is that the pthreads per-thread stack is too small,
and the complexity of the operation is blowing it out.  Git appears to
set the stack size to 65536 bytes (the default is ~1MB).  If so this
would be a bug in git.

DragonFly creates a stack guard at the bottom of every thread stack
so it might be catching a condition that other OSs are not.


Re: Buffer strategy message?

2011-05-01 Thread Matthew Dillon

:I see this message on halt/reboot occasionally.  Is it something I need to
:worry about?
:Synching disks...
:No strategy for buffer at 0xffe056aabf00
:: 0xffe0840876a8: type VBAD, sysrefs 1, writecount 0, holdcnt 0,
:Uptime: 12h9m53s
:the operating system has halted

It's 'probably' ok, but it isn't desireable.  It means one part of
the system detached while another part of the system was still using


Re: dntpd

2011-05-01 Thread Matthew Dillon

:Is there a way in dntpd.conf to specify from which hosts dntpd will accept 
:time requests?

dntpd is client-only (though I think it would be fairly easy to have
it serve requests if someone wanted to add that).  It pulls the time
from the hosts specified in /etc/dntpd.conf.

/etc/dntpd.conf is installed by default with {0,1,2}

Matthew Dillon

DragonFly 2.10 RELEASED!

2011-04-26 Thread Matthew Dillon
Hello everyone!  2.10 has finally been released.  Our mirrors are still
pulling the hot press.  Our main mirror site, avalon, has the goods if
your favorite mirror doesn't yet.  Here's a quick smattering of links:

Both 32-bit and 64-bit USB and ISO images are available.  My
recommendation is to use the 64-bit usb image or the 64-bit gui
usb image (if your machine can handle 64-bits).  That is,
dfly-x86_64-gui-2.10.1_REL.img.bz2.  Note that linux emulation
only works w/ the 32 bit image so if you need linux emulation
you have to go with 32 bits.

The gui images contain a full X environment and the git repos
for /usr/src and /usr/pkgsrc and is recommended if you have the
bandwidth to pull it down.  It is approximately 1.2GB.


This release contains many features, see the release page for an
exhaustive list.  The big ticket items are significantly better
MP performance and significantly better filesystem performance with
the AHCI and SILI drivers.

Matthew Dillon

Re: Hammer deduplication needs for RAM size

2011-04-22 Thread Matthew Dillon

:Hi all,
:can someone compare/describe need of RAM size by deduplication in
:Hammer? There's something interesting about deduplication in ZFS

The ram is basically needed to store matching CRCs.  The on-line dedup
uses a limited fixed-sized hash table to remember CRCs, designed to
match recently read data with future written data (e.g. 'cp').

The off-line dedup (when you run 'hammer dedup ...' or
'hammer dedup-simulate ...' will keep track of ALL data CRCs when
it scans the filesystem B-Tree.  It will happily use lots of swap
space if it comes down to it, which is probably a bug.  But that's
how it works now.

Actual file data is not persistently cached in memory.  It is read only
when the dedup locates a potential match and sticks around in a limited
cache before getting thrown away, and will be re-read as needed.

Matthew Dillon

2.10 Release scheduled for Monday.

2011-04-22 Thread Matthew Dillon
My weekend schedule is too crowded so we will be doing the official
release Monday evening.

HEAD is now 2.11 and we have a 2.10 release branch.  2011Q1 packages
have been built though some work is still ongoing.  Preliminary nrelease
builds have succeeded and testing continues.

We couldn't quite fit a moderate unrolling of the global VM system token
into the release but Venkatesh will be working on it with a vengeance
in HEAD after the release.  Except for the VM subsystem, all other
critical paths are MPSAFE.

The AHCI/CAM driver enhancements have made it into the release, so
significant improvements in concurrent random disk I/O for AHCI-attached
devices should be noticeable.

Matthew Dillon

Re: 2.10 Release schedule - Release will be April 23rd 2011

2011-04-21 Thread Matthew Dillon
The only issue w/ using dedup is you may need to
upgrade the hammer filesystem to at least version 5.
It's best that all mirrors be running the same version.

If you update to version 6 and use mirroring then
both sides have to be version 6 for sure because
the directory hash algorithm changes.

That's the only issue w/ regards to upgrading.

Matthew Dillon

Recent concurrency improvements in the AHCI driver and CAM need testing

2011-04-09 Thread Matthew Dillon
I've pushed some serious changes to the AHCI SATA driver and CAM.

One fixes issues where the tags were not being utilized to their fullest
extent... well, really they weren't being utilized at all.  I'm not
sure how I missed the problem before, but it is fixed now.

The second ensures that read requests cannot saturate all available
tags and cause writes to stall, and vise-versa, and also separates
out the read and write BIO streams and treats them as separate entities,
which means that reads can continue to be dispatched even if writes
saturate the drive's cache and writes can continue to be dispatched
even if concurrent read(s) would otherwise eat all available tags.

The reason the read/write saturation fixes are important is because
writes are usually completed instantly since they just go to the drive
cache, so even if reads are saturated there's no reason not to push
writes to the drive.  Plus when the HD's cache becomes saturated writes
no longer complete instantly and would prevent reads from being
dispatched if all the tags were used to hold the writes.


With these fixes I am getting much better numbers with concurrency

I now get around 37000 IOPS doing random 512-byte sector reads with
a Crucial C300 SSD, verses ~8000 or so before the fix.

And I now get around ~365 IOPS with the same test on a hard drive,
verses ~150 IOPS before (remember these are random reads!).

blogbench also appears to have much better write/read parallelism
against the swapcache with the SSD/HD combo.  Memory caches blow
out at around blog #1300 on my test boxes.

With the changes blogbench write performance is maintained through
blog #1600 or so, without the changes it drops off at #1300.

With the changes the swapcache SSD is pushing ~1400 IOPS or so
satisfying random read requests.  Without the changes the swapcache
SSD is only pushing ~130 IOPS.

With the changes blogbench is able to maintain a ~6 article
read rate at the end of the test.  Without the changes the
read rate is more around ~1 at the end of the test.  At this
stage swapcache has cached a significant chunk of the data
in the SSD so the I/O activity is mixed random SSD and HD reads.


Ok, so I feel a bit sheepish that I missed the fact that the AHCI
driver wasn't utilizing its tags properly before.  The difference
in performance is phenominal.  Maybe we will start winning some
of those I/O benchmark tests now.

Matthew Dillon

2.10 Release schedule - Release will be April 23rd 2011

2011-04-06 Thread Matthew Dillon
Saturday Apr 9  - We branch
Saturday Apr 23 - We release 2.10

This gives us two weeks to stabilize the release and build 2011Q1
packages.  Developers need to pounce on showstopper bugs such as
rebooting issues, mbuf leaks, panics, and so forth.

There will be a ton of features in this release, including major
compiler toolchain updates, better acpi, better swapcache, PF upgrade,
HAMMER live dedup, and many many other goodies.

SMP has progressed significantly in this release.  All nominal kernel
paths are MPSAFE.  The VM system is still using a global token and is
the only real bottleneck left.

Matthew Dillon

Improvements in swapcache's ability to cache data using HAMMER double_buffer mode.

2011-04-04 Thread Matthew Dillon
Normally data is only cached via the file vnode which means the cache
is blown away when the vnode gets cycled out of the vnode cache.  With
kern.maxvnodes around ~100,000 on 32 bit systems and ~400,000 on 64 bit
systems any filesystem which exceeds the limit will cause vnode recycling
to occur.  Nearly all filesystems these days exceed these limits,
particularly on 32 bit systems.  And on 64-bit systems files are often
not large enough to utilize available memory before hitting the vnode
limit and causing the data to be thrown away despite there being plenty
of free ram.

It is now possible to bypass these limitations in DragonFly master
by enabling both the HAMMER double_buffer feature
(vfs.hammer.double_buffer=1) AND the swapcache data caching
feature (vm.swapcache.data_enable=1).  See 'man swapcache' for
additional information on swapcache.

When both features are enabled together swapcache will cache file data
via HAMMER's block device instead of via individual file vnodes, making
the swapcache'd data immune to vnode recyclement.  Swapcache is thus
able to cache the data for potentially millions of files up to 75%
of available swap (normally configured up to 32G on 32-bit systems and
up to 512G on 64-bit systems).


Now add the fact that Sata-III is now widely available on motherboards
and Sata-III SSDs are now in mass production.  Intel's 510 series,
OCZ's Vertex III, and Crucial's C300 and M4 series are capable of
delivering 300-500 MBytes/sec reading and 200-400 MBytes/sec writing
from a single device.  Crucial's C300 series is very cost effective
w/64GB at SATA-III speeds for $160.  Compare this to the measily
2-5MBytes/sec a hard drive can do in a random seek/read environment.
We're talking 100x the performance already with just a single SSD
swap device.

With swapcache this means being able shrink the cost and the size of
what we might consider to be a 'server' by a factor of three or more.


The only downside to the new feature is that data is double-buffered in
ram.  That is, file data is cached via the block device AND also via
the file vnode, and there is really no way to get around this other than to
expire one of the copies of the cached data more quickly (which we try
to do).  I still consider the feature a bit experimental due to these
inefficiencies.  We are definitely on the right track and regardless of
the memory inefficiency the HD accesses go away for real when swapcache
SSD can take the load instead.

On one of our older servers I can now grep through 950,000 files
(~15GB worth of file data) at ~2000-4000 files per second pulling
40-50 MBytes/sec from the SSD and *zero* activity on the HD.  That is
a big deal that only a big whopping RAID system or a ton of ram could
compete with prior to the advent of SSDs... all from a little $700 box
with an older $100 SSD in it.

Matthew Dillon

Re: ACPI based interrupt routing and new ACPI code ready for testing

2011-03-24 Thread Matthew Dillon
:Hi Sephe. Great work :)
:Seems to boot fine on my x86_64 UP box. Anything you want tested with it 
:running? Verbose dmesg:
:However, it makes my graphics card (an ati 9200 agp) lose some speed. It 
:usually gets ~1600 fps with glxgears and with 2) enabled it drops to ~20 
:fps. From what I can see it gives no error about this problem.

What about the rest of the system?  Run some simple cpu benchmarks.
If those are slow also this could be an indication of an interrupt
storm (and possibly even a SMI storm, similar to ruse39's issue).

Matthew Dillon IP space renumbered

2011-03-06 Thread Matthew Dillon
The IP space for the primary network has been
reworked, please report any problems!

Matthew Dillon

Home stretch on new network - if_bridge looking better

2011-02-24 Thread Matthew Dillon
I'm in the home stretch of finishing up the new DragonFly network!
It's been pretty unstable the last week or so as I struggled first
with the (now failed) attempt at using an att static block with
U-Verse and then gave up on that and started working on running
a VPN over a dynamic-IP based att U-Verse + comcast internet.
I wanted bonding with failover.

Most of my struggles with U-Verse were in dealing with the stateful
firewall att has that cannot be turned off, even for the static
IP block.  It had serious issues dealing with many concurrent
connections and would drop connections randomly (it would send a
RST!).  The VPN bypasses the whole mess.

The last few days have been spent essentially rewriting half of
if_bridge so it would work properly, and testing it while I am
still tripple-homed (DSL, U-Verse, and ComCast).  Well, it caused
a lot of havoc on my network while I was beating it into shape
and that's putting it mildly!

But I think I now have if_bridge and openvpn and my ipfw and PF
rules smacked into shape.  I am going to implement line bonding
in if_bridge today (on top of the spanning tree and failover
which now works) and track down one or two remaining ARP issues
and then I'll call it done.  The basic setup is as shown below:

+ There are PF rules and ALTQs on each TAP interface to manage
  its outgoing bandwidth and keep network latencies down (on
  both sides of the VC).

+ IPFW forwarding (fwd) rules to manage multiple default routes
  based on the source IP.

The spanning tree appears to be working properly with the 2x2 and
the 3x3 'real' configuration I'm testing it with.  Once I get
line bonding working I expect my downlink to achieve ~30MBits+
and my uplink will be 4.8MBits.  I'm seriously considering keeping
both U-Verse and ComCast and just paring the service levels down
a little (top tier isn't needed).  The poor old DSL with its 600KBit
uplink is going to hit the trash heap.  It might have been slow, but
that ISP served my old /26 static block fairly well for many years.

Matthew Dillon

Re: Home stretch on new network - if_bridge looking better

2011-02-24 Thread Matthew Dillon

:Great news!
:Is there any chance to support more features in the bridge code? RSTP,
:span port , filtering based on mac address ….

RSTP would be doable as a GSOC project, I think it would be
very easy to implement.  Perhaps almost too easy but if someone
were to do it I would require significant testing to make sure the
protocol operates properly.  I have to move onto other things myself.

(RSTP is STP with a faster recovery time in case of link failure.
STP takes about 30 seconds to transition to a new topology while
RSTP takes about 10 seconds).

The span port is theoretically operational but it has NOT been tested
in any way, so something might blow if you try to use it.  This would
be more of a bug-fix type of thing, not worthy of a GSOC project.

MAC based filtering would be worthy of a GSOC project.  We don't have
it now but IPFW at least already has hooks for ethernet-level
firewalling.  Doing it w/PF would be a lot more difficult as PF is
designed as a routed packet filter (routing vs switching).


Re: Home stretch on new network - if_bridge looking better

2011-02-24 Thread Matthew Dillon

:On 02/24/11 11:50, Matthew Dillon wrote:
:So - reading over this - is it correct that the setup is roughly like:
:- assign a local interface (lan0) to a network
:- add this network to the bridge
:- create openvpn 'bridged' mode tunnels
:- add these to the bridge

In the case of my current setup, lan0, uverse0, comcast0, and
aerio0 are all physical ethernet ports.  lan0 is the LAN, and
the other three connect to the three different WAN services I

Only lan0 and the tunnels (tap0, tap1, tap2) are associated with
the bridge.

The other physical ethernet ports (uverse0, comcast0, and aerio0)
each have a different IP and a different default route and I use
IPFW to associate packets sourced from the IP to the default route
for each port.  Currently uverse0 and comcast0 are both dynamic
while aerio0 is a static IP (the old DragonFly net /26).

The OpenVPN tunnels are built using these IPs and back the tap devices.
The tap devices are then associated with the bridge and the main LAN.
The tap devices themselves, and the bridge, have *NO* IPs associated
with them.  All the local IP spaces are on lan0, including some local
NATted spaces (10.x.x.x).  The bridge code and the ARP code deal with
the inconsistencies and provide a consistent ARP for the bridge members.

Also, not shown here, is that I have a massive set of PF rules and ALTQs
on each of the TAP interfaces (tap0, tap1, and tap2). In particular I'm
running the ALTQs on the TAP devices with fair-share scheduling and
tuned to the bandwidth of each WAN so ping times will be low no matter
what topology the bridge is using.  (Of course I can't do fair-share
scheduling on the WAN ports, uverse0, comcast0, and aerio0, because
the only thing running over them is the OpenVPN UDP packets and it
can't dig into them to see what they represent).

:so the L2 bridge / STP will 'map' according to the state of
:the ethernet bridging, which in turn relates to the openvpn tunnel

Exactly.  The if_bridge module does its own 'pinging' using STP
config packets so it can detect when a link goes down.  OpenVPN
itself also has a ping/restart feature.  I use both.  OpenVPNs internal
keepalive auto-restarts openvpn on failure, and the if_bridge's
pinging is used to detect actual good flow across the link and controls
the failover.

:Without diverging any security sensitive whatnot,
:Is the VPN tunnel created to the ISP or to say, the colo space?
:(I'd assume the latter)

Yes, a colo space that the DragonFly project controls, provided by
Peter Avalos.  OpenVPN itself is running encrypted UDP packets.
Very easy to set up.  The colo has around 10 MBytes/sec of bandwidth
which is plenty for our project.

:Have been working on my own openvpn (routing mode) fun to a pair
:of VPS's as well over the last few days so this is of interest :D
:also - I note in the bridge2.txt file you 'cd /usr/pkg/etc/openvpn'
:before running - is this so openvpn can find the config files?

Yes, that's actually a bit broken.  I've since changed it to put a
'cd' directive in the config file itself and then just run openvpn
with the full path to the config file.  Openvpn has problems restarting
itself if you don't do this (it winds up getting confused and not being
able to find the key files if it restarts).

:if so - to note, you can add a 'cd /path/to/configdir' within the
:config files..

Yah, found that :-)

:also - assuming you have statics on both end of the tunnels -
:why did you choose openvpn ethernet bridging over say IP layer + ipsec?
:(or even openvpn 'routing' mode) with something like OSPF or similar
:and - do you have hw crypto cards on either endpoint?

I originally attempted to route a subnet but the problem is we have
a full class C at the colo, but DragonFly isn't really designed to
operate with two different subnets where one subnet overlaps the other.
Ethernet switching turned out to be the better solution.  The colocated
box itself is ON the class C, it doesn't have a separate IP outside
the class C space.  So there was no easy way to swing a routed network.

I wouldn't even consider something as complex as OSPF for a simple
setup like this, even with a routed solution.

:(my soekris 486 gets a little bogged down by the crypto, which is why I ask)
:ok enough questions ;)
:its definitely fun trying to convert consumer internet into a 'real 
:connection' :D
:- Chris
:(from a gigabit LAN piggybacked on a sometimes 56k wifi link)

OpenVPN has options to run in the clear after authentication is complete,
I think, but I highly recommend using the crypto TLS support.  The
instructions on setting up all the files are pretty clear (you can find

Re: Can't mount my hammer filesystem

2011-02-20 Thread Matthew Dillon

:So I deciced to format the master drive to install the system on and
:then get back my data from the slave. But, that's not cool, when I try
:to mount I get this message Not a valid HAMMER filesystem.
:Did I destroyed the filesystem by installing the bootblock on both disks ?
:Can I get my data back ? How ?
:I tried some commands unsuccessfully :
:# hammer -f /dev/serno/S1PZJ1DQ508109.s4 recover /media/dd2/
:hammer: setup_volume: /dev/serno/S1PZJ1DQ508109.s4: Header does not
:indicate that this is a hammer volume

s4 ?  Not s4d ?  Did you accidently install HAMMER directly on a slice
and not install it in a disklabeled partition?

Installing boot blocks would have wiped the header if you installed
HAMMER in a slice instead of a partition.

The hammer recover code needs information from the volume header at the
moment.  That's the only piece of the disk it needs to be able to do
a recovery scan.  It's a design bug that will require a media format
change to fix.


Dragonfly network changes - U-Verse almost a complete failure

2011-02-20 Thread Matthew Dillon
Hahaha... ok, well, I spoke too soon.  U-Verse is a piece of crap.
That's my conclusion.  Here's some detail:

* The physical infrastructure is fine, as long as you make sure
  there's no packet loss.  To make sure you have to upload and
  download continuously at the same time and look for glitching
  and stalls.

* The ATT iNID/RG router is a piece of crap, and it's impossible
  to replace it with anything else because it also takes the VDSL2
  from the street.

The iNID/RG router basically has a fully stateful firewall in it
WHICH CANNOT BE TURNED OFF for either static or dynamic IPs.  There
are lots of instructions on how to setup static IP and how to 'open'
the firewall to let everything through.

All lies.  No matter what you do, the firewall's stateful tracking is
turned on even for your static block.  It tries to track every single
'connection' running through it even when the Firewall has been turned
'off' in the config.  Worse, it is buggy as hell.  It drops connections
(as in sends a TCP RESET!!! to either my end or the remote end)
ALL THE TIME.  It loses packets.  It drops critical ICMP packets and
gets confused about normal ICMP packets.  It gets confused when lots of
connections are opened all at once (for example, running a simple iPAD
video app such as CrunchyRoll)... or running an actual business with
servers.  It can't handle third-party NATs...

It can BARELY handle its own NAT but even its own wireless/NAT
(bypassing all my stuff and tying my iPAD directly into the iNID/RG
over the RG's wireless) drops connections noticeably.

On top of that the uverse router/firewall uses MAC-based security and
only allows one IP assignment per MAC.  This means that your 'network'
cannot be routed, it can only be bridged, and you can't mix private and
public IPs on the same MAC (which is a very common setup).  If the
uverse router/firewall gets packets from the same IP but different MACs,
it blows up... it drops connections, it refuses to route packets, it
gets confused.

I spent a long time with PF and if_bridge and 'fixed' the MAC issue with
filters, and verified that only the correct MACs were getting through,
but I *STILL* get connection drops for no reason.


Ok, so what does work?  Drilling a PPTP through to a provider works.
That is what I finally did.  I drilled PPTP through the U-Verse to my
old provider, so my *original* IP block from my old ISP (who I still
have the DSL line with as a backup) is now running through U-Verse.

Let me repeat that... running my iPAD test through my own NAT and
wireless network through the PPTP link to bypass the U-Verse router
crap and to my old provider, who has LESS bandwidth than the U-Verse
link I'm drilling through, works BETTER than running the iPAD test
directly on U-Verse through the U-Verse iNID/RG/wireless (bypassing
all my own gear).

That's it.  That's all that works.  Even if you were to get a normal
u-verse link with dynamic IP and no static IP you are still SEVERELY
restricted in what you can do.  Your own NAT servers will simply not work
well.  You would HAVE to use ATT's NAT  RG/wireless.  You would HAVE
to be on a simple bridged network with no other firewall beyond the
ATT iNID/RG.  You would HAVE to have just one IP assignment for each

In otherwords, the simplest of network configurations will work.
Nothing else will work very well.


It isn't ideal, my old ISP can't push 2 MBytes/sec downlink to me through
the PPTP link.  But neither does it drop connections.  And my uplink
speed is still good which is the main thing I care about for the DragonFly

I'm going to stick with the U-Verse so I can get rid of the much
costlier COMCAST.  However, I am going to cancel the static IP block
and stick with drilling the PPTP through to my old ISP (which I'm
keeping for the backup DSL line anyway).

Sigh.  You'd think ATT would be smart enough to do this properly, but
after 5 years of trying they are still clueless about IP networks.  Maybe
in another year or two they will fix their stuff.  Or not.

Re: Hammer recover question

2011-02-20 Thread Matthew Dillon
:This was a 1.8.2 system.  Having a 1.9 system handy, I plugged the drive
:(300GB IDE) into it and tried hammer recover for the first time to see what
:I could save.  The good news is that it's recovering a ton of data!  The bad
:news is that it's taking an incredible amount of time.  So far it's been
:running 24 hours.  Is that to be expected?  The bad disk had approximately
:50GB on it, as reported by the df utility, but I don't know how much of that
:is snapshots.

It scans the entire disk linearly, so however long that takes is how
long recover takes to run.


Re: Can't mount my hammer filesystem

2011-02-20 Thread Matthew Dillon

:Thanks for your reply.
:I don't remember if I installed it on a disklabel or a slice. I will
:be able to know what I did once I get the usb flash disk with the
:system and look at the fstab.
:Hopefully, I didn't lose data because I did several backups before :-)

Ok, if the data is important we *can* recover it, so don't throw it
away, but it might require you making the whole image available to me.

I would need to add another option to the hammer recover directive to
supply the missing info (if the volume header is truly blown away) and
experiment a bit to figure out what the offset is in the image.

I've been meaning to add the option for a while now but that isn't the
real problem.  The real problem is that the volume header contains a
single piece of info, the data zone offset relative to the base of the
hammer filesystem, and it's a bit non-trivial to 'guess' it.


Dragonfly network changes

2011-02-17 Thread Matthew Dillon
Various DragonFly machines are now running on a much faster network
thanks to ATT u-verse, and despite utterly horrid disaster that
att's little router box is I am slowly managing to thrash it into

Our main web site is now on the new network (www, gitweb, wiki,
and bugs).

Developer access to leaf via the new network will work if you
use ''. will continue
to use the old network for a while.

Our nameserver topology has been revamped a bit to remove old cruft
and dual-home the networks.

I will not be renumbering until I can get the reverse DNS operational
(lots of phone tag with ATT), plus give the new network a good burn-in.


For those interested this is ATT Business U-Verse.  Downlink speed is
around 16 MBits and uplink speed is around 2 MBits with their
highest-grade service.  My comcast cable internet (which I will be
getting rid of soon), also the highest grade service, has a faster
downlink speed of around 30 MBits, but around the same uplink speed
of 2 MBits.

Of course, I only really care about uplink speed here, since I'm
serving data out.

However, the ATT service so far does seem a bit more consistent
and I will test it vs my comcast internet (before I get rid of it)
with hulu et-all.

Matthew Dillon

New ps feature -R

2011-02-14 Thread Matthew Dillon
master now has a new feature to /bin/ps, -R, which sub-sorts by
parent/child association and follows the chain in the output,
indenting the command to make it obvious.  Sample usage:  ps axlR

This is a pretty cool feature I think.  I had written something
similar 15 years ago and really began to miss it once I started
doing parallel pkgsrc bulkbuild tests.

Matthew Dillon

Re: ad1 renumbered to ad0

2011-02-09 Thread Matthew Dillon
:I just rebooted (after running into the alt-ctrl-F1 hangs with beeper on bug 
:again), went into the BIOS setup, and enabled audio and SATA (because my 
:friend is talking about getting a big SATA disk). On reaching cryptdisks, it 
:said device ad1s1e is not a luks device. I checked /dev/ad* and found that 
:it's now ad0. Some months ago, when I upgraded the kernel on the laptop, ad0 
:changed to ad1. What's going on?

Device probe order can change due to BIOS adjustments, which is why
you should always reference your disk drives by their serial number
instead of by the device name  unit number.

ls /dev/serno

dmesg | less--- look for the device, it should also print out
 the serial number nearby.

For example, on one of my machines the swap partition is:


Matthew Dillon

Re: hyperthreaded?

2011-02-02 Thread Matthew Dillon

:The guy who gave me the box says he has another one like it, but one is 
:hyperthreaded and the other isn't. Here's the beginning of dmesg. Is it 
:hyperthreaded, and if so, should I compile a kernel to take advantage of it?
:CPU: Intel(R) Pentium(R) 4 CPU 2.80GHz (2793.02-MHz 686-class CPU)
:  Logical CPUs per core: 2

Yes.  It has 2 real cpus and 2 hyper-threads per real cpu (4 total).

Definitely worth running a SMP kernel.  Even things like the atom
with one real cpu and 2 hyperthreads is worth running a SMP kernel


Re: hyperthreaded?

2011-02-02 Thread Matthew Dillon

::The guy who gave me the box says he has another one like it, but one is 
::hyperthreaded and the other isn't. Here's the beginning of dmesg. Is it 
::hyperthreaded, and if so, should I compile a kernel to take advantage of it?
::CPU: Intel(R) Pentium(R) 4 CPU 2.80GHz (2793.02-MHz 686-class CPU)
::  Logical CPUs per core: 2
:Yes.  It has 2 real cpus and 2 hyper-threads per real cpu (4 total).
:Definitely worth running a SMP kernel.  Even things like the atom
:with one real cpu and 2 hyperthreads is worth running a SMP kernel

Oops, I've been corrected.  That baby has 1 core and 2 hyperthreads.

In anycase, it is worth running a SMP kernel on it.

Matthew Dillon

Re: System has insufficient buffers to rebalance the tree

2011-01-31 Thread Matthew Dillon

:On Sunday 30 January 2011 22:43:01 Matthew Dillon wrote:
: :I still get this warning. Is it ever going to be fixed?
: :
: :Pierre
: I could remove the kprintf I guess and have it just reported by
: hammer cleanup.
:Is the number of buffers something I can change, or is it determined by the 
:kernel based on memory size?

   It's based on memory size and while it is possible to change it the
   problem is that your system doesn't have enough memory for what the
   hammer rebalance code really needs to operate.

   Part of this is that the rebalance algorithm simply requires a huge number
   of concurrent buffers.  Ultimately the fix would be to change the algorithm
   but it isn't easy to reformulate and I don't really want to break it open
   because one mistake in that code could blow up the filesystem.

Matthew Dillon

Re: 2.8.3 coming?

2011-01-29 Thread Matthew Dillon

:I had read in november that there was plans for 2.8.3 release coming.
:I plan to install a server next week with 2.8.2, but i will delay install if
:2.8.3 is coming in few more weeks.
:Someone got an estimation of release date?

It's not looking like it.  There just isn't enough time to figure out
what more needs to be merged in from master to make a 2.8.3 release,
verses just compiling whatever is the latest on the 2.8.x branch.


Re: Firefox crashes

2011-01-27 Thread Matthew Dillon
:$ firefox
:Any ideas? I had Firefox working fine on the previous DFly installation. I'm 
:running it over SSH (the X server isn't set up yet).

Try using ssh -Y and see if firefox will run.  If that doesn't
work you may have to do it via a direct connection using 
xhost +client on your server and then setenv DISPLAY server:0.0
on your client.

Even though the link is remote both machines might still need
at least the fonts installed.

Matthew Dillon

Re: Time to let go of ipfilter

2011-01-20 Thread Matthew Dillon

:Hi all,
:ipfilter is not maintained in dragonfly at all, I plan to remove it.
:Best Regards,
:Tomorrow Will Never Die

As long as it doesn't interfere with ipfw[2], which I still have to use
(due to pf's lack of a reinjection feature) then I'm fairly sure
ipfilter can be removed.

Matthew Dillon

Re: Avalon maintainance update

2011-01-18 Thread Matthew Dillon
Avalon is now back in a colo and online after its upgrade and
appears to be happy as a clam!

With meta-data cached on the SSD and the bulk build disk separated
from the packages/repo archive it is having an easier time handling
the various tasks assigned to it.

We're keeping a watch on its stability.

Matthew Dillon

Re: Avalon maintainance update

2011-01-14 Thread Matthew Dillon
Avalon has been shipped back and should be online again by Monday.
It's been upgraded with a 80G SSD and a 1TB and 2TB HD, in addition
to the 750G Seacrate.  The other 750G HD (with the read error) has been
removed.  The SSD is set up for meta-data caching and the bulk build
storage has been split off from the root drive, which should make Avalon
a lot more responsive once it comes up again.  The remaining space will
be used for off-site backups.

At the moment I'm leaving Avalon as i386 but I've reserved partition
space for a future ugprade to 64-bits and set it up to boot from the SSD.
We will want to do that at some point to make use of the RAM in the
box (which is 8G but i386 can only use ~3G of that), but it's a bit
complex because we'd have to switch around our bulk building machines.


Re: HEADS UP: tcp wrongly persist timer detection

2011-01-13 Thread Matthew Dillon
:Hi all,
:HEAD users only.
:It could panic your system upon TCP activities, so please backup your
:working kernel :).  If the panic happens, please send us the link to
:the core dumps.
:Thank you for your help in advance.
:Best Regards,

Crater crunched on this.  I could not get a core dump but I was
able to get a backtrace.

In the particular crash I got it appears that a tcp timer callout
occurs (I believe it was the persist timer but now I'm not sure) and
the connection is dropped.  tcp_drop() is called which:

tp-t_state = TCPS_CLOSED;

And that triggered the panic:

tcp_setpersist: not established yet

I do not know what state the connection was in prior to that point. 

I have adjusted the conditionals in my local source tree to not panic
if the connection is in TCPS_CLOSED but I haven't commited this as I
do not know whether it is correct or not.

Unfortunately I don't know how the persist was meant to work (or not)
prior to entering an ESTABLISHED state.


Re: Avalon maintainance update

2011-01-11 Thread Matthew Dillon
I'd rather not change the DNS, it could create confusion for the
mirrors.  And it will probably confuse the hell out of crater and
pkgbox64 too.


Re: HAMMER and bad disk sectors

2011-01-10 Thread Matthew Dillon
:What does HAMMER do about bad sectors? Does it mark them as corrupt, and
:allow system to function normally?

You will get read errors.  EIO for any related files/directories.
It may or may not be able to reallocate them but HAMMER does not keep
a bad-block table.  media errors generally mean the media must be
replaced.  If the media error occurs in a bad place, like a bitmap or
undo record, the filesystem will be knocked down to read-only mode.

If the media errors prevent you from being able to backup the filesystem
you can use the hammer recover directive to exhaustively locate and copy
the directory tree to another hammer filesystem on different media.

Matthew Dillon

Re: CRC data failed?

2011-01-06 Thread Matthew Dillon

:It sounds like the filesystem image got corrupted, your best bet is
:to wipe it completely (w/ newfs_hammer).
:Only a handfull of files seem to have a problem, so for now I will leave as 
:it. I guess I could try deleting those files.
:  It is possible to recover the data to another filesystem (not in-place) 
:if you need the data.
:Will do. I will setup another VM. The original data was downloaded over DSL 
:and took almost a week to download.
:Checking the entire filesystem's CRCs can be done by issuing a 
:mirror-read redirected to /dev/null and seeing if any errors pop
:up.  This will wind up reading the entire disk and could take some
:time depending on the size of the filesystem.
:Where can I read about how to do that?

'man hammer' gives you a lot of information.  The integrity checks are
integrated into the mirror-read code but we haven't wrapped it with
something more user friendly.

:Also, is the crash possibly due to the fact it is in a VM? This is my first 
:production test of DragonFly, so I am trying to understand what happened. If 
:this was a sever with original data, I would be pretty worried about the 

The problem is basically that the VM lies to the kernel when the kernel
tries to flush the disk cache, telling the kernel that the disk cache
has been flushed when the data has not actually been synced to the
physical media.  This means that when a crash of the host occurs
(verses the guest), the state of the data on the physical media winds
up being different than what the HAMMER recovery code expects.

On a real system the disk flush command actually works properly.

Matthew Dillon

Re: CRC data failed?

2011-01-05 Thread Matthew Dillon

:In the screen of the VM where I am running dragonfly, the console in this 
:case, there is an error about CRC data failed.
:What is that about?

Well, when it's on a VM it usually means a crash occured and the VM
was lying the operating system when the OS told it to flush its disk
caches.  What VM were you running it on?

Matthew Dillon

  1   2   3   4   5   6   7   8   9   10   >